-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: Add option to adjust writer buffer size for query output #15747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add option to adjust writer buffer size for query output #15747
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @m09526 -- in general this looks great. I have just one small API suggestion
@@ -88,6 +91,21 @@ pub async fn create_writer( | |||
file_compression_type.convert_async_writer(buf_writer) | |||
} | |||
|
|||
/// Returns an [`AsyncWrite`] which writes to the given object store location |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to add a new API, can we please add a Builder style one here so future additional options are easier ?
So instead of
let mut object_store_writer = create_writer_with_size(
FileCompressionType::UNCOMPRESSED,
&path,
Arc::clone(&object_store),
context
.session_config()
.options()
.execution
.objectstore_writer_buffer_size,
Something more like
let mut object_store_writer = ObjectStoreWriterBuilder::new(
FileCompressionType::UNCOMPRESSED,
&path,
Arc::clone(&object_store)
).with_buffer_size(
context
.session_config()
.options()
.execution
.objectstore_writer_buffer_size,
).build();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then maybe mark create_writer
as deprecated per this page: https://datafusion.apache.org/contributor-guide/api-health.html#deprecation-guidelines (next release will be 48.0.0)
And update create_writer
to use the new builder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review @alamb, I'll get on to those improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a builder API which tries to match the requested approach.
Which issue does this PR close?
Rationale for this change
DataFusion uses
object_store
'sBufWriter
to write query results to remote object stores. Internally, this may be implemented using a remote store API to upload an object in separate chunks. For example, AWS S3 imposes a maximum count of 10,000 chunks. With a default chunk size of 10 MiB, this limits total output size to 10,000 x 10 = 100 GiB.To allow DataFusion to write objects larger than this limit, we want to increase the buffer size. Clients expecting to write a large output can increase the chunk size via the execution option
objectstore_writer_buffer_size
.What changes are included in this PR?
Allow client to set size of buffer used when DataFusion is writing output data to object_store. This is controlled via the new execution option
objectstore_writer_buffer_size
.Are these changes tested?
Should all be covered by existing tests in DataFusion.
Are there any user-facing changes?
New configuration option in
ExecutionOptions
.New function
create_writer_with_size
indatafusion::datasource::file_format::write
.Documentation added to datafusion/docs/source/user-guide/configs.md.