Skip to content

feat: Add option to adjust writer buffer size for query output #15747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

m09526
Copy link
Contributor

@m09526 m09526 commented Apr 17, 2025

Which issue does this PR close?

Rationale for this change

DataFusion uses object_store's BufWriter to write query results to remote object stores. Internally, this may be implemented using a remote store API to upload an object in separate chunks. For example, AWS S3 imposes a maximum count of 10,000 chunks. With a default chunk size of 10 MiB, this limits total output size to 10,000 x 10 = 100 GiB.

To allow DataFusion to write objects larger than this limit, we want to increase the buffer size. Clients expecting to write a large output can increase the chunk size via the execution option objectstore_writer_buffer_size.

What changes are included in this PR?

Allow client to set size of buffer used when DataFusion is writing output data to object_store. This is controlled via the new execution option objectstore_writer_buffer_size.

Are these changes tested?

Should all be covered by existing tests in DataFusion.

Are there any user-facing changes?

New configuration option in ExecutionOptions.
New function create_writer_with_size in datafusion::datasource::file_format::write.
Documentation added to datafusion/docs/source/user-guide/configs.md.

@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate common Related to common crate datasource Changes to the datasource crate labels Apr 17, 2025
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 17, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @m09526 -- in general this looks great. I have just one small API suggestion

@@ -88,6 +91,21 @@ pub async fn create_writer(
file_compression_type.convert_async_writer(buf_writer)
}

/// Returns an [`AsyncWrite`] which writes to the given object store location
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to add a new API, can we please add a Builder style one here so future additional options are easier ?

So instead of

            let mut object_store_writer = create_writer_with_size(
                FileCompressionType::UNCOMPRESSED,
                &path,
                Arc::clone(&object_store),
                context
                    .session_config()
                    .options()
                    .execution
                    .objectstore_writer_buffer_size,

Something more like

            let mut object_store_writer = ObjectStoreWriterBuilder::new(
                FileCompressionType::UNCOMPRESSED,
                &path,
                Arc::clone(&object_store)
             ).with_buffer_size(
                context
                    .session_config()
                    .options()
                    .execution
                    .objectstore_writer_buffer_size,
            ).build();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then maybe mark create_writer as deprecated per this page: https://datafusion.apache.org/contributor-guide/api-health.html#deprecation-guidelines (next release will be 48.0.0)

And update create_writer to use the new builder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review @alamb, I'll get on to those improvements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a builder API which tries to match the requested approach.

@m09526 m09526 requested a review from alamb April 23, 2025 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add configuration option to adjust ObjectStore's BufWriter upload size to support large file uploading to S3
2 participants