Skip to content

predicate pruning: support cast and try_cast for more types #15764

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 24, 2025

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Apr 18, 2025

Adds support for dictionaries and some more primitive types to pruning predicate.
I encountered this when feeding in a stats schema with dictionary encoded columns.
I see no reason we can't support this.

@github-actions github-actions bot added the optimizer Optimizer rules label Apr 18, 2025
@adriangb adriangb changed the title predicate pruning: support dictionaries predicate pruning: support cast and try_cast for more types Apr 18, 2025
@adriangb
Copy link
Contributor Author

@appletreeisyellow would you mind reviewing 🙏🏻 ?

@appletreeisyellow
Copy link
Contributor

@adriangb Yes, I'm happy to review. I'll have some time to review it this afternoon or evening

if from_type == to_type {
return Ok(());
}
// TODO: add an `is_string()` method to DataType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -1544,7 +1572,10 @@ fn build_predicate_expression(
Ok(builder) => builder,
// allow partial failure in predicate expression generation
// this can still produce a useful predicate when multiple conditions are joined using AND
Err(_) => return unhandled_hook.handle(expr),
Err(e) => {
dbg!(format!("Error building pruning expression: {e}"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend changing this to debug! (dbug logging rather than printing to stdout on failure)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adriangb -- this seems like an improvement to me.

However, I am thinking a better solution might be to avoid checking cast behavior entirely in the pruning code and assume any casting has already been done (which I think is true). Maybe we can file a follow on ticket to explore that idea.

This would likely both fix @etsidel's report on #15742 (comment) and would likely be simpler code

@@ -1215,8 +1215,33 @@ fn is_compare_op(op: Operator) -> bool {
// For example, casts from string to numbers is not correct.
// Because the "13" is less than "3" with UTF8 comparison order.
fn verify_support_type_for_prune(from_type: &DataType, to_type: &DataType) -> Result<()> {
// TODO: support other data type for prunable cast or try cast
if matches!(
// Dictionary casts are always supported as long as the value types are supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember why we we even have this code -- I think it may predate unwrapping casts in the logical planning phase so by the time this code sees physical expressions it shouldn't have to add casts at all 🤔

Copy link

@etseidl etseidl Apr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the big issue is when comparing string to numeric (both sides will be coerced to string IIUC). Rather than a whitelist here, maybe instead disallow casts we know will fail, so this could maybe just return Err if the LHS is numeric or temporal and the RHS is string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl, I'll implement that instead. I think we do need to keep the dict part in here.

Tangential but do any of you know where we have functionality to simplify/adapt a PhysicalExpr given a schema eg to remove unnecessary casts or add cases where necessary?

@etseidl
Copy link

etseidl commented Apr 19, 2025

Thanks @adriangb! I think if we clean up the pruning allowed types this can close #15742.

We can then tackle the special handling for floats later. They're already wrong for non-casts cases anyway.

Copy link
Contributor

@appletreeisyellow appletreeisyellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Yes, Dictionary type can be easily missed in the verification. Thank you for adding the unit tests!

}
// TODO: add an `is_string()` method to DataType
let supported_string_cast = (matches!(
// String -> String casts are suppoted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// String -> String casts are suppoted
// String -> String casts are supported

@adriangb
Copy link
Contributor Author

@etseidl I just pushed a change that does what I think you suggested and only denies a cast between a string and a non string type but otherwise assumes that the general casting rules have already been sorted out.

@appletreeisyellow I pushed several more tests for the dict cases.

Would love to see this merged since it ended up blocking some internal work 🙏🏻

@etseidl
Copy link

etseidl commented Apr 20, 2025

Thanks @adriangb. I'm testing now...

@etseidl
Copy link

etseidl commented Apr 20, 2025

Yep, fixes my issue. Thanks again!

@adriangb
Copy link
Contributor Author

Given approvals and relatively simple change, could a committer merge this please?

@adriangb
Copy link
Contributor Author

Random ping to @xudong963 🙏🏻

@comphead
Copy link
Contributor

Thanks everyone!

@comphead comphead merged commit 1fe856b into apache:main Apr 24, 2025
27 checks passed
nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
…5764)

* predicate pruning: support dictionaries

* more types

* clippy

* add tests

* add tests

* simplify to dicts

* revert most changes

* just check for strings, more tests

* more tests

* remove unecessary now confusing clause
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants