-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Add warning to .groupby
when null keys would be dropped due to default dropna
#61351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
tehunter
wants to merge
10
commits into
pandas-dev:main
Choose a base branch
from
tehunter:groupby-dropna-warning
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
1705532
add NullKeyWarning
tehunter bfa5846
Add tests
tehunter 992eaff
fix index and Series tests
tehunter d1c5053
add multi-index and categorical tests
tehunter 47cabb2
implement dropna null key warning
tehunter 0bf986e
add test for `Series.groupby`
tehunter cee2378
implement for `Series.groupby`
tehunter 41131a1
add mode.null_grouper_warning option
tehunter 692c153
fix tests which trigger NullKeyWarning
tehunter 9822a66
resolve repr change and empty grouper bug
tehunter File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,12 +9,20 @@ | |
TYPE_CHECKING, | ||
final, | ||
) | ||
import warnings | ||
|
||
import numpy as np | ||
|
||
from pandas._config.config import get_option | ||
|
||
from pandas._libs import lib | ||
from pandas._libs.tslibs import OutOfBoundsDatetime | ||
from pandas.errors import InvalidIndexError | ||
from pandas.errors import ( | ||
InvalidIndexError, | ||
NullKeyWarning, | ||
) | ||
from pandas.util._decorators import cache_readonly | ||
from pandas.util._exceptions import find_stack_level | ||
|
||
from pandas.core.dtypes.common import ( | ||
is_list_like, | ||
|
@@ -55,6 +63,13 @@ | |
from pandas.core.generic import NDFrame | ||
|
||
|
||
_NULL_KEY_MESSAGE = ( | ||
"`dropna` is not specified but grouper encountered null group keys. These keys " | ||
"will be dropped from the result by default. To keep null keys, set `dropna=True`, " | ||
"or to hide this warning and drop null keys, set `dropna=False`." | ||
) | ||
Comment on lines
+66
to
+70
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this a standard approach for a warning message that could be hit from two lines of code? |
||
|
||
|
||
class Grouper: | ||
""" | ||
A Grouper allows the user to specify a groupby instruction for an object. | ||
|
@@ -246,7 +261,7 @@ class Grouper: | |
""" | ||
|
||
sort: bool | ||
dropna: bool | ||
dropna: bool | lib.NoDefault | ||
_grouper: Index | None | ||
|
||
_attributes: tuple[str, ...] = ("key", "level", "freq", "sort", "dropna") | ||
|
@@ -264,19 +279,25 @@ def __init__( | |
level=None, | ||
freq=None, | ||
sort: bool = False, | ||
dropna: bool = True, | ||
dropna: bool | lib.NoDefault = lib.no_default, | ||
) -> None: | ||
self.key = key | ||
self.level = level | ||
self.freq = freq | ||
self.sort = sort | ||
self.dropna = dropna | ||
self._dropna = dropna | ||
|
||
self._indexer_deprecated: npt.NDArray[np.intp] | None = None | ||
self.binner = None | ||
self._grouper = None | ||
self._indexer: npt.NDArray[np.intp] | None = None | ||
|
||
@property | ||
def dropna(self): | ||
if self._dropna is lib.no_default: | ||
return True | ||
return self._dropna | ||
|
||
def _get_grouper( | ||
self, obj: NDFrameT, validate: bool = True | ||
) -> tuple[ops.BaseGrouper, NDFrameT]: | ||
|
@@ -442,7 +463,7 @@ def __init__( | |
sort: bool = True, | ||
observed: bool = False, | ||
in_axis: bool = False, | ||
dropna: bool = True, | ||
dropna: bool | lib.NoDefault = lib.no_default, | ||
uniques: ArrayLike | None = None, | ||
) -> None: | ||
self.level = level | ||
|
@@ -599,9 +620,16 @@ def codes(self) -> npt.NDArray[np.signedinteger]: | |
def uniques(self) -> ArrayLike: | ||
return self._codes_and_uniques[1] | ||
|
||
@property | ||
def dropna(self) -> bool: | ||
if self._dropna is lib.no_default: | ||
return True | ||
return self._dropna | ||
|
||
@cache_readonly | ||
def _codes_and_uniques(self) -> tuple[npt.NDArray[np.signedinteger], ArrayLike]: | ||
uniques: ArrayLike | ||
unspecified_dropna = self._dropna is lib.no_default | ||
if self._passed_categorical: | ||
# we make a CategoricalIndex out of the cat grouper | ||
# preserving the categories / ordered attributes; | ||
|
@@ -617,11 +645,11 @@ def _codes_and_uniques(self) -> tuple[npt.NDArray[np.signedinteger], ArrayLike]: | |
else: | ||
ucodes = np.arange(len(categories)) | ||
|
||
has_dropped_na = False | ||
if not self._dropna: | ||
na_mask = cat.isna() | ||
if np.any(na_mask): | ||
has_dropped_na = True | ||
has_na_values = False | ||
na_mask = cat.isna() | ||
if np.any(na_mask): | ||
has_na_values = True | ||
if not self.dropna: | ||
if self._sort: | ||
# NA goes at the end, gets `largest non-NA code + 1` | ||
na_code = len(categories) | ||
|
@@ -637,11 +665,18 @@ def _codes_and_uniques(self) -> tuple[npt.NDArray[np.signedinteger], ArrayLike]: | |
) | ||
codes = cat.codes | ||
|
||
if has_dropped_na: | ||
if not self._sort: | ||
# NA code is based on first appearance, increment higher codes | ||
codes = np.where(codes >= na_code, codes + 1, codes) | ||
codes = np.where(na_mask, na_code, codes) | ||
if has_na_values: | ||
if not self.dropna: | ||
if not self._sort: | ||
# NA code is based on first appearance, increment higher codes | ||
codes = np.where(codes >= na_code, codes + 1, codes) | ||
codes = np.where(na_mask, na_code, codes) | ||
elif get_option("null_grouper_warning") and unspecified_dropna: | ||
warnings.warn( | ||
_NULL_KEY_MESSAGE, | ||
NullKeyWarning, | ||
stacklevel=find_stack_level(), | ||
) | ||
|
||
return codes, uniques | ||
|
||
|
@@ -660,8 +695,19 @@ def _codes_and_uniques(self) -> tuple[npt.NDArray[np.signedinteger], ArrayLike]: | |
# error: Incompatible types in assignment (expression has type "Union[ | ||
# ndarray[Any, Any], Index]", variable has type "Categorical") | ||
codes, uniques = algorithms.factorize( # type: ignore[assignment] | ||
self.grouping_vector, sort=self._sort, use_na_sentinel=self._dropna | ||
self.grouping_vector, sort=self._sort, use_na_sentinel=self.dropna | ||
) | ||
if ( | ||
get_option("null_grouper_warning") | ||
and unspecified_dropna | ||
and codes.min(initial=0) == -1 | ||
): | ||
warnings.warn( | ||
_NULL_KEY_MESSAGE, | ||
NullKeyWarning, | ||
stacklevel=find_stack_level(), | ||
) | ||
|
||
return codes, uniques | ||
|
||
@cache_readonly | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the implementation is trivial, but this is redundant with
Grouper
. I'm not sure we can get around it while still being a class property, but should the default value be referenced as a constant defined just once?