Skip to content

(fix): remove _getattr__ method for PandasExtensionArray #10250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 258 commits into from
Apr 29, 2025

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Apr 25, 2025

This PR makes explicit the methods needed for duck array typing instead of using getattr on the underlying array (thereby resolving the linked issue). It also fixes a bug in __getitem__ which should further help solidify the API (which was typed correctly).

kmuehlbauer and others added 30 commits October 18, 2024 07:31
…t resolution, fix code and tests to allow this
… more carefully, for now using pd.Series to covert `OMm` type datetimes/timedeltas (will result in ns precision)
…rray` series creating an extension array when `.array` is accessed
Co-authored-by: Stephan Hoyer <shoyer@google.com>
@dcherian
Copy link
Contributor

The core issue seems to be the __getattr__ override, which is overly broad and tricky to get right. Can we get rid of that instead?

@ilan-gold
Copy link
Contributor Author

@dcherian We can get rid of it, but then I think we would have to implement everything explicitly. I handled the getattr recursion issue directly, which obviated the need for any new methods, great suggestions.

So the only question is whether we want to implement ndims shape dtype and whatever else is currently routed through getattr explicitly.

@ilan-gold ilan-gold changed the title (fix): add copy + deepcopy to extension array (fix): _getattr__ method for PandasExtensionArray Apr 28, 2025
@@ -109,7 +111,7 @@ def __repr__(self):
return f"PandasExtensionArray(array={self.array!r})"

def __getattr__(self, attr: str) -> object:
return getattr(self.array, attr)
return getattr(super().__getattribute__("array"), attr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand this :(. I still lean toward explicit attributes. Subclassing from NdimLenMixin or whatever (see indexing.py) might reduce some pain

Copy link
Contributor Author

@ilan-gold ilan-gold Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So without __deepcopy__ or __copy__, the object is first constructed and then the sub-objects are attached:
"A shallow copy constructs a new compound object and then (to the extent possible) inserts references into it to the objects found in the original" + "A deep copy constructs a new compound object and then, recursively, inserts copies into it of the objects found in the original."

So without array, this method then calls getattr for "array" again while looking for __setstate__ (which is apparently the first thing sought in copy.copy from the under-construction copied object) because of self.array. __getattribute__ bypasses the lookup mechanism of __getattr__.

See https://stackoverflow.com/questions/40583131/python-deepcopy-with-custom-getattr-and-setattr as well. I will push a commit with all the methods needed enumerated and try to use inheritance

@ilan-gold ilan-gold changed the title (fix): _getattr__ method for PandasExtensionArray (fix): remove _getattr__ method for PandasExtensionArray Apr 29, 2025
@ilan-gold
Copy link
Contributor Author

@dcherian Removing the method revealed some other small things I cleaned up including a bug in __getitem__ where PandasExtensionArray was not returning itself. I like where things landed, thanks for the review and sticking with it

return item
return PandasExtensionArray(item)
if np.isscalar(item) or isinstance(key, int):
return PandasExtensionArray(type(self.array)._from_sequence([item])) # type: ignore[call-arg] # only subclasses with proper __init__ allowed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the _from_sequence, looks sketchy in the long-term

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, not sure why these underscore methods are in the public api, but they appear to be https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionArray._from_sequence.html so that's what im going on

ilan-gold and others added 2 commits April 29, 2025 16:19
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
@dcherian dcherian enabled auto-merge (squash) April 29, 2025 15:24
@dcherian dcherian merged commit 93a4c03 into pydata:main Apr 29, 2025
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RecursionError when reindexing Dataset with pandas extension arrays inside
4 participants