Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Enable pytables to round-trip with StringDtype #60663

Merged
merged 7 commits into from
Jan 23, 2025

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Jan 5, 2025

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Replaces #60625

Writing out with object dtype still infers str when infer_string=True.

Comment on lines 3033 to 3041
if dtype == "str[python]":
dtype = StringDtype("python", np.nan)
elif dtype == "string[python]":
dtype = StringDtype("python", NA)
elif dtype == "str[pyarrow]":
dtype = StringDtype("pyarrow", np.nan)
else:
assert dtype == "string[pyarrow]"
dtype = StringDtype("pyarrow", NA)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there is a better approach here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringDtype.construct_from_string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wouldn't allow round-tripping if you e.g. write out a Python-backed string with NaN-semantics, and read it in an environment with PyArrow installed.

@rhshadrach rhshadrach marked this pull request as ready for review January 6, 2025 01:54
@mroeschke mroeschke added the IO HDF5 read_hdf, HDFStore label Jan 6, 2025
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks for getting this started. I guess this should be added as an enhancement to 2.3?

vlarr = self._handle.create_vlarray(self.group, key, _tables().ObjectAtom())
vlarr.append(value.to_numpy())
node = getattr(self.group, key)
if value.dtype == StringDtype("python", np.nan):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all the branches here or can you just do str(values.dtype)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if using_string_dtype() and is_string_array(values, skipna=True):
if (
using_string_dtype()
and isinstance(values, np.ndarray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming read_array can return a non-ndarray type here? Does this prevent against pyarrow arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming read_array can return a non-ndarray type here?

Correct - datetime and string currently.

@jorisvandenbossche
Copy link
Member

More general question here: do we actually need (or want) to fully roundtrip the string dtype in HDF IO?

I don't think we do that in other IO methods (except for pickle, I assume). It seems we don't support other nullable or pyarrow dtypes right now in HDF IO, so can't compare with how we roundtrip those. But for example for Parquet we right now only (will, with the latest pyarrow) roundtrip "str" vs "string", but not the exact storage.

I think in many cases it is good to "just" get the default string dtype for a file where a column has been stored "as a string column" (for example assume we already had StringDtype by default for a while but would now be changing the default storage from python to payrrow, then I think ideally many people just start getting that default storage when reading existing files. As long as the idea is that the difference should not matter in 99.9% of the cases (except for performance benefits etc, of course)).
In general we might not want to store every single detail of the dtype in the file, but only the general ("logical"?) type (while we of course have the specific NaN vs NA case for now, which might be a good exception)

@WillAyd
Copy link
Member

WillAyd commented Jan 13, 2025

That's a good question - yes generally I don't think it is important to restore the storage of the string types through I/O operations (even for Pickle). I think that even hurts the portability of pandas by potentially encoding information about the execution environment into the I/O storage

@rhshadrach
Copy link
Member Author

rhshadrach commented Jan 14, 2025

@jorisvandenbossche

(while we of course have the specific NaN vs NA case for now, which might be a good exception)

I take this to mean store sentinel values like string-na and string-nan and when reading in the data we infer the backend (python vs pyarrow) from the environment. Is that right?

What if str / string is written out but then read in with infer_string set to False?

@jorisvandenbossche
Copy link
Member

I take this to mean store sentinel values like string-na and string-nan and when reading in the data we infer the backend (python vs pyarrow) from the environment. Is that right?

Yes, but then it could also be "str" vs "string" since we already have that distinguishing string repr.

(now, I am fine with distinguishing those two, but personally I would also be fine with just storing string data the same and reading them the same depending on the option being set)

What if str / string is written out but then read in with infer_string set to False?

Good question. I should check to be sure but for parquet I think it will then still use StringDtype, as in general it will try to preserve any extension dtype. I am personally fine with doing the same here, but also OK to explicitly not restore "str" in that case but read it as object dtype (while "string" is more explicit opt-in, so that I would restore regardless of the option?)

@WillAyd
Copy link
Member

WillAyd commented Jan 14, 2025

Good question. I should check to be sure but for parquet I think it will then still use StringDtype, as in general it will try to preserve any extension dtype. I am personally fine with doing the same here, but also OK to explicitly not restore "str" in that case but read it as object dtype (while "string" is more explicit opt-in, so that I would restore regardless of the option?)

I think this proposal makes sense and works in both current and future versions of pandas. +1

@rhshadrach
Copy link
Member Author

@WillAyd @jorisvandenbossche - updated the logic here. Any thoughts on the whatsnew line?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool - this implementation looks great

doc/source/whatsnew/v2.3.0.rst Outdated Show resolved Hide resolved
@rhshadrach rhshadrach requested a review from WillAyd January 22, 2025 22:44
@jorisvandenbossche jorisvandenbossche merged commit 60325b8 into pandas-dev:main Jan 23, 2025
53 of 54 checks passed
@jorisvandenbossche
Copy link
Member

Thanks @rhshadrach!

@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Jan 23, 2025
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Jan 23, 2025
@jorisvandenbossche
Copy link
Member

Backport -> #60771

@rhshadrach rhshadrach deleted the enh_pytables_str branch January 23, 2025 16:08
mroeschke pushed a commit that referenced this pull request Jan 23, 2025
…60663) (#60771)

ENH: Enable pytables to round-trip with StringDtype (#60663)

Co-authored-by: William Ayd <[email protected]>
(cherry picked from commit 60325b8)

Co-authored-by: Richard Shadrach <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants