ENH: Enable pytables to round-trip with StringDtype #60663

rhshadrach · 2025-01-05T20:56:47Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Replaces #60625

Writing out with object dtype still infers str when infer_string=True.

rhshadrach · 2025-01-05T21:02:02Z

pandas/io/pytables.py

+                if dtype == "str[python]":
+                    dtype = StringDtype("python", np.nan)
+                elif dtype == "string[python]":
+                    dtype = StringDtype("python", NA)
+                elif dtype == "str[pyarrow]":
+                    dtype = StringDtype("pyarrow", np.nan)
+                else:
+                    assert dtype == "string[pyarrow]"
+                    dtype = StringDtype("pyarrow", NA)


Not sure if there is a better approach here.

StringDtype.construct_from_string?

This wouldn't allow round-tripping if you e.g. write out a Python-backed string with NaN-semantics, and read it in an environment with PyArrow installed.

WillAyd

Cool thanks for getting this started. I guess this should be added as an enhancement to 2.3?

WillAyd · 2025-01-07T19:14:12Z

pandas/io/pytables.py

+            vlarr = self._handle.create_vlarray(self.group, key, _tables().ObjectAtom())
+            vlarr.append(value.to_numpy())
+            node = getattr(self.group, key)
+            if value.dtype == StringDtype("python", np.nan):


Do we need all the branches here or can you just do str(values.dtype)?

Same as #60663 (comment)

WillAyd · 2025-01-07T19:15:09Z

pandas/io/pytables.py

-        if using_string_dtype() and is_string_array(values, skipna=True):
+        if (
+            using_string_dtype()
+            and isinstance(values, np.ndarray)


I'm assuming read_array can return a non-ndarray type here? Does this prevent against pyarrow arrays?

I'm assuming read_array can return a non-ndarray type here?

Correct - datetime and string currently.

jorisvandenbossche · 2025-01-13T09:45:54Z

More general question here: do we actually need (or want) to fully roundtrip the string dtype in HDF IO?

I don't think we do that in other IO methods (except for pickle, I assume). It seems we don't support other nullable or pyarrow dtypes right now in HDF IO, so can't compare with how we roundtrip those. But for example for Parquet we right now only (will, with the latest pyarrow) roundtrip "str" vs "string", but not the exact storage.

I think in many cases it is good to "just" get the default string dtype for a file where a column has been stored "as a string column" (for example assume we already had StringDtype by default for a while but would now be changing the default storage from python to payrrow, then I think ideally many people just start getting that default storage when reading existing files. As long as the idea is that the difference should not matter in 99.9% of the cases (except for performance benefits etc, of course)).
In general we might not want to store every single detail of the dtype in the file, but only the general ("logical"?) type (while we of course have the specific NaN vs NA case for now, which might be a good exception)

WillAyd · 2025-01-13T15:36:05Z

That's a good question - yes generally I don't think it is important to restore the storage of the string types through I/O operations (even for Pickle). I think that even hurts the portability of pandas by potentially encoding information about the execution environment into the I/O storage

rhshadrach · 2025-01-14T02:45:03Z

@jorisvandenbossche

(while we of course have the specific NaN vs NA case for now, which might be a good exception)

I take this to mean store sentinel values like string-na and string-nan and when reading in the data we infer the backend (python vs pyarrow) from the environment. Is that right?

What if str / string is written out but then read in with infer_string set to False?

jorisvandenbossche · 2025-01-14T07:45:41Z

I take this to mean store sentinel values like string-na and string-nan and when reading in the data we infer the backend (python vs pyarrow) from the environment. Is that right?

Yes, but then it could also be "str" vs "string" since we already have that distinguishing string repr.

(now, I am fine with distinguishing those two, but personally I would also be fine with just storing string data the same and reading them the same depending on the option being set)

What if str / string is written out but then read in with infer_string set to False?

Good question. I should check to be sure but for parquet I think it will then still use StringDtype, as in general it will try to preserve any extension dtype. I am personally fine with doing the same here, but also OK to explicitly not restore "str" in that case but read it as object dtype (while "string" is more explicit opt-in, so that I would restore regardless of the option?)

WillAyd · 2025-01-14T14:37:56Z

Good question. I should check to be sure but for parquet I think it will then still use StringDtype, as in general it will try to preserve any extension dtype. I am personally fine with doing the same here, but also OK to explicitly not restore "str" in that case but read it as object dtype (while "string" is more explicit opt-in, so that I would restore regardless of the option?)

I think this proposal makes sense and works in both current and future versions of pandas. +1

rhshadrach · 2025-01-22T01:52:11Z

@WillAyd @jorisvandenbossche - updated the logic here. Any thoughts on the whatsnew line?

WillAyd

Very cool - this implementation looks great

doc/source/whatsnew/v2.3.0.rst

Co-authored-by: William Ayd <[email protected]>

jorisvandenbossche · 2025-01-23T07:38:39Z

Thanks @rhshadrach!

Co-authored-by: William Ayd <[email protected]> (cherry picked from commit 60325b8)

jorisvandenbossche · 2025-01-23T07:45:28Z

Backport -> #60771

…60663) (#60771) ENH: Enable pytables to round-trip with StringDtype (#60663) Co-authored-by: William Ayd <[email protected]> (cherry picked from commit 60325b8) Co-authored-by: Richard Shadrach <[email protected]>

ENH: Enable pytables to round-trip with StringDtype

91c8699

rhshadrach mentioned this pull request Jan 5, 2025

TST(string dtype): Resolve HDF5 xfails in test_put.py #60625

Closed

5 tasks

Cleanups

621ea78

rhshadrach commented Jan 5, 2025

View reviewed changes

Remove type-ignore

11f8c0d

rhshadrach marked this pull request as ready for review January 6, 2025 01:54

rhshadrach requested review from WillAyd and jorisvandenbossche January 6, 2025 01:54

mroeschke added the IO HDF5 read_hdf, HDFStore label Jan 6, 2025

WillAyd reviewed Jan 7, 2025

View reviewed changes

Add whatsnew note

52dbb5e

rhshadrach added 2 commits January 21, 2025 20:41

Merge main

7265e5c

Rework roundtripping logic

4c20a6a

WillAyd reviewed Jan 22, 2025

View reviewed changes

doc/source/whatsnew/v2.3.0.rst Outdated Show resolved Hide resolved

jorisvandenbossche approved these changes Jan 22, 2025

View reviewed changes

Update doc/source/whatsnew/v2.3.0.rst

275c3b8

Co-authored-by: William Ayd <[email protected]>

rhshadrach requested a review from WillAyd January 22, 2025 22:44

jorisvandenbossche merged commit 60325b8 into pandas-dev:main Jan 23, 2025
53 of 54 checks passed

jorisvandenbossche added this to the 2.3 milestone Jan 23, 2025

jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Jan 23, 2025

ENH: Enable pytables to round-trip with StringDtype (pandas-dev#60663)

358577d

Co-authored-by: William Ayd <[email protected]> (cherry picked from commit 60325b8)

jorisvandenbossche mentioned this pull request Jan 23, 2025

[backport 2.3.x] ENH: Enable pytables to round-trip with StringDtype (#60663) #60771

Merged

rhshadrach deleted the enh_pytables_str branch January 23, 2025 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Enable pytables to round-trip with StringDtype #60663

ENH: Enable pytables to round-trip with StringDtype #60663

rhshadrach commented Jan 5, 2025 •

edited

Loading

rhshadrach Jan 5, 2025

mroeschke Jan 6, 2025

rhshadrach Jan 7, 2025

WillAyd left a comment

WillAyd Jan 7, 2025

rhshadrach Jan 7, 2025

WillAyd Jan 7, 2025

rhshadrach Jan 7, 2025

jorisvandenbossche commented Jan 13, 2025

WillAyd commented Jan 13, 2025

rhshadrach commented Jan 14, 2025 •

edited

Loading

jorisvandenbossche commented Jan 14, 2025

WillAyd commented Jan 14, 2025

rhshadrach commented Jan 22, 2025

WillAyd left a comment

jorisvandenbossche commented Jan 23, 2025

jorisvandenbossche commented Jan 23, 2025

ENH: Enable pytables to round-trip with StringDtype #60663

ENH: Enable pytables to round-trip with StringDtype #60663

Conversation

rhshadrach commented Jan 5, 2025 • edited Loading

rhshadrach Jan 5, 2025

Choose a reason for hiding this comment

mroeschke Jan 6, 2025

Choose a reason for hiding this comment

rhshadrach Jan 7, 2025

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd Jan 7, 2025

Choose a reason for hiding this comment

rhshadrach Jan 7, 2025

Choose a reason for hiding this comment

WillAyd Jan 7, 2025

Choose a reason for hiding this comment

rhshadrach Jan 7, 2025

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 13, 2025

WillAyd commented Jan 13, 2025

rhshadrach commented Jan 14, 2025 • edited Loading

jorisvandenbossche commented Jan 14, 2025

WillAyd commented Jan 14, 2025

rhshadrach commented Jan 22, 2025

WillAyd left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 23, 2025

jorisvandenbossche commented Jan 23, 2025

rhshadrach commented Jan 5, 2025 •

edited

Loading

rhshadrach commented Jan 14, 2025 •

edited

Loading