Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fmow and Pandas 2.0.0 datetime conversion #146

Open
somussma opened this issue Apr 4, 2023 · 1 comment
Open

fmow and Pandas 2.0.0 datetime conversion #146

somussma opened this issue Apr 4, 2023 · 1 comment

Comments

@somussma
Copy link

somussma commented Apr 4, 2023

I'm getting an error when initializing the "fmow" dataset. I got the following error for the conversion of the timestamp to datetime with Pandas:

ValueError: time data "2011-02-07T02:48:56.643Z" doesn't match format "%Y-%m-%dT%H:%M:%S%z", at position 92. You might want to try:
- passing format if your strings have a consistent format;
- passing format='ISO8601' if your strings are all ISO8601 but not necessarily in exactly the same format;
- passing format='mixed', and the format will be inferred for each element individually. You might want to use dayfirst alongside this.

I noticed I was using Pandas 2.0.0 (presumably the most recent version) and when I reverted to Pandas 1.5.3, the issue seemed to go away. I'm guessing the datetime formatting was changed in version 2 and it might be good to update WILDS to still work with the new version. Thanks!

@j-bl
Copy link

j-bl commented Jan 13, 2025

Hi everyone,

After investigating the issue, I believe I've identified the cause and how it can be avoided.

It turns out that most timestamps in the FMoW dataset, which contains over 500,000 elements, follow the format 2013-10-05T02:27:17Z. However, fewer than 2,700 timestamps include higher precision, such as 2011-02-07T02:48:56.643Z (note the three additional digits after the decimal point).

In versions of pandas prior to 2.0.0, this discrepancy wasn't a problem because pandas.to_datetime inferred the format for each element individually. In the latest versions, however, pandas essentially infers the format once at the start and expects all subsequent entries to adhere to that format. To resolve this, we can explicitly specify the (flexible) format ISO8601.

Here's an example:

import pandas as pd

# Two dates, one of them with additional precision.
dates = ["2013-10-05T02:27:17Z", "2011-02-07T02:48:56.643Z"]

# Individually, each of the two dates can be loaded.
print(pd.to_datetime(dates[0])) # Prints: "2013-10-05 02:27:17+00:00"
print(pd.to_datetime(dates[1])) # Prints: "2011-02-07 02:48:56.643000+00:00"

# Loading both elements at once causes the problem specified above.
print(pd.to_datetime(dates)) # Raises ValueError: "time data "2011-02-07T02:48:56.643Z" doesn't match format "%Y-%m-%dT%H:%M:%S%z", at position 1"

# If we specify the format "ISO8601" as proposed by the error message, pandas is able to handle the deviation in precision.
print(pd.to_datetime(dates, format="ISO8601")) # Prints: DatetimeIndex(['2013-10-05 02:27:17+00:00', '2011-02-07 02:48:56.643000+00:00'], dtype='datetime64[ns, UTC]', freq=None)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants