fmow and Pandas 2.0.0 datetime conversion #146

somussma · 2023-04-04T02:40:08Z

I'm getting an error when initializing the "fmow" dataset. I got the following error for the conversion of the timestamp to datetime with Pandas:

ValueError: time data "2011-02-07T02:48:56.643Z" doesn't match format "%Y-%m-%dT%H:%M:%S%z", at position 92. You might want to try:
- passing format if your strings have a consistent format;
- passing format='ISO8601' if your strings are all ISO8601 but not necessarily in exactly the same format;
- passing format='mixed', and the format will be inferred for each element individually. You might want to use dayfirst alongside this.

I noticed I was using Pandas 2.0.0 (presumably the most recent version) and when I reverted to Pandas 1.5.3, the issue seemed to go away. I'm guessing the datetime formatting was changed in version 2 and it might be good to update WILDS to still work with the new version. Thanks!

The text was updated successfully, but these errors were encountered:

j-bl · 2025-01-13T09:57:24Z

Hi everyone,

After investigating the issue, I believe I've identified the cause and how it can be avoided.

It turns out that most timestamps in the FMoW dataset, which contains over 500,000 elements, follow the format 2013-10-05T02:27:17Z. However, fewer than 2,700 timestamps include higher precision, such as 2011-02-07T02:48:56.643Z (note the three additional digits after the decimal point).

In versions of pandas prior to 2.0.0, this discrepancy wasn't a problem because pandas.to_datetime inferred the format for each element individually. In the latest versions, however, pandas essentially infers the format once at the start and expects all subsequent entries to adhere to that format. To resolve this, we can explicitly specify the (flexible) format ISO8601.

Here's an example:

import pandas as pd

# Two dates, one of them with additional precision.
dates = ["2013-10-05T02:27:17Z", "2011-02-07T02:48:56.643Z"]

# Individually, each of the two dates can be loaded.
print(pd.to_datetime(dates[0])) # Prints: "2013-10-05 02:27:17+00:00"
print(pd.to_datetime(dates[1])) # Prints: "2011-02-07 02:48:56.643000+00:00"

# Loading both elements at once causes the problem specified above.
print(pd.to_datetime(dates)) # Raises ValueError: "time data "2011-02-07T02:48:56.643Z" doesn't match format "%Y-%m-%dT%H:%M:%S%z", at position 1"

# If we specify the format "ISO8601" as proposed by the error message, pandas is able to handle the deviation in precision.
print(pd.to_datetime(dates, format="ISO8601")) # Prints: DatetimeIndex(['2013-10-05 02:27:17+00:00', '2011-02-07 02:48:56.643000+00:00'], dtype='datetime64[ns, UTC]', freq=None)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fmow and Pandas 2.0.0 datetime conversion #146

fmow and Pandas 2.0.0 datetime conversion #146

somussma commented Apr 4, 2023

j-bl commented Jan 13, 2025

fmow and Pandas 2.0.0 datetime conversion #146

fmow and Pandas 2.0.0 datetime conversion #146

Comments

somussma commented Apr 4, 2023

j-bl commented Jan 13, 2025