Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pydantic errors #6

Open
kiranzo opened this issue Aug 26, 2024 · 2 comments
Open

Pydantic errors #6

kiranzo opened this issue Aug 26, 2024 · 2 comments
Assignees

Comments

@kiranzo
Copy link

kiranzo commented Aug 26, 2024

I'm researching on different formats for storing table data, and I came across this one. When I wanted to test it, I got lots of Pydantic validation errors.

from bintablefile import BinTableFile
record_format = (str, str, str, str, str)  # actually, 3 out of 5 columns are categorical data
record_file = BinTableFile("/home/testuser/meta_df.btf", record_format=record_format,
                               columns=tuple(meta_df.columns), opener=open)
records = [tuple(item[key] for key in meta_df.columns) for item in meta_df.to_dict(orient='records')]  # data from pandas df
record_file.extend(records)
record_file.flush()

Errors:

pydantic.error_wrappers.ValidationError: 40 validation errors for Init
record_format -> 0
  subclass of int expected (type=type_error.subclass; expected_class=int)
record_format -> 0
  subclass of float expected (type=type_error.subclass; expected_class=float)
record_format -> 0
  subclass of Decimal expected (type=type_error.subclass; expected_class=Decimal)
record_format -> 0
  subclass of bool expected (type=type_error.subclass; expected_class=bool)
record_format -> 0
  subclass of int64 expected (type=type_error.subclass; expected_class=int64)
record_format -> 0
  subclass of int8 expected (type=type_error.subclass; expected_class=int8)
record_format -> 0
  subclass of float64 expected (type=type_error.subclass; expected_class=float64)
record_format -> 0
  subclass of bool_ expected (type=type_error.subclass; expected_class=bool_)
record_format -> 1
  subclass of int expected (type=type_error.subclass; expected_class=int)
record_format -> 1
  subclass of float expected (type=type_error.subclass; expected_class=float)
record_format -> 1
  subclass of Decimal expected (type=type_error.subclass; expected_class=Decimal)
record_format -> 1
  subclass of bool expected (type=type_error.subclass; expected_class=bool)
record_format -> 1
  subclass of int64 expected (type=type_error.subclass; expected_class=int64)
record_format -> 1
  subclass of int8 expected (type=type_error.subclass; expected_class=int8)
record_format -> 1
  subclass of float64 expected (type=type_error.subclass; expected_class=float64)
record_format -> 1
  subclass of bool_ expected (type=type_error.subclass; expected_class=bool_)
record_format -> 2
  subclass of int expected (type=type_error.subclass; expected_class=int)
record_format -> 2
  subclass of float expected (type=type_error.subclass; expected_class=float)
record_format -> 2
  subclass of Decimal expected (type=type_error.subclass; expected_class=Decimal)
record_format -> 2
  subclass of bool expected (type=type_error.subclass; expected_class=bool)
record_format -> 2
  subclass of int64 expected (type=type_error.subclass; expected_class=int64)
record_format -> 2
  subclass of int8 expected (type=type_error.subclass; expected_class=int8)
record_format -> 2
  subclass of float64 expected (type=type_error.subclass; expected_class=float64)
record_format -> 2
  subclass of bool_ expected (type=type_error.subclass; expected_class=bool_)
record_format -> 3
  subclass of int expected (type=type_error.subclass; expected_class=int)
record_format -> 3
  subclass of float expected (type=type_error.subclass; expected_class=float)
record_format -> 3
  subclass of Decimal expected (type=type_error.subclass; expected_class=Decimal)
record_format -> 3
  subclass of bool expected (type=type_error.subclass; expected_class=bool)
record_format -> 3
  subclass of int64 expected (type=type_error.subclass; expected_class=int64)
record_format -> 3
  subclass of int8 expected (type=type_error.subclass; expected_class=int8)
record_format -> 3
  subclass of float64 expected (type=type_error.subclass; expected_class=float64)
record_format -> 3
  subclass of bool_ expected (type=type_error.subclass; expected_class=bool_)
record_format -> 4
  subclass of int expected (type=type_error.subclass; expected_class=int)
record_format -> 4
  subclass of float expected (type=type_error.subclass; expected_class=float)
record_format -> 4
  subclass of Decimal expected (type=type_error.subclass; expected_class=Decimal)
record_format -> 4
  subclass of bool expected (type=type_error.subclass; expected_class=bool)
record_format -> 4
  subclass of int64 expected (type=type_error.subclass; expected_class=int64)
record_format -> 4
  subclass of int8 expected (type=type_error.subclass; expected_class=int8)
record_format -> 4
  subclass of float64 expected (type=type_error.subclass; expected_class=float64)
record_format -> 4
  subclass of bool_ expected (type=type_error.subclass; expected_class=bool_)

So, it doesn't support string data at all?

@asuiu
Copy link
Member

asuiu commented Aug 27, 2024

@kiranzo
yes, the bintablefile doesn't support string data at all, as strings have variable length. The whole point of the bintablefile is fast access to records randomly through the file (like reading last records without having to read the whole file afront), so the records needs to have the fixed width, i.e. being made of only primitive data types with fixed size like ints, floats, booleans.

For storing records with Strings, I'd recommend Apache ORC, or Parquet.

@kiranzo
Copy link
Author

kiranzo commented Aug 28, 2024

@kiranzo yes, the bintablefile doesn't support string data at all, as strings have variable length. The whole point of the bintablefile is fast access to records randomly through the file (like reading last records without having to read the whole file afront), so the records needs to have the fixed width, i.e. being made of only primitive data types with fixed size like ints, floats, booleans.

For storing records with Strings, I'd recommend Apache ORC, or Parquet.

I tried ORC, and wow, it's really small on my data, compared to max compression parquet and feather, thank you for the suggestion.
If variable length is a problem, represent strings as padded byte arrays, maybe? And add max length restriction as an obligatory field param.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants