You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm researching on different formats for storing table data, and I came across this one. When I wanted to test it, I got lots of Pydantic validation errors.
from bintablefile import BinTableFile
record_format = (str, str, str, str, str) # actually, 3 out of 5 columns are categorical data
record_file = BinTableFile("/home/testuser/meta_df.btf", record_format=record_format,
columns=tuple(meta_df.columns), opener=open)
records = [tuple(item[key] for key in meta_df.columns) for item in meta_df.to_dict(orient='records')] # data from pandas df
record_file.extend(records)
record_file.flush()
Errors:
pydantic.error_wrappers.ValidationError: 40 validation errors for Init
record_format -> 0
subclass of int expected (type=type_error.subclass; expected_class=int)
record_format -> 0
subclass of float expected (type=type_error.subclass; expected_class=float)
record_format -> 0
subclass of Decimal expected (type=type_error.subclass; expected_class=Decimal)
record_format -> 0
subclass of bool expected (type=type_error.subclass; expected_class=bool)
record_format -> 0
subclass of int64 expected (type=type_error.subclass; expected_class=int64)
record_format -> 0
subclass of int8 expected (type=type_error.subclass; expected_class=int8)
record_format -> 0
subclass of float64 expected (type=type_error.subclass; expected_class=float64)
record_format -> 0
subclass of bool_ expected (type=type_error.subclass; expected_class=bool_)
record_format -> 1
subclass of int expected (type=type_error.subclass; expected_class=int)
record_format -> 1
subclass of float expected (type=type_error.subclass; expected_class=float)
record_format -> 1
subclass of Decimal expected (type=type_error.subclass; expected_class=Decimal)
record_format -> 1
subclass of bool expected (type=type_error.subclass; expected_class=bool)
record_format -> 1
subclass of int64 expected (type=type_error.subclass; expected_class=int64)
record_format -> 1
subclass of int8 expected (type=type_error.subclass; expected_class=int8)
record_format -> 1
subclass of float64 expected (type=type_error.subclass; expected_class=float64)
record_format -> 1
subclass of bool_ expected (type=type_error.subclass; expected_class=bool_)
record_format -> 2
subclass of int expected (type=type_error.subclass; expected_class=int)
record_format -> 2
subclass of float expected (type=type_error.subclass; expected_class=float)
record_format -> 2
subclass of Decimal expected (type=type_error.subclass; expected_class=Decimal)
record_format -> 2
subclass of bool expected (type=type_error.subclass; expected_class=bool)
record_format -> 2
subclass of int64 expected (type=type_error.subclass; expected_class=int64)
record_format -> 2
subclass of int8 expected (type=type_error.subclass; expected_class=int8)
record_format -> 2
subclass of float64 expected (type=type_error.subclass; expected_class=float64)
record_format -> 2
subclass of bool_ expected (type=type_error.subclass; expected_class=bool_)
record_format -> 3
subclass of int expected (type=type_error.subclass; expected_class=int)
record_format -> 3
subclass of float expected (type=type_error.subclass; expected_class=float)
record_format -> 3
subclass of Decimal expected (type=type_error.subclass; expected_class=Decimal)
record_format -> 3
subclass of bool expected (type=type_error.subclass; expected_class=bool)
record_format -> 3
subclass of int64 expected (type=type_error.subclass; expected_class=int64)
record_format -> 3
subclass of int8 expected (type=type_error.subclass; expected_class=int8)
record_format -> 3
subclass of float64 expected (type=type_error.subclass; expected_class=float64)
record_format -> 3
subclass of bool_ expected (type=type_error.subclass; expected_class=bool_)
record_format -> 4
subclass of int expected (type=type_error.subclass; expected_class=int)
record_format -> 4
subclass of float expected (type=type_error.subclass; expected_class=float)
record_format -> 4
subclass of Decimal expected (type=type_error.subclass; expected_class=Decimal)
record_format -> 4
subclass of bool expected (type=type_error.subclass; expected_class=bool)
record_format -> 4
subclass of int64 expected (type=type_error.subclass; expected_class=int64)
record_format -> 4
subclass of int8 expected (type=type_error.subclass; expected_class=int8)
record_format -> 4
subclass of float64 expected (type=type_error.subclass; expected_class=float64)
record_format -> 4
subclass of bool_ expected (type=type_error.subclass; expected_class=bool_)
So, it doesn't support string data at all?
The text was updated successfully, but these errors were encountered:
@kiranzo
yes, the bintablefile doesn't support string data at all, as strings have variable length. The whole point of the bintablefile is fast access to records randomly through the file (like reading last records without having to read the whole file afront), so the records needs to have the fixed width, i.e. being made of only primitive data types with fixed size like ints, floats, booleans.
For storing records with Strings, I'd recommend Apache ORC, or Parquet.
@kiranzo yes, the bintablefile doesn't support string data at all, as strings have variable length. The whole point of the bintablefile is fast access to records randomly through the file (like reading last records without having to read the whole file afront), so the records needs to have the fixed width, i.e. being made of only primitive data types with fixed size like ints, floats, booleans.
For storing records with Strings, I'd recommend Apache ORC, or Parquet.
I tried ORC, and wow, it's really small on my data, compared to max compression parquet and feather, thank you for the suggestion.
If variable length is a problem, represent strings as padded byte arrays, maybe? And add max length restriction as an obligatory field param.
I'm researching on different formats for storing table data, and I came across this one. When I wanted to test it, I got lots of Pydantic validation errors.
Errors:
So, it doesn't support string data at all?
The text was updated successfully, but these errors were encountered: