Can you split a large vaex dataframe into smaller hdf5 files? #2156
-
I have a Vaex dataframe with 190 columns and 5 mil. rows. I read the dataframe from its prior format (.csv), then do some preprocessing into it. Then I'd like to split the preprocessed dataframe into smaller .hdf5 files. I have tried enumerating dataframe.split() but the approach sort of doesn't work because it seemed that dataframe.split() isn't an iterable. This approach was used when we read .csv using vaex then split it into smaller .hdf5 files using export_hdf5, but somehow this approach won't work for other use cases. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hey , Maybe i don't understand but isn't this what you want? import vaex
df = vaex.example()
for i, df_part in enumerate(df.split([0.2, 0.2, 0.2, 0.2, 0.2])):
print(i, df_part.shape) Or do you want to export a single big dataframe (in hdf5 or otherwise) into smaller files? In that case there is Maybe you should explain your usecase a bit better.. i.e. what you want to achieve. If you want to pass data to an ML model or some kind of service / process, there are probably better ways than the one above. |
Beta Was this translation helpful? Give feedback.
-
Hello Jovan,
df.export_many(...) is the exact function I'm looking for. My problem is
solved, thank you for your fast and helpful response!
Sincerely,
Andreas Parasian
…On Sat, Aug 6, 2022 at 4:35 PM Jovan Veljanoski ***@***.***> wrote:
Hey ,
Maybe i don't understand but isn't this what you want?
import vaex
df = vaex.example()
for i, df_part in enumerate(df.split([0.2, 0.2, 0.2, 0.2, 0.2])):
print(i, df_part.shape)
Or do you want to export a single big dataframe (in hdf5 or otherwise)
into smaller files? In that case there is df.export_many(..)
Maybe you should explain your usecase a bit better.. i.e. what you want to
achieve. If you want to pass data to an ML model or some kind of service /
process, there are probably better ways than the one above.
—
Reply to this email directly, view it on GitHub
<#2156 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOHMQXE77CULAPE3UVORWCDVXYWURANCNFSM55YIQQRA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
Hey ,
Maybe i don't understand but isn't this what you want?
Or do you want to export a single big dataframe (in hdf5 or otherwise) into smaller files? In that case there is
df.export_many(..)
Maybe you should explain your usecase a bit better.. i.e. what you want to achieve. If you want to pass data to an ML model or some kind of service / process, there are probably better ways than the one above.