-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slice question #66
Comments
One quick question that has profound performance implications: is your data compressed, or chunked for some other reason? If so, what are the chunk dimensions? |
This data is in I just read it like this:
There's nothing that suggests that his data is compressed or chunked in any way. Two important notes, this is just an example file but all follow the same format, and I have no control over these data, I just must process it. Thanks. |
can you also paste dataSet.metadata? |
Sure thing, this is the metadata for the main group and the dataset.
|
Ah, sorry - I meant the hdf5 metadata - in your snippet above where you pasted e.g., in a test file I made: > f.get('uncompressed').shape
[ 300000, 4492 ]
> f.get('uncompressed').metadata
{
signed: true,
type: 0,
cset: -1,
vlen: false,
littleEndian: true,
size: 4,
shape: [ 300000, 4492 ],
maxshape: [ 300000, 4492 ],
chunks: null,
total_size: 1347600000
}
> |
Alright
|
Ok - my first piece of advice is to retrieve data in slices that align with the chunk size as much as possible. Chunked data is stored non-contiguously on disk, so chunks are retrieved individually as needed. For the dataset you posted above, this would mean taking slices of 71 columns at a time. It will take about the same amount of time to get the slice Second, I looked through the documentation for the HDF5 library and there is no simple way to retrieve data from an HDF5 Dataset in column-major order. Even for the Fortran HDF5 library, it looks like the data is reordered after it is retrieved. So unfortunately my advice is to continue doing what you're doing, mostly. Depending on what your processing looks like, you might be able to speed it up by hardcoding an indexing function for walking through the data, e.g. function get_index(row, col, num_cols=71) {
return row * num_cols + col;
}
const total_num_rows = 300000;
const total_num_cols = 4492;
const col_chunk_size = 71;
for (let start_col=0; start_col < total_num_cols; start_col += col_chunk_size) {
end_col = Math.min(total_num_cols, start_col + col_chunk_size);
slice_data = f.get('data').slice([[], [start_col, end_col]]);
const num_cols = end_col - start_col;
for (let col=0; col < num_cols; col++) {
for (let row=0; row < total_num_rows; row++) {
process_data(slice_data[get_index(row, col, col_chunk_size)]);
}
}
} I implied from what you wrote above that you want to process an entire column at a time... if you don't need to do that, in general it's still true that it will be faster to retrieve blocks of data that align with chunk boundaries. |
I will give it a try. Thank you so much for the chunk explanation. |
Hi, thanks in advance for the help with this library. I've got a quick question about using the
slice
function in aDataset
.Is there a way to
slice
aDataset
by columns? I'm asking because when you slice multiple columns together, it gives you aTypedArray
where the rows come first and then the columns.The issue is, I'm working with a massive
Dataset
and need to process it by column. Slicing it column by column is just too time-consuming (it's 4492 columns and 300k rows), so I've been slicing thisDataset
into several arrays and then converting this data to a column format in memory. This approach isn't very efficient, and the logic for calculating the position of each element in a given column is pretty complex.Any advice or suggestions would be really helpful. Thank you.
The text was updated successfully, but these errors were encountered: