Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write block tutorial #696

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions docs/source/tutorials/writing.md
TibbersHao marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,49 @@ Write array and tabular data.
<DataFrameClient ['x', 'y']>
```

In some scenarios, you may want to send data a chunk at a time, rather than sending the entire chunk at once. This might be in cases where the full data is not available at once, or the data is too large for memory. This can be achieved in two ways:
TibbersHao marked this conversation as resolved.
Show resolved Hide resolved

The first one is to stack them before saving back to client using the above mentioned `write_array` method. This works when the size of data is small.

When the size of merged data becomes an issue for memory, or in cases when you want to save the result on-the-fly as each individual array is generated, this could be achieved by using the `write_block` method with a pre-allocated space in client.

```python
# This approach will require you to know the final array dimension beforehand.

# Assuming you have five 2d arrays (eg. images), each in shape of 32 by 32.
>>> stacked_array_shape = (5, 32, 32)

# Define a tiled ArrayStructure based on shape
>>> import numpy
>>> from tiled.structures.array import ArrayStructure

>>> structure = ArrayStructure.from_array(numpy.zeros(stacked_array_shape, dtype = numpy.int8)) # A good practice to keep the dtype the same as your final results to avoid mismatch.
TibbersHao marked this conversation as resolved.
Show resolved Hide resolved
>>> structure
ArrayStructure(data_type=BuiltinDtype(endianness='not_applicable', kind=<Kind.integer: 'i'>, itemsize=1), chunks=((5,), (32,), (32,)), shape=(5, 32, 32), dims=None, resizable=False)

# Re-define the chunk size to allow single array to be saved.
TibbersHao marked this conversation as resolved.
Show resolved Hide resolved
>>> structure.chunks = ((1,) * stacked_array_shape[0], (stacked_array_shape[1],), (stacked_array_shape[2],))

# Now to see that the chunk for the first axis has been divided.
>>> structure
ArrayStructure(data_type=BuiltinDtype(endianness='not_applicable', kind=<Kind.integer: 'i'>, itemsize=1), chunks=((1, 1, 1, 1, 1), (32,), (32,)), shape=(5, 32, 32), dims=None, resizable=False)

# Allocate a new array client in tiled
>>> array_client = client.new(structure_family="array", structure=structure, key ="stacked_result", metadata={"color": "yellow", "barcode": 13})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which version of Tiled is this suggested for? With the current most up-to-date Tiled in the main branch I get the following error.

TypeError: Container.new() got an unexpected keyword argument 'structure'

You may want to consider defining array_client the following way instead.

>>> from tiled.structures.data_source import DataSource
>>> data_source = DataSource(structure=structure, structure_family="array")
>>> array_client = client.new(structure_family="array", data_sources=[data_source], key ="stacked_result", metadata={"color": "yellow", "barcode": 13})

Also, PEP8:

Suggested change
>>> array_client = client.new(structure_family="array", structure=structure, key ="stacked_result", metadata={"color": "yellow", "barcode": 13})
>>> array_client = client.new(structure_family="array", structure=structure, key="stacked_result", metadata={"color": "yellow", "barcode": 13})

Finally, for @danielballan, would it be useful to put this part in a utility function and return the client for this particular use case (where we know the eventual array size but don't have the full array yet). Then the user can use that client to write chunks in streaming fashion without having to fiddle with the lower level client.new() method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. To answer the first part: I am using tiled == 0.1.0a113 at this moment, and it worked out fine on my local end.

I will test the suggested DataSource and see if it's compatible with my pipeline. Will follow up on a new comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up: I went through the release notes, the DataSource is introduced in version 0.1.0a115, so within my development environment (0.1.0a113) I was not able to run the suggested code.

To accommodate this, I will add both approaches and comment with notes about tiled version. This could serve as a temporary solution until a more advanced high level version come into the play.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should write the tutorial to the most recent version of tiled, given that it is still in alpha.


>>> array_client
<ArrayClient shape=(5, 32, 32) chunks=((1, 1, 1, 1, 1), (32,), (32,)) dtype=int8>

# Save a single slice with specific index
# Save to the first array (first block index 0)
>>> first_array = numpy.random.rand(32, 32).astype(numpy.int8)
>>> array_client.write_block(first_array, block=(0, 0, 0))

# Save to the 3rd array (first block index 2)
>>> third_array = numpy.random.rand(32, 32).astype(numpy.int8)
>>> array_client.write_block(third_array, block=(2, 0, 0))
```

Search to find the data again.

```py
Expand Down
Loading