Skip to content

Commit

Permalink
Merge pull request #5 from genematx/consolidated-structure
Browse files Browse the repository at this point in the history
Rename and Refactor Consolidated Structure
  • Loading branch information
danielballan authored Dec 14, 2024
2 parents 9dc3a1f + c7ee962 commit 2548cea
Show file tree
Hide file tree
Showing 26 changed files with 429 additions and 190 deletions.
3 changes: 2 additions & 1 deletion docs/source/explanations/catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,8 @@ and `assets`, describes the format, structure, and location of the data.
to the Adapter
- `management` --- enum indicating whether the data is registered `"external"` data
or `"writable"` data managed by Tiled
- `structure_family` --- enum of structure types (`"container"`, `"array"`, `"table"`, ...)
- `structure_family` --- enum of structure types (`"container"`, `"array"`, `"table"`,
etc. -- except for `consolidated`, which can not be assigned to a Data Source)
- `structure_id` --- a foreign key to the `structures` table
- `node_id` --- foreign key to `nodes`
- `id` --- integer primary key
Expand Down
76 changes: 75 additions & 1 deletion docs/source/explanations/structures.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ The structure families are:

* array --- a strided array, like a [numpy](https://numpy.org) array
* awkward --- nested, variable-sized data (as implemented by [AwkwardArray](https://awkward-array.org/))
* container --- a of other structures, akin to a dictionary or a directory
* consolidated --- a container-like structure to combine tables and arrays in a common namespace
* container --- a collection of other structures, akin to a dictionary or a directory
* sparse --- a sparse array (i.e. an array which is mostly zeros)
* table --- tabular data, as in [Apache Arrow](https://arrow.apache.org) or
[pandas](https://pandas.pydata.org/)
Expand Down Expand Up @@ -575,3 +576,76 @@ response.
"count": 5
}
```

### Consolidated

This is a specialized container-like structure designed to link together multiple tables and arrays that store
related scientific data. It does not support nesting but provides a common namespace across all columns of the
contained tables along with the arrays (thus, name collisions are forbidden). This allows to further abstract out
the disparate internal storage mechanisms (e.g. Parquet for tables and zarr for arrays) and present the user with a
smooth homogeneous interface for data access. Consolidated structures do not support pagination and are not
recommended for "wide" datasets with more than ~1000 items (cloumns and arrays) in the namespace.

Below is an example of a Consolidated structure that describes two tables and two arrays of various sizes. Their
respective structures are specfied in the `parts` list, and `all_keys` defines the internal namespace of directly
addressible columns and arrays.

```json
{
"parts": [
{
"structure_family": "table",
"structure": {
"arrow_schema": "data:application/vnd.apache.arrow.file;base64,/////...FFFF",
"npartitions": 1,
"columns": ["A", "B"],
"resizable": false
},
"name": "table1"
},
{
"structure_family": "table",
"structure": {
"arrow_schema": "data:application/vnd.apache.arrow.file;base64,/////...FFFF",
"npartitions": 1,
"columns": ["C", "D", "E"],
"resizable": false
},
"name": "table2"
},
{
"structure_family": "array",
"structure": {
"data_type": {
"endianness": "little",
"kind": "f",
"itemsize": 8,
"dt_units": null
},
"chunks": [[3], [5]],
"shape": [3, 5],
"dims": null,
"resizable": false
},
"name": "F"
},
{
"structure_family": "array",
"structure": {
"data_type": {
"endianness": "not_applicable",
"kind": "u",
"itemsize": 1,
"dt_units": null
},
"chunks": [[5], [7], [3]],
"shape": [5, 7, 3],
"dims": null,
"resizable": false
},
"name": "G"
}
],
"all_keys": ["A", "B", "C", "D", "E", "F", "G"]
}
```
38 changes: 37 additions & 1 deletion docs/source/how-to/register.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,10 @@ Sometimes it is necessary to take more manual control of this registration
process, such as if you want to take advantage of particular knowledge
about the files to specify particular `metadata` or `specs`.

Use the Python client, as in this example.
#### Registering external data

To register data from external files in Tiled, one can use the Python client and
construct Data Source object explicitly passing the list of assets, as in the following example.

```py
import numpy
Expand Down Expand Up @@ -112,3 +115,36 @@ client.new(
specs=[],
)
```

#### Writing a consolidated structure

Similarly, to create a consolidated container structure, one needs to specify
its constituents as separate Data Sources. For example, to consolidate a table
and an array, consider the following example

```python
import pandas

rng = numpy.random.default_rng(12345)
arr = rng.random(size=(3, 5), dtype="float64")
df = pandas.DataFrame({"A": ["one", "two", "three"], "B": [1, 2, 3]})

node = client.create_consolidated(
[
DataSource(
structure_family=StructureFamily.table,
structure=TableStructure.from_pandas(df),
name="table1",
),
DataSource(
structure_family=StructureFamily.array,
structure=ArrayStructure.from_array(arr),
name="C",
)
]
)

# Write the data
node.parts["table1"].write(df)
node.parts["C"].write_block(arr, (0, 0))
```
2 changes: 2 additions & 0 deletions docs/source/reference/service.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,8 @@ See {doc}`../explanations/structures` for more context.
tiled.structures.array.BuiltinDtype
tiled.structures.array.Endianness
tiled.structures.array.Kind
tiled.structures.consolidated.ConsolidatedStructure
tiled.structures.consolidated.ConsolidatedStructurePart
tiled.structures.core.Spec
tiled.structures.core.StructureFamily
tiled.structures.table.TableStructure
Expand Down
88 changes: 88 additions & 0 deletions tiled/_tests/test_consolidated.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
import numpy
import pandas
import pandas.testing
import pytest

from ..catalog import in_memory
from ..client import Context, from_context
from ..server.app import build_app
from ..structures.array import ArrayStructure
from ..structures.core import StructureFamily
from ..structures.data_source import DataSource
from ..structures.table import TableStructure

rng = numpy.random.default_rng(12345)

df1 = pandas.DataFrame({"A": ["one", "two", "three"], "B": [1, 2, 3]})
df2 = pandas.DataFrame(
{
"C": ["red", "green", "blue", "white"],
"D": [10.0, 20.0, 30.0, 40.0],
"E": [0, 0, 0, 0],
}
)
arr1 = rng.random(size=(3, 5), dtype="float64")
arr2 = rng.integers(0, 255, size=(5, 7, 3), dtype="uint8")
md = {"md_key1": "md_val1", "md_key2": 2}


@pytest.fixture(scope="module")
def tree(tmp_path_factory):
return in_memory(writable_storage=tmp_path_factory.getbasetemp())


@pytest.fixture(scope="module")
def context(tree):
with Context.from_app(build_app(tree)) as context:
client = from_context(context)
x = client.create_consolidated(
[
DataSource(
structure_family=StructureFamily.table,
structure=TableStructure.from_pandas(df1),
name="table1",
),
DataSource(
structure_family=StructureFamily.table,
structure=TableStructure.from_pandas(df2),
name="table2",
),
DataSource(
structure_family=StructureFamily.array,
structure=ArrayStructure.from_array(arr1),
name="F",
),
DataSource(
structure_family=StructureFamily.array,
structure=ArrayStructure.from_array(arr2),
name="G",
),
],
key="x",
metadata=md,
)
# Write by data source.
x.parts["table1"].write(df1)
x.parts["table2"].write(df2)
x.parts["F"].write_block(arr1, (0, 0))
x.parts["G"].write_block(arr2, (0, 0, 0))

yield context


def test_iterate_parts(context):
client = from_context(context)
for part in client["x"].parts:
client["x"].parts[part].read()


def test_iterate_columns(context):
client = from_context(context)
for col in client["x"]:
client["x"][col].read()
client[f"x/{col}"].read()


def test_metadata(context):
client = from_context(context)
assert client["x"].metadata == md
22 changes: 22 additions & 0 deletions tiled/_tests/test_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,17 @@
pandas.DataFrame({f"column_{i:03d}": i * numpy.ones(5) for i in range(10)}),
npartitions=1,
),
# a dataframe with mixed types
"diverse": DataFrameAdapter.from_pandas(
pandas.DataFrame(
{
"A": numpy.array([1, 2, 3], dtype="|u8"),
"B": numpy.array([1, 2, 3], dtype="<f8"),
"C": ["one", "two", "three"],
}
),
npartitions=1,
),
}
)

Expand Down Expand Up @@ -100,6 +111,17 @@ def test_dataframe_single_partition(context):
pandas.testing.assert_frame_equal(actual, expected)


def test_reading_diverse_dtypes(context):
client = from_context(context)
expected = tree["diverse"].read()
actual = client["diverse"].read()
pandas.testing.assert_frame_equal(actual, expected)

for col in expected.columns:
actual = client["diverse"][col].read()
assert numpy.array_equal(expected[col], actual)


def test_dask(context):
client = from_context(context, "dask")["basic"]
expected = tree["basic"].read()
Expand Down
24 changes: 12 additions & 12 deletions tiled/_tests/test_writing.py
Original file line number Diff line number Diff line change
Expand Up @@ -676,7 +676,7 @@ def test_append_partition(
assert_frame_equal(x.read(), df3, check_dtype=False)


def test_union_one_table(tree):
def test_consolidated_one_table(tree):
with Context.from_app(build_app(tree)) as context:
client = from_context(context)
df = pandas.DataFrame({"A": [], "B": []})
Expand All @@ -686,17 +686,17 @@ def test_union_one_table(tree):
structure=structure,
name="table",
)
client.create_union([data_source], key="x")
client.create_consolidated([data_source], key="x")


def test_union_two_tables(tree):
def test_consolidated_two_tables(tree):
with Context.from_app(build_app(tree)) as context:
client = from_context(context)
df1 = pandas.DataFrame({"A": [], "B": []})
df2 = pandas.DataFrame({"C": [], "D": [], "E": []})
structure1 = TableStructure.from_pandas(df1)
structure2 = TableStructure.from_pandas(df2)
x = client.create_union(
x = client.create_consolidated(
[
DataSource(
structure_family=StructureFamily.table,
Expand All @@ -717,15 +717,15 @@ def test_union_two_tables(tree):
x.parts["table2"].read()


def test_union_two_tables_colliding_names(tree):
def test_consolidated_two_tables_colliding_names(tree):
with Context.from_app(build_app(tree)) as context:
client = from_context(context)
df1 = pandas.DataFrame({"A": [], "B": []})
df2 = pandas.DataFrame({"C": [], "D": [], "E": []})
structure1 = TableStructure.from_pandas(df1)
structure2 = TableStructure.from_pandas(df2)
with fail_with_status_code(422):
client.create_union(
client.create_consolidated(
[
DataSource(
structure_family=StructureFamily.table,
Expand All @@ -742,15 +742,15 @@ def test_union_two_tables_colliding_names(tree):
)


def test_union_two_tables_colliding_keys(tree):
def test_consolidated_two_tables_colliding_keys(tree):
with Context.from_app(build_app(tree)) as context:
client = from_context(context)
df1 = pandas.DataFrame({"A": [], "B": []})
df2 = pandas.DataFrame({"A": [], "C": [], "D": []})
structure1 = TableStructure.from_pandas(df1)
structure2 = TableStructure.from_pandas(df2)
with fail_with_status_code(422):
client.create_union(
client.create_consolidated(
[
DataSource(
structure_family=StructureFamily.table,
Expand All @@ -767,7 +767,7 @@ def test_union_two_tables_colliding_keys(tree):
)


def test_union_two_tables_two_arrays(tree):
def test_consolidated_two_tables_two_arrays(tree):
with Context.from_app(build_app(tree)) as context:
client = from_context(context)
df1 = pandas.DataFrame({"A": [], "B": []})
Expand All @@ -778,7 +778,7 @@ def test_union_two_tables_two_arrays(tree):
structure2 = TableStructure.from_pandas(df2)
structure3 = ArrayStructure.from_array(arr1)
structure4 = ArrayStructure.from_array(arr2)
x = client.create_union(
x = client.create_consolidated(
[
DataSource(
structure_family=StructureFamily.table,
Expand Down Expand Up @@ -820,15 +820,15 @@ def test_union_two_tables_two_arrays(tree):
x[column].read()


def test_union_table_column_array_key_collision(tree):
def test_consolidated_table_column_array_key_collision(tree):
with Context.from_app(build_app(tree)) as context:
client = from_context(context)
df = pandas.DataFrame({"A": [], "B": []})
arr = numpy.array([], dtype=numpy.float64)
structure1 = TableStructure.from_pandas(df)
structure2 = ArrayStructure.from_array(arr)
with fail_with_status_code(422):
client.create_union(
client.create_consolidated(
[
DataSource(
structure_family=StructureFamily.table,
Expand Down
2 changes: 1 addition & 1 deletion tiled/adapters/arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ def generate_data_sources(
"""
return [
DataSource(
structure_family=self.structure_family,
structure_family=StructureFamily.table,
mimetype=mimetype,
structure=dict_or_none(self.structure()),
parameters={},
Expand Down
Loading

0 comments on commit 2548cea

Please sign in to comment.