Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data curation guide with narrative, 1st draft #7

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -151,32 +151,27 @@ profiled using different write configurations. The replicability of writes
should also be tested by checking the hash of data files while varying parallel
configurations, write frequencies and data chunking strategies. I/O performance
can be a major bottleneck for larger parallel simulations, but there are tools
to help characterize I/O, see {cite}`Ather2024` for an overview.
to help characterize I/O, see {cite}`Ather2024` for a thorough overview.

### File formats

In general, when running phase-field simulations, the user is limited to the
file format that the software supports. For example, if the research is using
PRISMS-PF the default data format is VTK and there is no reason to seek an
alternative. If an alternative file format is required then the researcher
could code a C++ function to write data in an alternative format to VTK such as
NetCDF.

As a general rule it is best to choose file formats that work with the tools
already in use and / or that your colleagues are using. There are other
considerations to be aware of though. Human readable formats such as CSV and
JSON are often useful for small medium data sets (such as derived quantities)
as some metadata can be embedded alongside the raw data resulting in a FAIRer
data product than standard binary formats. Some binary file formats also
support metadata and might be more useful for final data curation of a
considerations to be aware of though. Human readable formats such as CSV, JSON
or even YAML are often useful for small medium data sets (such as derived
quantities) as some metadata can be embedded alongside the raw data resulting
in a FAIRer data product than standard binary formats. Some binary file formats
also support metadata and might be more useful for final data curation of a
phase-field study even if not used during the research process. One main
benefit of using binary data (beyond saving disk space) is the ability to
preserve full precision for floating point numbers. The longevity of file
formats should be considered as well. A particularly egregious case of ignoring
longevity would be using the Pickle file format in Python, which is both
language dependent and code dependent. It is an example of data serialization,
which is used mainly for in-process data storage for asynchronous tasks, but
not good for long term data storage.
preserve full precision for floating point numbers. See the [Working with
Data][working-with-data] section of the Python for Scientific Computing
document for a comparison of binary versus text based formats. The longevity of
file formats should be considered as well. A particularly egregious case of
ignoring longevity would be using the Pickle file format in Python, which is
both language dependent and code dependent. It is an example of data
serialization, which is used mainly for in-process data storage for
asynchronous tasks and checkpointing, but not good for long term data storage.

There are many binary formats used for storing field data based on an Eulerian
mesh or grid. Common formats for field data are NetCDF, VTK, XDMF and
Expand All @@ -186,28 +181,24 @@ different native file formats based on both XML and HDF5 (both non-binary and
binary). The VTK library works well with FE simulations supporting many
different element types as well as parallel data storage for domain
decomposition. See the [XML file formats documentation][vtk-xml] for VTK for
an overview of zoo of different file extensions and their meaning. In contrast
to VTK, NetCDF is more geared towards gridded data having arisen from
atmospheric research, which uses more FD and FV than FE. For a comparison of
performance and metrics for different file types see the
[Python MeshIO tools's README.md][meshio].

The Python MeshIO tool is a good place to start for IO when writing custom
phase-field codes in Python (or Julia using `pyimport`). MeshIO is also a good
place to start for exploring, debugging or picking apart file data in an
interactive Python environment, which can be harder to do with dedicated
viewing tools like Paraview. The scientific Python ecosystem is very rich with
tools for data manipulation and storage such as Pandas, which supports storage
in many different formats, and xarray for higher dimensional data. xarray
supports NetCDF file storage, which includes coordinate systems and metadata in
an overview of the many different file extensions and their meanings. In
contrast to VTK, NetCDF is more geared towards gridded data having arisen from
atmospheric research (using finite difference grids rather than finite element
meshes). For a comparison of performance and metrics for different file types
see the [MeshIO README.md][meshio].

The MeshIO tool {cite}`Schlomer` is a good place to start for IO when writing
custom phase-field codes in Python (or Julia using `pyimport`). MeshIO is also
a good place to start for exploring, debugging or picking apart file data in an
interactive Python environment. Debugging data can be much more difficult with
GUI style data viewers such as Paraview. The scientific Python ecosystem is
very rich with tools for data manipulation and storage such as Pandas, which
supports table data storage in many different formats, and xarray
{cite}`Hoyer2017` for higher dimensional data storage. [xarray supports NetCDF
file storage][xarray-io], which includes coordinate systems and metadata in
HDF5. Both Pandas and xarray can be used in a parallel or a distributed manner
in conjucntion with Dask. Dask along with xarray supports writing to the Zarr
data format. Zarr allows data to be stored on disk during analysis to avoid
loading the entire data object into memory.

- https://aaltoscicomp.github.io/python-for-scicomp/work-with-data/
- https://docs.vtk.org/en/latest/index.html
- https://docs.xarray.dev/en/stable/user-guide/io.html=
in conjunction with Dask. Dask along with xarray supports writing to the Zarr
data format which supports out-of-memory operations.

(label-restarts)=
### Recovering from crashes and restarts
Expand Down Expand Up @@ -456,5 +447,8 @@ Dockstore and Workflowhub https://arxiv.org/pdf/2410.03490
[schemaorg]: https://github.com/openschemas/schemaorg
[structured data schema]: https://en.wikipedia.org/wiki/Data_model
[link1]: https://workflows.community/groups/fair/best-practices/
[mehio]: https://github.com/nschloe/meshio?tab=readme-ov-file#performance-comparison
[meshio]: https://github.com/nschloe/meshio?tab=readme-ov-file#performance-comparison
[vtk-xml]: https://docs.vtk.org/en/latest/design_documents/VTKFileFormats.html#xml-file-formats
[working-with-data]: https://aaltoscicomp.github.io/python-for-scicomp/work-with-data/#binary-file-formats
[xarray-io]: https://docs.xarray.dev/en/stable/user-guide/io.html

19 changes: 19 additions & 0 deletions pf-recommended-practices/references.bib
tkphd marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -375,3 +375,22 @@ @misc{Ather2024
primaryClass={cs.DC},
url={https://arxiv.org/abs/2501.00203},
}

@misc{Schlomer,
author={Schlömer, Nico},
doi={10.5281/zenodo.1173115},
license = {MIT},
title={{meshio: Tools for mesh files}},
url={https://github.com/nschloe/meshio}
}

@article{Hoyer2017,
author={Hoyer, Stephan and Joseph, Hamman},
doi={10.5334/jors.148},
journal={Journal of Open Research Software},
month=apr,
number={1},
title={{xarray: N-D labeled Arrays and Datasets in Python}},
volume={5},
year={2017}
}
Loading