Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminology: most rapidly varying dimension mislabelling of storage order #583

Open
pvanlaake opened this issue Jan 8, 2025 · 4 comments
Labels
defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors new contributor This issue was worked on by new contributors to the CF conventions

Comments

@pvanlaake
Copy link

pvanlaake commented Jan 8, 2025

Mis-labelling of storage order in definition of "most rapidly varying dimension"

In #530 a new definition for "most rapidly varying dimension" was developed and this has since been included in 1.12. Unfortunately, row-major ordering and column-major ordering are reversed. In fact, C-style is row-major and Fortran-style is column major.

Moderator

Not assigned.

Requirement Summary

Update the text in the terminology section to correctly reflect the storage order.

Additionally, some minor textual changes are proposed to make the text more accurate and inclusive:

  • Change "When netCDF is represented in CDL..." to "When a netCDF dataset is represented in CDL...".
  • Change "C and Python NumPy use the same order as C, also called "column-major order", but Fortran uses the opposite convention, also called "row-major order", so that when netCDF variables are accessed in Fortran the most rapidly varying dimension is the first one." to "C and Python NumPy use the same order as CDL, called "row-major order", while R and Fortran use the alternative arrangement, called "column-major order", so that when netCDF variables are accessed in R or Fortran the most rapidly varying dimension is the first one."

Associated pull request

PR will be made after any comments and suggestions have been processed.

@pvanlaake pvanlaake added the defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors label Jan 8, 2025
@ChrisBarker-NOAA
Copy link
Contributor

I think we should use:

"C and Python NumPy uses the same order as CDL...."

Since it begins with the definition for CDL.

(and I suspect that was the original intent, as saying C uses the same order as C wouldn't have been intentional :-)

Agree with the addition of R as an example.

What does confuse me, and maybe this isn't the place to clarify in the docs, but if a netcdf File has:

variable(x, y, z)

in it, and you open it in Fortran or R, do you access it as:

variable[z, y, x] ?

(and the same for writing?

(I don't use Fortran or R, so ....)

@pvanlaake
Copy link
Author

Proposed text updated as suggested.

On confusion: this is a common thing among the best of us, but in the end it really doesn't matter. Dimensions can be stored in any order so a reader has to examine the relevant attributes to determine how to orient the data. I am not sure about the details of the netcdf library, of which there are versions in C and Fortran, and if they would write in their native storage mode or whether there is a default arrangement that both library versions use. In R I use package RNetCDF, which is written and maintained by UCAR staff, as a low-level access to the library and that produces data in row-major order. That leads to fun stuff like flipped maps etc - you may find any number of non-plussed users on StackOverflow or similar platforms.

Where it does matter is in processing of the data. Getting a time profile for a specific location from a COARDS compliant 3-dimensional data set is painfully slow compared to getting an area of data for a specific time, due to the contiguity of the data on file and thus the more efficient I/O. That, however, is the same for both storage orders, but just operating on different dimensions.

@ChrisBarker-NOAA
Copy link
Contributor

Where it does matter is in processing of the data. Getting a time profile for a specific location from a COARDS compliant 3-dimensional data set is painfully slow compared to getting an area of data for a specific time, due to the contiguity of the data on file and thus the more efficient I/O. "

well, yes, which is why CF recommends an order, but does not require it -- and why it uses "most rapidly varying" rather than first [last] dimension.

Though with modern file formats (netCDF4, zarr, ???) this ends up being more an issue of how the data are chunked, rather than the dimension order.

@JonathanGregory JonathanGregory added the new contributor This issue was worked on by new contributors to the CF conventions label Jan 9, 2025
@JonathanGregory
Copy link
Contributor

Thanks for opening the issue, @pvanlaake, and for spotting the mistake. I agree with the suggested change of Chris's, which you've made, and also agree with his explanation of why CF relaxed the COARDS requirement for ordering of dimensions.

Patrick should be added to the list of contributors to the convention once this issue has been concluded. I've added the new contributor label to remind us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors new contributor This issue was worked on by new contributors to the CF conventions
Projects
None yet
Development

No branches or pull requests

3 participants