-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support localization of natural language attributes and variables #528
Comments
Here is the draft text we were looking at in the discussion thread ADDITION TO 2.5 (prior to 2.5.1 heading, after the existing text) Files that wish to provide localized (i.e. multilingual) versions of the content of variables shall reference section #TBD for details on how to do so. ADDITION TO 2.6 (prior to 2.6.1 heading, after the existing text following 2.6) Files that wish to provide localized (i.e. multilingual) versions of the content of attributes shall reference section #TBD for details on how to do so. NEW SECTION TBD. LocalizationCertain attributes and variables in NetCDF files contain natural language text. Natural language text is written for a specific locale: this defines the language (e.g. English), the country (e.g Canada), the script (e.g English alphabet), and/or other features of the natural language. Locales are defined by a "language tag" that follows the format specified in BCP 47 //link to BCP47 here//, such as Localization of attributes and variables is limited to natural language string values that are not taken from a controlled vocabulary. See Appendix A for recommendations on localization of CF attributes. Non-CF text attributes that use a natural language may also be localized using these rules. To localize an attribute or variable, an alternative version of it is supplied using a suffix for its name that is associated with the language tag. TBD.1 Localized FilesA "localized file" is one that provides the global attribute The default locale should be chosen to represent the most complete set of attributes and variables; if only some of the natual language text attributes have localized versions, then the more complete language should be chosen as the default. Where there are two or more complete sets, the predominant language that the content was originally written in should be chosen. An attribute or a variable in a localized file must not have a name ending with a locale suffix unless it is used to indicate the locale as per this section. Applications that process NetCDF files are encouraged to apply BCP 47 in determining which content to show a user when localized content is available. When content is not available in a suitable locale for the user, the default locale should be used. TBD.2 Localized AttributesLocalized attributes are created by appending a locale suffix to the usual attribute name. For example:
TBD.3 Localized VariablesLocalized variables are created by appending a locale suffix to the variable name; note that this is only necessary where the data stored in the variable itself is localized and does not come from a controlled vocabulary. Natural language attributes for a localized variable should be provided in the locale of that variable. Localized versions of a variable must be of the same data type and dimensions and must contain the same number of elements appearing in the same order (i.e.
ADDITION TO APPENDIX A
References EDIT NOTES:
|
This looks good. Nice work!
I think it could be hard to make |
I'm struggling with this for reasons I'm having trouble articulating (and am not sure are even valid reasons), I'll try. I don't think CF should introduce attribute and variable name modifications as a concept. Section 2.5 starts out with:
With this proposal, it would be except in cases of localization. This proposal also places restrictions on variable/attribute names and their suffixes. I think the ERDDAP "elephant in the room" is important and relevant here. For ERDDAP adoption in Canada, it needs to display localized content, so this proposal must work in a real world application. I'm fearful that this will not work in ERDDAP for some reason we don't know and won't know until an implementation is worked on and the standard has already been codified. Would there be willingness on the ERDDAP side to implement something that isn't quite in CF yet? And on the CF side, would we be willing to codify whatever ends up working best in ERDDAP? Realizing that someone involved in both communities might need to actually do the coding. |
@DocOtak @turnbullerin As I told Erin, the best way to help see if this can be done is to contribute code to ERDDAP to do so and be tested. There are ready-to-go development environments, compiling and testing have been simplified. Chris is only part-time, but I am sure it would be happy to work with people on this. |
For an update, the ERDDAP issue now has a link to a working prototype I developed based on this proposal for the title attribute - see ERDDAP/erddap#114 |
This is heading towards resolution and I don't want to impede things if the answer is 'no', but over in discussion #341 we're talking about whether string arrays should be allowed, so I just want to ask the question: Would it be dramatically simpler to solve this problem if we used an array of strings for different localizations of free-text attributes? The current solution is effectively creating an ad hoc array of attributes by appending suffixes to the attribute name. Would it make things easier on the implementation side to have a for-real array of attribute values with, say, a localization prefix (e.g., My guess is that the answer on the ERDDAP side is probably 'no', and since that's the primary driver here, ERDDAP-viability is paramount, so if that's the case, I think things should proceed as if I had not commented at all. But I wanted to bring it up before we get locked in to a solution, just in case the alternative would make everyone's lives a lot easier. |
Interesting idea -- thanks for a potential use case! But I fully agree with you @sethmcg that for this particular issue of localization should we go with what most effectively and with least effort move this forward towards conclusion. |
I think more progress will be made at the CF Workshop hackathon. I'd personally like to revisit/present again a container variable solution that does not need any string parsing or manipulation, other than the BCP 47 language tag which tends to have library support. I'm hopeful I'll be able to attend that workshop in person. |
@larsbarring filled out the form yesterday, hopefully it's all ok. |
All good -- thanks! /Lars |
I suggest that a container solution will be workable, and will have about the same complexity compared to the original concept for multiple attribute names with suffixes. I favor suffixes over container, for readability and transparency. By transparency I mean that the meaning of suffixes would be obvious to the casual user, with only the knowledge that the suffixes are BCP 47, and no other details from the CF document. One particular advantage of suffixes is that you get backward compatibility with no complications. If you have e.g. |
Erin, you have put a lot of thought and care into your current proposal seen above. What you have envisioned is quite workable. However I recommend simplification. We have discussed some of this before.
|
I was really hoping to participate in the hackathon for this. I had even managed to get the ERDDAP source working on my computer but couldn't do much more than build/run it. I'm not a Java person and was really hoping for help on this. While at the hackathon, I was planning (see above thread) to have a container variable possible solution implemented in ERDDAP that could be compared to the one done already. I'm motivated by two things:
I think I made some example files a year or two ago, but IIRC they were complex to show capabilities. I'll try to make some simple files soon and show them here. |
I was mistaken, so I retract this comment. This backward compatibility could be done exactly same way with either containers or suffixes. Either way, I expect there would normally be a traditional attribute in the default locale (to be determined elsewhere). Then the "extended attributes" would consist ONLY of the non-default values. Here is a simple imagined container example.
|
Hey all, sorry for missing the hackathon, I had some significant health
issues come up.
In terms of using `.fr-CA`, I think this is not a feasible solution for two
reasons.
One, the CF conventions state "It is recommended that variable, dimension,
attribute and group names begin with a letter and be composed of letters,
digits, and underscores.". I would not want to introduce a specification
that makes it mandatory to disobey a recommended practice.
Two, the ERDDAP folks have made it pretty clear that they won't change
their own policy of requiring attributes to follow the CF recommendation
because of some valid concerns. So any pattern using non recommended
characters would make it incompatible with ERDDAP.
For these reasons, if we are going to use a single convention, it would
then either requiring breaking compatibility with ERDDAP and changing the
current recommendation (which was extensively debated and decided against)
or set to be something that uses only alphabetical and underscores, like
"_fr_CA" or "__fr_CA". I'm ok with this. I think it is still worth listing
them in an attribute in a way that makes it clear what the default is and
what the available options are - otherwise, to identify the languages
available would require checking every possible option which is an ever
growing list. We'd just need to agree on a pattern for doing it, which was
what opened a big debate and resulted in me suggesting we just let users
specify their own suffix so that you can break compatibility with ERDDAP
and the CF recommendations if you want, at your own risk.
On the topic of containers, I don't see how it's implementable in ERDDAP
since ERDDAP only allows for global and variable attributes and ERDDAP, as
far as I know, really doesn't like "dummy" variables. Plus I think the
implementation is still too complex for it to be easy to add languages.
I'll try to keep tabs on this but I have some healing to do for the next
couple of weeks.
…On Tue, Sep 24, 2024, 8:10 PM Dave Allured ***@***.***> wrote:
One particular advantage of suffixes is that you get backward
compatibility with no complications. If you have e.g. title and title.fr,
then existing naive applications will continue to operate normally and
display title, with no awareness nor interference from alternative
attributes. With a container, I am afraid that you would need either
duplicate copies of the default string, or else some other awkward device
to tell the container to look elsewhere for the default string.
I was mistaken, so I retract this comment. This backward compatibility
could be done exactly same way with either containers or suffixes. Either
way, I expect there would normally be a traditional attribute in the
default locale (to be determined elsewhere). Then the "extended attributes"
would consist ONLY of the non-default values. Here is a simple imagined
container example.
:title = "English Title";
:title_localized = "fr-CA:Titre française; es-MX:Título en español";
—
Reply to this email directly, view it on GitHub
<#528 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AVNFBVSL4DLNYEHXROYJUOTZYH5OXAVCNFSM6AAAAABKQ7G2PKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZSGYYDMNJUGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Dear Erin Sorry for your illness; I'm sure we all hope you recover soon. In Sect 2.3 of the working draft, we quite recently added, "ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only." Therefore Best wishes Jonathan |
Erin, I hope that your situation is improving. I just want to make a couple of comments in relation to your status update.
Ping @ChrisJohnNOAA |
The main concern I have from ERDDAP is that we allow for exporting of data in many different formats intended for use in a variety of programing languages, both of which can have limitations on what characters can be used in variable names. If this suffix is included in the data files that ERDDAP exports, we are severely limited in what characters can be included. If you need more context than that, this part of the ERDDAP discussion might be useful: ERDDAP/erddap#114 (comment) |
@ChrisJohnNOAA, okay. How about we focus on attribute names only, and skip variable names for now? By my analysis, any reasonable user program in any language should regard attribute names with special characters as unknown optional attributes, and ignore them by default. Will that be okay for ERDDAP? |
Here is my "Container Variable" proposal for localizations Define two new attribute:
Localization data is contained in a container variable, like the existing geometry container and the newly added quantization container variables. Localization containers must contain a locale attribute with a BCP 47 language tag. All other attributes on this container variable are localized versions of attributes in the referencing scope (global or variable). Why I like this:
The last three bullet points I view as a security feature that might be relevant when implementing in something like ERDDAP that I think has some US Government requirements imposed on it. Here is a simple example for what this looks like in a full CDL:
If the users locale was Canadian French, the attributes from the available Canadian French localization container variable would replace those in the global attributes. BCP 47 defines some algorithm on quality weights and how to fall back to find a localized string, we must define a default behavior if no matching locale is found, which for us would be the locale that is not in a container variable. In the above example, the default locale is For ERDDAP think the combination of Other situations:
|
@DocOtak @larsbarring - Above it says: "The last three bullet points I view as a security feature that might be relevant when implementing in something like ERDDAP that I think has some US Government requirements imposed on it." Besides the points made by @ChrisJohnNOAA that we try to be compatible with as many platforms as possible, which puts constraints that we can't control, we get all sorts of security scans, many of which flag everything under the sun. These are scanners like Nessus and Qualys, and we try to make sure that to the best of our knowledge an ERDDAP release will not fail a security scan. This can also place some restrictions on what we can do (and why we encourage people to upgrade to the latest version - the last thing we want is for an ERDDAP running anywhere to fail a scan, which could lead to a lot of shutdowns until fixed - many of you may not remember when that happened to OPenDAP/TDS quite a few years ago). |
@rmendels I knew ERDDAP has some additional constraints like that, but wasn't sure the details. I do know that input sanitization is "hard" and want something that the code owners of ERDDAP can be confident in. |
@DocOtak In principle I like the 10 bullet points of your "Container Variable" solution, but I am not sure how it plays out when other components of a file also is localized. Would you be able to create a small but complete CDL file ? At the same time I (still) kind of like my own suggestion based on Erin's previous ideas. Here is the CDL for a small complete mockup example:
It does not tick off as many of the 10 bullet points, and I will in no way try to push for this solution. But maybe @DocOtak you could do something similar for you "Container Variable" solution to let the ERDDAP folks -- and everyone else of course -- see for themselves and create small test files using |
I was asked what the ERDDAP concern with ".fr-CA" was and I mentioned the concerns because in the past I've seen recommendations to treat Attribute and Variable names the same in CF. To be clear I think allowing '.' and '-' in variable names is a very bad idea. I don't think attribute names are generally exported from ERDDAP for non-nc file formats. Most likely using '.' and '-' in attribute names would be fine from the ERDDAP perspective, though I haven't fully audited the ERDDAP code for that. |
I'm not convinced that's a safe assumption. I mean, if you want to use that as a criterion for whether a user program is reasonable, fair enough. It's definitely what programs should do. But in terms of software that people actually use, I don't know that we can rely on that being true. We have to remember that plenty of scientific software is written by people who are scientists first and coders second, and who may not follow best practices of software engineering. I will freely admit to being one of those people, and I have written lots of code that is very cavalier about things like checking inputs... |
@larsbarring I have two concerns with the way your example encodes this information:
While the following example is long and verbose, I think that it is worth it because everything is mostly in data structures that are in some ways more "ready to go" in that no transformations need to occur. I also think that by including locale information within the netCDF file, we are explicitly making something that is not meant for humans to look at without the computer processing the locale data first. I don't ever look at all the
|
Please note that @Dave-Allured has opened conventions issue 548 to delete the sentence, "ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only." in Sect 2.3. This sentence was inserted into the working version by conventions issue 477 for various reasons, including to support IETF BCP 47 language tags, as discussed in this issue. If Dave's proposal is accepted, the characters allowed for attribute names will be the same as for variable names in CF 1.12, which is the same as in CF 1.11, the most recently reduced version. @ChrisJohnNOAA commented above that "allowing '.' and '-' in variable names is a very bad idea". Please add your support to conventions issue 548 if you agree with @Dave-Allured that they should not be allowed. |
Dear Andrew @DocOtak, @larsbarring, @turnbullerin et al. I agree with Andrew that using an algorithm to predict the name of an attribute would be unlike previous CF practice. Although we choose meaningful names for CF attributes, all those names are explicitly defined (in Appendix A and elsewhere). They have to be hard-coded in software, and in that sense they are treated as if they were arbitrary, like variable and dimension names are. Furthermore, although we could make a convention with suffixes for attributes work in netCDF, it might not work in other formats CF data could be converted into. Another format might have different rules about characters allowed in names, or it might not even have names at all. Therefore I prefer the container variable as demonstrated by @DocOtak, but I'd combine it with a "keyword If I have understood correctly, @DocOtak, you don't like this kind of syntax. However, quite a lot of CF attributes use it. I don't think it's difficult to parse. It's a blank separated list of words, some of which (the keywords) end in With this syntax, the example would be as below. Best wishes Jonathan
|
@JonathanGregory strong disagree that needing to search the attributes of referenced variables is not general CF pattern. It is very pervasive and probably even a fundamental CF pattern with usage prominently in In your specific example, I'm not sure I like the lack of locale on the localization variables themselves. Without the context of the referencing variables, what locale should I assume they are? Would it be the global locale attribute which is "en-US" in this case? Continuing with that line of thinking, I don't think the repeated "en-US" locale on the Re the
The most recent addition of something that looks like I feel somewhat strongly that if CF wants to continue to use its own bespoke syntax for these string attributes, it needs to define the grammar of them formally in some way, e.g. using EBNF/ISO 14977. I don't think that a regex would be acceptable here either. |
Good morning, @DocOtak Regarding your comment that the need to search the attributes of referenced variables is a pervasive CF pattern, whereas I said that it isn't CF-like. I'm sorry that I didn't consider this remark more carefully! You're right that there are cases where an attribute names several variables (such as I was thinking instead of the situations where an attribute identifies variables by their purpose, such as In my version of the localization example, you have the data variable My version is no more than a rearrangement of yours. I have replaced the I don't think the container variables need a I agree with you that we don't need the Best wishes Jonathan |
Dear Andrew @DocOtak You made a good point that "attributes are key-value pairs". We use that idea for various kinds of container variable, such as grid mapping. Here's a modified version of my previous example (itself a modified version of yours), in which I use a "supercontainer" variable, instead of an attribute containing key-value pairs, to point to the localized metadata containers. The supercontainer may have any attribute name which is a legal language tag, and no other attributes. Do you prefer this? I have also assumed that the file Best wishes Jonathan
|
Moderator
TBD
Moderator Status Review
None
Requirement Summary
Metadata includes natural language text in several places, notably the
title
andlong_name
attributes, as well as potentially in character data variables. Other metadata standards, such as ISO-19115, support the translation of these variables and translation is mandatory in some places such as in files generated by the Canadian Government. By standardizing how these elements are specified in a fashion that is both human-readable and machine-readable, users can identify metadata in their preferred language more easily and computer applications can display metadata to match users preferences and, where this is not possible, then at least while using appropriate accessible techniques. Of key importance is also compatibility with applications such as ERDDAP, which is an application that uses NetCDF files following the CF conventions to create a web interface to select and download data. For this reason, we decided not to use the new.fr-CA
suffix as a required format as it would not be compatible with ERDDAP - instead, data providers are free to choose suffixes that meet their use case.Technical Proposal Summary
Based on discussions in cf-convention/discuss#244, the following proposal seemed acceptable: (1) the creation of a new global attribute that maps suffixes to BCP 47 language tags as well as specifying the default language tag in the file, (2) designating that any attribute or data variable with such a suffix is a localized version of the text in the non-suffixed attribute or variable.
Benefits
Data producers who are required to produce metadata or data in multiple languages, applications that offer multilingual interfaces for viewing or manipulating NetCDF data based on the CF standards, data users who wish to access metadata in the language of their choice
Status Quo
Currently no NetCDF standard offers a standard for localized metadata. However, such standards existing in other metadata formats, such as ISO-19115.
Associated pull request
Not present yet
Detailed Proposal
The addition of a new section to the CF conventions that specifies the following:
localizations
, which will be a space-separated list of paired suffixes and BCP 47 language tags (similar to howcell_methods
is formatted): for example:localizations = "default: en-US _fr: fr-CA _es: es-MX";
default
instead of a suffix to indicate the default locale of the documentlocalizations
attribute is present, attributes and variables may not be named with a suffix except that they indicate localized versions of non-suffixed attributes or variablesIn addition, the following changes would be proposed:
title
,comment
,institution
,long_name
,references
,sources
and to be discussedhistory
andflag_meanings
)The text was updated successfully, but these errors were encountered: