Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(input): Add EDTF as an option for date representation #284

Merged
merged 6 commits into from
Jun 30, 2020

Conversation

bdarcus
Copy link
Member

@bdarcus bdarcus commented Jun 28, 2020

Description

As part of #278, and to harmonize the JSON and YAML representations around a much more concise and expressive date format, this adds an option to use EDTF.

While EDTF was originally an initiative of the US Library of Congress, it seems that in 2019, ISO adopted it as part of 8601-2.

Example of a date-range with a circa qualifier in the new alternative (using the YAML syntax):

issued: 2004-02-01/2005-02-08?

TODO

  1. settle on the precise compliance levels and features we support (for documentation); see below
  2. modify regex pattern for validation that reflects 1, or that tightens the validation beyond current
  3. long-term status of date-parts; do we say deprecated now and remove later, for example, or recommend EDTF?

Compliance Details

We need to say "Level 0 is supported, and in addition the following features of levels 1 and 2 are supported (list features)."

Note: biblatex formally supports levels 0 and 1.

Level 0

  • Date (2010-03-10)
  • Time Interval (2010-03-10/2010-03-11)

Note: level 0 includes date-times, which I don't think we need.

Level 1

  • Seasons (2010-21)
  • Negative calendar year (-0500)

I'm confident in the above; the rest below are uncertainties; I think I'd prefer to not support, on input at least.

Uncertain vs Approximate Dates

  • Qualification of Date (2010-03-04? means "uncertain"; ~ indicates "approximate" and % both)

I remember this discussion during EDTF development, but never understood the practical distinction.

In our case, do we support one or the other, or both?

How?

I vote for only one, which we treat as equivalent to our "circa." But no pun intended, I'm not 100% sure.

Other Issues

I'm not sure we need this:

  • Unspecified digit(s) from the right (2010-03-XX)

We need these "Extended Interval " features on styles, but on input?

  • Open end time interval (1900/..)
  • Open start time interval (../1900)
  • Time interval with unknown end (1900/)
  • Time interval with unknown start (/1900)

Level 2

Do we need "all member" sets? ({2010-03-04,2010-03-05})

Extension

  • season ranges (2020-21/2020-22).

EDTF libraries

@bdarcus bdarcus added the 1.1 label Jun 28, 2020
@jgm
Copy link

jgm commented Jun 28, 2020

Adopting EDTF sounds like a good plan to me -- much more compact, still readable, allows ranges (pandoc currently handles these by using an array of the year/month/day objects).
Also, it makes it impossible to specify, say, just a month but no year.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 28, 2020 via email

@bwiernik
Copy link
Member

We should just adopt EDTF. It has defined levels of features, right? For citations, we need date ranges, uncertain dates, approximate dates, CE/BCE, seasons, and dates to specified precision (e.g., 194x). I think that covers it. What level of EDTF had that?

@retorquere you’ve explored this in BBT right?

@fbennett you’ve also explored accommodating Japanese imperial dates in CSLm. How does that work? Could we incorporate that and extend to Arabic calendar dates?

@bdarcus
Copy link
Member Author

bdarcus commented Jun 28, 2020

I just pushed a commit that removes the date-part pattern and replaces it with a edtf key. Result would be:

issued: 
   edtf: 2004-02-01?

Note:

  1. I'm not a fan of that, and only added that key because I wasn't sure if we could get rid of the literal and raw variables; would be simpler still if we could. @fbennett - can you comment on this please?
  2. I just defined it as a string for now.

FWIW, EDTF has been around for awhile, and I participated in the discussions largely to ensure it would work for our use case. It's been a few years since I have participated, but it seems a) they finished, and b) it does.

@bdarcus bdarcus force-pushed the input-yaml-json-harmonize branch from 7603cd1 to d6daada Compare June 28, 2020 18:53
@bwiernik
Copy link
Member

Emiliano has done a lot of work on date parsing, so I think we can draw on that.

@bdarcus bdarcus force-pushed the input-yaml-json-harmonize branch from d6daada to 20a86fd Compare June 28, 2020 18:55
@bdarcus bdarcus changed the title (input): Convert date-part from array to object (input): Convert date-variable definition to use EDTF instead Jun 28, 2020
@bdarcus bdarcus changed the title (input): Convert date-variable definition to use EDTF instead (input): Convert date-variable definition to use EDTF Jun 28, 2020
@bdarcus
Copy link
Member Author

bdarcus commented Jun 28, 2020

What level of EDTF had that?

I think we'll need base + parts of 1 and 2.

So we'd chose the "Level 0 is supported, and in addition the following features of levels 1 are supported (list features)" option under "compliance," and so itemize what features we use beyond base.

@denismaier
Copy link
Member

EDTF: yes please! Would be a massive simplification. I don't see what could speak against this...

@denismaier
Copy link
Member

This could also be useful for the proposed date range condition: #251

@bdarcus bdarcus force-pushed the input-yaml-json-harmonize branch from 20a86fd to cb76bd8 Compare June 28, 2020 19:17
@njbart
Copy link

njbart commented Jun 28, 2020

I’m all for using the EDTF date format, but there’s one thing I am aware of that can be expressed in the current CSL JSON and CSL YAML date format, but, regrettably, has not been included in EDTF, and that’s season ranges, e.g., 2020-21/2020-22 for “Spring 2020 to Summer 2020”, something that appears not too infrequently in bibliographic data, and is recognised by style guides such as CMOS.

edtf.js does support these (“Seasons in intervals are supported at the experimental/non-standard level 3.“), and I feel CSL should declare support for EDTF dates extended by season ranges in similar terms. See also this discussion on the edtf.js issue tracker.

biblatex, BTW, supports the 2020-21/2020-22 notation, too.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 28, 2020

edtf.js does support these (“Seasons in intervals are supported at the experimental/non-standard level 3.“), and I feel CSL should declare support for EDTF dates extended by season ranges in similar terms. See also this discussion on the edtf.js issue tracker.

Thanks much @njbart; very helpful!

I agree, this is what we should do.

I revised the first post to include a TODO, of which this is part of the first one.

@bdarcus bdarcus added the input label Jun 28, 2020
@retorquere
Copy link

Emiliano has done a lot of work on date parsing, so I think we can draw on that.

My date parser works in stages (one of which is EDTF, done by edtf.js), but that's mostly because I try to make sense of dates that I should hope will never turn up in CSL:

  1. I regex against a number of patterns to parse textual dates in various languages
  2. I regex against a number of numerically formatted dates, taking into account the Zotero locale when there's ambiguity. These dates can have approx and seasons, but not both
  3. when none of those match, I try EDTF, "upgrading" the input to the EDTF that edtf.js supports (replacing unknown by * etc)
  4. when none of that matches, try to run it through edtfy and try edtf.js again
  5. when none of those match, try to split on some common delimiters (/, --, etc) and try step 1-4 again to create a range; one of the sides may be empty to have an open-ended range.

You'll understand this is slow as molasses, but it makes sense of a lot of stuff I've been offered in test cases over the years.

This adds EDTF, part of ISO 8601-2, as an alernative to the current,
more verbose, representation.
@bdarcus bdarcus force-pushed the input-yaml-json-harmonize branch from cb76bd8 to 69b4e49 Compare June 29, 2020 17:42
@bdarcus bdarcus changed the title (input): Convert date-variable definition to use EDTF (input): Add EDTF as an option for data representation Jun 29, 2020
@bdarcus
Copy link
Member Author

bdarcus commented Jun 29, 2020

OK, all, after some back-and-forth with @bwiernik and @dhimmel, I revised the PR to give two choices:

  1. current JSON representation
  2. an EDTF string

I also added 2 to 1, with a new edtf property.

TBD still:

  1. are we all OK with this in general? this does not, for example, support the current pandoc model.
  2. language: right now, the language I put in the docstrings puts the two on equal footing; one might make the argument we should more strongly recommend EDTF over the current. if you have suggested tweaks to the language, please add as a line comment, with a "suggestion."
  3. what should be the regular expression on the edtf-datatype definition I just added (currently just defines valid characters)?

schemas/input/csl-data.json Outdated Show resolved Hide resolved
@bdarcus bdarcus changed the title (input): Add EDTF as an option for data representation (input): Add EDTF as an option for date representation Jun 29, 2020
Copy link
Contributor

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't tested the new JSON schema, but the changes look great @bdarcus (if they achieve the intended effect. Defining edtf-datatype separately was a smart move.

schemas/input/csl-data.json Outdated Show resolved Hide resolved
@bdarcus
Copy link
Member Author

bdarcus commented Jun 29, 2020

haven't tested the new JSON schema, but the changes look great @bdarcus (if they achieve the intended effect. Defining edtf-datatype separately was a smart move.

I did test it, and it works as expected.

schemas/input/csl-data.json Outdated Show resolved Hide resolved
@bdarcus
Copy link
Member Author

bdarcus commented Jun 30, 2020

I forgot about season ranges, but just added it.

  • The literal and season elements are useful for entering information that does not fit into the EDTF format, e.g. "literal" : "13th century" or "season" : "Trinity" (both examples from the old citeproc-js docs, archived here; without date-parts elements, the latter will only make sense though if accompanied by a year in the edtf element). I feel it could be useful to keep the option of entering unstructured information – though neither CMOS nor APA seem to mention anything similar.

OK. To be clear, the current code allows one representation, or the other.

But you could indeed do this:

issued:
   edtf: 1950
   season: Trinity

On the longer term question, beyond status of date-parts, would be whether the above would be allowed too; e.g. whether it'd just be the EDTF string.

PS - wouldn't this be an EDTF representation of 13th century: 12XX?

@njbart
Copy link

njbart commented Jun 30, 2020

We need these "Extended Interval " features on styles, but on input?

We do, e.g., for multivolume works where not all planned volumes have appeared yet (e.g., 1999/ or 1999/..). (Strictly speaking, in this case I guess "unknown" would apply, since we expect some final volume, we just don’t know its publication date yet. For journals OTOH, "open" would seem more appropriate. For our purposes, I think the difference is moot.) Open/unknown start dates seem to be less common in bibliographic contexts, but, FWIW, biblatex supports all four forms, and does not distinguish open and unknown.

@njbart
Copy link

njbart commented Jun 30, 2020

PS - wouldn't this be an EDTF representation of 13th century: 12XX?

Yes, and newer biblatex versions do support this as EDTF input – but I doubt whether even biblatex can actually render this as “13th century” (haven’t tested though).

@bdarcus
Copy link
Member Author

bdarcus commented Jun 30, 2020

We need these "Extended Interval " features on styles, but on input?

We do, e.g., for multivolume works where not all planned volumes have appeared yet (e.g., 1999/ or 1999/..). (Strictly speaking, in this case I guess "unknown" would apply, since we expect some final volume, we just don’t know its publication date yet. For journals OTOH, "open" would seem more appropriate. For our purposes, I think the difference is moot.) Open/unknown start dates seem to be less common in bibliographic contexts, but, FWIW, biblatex supports all four forms, and does not distinguish open and unknown.

I think at this point, I just want the regex to reject dates that have characters for features we don't want/need to support.

The remaining questions are sets (I agree, we can add), and approximate (?) vs uncertain (~), and we can address the rest in documentation.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 30, 2020

Maybe we should just do as biblatex does?

See it's manual: section: 2.3.8 Date and Time Specifications, which explicitly discusses it's support for ISO 8601-2.

FWIW, it supports both approximate (circa) and uncertain dates. Maybe we support both in this PR, and then finalize the decision on the documentation?

@njbart
Copy link

njbart commented Jun 30, 2020

Maybe we should just do as biblatex does?

I'm all in favour.

CSL could still support literal and season elements (biblatex recommends its year field if literal/unstructured date information is to be used, and suggests putting season(-ish) terms such as “Michaelmas term” in the issue field).

@bdarcus
Copy link
Member Author

bdarcus commented Jun 30, 2020

We need these "Extended Interval " features on styles, but on input?

We do, e.g., for multivolume works where not all planned volumes have appeared yet (e.g., 1999/ or 1999/..).

What about this though: {2010-03,2010-05,2010-03..2010-05}?

@njbart
Copy link

njbart commented Jun 30, 2020

What about this though: {2010-03,2010-05,2010-03..2010-05}?

Probably not. I can’t remember having encountered anything like this, neither in the wild, nor in CMOS or APA. Theoretically, such a feature might make sense if you wanted to highlight a particularly unusual publication history (“Doe (1901, 1902, 1910, 1969), My Autobiography in Four Volumes.“), but, again, I’ve never actually seen this anywhere.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 30, 2020

OK. I'm going to remove support for the brackets and comma then. We can revisit later.

@fbennett
Copy link
Member

Just one question, I guess. Will the edtf.js library be able to parse all valid CSL inputs on the edtf segment?

@njbart
Copy link

njbart commented Jun 30, 2020

FWIW, it supports both approximate (circa) and uncertain dates. Maybe we support both in this PR, and then finalize the decision on the documentation?

I’d support all three “flags” in EDTF input data, i.e., ~, ?, and %, regardless of whether CSL treats all as “circa” or starts making any distinctions.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 30, 2020

Will the edtf.js library be able to parse all valid CSL inputs on the edtf segment?

EDTF.js claims to support the full EDTF spec, which is a superset of what we need.

To go back to your question, @fbennett, EDTF does not have support for literal or raw content in the current date representation. But it's a superset of date-parts + circa + season.

schemas/input/csl-data.json Outdated Show resolved Hide resolved
@bwiernik
Copy link
Member

From the above, I’m not sure where we stand on open ended date ranges like 2006/... I think we should support them—e.g., with the new periodical type, such a format might be used to refer to the journal under its current editor tenure.

@denismaier
Copy link
Member

From the above, I’m not sure where we stand on open ended date ranges like 2006/... I think we should support them—e.g., with the new periodical type, such a format might be used to refer to the journal under its current editor tenure.

Yes, open ended ranges are needed for journals or multi-volume works that are not complete yet.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 30, 2020

From the above, I’m not sure where we stand on open ended date ranges like 2006/... I think we should support them—e.g., with the new periodical type, such a format might be used to refer to the journal under its current editor tenure.

Nothing here would preclude that. Just a topic for documentation.

I think this PR is RTM.

Would it be possible for @bwiernik, @denismaier, and @njbart to collaborate on a documentation PR sometime in the near future? As I suggested, I think biblatex (which claims support for level 0 and 1) would be the place to start, given a) it's well-designed in general, and b) they already integrated and implemented this support.

@bdarcus bdarcus merged commit bb22f12 into v1.1 Jun 30, 2020
@bdarcus
Copy link
Member Author

bdarcus commented Jul 1, 2020

I think biblatex (which claims support for level 0 and 1) would be the place to start, given a) it's well-designed in general, and b) they already integrated and implemented this support.

PS - the other advantage of following biblatex as closely as possible is compatibility.

@larsgw
Copy link

larsgw commented Jul 2, 2020

I missed this discussion, but I am personally not a fan of serializing information that could be represented differently. Converting CSL-JSON with EDTF to BibTeX or RIS now requires parsing EDTF, which is potentially pretty slow, versus an array access. I'd rather improve the current format, I think something like below could encode at least the same as EDTF.

{
  accessed: [{ date: [2020, 7, 1] }],
  issued: [{ date: [1960], season: 'Spring' }],
  'original-date': [{ literal: '13th century' }],
  'event-date': [
    { date: [1960, 3, 2] },
    { date: [1960, 3, 5] }
  ]
}

(You could even remove the outer array for non-range dates)

However, I do see the use with YAML and I get it if this is considered a very minor concern.

@bdarcus
Copy link
Member Author

bdarcus commented Jul 2, 2020 via email

@larsgw
Copy link

larsgw commented Jul 2, 2020

I meant specifically BibTeX, not BibLaTeX, which has year and month fields instead. I think it would be good to enhance the structured option.

@bwiernik
Copy link
Member

bwiernik commented Jul 2, 2020

That’s basically adding the pandoc fields as an option, but providing for all of those options would mean requiring processors to support every format.

Personally I think BibLaTeX is a more important thing to be compatible with; supporting both requires EDTF anyway.

@bdarcus bdarcus deleted the input-yaml-json-harmonize branch July 2, 2020 10:18
bwiernik pushed a commit to bwiernik/schema that referenced this pull request Jul 8, 2020
…language#284)

As part of citation-style-language#278, and to harmonize the JSON and YAML representations 
around a much more concise and expressive date format, this adds a 
an option to use EDTF; either as a preferred string on any date, or as 
an "edtf" string property on the more verbose alternative object 
representation.

While EDTF was originally an initiative of the US Library of Congress, 
ISO adopted it as part of 8601-2 in 2019.

Note: The current regular expression pattern only checks for valid 
characters.
bdarcus added a commit that referenced this pull request Jul 26, 2020
As part of #278, and to harmonize the JSON and YAML representations 
around a much more concise and expressive date format, this adds a 
an option to use EDTF; either as a preferred string on any date, or as 
an "edtf" string property on the more verbose alternative object 
representation.

While EDTF was originally an initiative of the US Library of Congress, 
ISO adopted it as part of 8601-2 in 2019.

Note: The current regular expression pattern only checks for valid 
characters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants