-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Content Iterator #177
base: master
Are you sure you want to change the base?
Conversation
* `previous() -> Content?` | ||
* Returns the previous `Content` element in the iterator, or `null` when reaching the beginning. | ||
* `next() -> Content?` | ||
* Returns the next `Content` element in the iterator, or `null` when reaching the end. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should make it explicit that these methods make the iterator move forward or backward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the documentation comment? I'm fine making it more explicit there.
I wouldn't change the function signature as it's pretty common though:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was talking about the documentation.
##### Properties | ||
|
||
* `locator: Locator` | ||
* Locator targeting this element in the `Publication`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we be able to get a locator for any target data? Think of three successive images without HTML ids. I believe that neither fragments nor text after/before can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't guarantee that it will be the case for all media types, but for the ones we have so far I think so.
With image elements, you can use a cssSelector
in the locations
object. I filled it for all locators in the HtmlResourceContentIterator
, as even with a text context it can help limit the search scope to a parent element.
If there's ever a case where we can't target precisely, this Locator
should at least match the closest targetable parent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see.
* `style: TextStyle` | ||
* Semantic style for this element. | ||
* `spans: [TextSpan]` | ||
* List of text spans in this element. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale behind grouping multiple TextSpan
s into a single Content.Data.Text
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example if you have the body paragraph:
<body lang="en">
<p>The correct pronunciation is <span lang="fr">croissant</span>, and not croissant.</p>
</body>
We want the whole paragraph as a single semantic Text
element. However to not loose the information about the language (which is important for TTS), you can split the Text
element into a list of spans which are basically "attributed ranges" in the text element:
{
"locator": {},
"data": {
"style": "body",
"spans": [
{ "text": "The correct pronunciation is ", "language": "en" },
{ "text": "croissant", "language": "fr" },
{ "text": ", and not croissant.", "language": "en" }
]
}
}
Like @danielweck mentioned on the call, the term span
can be confusing when parsing HTML, as they are no 1-1 relationship between the HTML spans and the produced elements.
However thinking more about this, I think the term is still the most accurate. The problem happens only with HTML media types and the semantic seems to be correct:
- https://developer.android.com/guide/topics/text/spans
- https://api.flutter.dev/flutter/painting/TextSpan-class.html
I think it even matches the meaning from the HTML spec:
The span element doesn't mean anything on its own, but can be useful when used together with the global attributes, e.g. class, lang, or dir.
https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-span-element
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielweck I'm renaming span
into segment
to lift any ambiguity. (Other candidates: part
, chunk
, portion
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will also rename style
into role
, to remove the notion of appearance that style
brings.
|
||
### Extracting the data from `Content` elements | ||
|
||
The `Content` elements are value objects containing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some new elements to take into account:
- SVG
- can have an
aria-label
ortitle
child element
- can have an
- MathML
- extracting the
tt
child elements could be useful
- extracting the
- MathJax
See the full proposal
Summary
This proposal introduces a new Publication Service to iterate through a publication's content extracted as semantic elements.
The Content Iterator service provides a building block for many high-level features requiring access to the raw content of a publication, such as:
Today, implementing such features is complex because you need to:
The Content Iterator handles all that in a media type agnostic way.
Review notes
Content
class and its components need some review to account for all needs.