Proposal: Content Iterator #177

mickael-menu · 2022-06-15T13:30:56Z

Summary

This proposal introduces a new Publication Service to iterate through a publication's content extracted as semantic elements.

The Content Iterator service provides a building block for many high-level features requiring access to the raw content of a publication, such as:

Text to speech
Accessibility readers
Basic search
Full-text search indexing
Image or audio indexes

Today, implementing such features is complex because you need to:

(maybe) Find a starting location in one of the reading order resources.
Iterate through the reading order, opening and closing resources when needed.
Extracting the textual or media content from the raw resource, which is different for every supported media type.

The Content Iterator handles all that in a media type agnostic way.

Review notes

The whole Content class and its components need some review to account for all needs.

qnga · 2022-06-17T08:45:49Z

proposals/000-content-iterator.md

+* `previous() -> Content?`
+    * Returns the previous `Content` element in the iterator, or `null` when reaching the beginning.
+* `next() -> Content?`
+    * Returns the next `Content` element in the iterator, or `null` when reaching the end.


You should make it explicit that these methods make the iterator move forward or backward.

In the documentation comment? I'm fine making it more explicit there.

I wouldn't change the function signature as it's pretty common though:

https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.collections/-iterator/next.html

https://developer.apple.com/documentation/swift/iteratorprotocol/next()

Yeah, I was talking about the documentation.

qnga · 2022-06-17T08:54:22Z

proposals/000-content-iterator.md

+##### Properties
+
+* `locator: Locator`
+    * Locator targeting this element in the `Publication`.


Will we be able to get a locator for any target data? Think of three successive images without HTML ids. I believe that neither fragments nor text after/before can be used.

I can't guarantee that it will be the case for all media types, but for the ones we have so far I think so.

With image elements, you can use a cssSelector in the locations object. I filled it for all locators in the HtmlResourceContentIterator, as even with a text context it can help limit the search scope to a parent element.

If there's ever a case where we can't target precisely, this Locator should at least match the closest targetable parent.

qnga · 2022-06-17T08:55:55Z

proposals/000-content-iterator.md

+* `style: TextStyle`
+    * Semantic style for this element.
+* `spans: [TextSpan]`
+    * List of text spans in this element.


What's the rationale behind grouping multiple TextSpans into a single Content.Data.Text?

For example if you have the body paragraph:

<body lang="en"> <p>The correct pronunciation is <span lang="fr">croissant</span>, and not croissant.</p> </body>

We want the whole paragraph as a single semantic Text element. However to not loose the information about the language (which is important for TTS), you can split the Text element into a list of spans which are basically "attributed ranges" in the text element:

{ "locator": {}, "data": { "style": "body", "spans": [ { "text": "The correct pronunciation is ", "language": "en" }, { "text": "croissant", "language": "fr" }, { "text": ", and not croissant.", "language": "en" } ] } }

Like @danielweck mentioned on the call, the term span can be confusing when parsing HTML, as they are no 1-1 relationship between the HTML spans and the produced elements.

However thinking more about this, I think the term is still the most accurate. The problem happens only with HTML media types and the semantic seems to be correct:

https://developer.android.com/guide/topics/text/spans

https://api.flutter.dev/flutter/painting/TextSpan-class.html

I think it even matches the meaning from the HTML spec:

The span element doesn't mean anything on its own, but can be useful when used together with the global attributes, e.g. class, lang, or dir.
https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-span-element

Sounds good!

@danielweck I'm renaming span into segment to lift any ambiguity. (Other candidates: part, chunk, portion)

I will also rename style into role, to remove the notion of appearance that style brings.

mickael-menu · 2023-01-13T10:30:09Z

proposals/000-content-iterator.md

+
+### Extracting the data from `Content` elements
+
+The `Content` elements are value objects containing:


Some new elements to take into account:

SVG

can have an aria-label or title child element

MathML

extracting the tt child elements could be useful

MathJax

mickael-menu added 2 commits June 15, 2022 15:21

Add the Content Iterator proposal

32fd274

Update PR link

a3ff543

qnga reviewed Jun 17, 2022

View reviewed changes

mickael-menu commented Jan 13, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Content Iterator #177

Proposal: Content Iterator #177

mickael-menu commented Jun 15, 2022

qnga Jun 17, 2022

mickael-menu Jun 17, 2022

qnga Jun 19, 2022

qnga Jun 17, 2022

mickael-menu Jun 17, 2022

qnga Jun 19, 2022

qnga Jun 17, 2022

mickael-menu Jun 17, 2022

qnga Jun 19, 2022

mickael-menu Jul 6, 2022

mickael-menu Jul 6, 2022

mickael-menu Jan 13, 2023


		### Extracting the data from `Content` elements

		The `Content` elements are value objects containing:

Proposal: Content Iterator #177

Are you sure you want to change the base?

Proposal: Content Iterator #177

Conversation

mickael-menu commented Jun 15, 2022

Summary

Review notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment