Inner HTML elements are processed after the tail text #333

HiromuHota · 2019-10-15T23:24:58Z

Describe the bug

In tests/data/html_simple/md.html, italics and later bold. Even is processed in the following order:

italics and later
. Even
bold

This is illustrated in #12 too.

To Reproduce
Steps to reproduce the behavior:

Run tests/parser/test_parser.py::test_parse_md_details and set a breakpoint

Expected behavior

The said sentence is processed in the following order:

italics and later
bold
. Even

Environment (please complete the following information):

Fonduer Version: [0.7.0 and master(3d5392c)]

Additional context
Add any other context about the problem here.

Why this happens?

When the node is italics and later bold, node.text (=italics and later ) and node.tail (=. Even) are processed, the next node bold is processed.

The text was updated successfully, but these errors were encountered:

senwu · 2020-06-20T20:44:45Z

So, the expected order should be (1) node.text, (2) text in the inner node, and (node.tail)?

HiromuHota · 2020-06-24T16:51:04Z

I think so.
This is especially true when you want to align words between HTML and PDF (related to #12).
I don't think this is a major issue, though.

HiromuHota pushed a commit to HiromuHota/fonduer that referenced this issue Sep 30, 2020

Process the tail text only after child elements (HazyResearch#333)

1503683

HiromuHota mentioned this issue Sep 30, 2020

Process the tail text only after child elements (#333) #520

Merged

4 tasks

HiromuHota pushed a commit to HiromuHota/fonduer that referenced this issue Oct 5, 2020

Process the tail text only after child elements (HazyResearch#333)

2f896ca

lukehsiao closed this as completed in #520 Oct 5, 2020

lukehsiao pushed a commit that referenced this issue Oct 5, 2020

Process the tail text only after child elements (#333)

8ec5a85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inner HTML elements are processed after the tail text #333

Inner HTML elements are processed after the tail text #333

HiromuHota commented Oct 15, 2019

senwu commented Jun 20, 2020

HiromuHota commented Jun 24, 2020

Inner HTML elements are processed after the tail text #333

Inner HTML elements are processed after the tail text #333

Comments

HiromuHota commented Oct 15, 2019

senwu commented Jun 20, 2020

HiromuHota commented Jun 24, 2020