Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pdf2parquet to Docling v2 #756

Merged
merged 16 commits into from
Oct 31, 2024
Merged

Update pdf2parquet to Docling v2 #756

merged 16 commits into from
Oct 31, 2024

Conversation

dolfim-ibm
Copy link
Member

@dolfim-ibm dolfim-ibm commented Oct 30, 2024

Why are these changes needed?

The new Docling v2 allows to

  1. Process more input formats: PDF, DOCX, PPTX, HTML, Markdown, ASCII Docs
  2. Faster PDF backend
  3. Improvements in the generated document
  4. New DoclingDocument
  5. Additional export format to plain text

In progress

  • upgrade dependencies and adapt code
  • add safe-guard for the download of model weights on multi-processes
  • implement flushing mechanism for creating batches of files
  • update doc_chunk tansform
  • propagate parameters to kfp

Related issue number (if any).

new parameters and input formats
faster backend
revalidated the test results

Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
@touma-I touma-I requested a review from daw3rd October 30, 2024 18:10
@dolfim-ibm dolfim-ibm marked this pull request as ready for review October 30, 2024 18:27
@touma-I touma-I self-requested a review October 30, 2024 18:32
Signed-off-by: Michele Dolfi <[email protected]>
Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do any of. your tests exercise self.buffer and flush_binary()?

@@ -1,6 +1,6 @@
data-prep-toolkit==0.2.2.dev1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be >= 0.2.2.dev1. dev1 may eventually go away.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I rely on @daw3rd and @touma-I expertise. let me know what is the correct value to use

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@touma-I I defer to you on this one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dolfim-ibm @daw3rd this will be reset during the release process or whenever we push to pypi. for now, I wouldn't worry about it. You're good.

Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo @touma-I's call on the >= versioning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants