Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XHTML support, does not extract text #743

Open
simjak opened this issue Jan 14, 2025 · 0 comments
Open

XHTML support, does not extract text #743

simjak opened this issue Jan 14, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@simjak
Copy link

simjak commented Jan 14, 2025

When I give *xhtml files the output is:

<!-- image -->

<!-- image -->

<!-- image -->

Logs:

2025-01-14 12:02:29 - docling.backend.html_backend - DEBUG - html_backend.py:26 - __init__() - About to init HTML backend...
2025-01-14 12:02:29 - charset_normalizer - DEBUG - api.py:461 - from_bytes() - Encoding detection: utf_8 is most likely the one.
2025-01-14 12:02:30 - docling.document_converter - INFO - document_converter.py:238 - _convert() - Going to convert document batch...
2025-01-14 12:02:30 - docling.pipeline.base_pipeline - INFO - base_pipeline.py:37 - execute() - Processing document airbus.xhtml
2025-01-14 12:02:30 - docling.backend.html_backend - DEBUG - html_backend.py:77 - convert() - Trying to convert HTML...
2025-01-14 12:02:30 - docling.document_converter - INFO - document_converter.py:253 - _convert() - Finished converting document airbus.xhtml in 0.80 s

Pipeline options:

    pipeline_options = PdfPipelineOptions(
        ocr_options=ocr_options, do_table_structure=True
    )

    pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
    pipeline_options.generate_picture_images = True

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
                backend=DoclingParseV2DocumentBackend,
            ),
            InputFormat.HTML: HTMLFormatOption(
                pipeline_options=pipeline_options,
                backend=HTMLDocumentBackend,
            ),
        },
    )

    conversion_result: ConversionResult = converter.convert(source=params.file_path)
    return conversion_result
@simjak simjak added the enhancement New feature or request label Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant