XHTML support, does not extract text #743

simjak · 2025-01-14T10:07:22Z

When I give *xhtml files the output is:

<!-- image -->

<!-- image -->

<!-- image -->

Logs:

2025-01-14 12:02:29 - docling.backend.html_backend - DEBUG - html_backend.py:26 - __init__() - About to init HTML backend...
2025-01-14 12:02:29 - charset_normalizer - DEBUG - api.py:461 - from_bytes() - Encoding detection: utf_8 is most likely the one.
2025-01-14 12:02:30 - docling.document_converter - INFO - document_converter.py:238 - _convert() - Going to convert document batch...
2025-01-14 12:02:30 - docling.pipeline.base_pipeline - INFO - base_pipeline.py:37 - execute() - Processing document airbus.xhtml
2025-01-14 12:02:30 - docling.backend.html_backend - DEBUG - html_backend.py:77 - convert() - Trying to convert HTML...
2025-01-14 12:02:30 - docling.document_converter - INFO - document_converter.py:253 - _convert() - Finished converting document airbus.xhtml in 0.80 s

Pipeline options:

    pipeline_options = PdfPipelineOptions(
        ocr_options=ocr_options, do_table_structure=True
    )

    pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
    pipeline_options.generate_picture_images = True

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
                backend=DoclingParseV2DocumentBackend,
            ),
            InputFormat.HTML: HTMLFormatOption(
                pipeline_options=pipeline_options,
                backend=HTMLDocumentBackend,
            ),
        },
    )

    conversion_result: ConversionResult = converter.convert(source=params.file_path)
    return conversion_result

The text was updated successfully, but these errors were encountered:

simjak added the enhancement New feature or request label Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XHTML support, does not extract text #743

XHTML support, does not extract text #743

simjak commented Jan 14, 2025

XHTML support, does not extract text #743

XHTML support, does not extract text #743

Comments

simjak commented Jan 14, 2025