Skip to content

Releases: OCR-D/core

v3.0.0

09 Jan 10:54
@kba kba
Compare
Choose a tag to compare

Changed:

  • Merge v2 master into new-procesor-api
  • PAGE API: Update to latest generateDS 2.44.1, bertsky#21
  • 🔥 logging: increase default root (not ocrd) level from INFO to WARNING
  • 🔥 initLogging: do not remove any previous handlers/levels, unless force_reinit
  • 🔥 disableLogging: remove all handlers, reset all levels - instead of being selective
  • 🔥 Processor: replace weakref with __del__ to trigger shutdown
  • 🔥 OCRD_MAX_PARALLEL_PAGES>1: log via QueueHandler in subprocess, QueueListener in main
  • 🔥 ocrd_utils.initLogging: also add handler to root logger (as in file config),
    but disable message propagation to avoid duplication
  • only import ocrd_network in src/ocrd/decorators/__init__.py once needed
  • Processor.process_page_file: skip computing process_page_pcgts if output already exists,
    but OCRD_EXISTING_OUTPUT!=OVERWRITE
  • 🔥 OCRD_MAX_PARALLEL_PAGES>1: switch from multithreading to multiprocessing, depend on
    loky instead of stdlib concurrent.futures
  • OCRD_PROCESSING_PAGE_TIMEOUT>0: actually enforce timeout within worker
  • OCRD_MAX_MISSING_OUTPUTS>0: abort early if too many failures already, prospectively
  • Processor.process_workspace: split up into overridable sub-methods:
    • process_workspace_submit_tasks (iterate input file group and schedule page tasks)
    • process_workspace_submit_page_task (download input files and submit single page task)
    • process_workspace_handle_tasks (monitor page tasks and aggregate results)
    • process_workspace_handle_page_task (await single page task and handle errors)
  • 🔥 Processor / Workspace.add_file: always force if OCRD_EXISTING_OUTPUT==OVERWRITE
  • 🔥 Processor.verify: revert 3.0.0b1 enforcing cardinality checks (stay backwards compatible)
  • 🔥 Processor.verify: check output fileGrps, too
    (must not exist unless OCRD_EXISTING_OUTPUT=OVERWRITE|SKIP or disjoint --page-id range)
  • lib.bash input-files: do not try to validate tasks here (now covered by Processor.verify())
  • run_processor: be robust if ocrd_tool is missing steps
  • PcGtsType.PageType.id via make_xml_id: replace / with _
  • 🔥 ocrd_utils, ocrd_models, ocrd_modelfactory, ocrd_validators and ocrd_network are not published
    as separate packages anymore, everything is contained in ocrd - you should adapt your requirements.txt accordingly
  • 🔥 Processor.parameter now a property (attribute always exists, but None for non-processing contexts)
  • 🔥 Processor.parameter is now a frozendict (contents immutable)
  • 🔥 Processor.parameter validate when(ever) set instead of (just) the constructor
  • setting Processor.parameter will also trigger (Processor.shutdown() and) Processor.setup()
  • get_processor(... instance_caching=True): use min(max_instances, OCRD_MAX_PROCESSOR_CACHE)
  • 🔥 Processor.verify always validates fileGrp cardinalities (because we have ocrd-tool.json defaults now)
  • 🔥 OcrdMets.add_agent without positional arguments
  • ocrd bashlib input-files now uses normal Processor decorator, and gets passed actual ocrd-tool.json and tool name
    from bashlib's ocrd__wrap
  • 🔥 OcrdPage as proxy of PcGtsType instead of alias; also contains etree and mapping now
  • 🔥 page_from_file: removed kwarg with_tree - use OcrdPage.etree and OcrdPage.mapping instead
  • 🔥 Processor.zip_input_files now can throw ocrd.NonUniqueInputFile and ocrd.MissingInputFile
    (the latter only if OCRD_MISSING_INPUT=ABORT)
  • 🔥 Processor.zip_input_files does not by default use require_first anymore
    (so the first file in any input file tuple per page can be None as well)
  • 🔥 no more Workspace.overwrite_mode, merely delegate to OCRD_EXISTING_OUTPUT=OVERWRITE
  • 🎨 improve on docs result for ocrd_utils.config
  • 🔥 Deprecate Processor.process
  • update spec to v3.25.0, which requires annotating fileGrp cardinality in ocrd-tool.json
  • 🔥 Remove passing non-processing kwargs to Processor constructor, add as members
    (i.e. show_help, dump_json, dump_module_dir, list_resources, show_resource, resolve_resource)
  • 🔥 Deprecate passing processing arg / kwargs to Processor constructor
    (i.e. workspace, page_id, input_file_grp, output_file_grp; now all set by run_processor)
  • 🔥 Deprecate passing ocrd-tool.json metadata to Processor constructor
  • ocrd.processor: Handle loading of bundled ocrd-tool.json generically

Fixed:

  • ocrd --help output was broken for multiline config options, bertsky#25
  • Call initLogging before instantiating processors in ocrd_cli_wrap_processor, bertsky#24, #1296
  • PAGE API: Fully reversable mapping from/to XML element/generateDS instances, bertsky#21
  • initLogging: only add root handler instead of multiple redundant handlers with propagate=false
  • setOverrideLogLevel: override all currently active loggers' level
  • OcrdMets.get_physical_pages: cover return_divs w/o for_fileIds and for_pageIds
  • tests: ensure ocrd_utils.config gets reset whenever changing it globally
  • OcrdMetsServer.add_file: pass on force kwarg
  • ocrd.cli.workspace: consistently pass on --mets-server-url and --backup
  • ocrd.cli.validate "tasks": pass on --mets-server-url
  • ocrd.cli.bashlib "input-files": pass on --mets-server-url
  • lib.bash input-files: pass on --mets-server-url, --overwrite, and parameters
  • lib.bash: fix errexit handling
  • ocrd.cli.ocrd-tool "resolve-resource": forgot to actually print result
  • Processor.metadata_location: src workaround respects namespace packages, qurator-spk/eynollah#134
  • Workspace.reload_mets: handle ClientSideOcrdMets as well
  • disableLogging: also re-instate root logger to Python defaults
  • actually apply CLI --log-filename, and show in --help
  • adapt to Pillow changes
  • ocrd workspace clone: do pass on --file-grp (for download filtering)

Added:

  • ocrd-filter processor to remove segments based on XPath expressions, bertsky#21
  • XPath function pc:pixelarea for the number of pixels of the bounding box (or sum area on node sets), bertsky#21
  • XPath function pc:textequiv for the first TextEquiv unicode string (or concatenated string on node sets), bertsky#21
  • OcrdPage: new PageType.get_ReadingOrderGroups() to retrieve recursive RO as dict
  • ocrd.cli.workspace server: add subcommands reload and save
  • METS Server: export and delegate physical_pages
  • processor CLI: delegate --resolve-resource, too
  • Processor.process_page_file / OcrdPageResultImage: allow None besides AlternativeImageType
  • OcrdConfig.reset_defaults to reset config variables to their defaults
  • Processor.max_workers: class attribute to control per-page parallelism of this implementation
  • Processor.max_page_seconds: class attribute to control per-page timeout of this implementation
  • OCRD_MAX_PARALLEL_PAGES for whether and how many workers should process pages in parallel
  • OCRD_PROCESSING_PAGE_TIMEOUT for whether and how long processors should wait for single pages
  • OCRD_MAX_MISSING_OUTPUTS for maximum rate (fraction) of pages before making OCRD_MISSING_OUTPUT=abort
  • Processor.metadata_filename: expose to make local path of ocrd-tool.json in Python distribution reusable+overridable
  • Processor.metadata_location: expose to make absolute path of ocrd-tool.json reusable+overridable
  • Processor.metadata_rawdict: expose to make in-memory contents of ocrd-tool.json reusable+overridable
  • Processor.metadata: expose to make validated and default-expanded contents of ocrd-tool.json reusable+overridable
  • Processor.shutdown: to shut down processor after processing, optional
  • Processor.max_instances: class attribute to control instance caching of this implementation
  • 👉 OCRD_DOWNLOAD_INPUT for whether input files should be downloaded before processing
  • 👉 OCRD_MISSING_INPUT for how to handle missing input files (SKIP or ABORT)
  • 👉 OCRD_MISSING_OUTPUT for how to handle processing failures (SKIP or ABORT or COPY)
    the latter behaves like ocrd-dummy for the failed page(s)
  • 👉 OCRD_EXISTING_OUTPUT for how to handle existing output files (SKIP or ABORT or OVERWRITE)
  • new CLI option --debug as short-hand for ABORT choices above
  • Processor.logger set up by constructor already (for re-use by processor implementors)
  • default-expand and validate ocrd_tool.json in Processor constructor, log invalidities
  • handle JSON deprecation in ocrd_tool.json by reporting warnings
  • Processor.process_workspace: process a complete workspace, with default implementation
  • Processor.process_page_file: process an OcrdFile, with default implementation
  • Processor.process_page_pcgts: process a single OcrdPage, produce a single OcrdPage, required to implement
  • Processor.verify: handle fileGrp cardinality verification, with default implementation
  • Processor.setup: to set up processor before processing, optional

v2.71.0

20 Nov 12:20
@kba kba
Compare
Choose a tag to compare

Changed:

  • Rewrite ocrd_utils.logging, #1288
    • Handle only '' as the root logger
    • disableLogging: Remove handlers from root and all configured loggers
    • Do not do any module-level modification of the log config

Fixed:

  • Typo in processing_worker log message, #1293
  • Call initLogging at the right time in ocrd_network, #1292
  • make docs fixed with absolute path to location, #1273

v2.70.0

10 Oct 14:17
@kba kba
Compare
Choose a tag to compare

Added:

  • ocrd network client workflow run: Add --print-status flag to periodically print the job status, #1277
  • Processing Server: DELETE /mets_server_zombies to kill any renegade METS servers, #1277
  • No more zombie METS Server by properly shutting them down, #1284
  • OCRD_NETWORK_RABBITMQ_HEARBEAT to allow overriding the heartbeat behavior of RabbitMQ, #1285

Changed:

  • significantly more detailed logging for the METS Server and Processing Server, #1284
  • Only import ocrd_network in src/ocrd/decorators/init.py once needed, #1289
  • Automate release via GitHub Actions, #1290

Fixed:

  • ocrd/core-cuda-torch: Install torchvision as well, #1286
  • Processing Server: remove shut down METS servers from deployer's cache, #1287
  • typos, #1274

v2.69.0

30 Sep 16:34
@kba kba
Compare
Choose a tag to compare

Fixed:

  • tests: ensure ocrd_utils.config gets reset whenever changing it globally
  • ocrd.cli.workspace: consistently pass on --mets-server-url and --backup
  • ocrd.cli.workspace: make list-page work w/ METS Server
  • ocrd.cli.validate "tasks": pass on --mets-server-url
  • lib.bash: fix errexit handling
  • actually apply CLI --log-filename, and show in --help
  • adapt to Pillow changes
  • ocrd workspace clone: do pass on --file-grp (for download filtering)
  • OcrdMetsServer.add_file: pass on force kwarg
  • Workspace.reload_mets: handle ClientSideOcrdMets as well
  • OcrdMets.get_physical_pages: cover return_divs w/o for_fileIds and for_pageIds
  • disableLogging: also re-instate root logger to Python defaults
  • OcrdExif: handle multi-frame TIFFs gracefully in identify callout, #1276

Changed:

  • run_processor: be robust if ocrd_tool is missing steps
  • PcGtsType.PageType.id via make_xml_id: replace / with _
  • ClientSideOcrdMets: use same logger name prefix as METS Server
  • Processor.zip_input_files: when --page-id yields empty list, just log instead of raise

Added:

  • OcrdPage: new PageType.get_ReadingOrderGroups() to retrieve recursive RO as dict
  • METS Server: export and delegate physical_pages
  • ocrd.cli.workspace server: add subcommands reload and save
  • processor CLI: delegate --resolve-resource, too
  • OcrdConfig.reset_defaults to reset config variables to their defaults
  • ocrd_utils.scale_coordinates for resizing images

v3.0.0b5

16 Sep 11:36
@kba kba
Compare
Choose a tag to compare
v3.0.0b5 Pre-release
Pre-release

Fixed:

  • tests: ensure ocrd_utils.config gets reset whenever changing it globally
  • OcrdMetsServer.add_file: pass on force kwarg
  • ocrd.cli.workspace: consistently pass on --mets-server-url and --backup
  • ocrd.cli.validate "tasks": pass on --mets-server-url
  • ocrd.cli.bashlib "input-files": pass on --mets-server-url
  • lib.bash input-files: pass on --mets-server-url, --overwrite, and parameters
  • lib.bash: fix errexit handling
  • ocrd.cli.ocrd-tool "resolve-resource": forgot to actually print result

Changed:

  • 🔥 Processor / Workspace.add_file: always force if OCRD_EXISTING_OUTPUT==OVERWRITE
  • 🔥 Processor.verify: revert 3.0.0b1 enforcing cardinality checks (stay backwards compatible)
  • 🔥 Processor.verify: check output fileGrps, too
    (must not exist unless OCRD_EXISTING_OUTPUT=OVERWRITE|SKIP or disjoint --page-id range)
  • lib.bash input-files: do not try to validate tasks here (now covered by Processor.verify())
  • run_processor: be robust if ocrd_tool is missing steps
  • PcGtsType.PageType.id via make_xml_id: replace / with _

Added:

  • OcrdPage: new PageType.get_ReadingOrderGroups() to retrieve recursive RO as dict
  • ocrd.cli.workspace server: add subcommands reload and save
  • METS Server: export and delegate physical_pages
  • processor CLI: delegate --resolve-resource, too
  • Processor.process_page_file / OcrdPageResultImage: allow None besides AlternativeImageType

v3.0.0b4

02 Sep 09:37
@kba kba
Compare
Choose a tag to compare
v3.0.0b4 Pre-release
Pre-release

Fixed:

  • Processor.metadata_location: src workaround respects namespace packages, qurator-spk/eynollah#134
  • Workspace.reload_mets: handle ClientSideOcrdMets as well

v3.0.0b3

30 Aug 13:44
@kba kba
Compare
Choose a tag to compare
v3.0.0b3 Pre-release
Pre-release

Added:

  • OcrdConfig.reset_defaults to reset config variables to their defaults

v3.0.0b2

30 Aug 11:28
@kba kba
Compare
Choose a tag to compare
v3.0.0b2 Pre-release
Pre-release

Added:

  • Processor.max_workers: class attribute to control per-page parallelism of this implementation
  • Processor.max_page_seconds: class attribute to control per-page timeout of this implementation
  • OCRD_MAX_PARALLEL_PAGES for whether and how many workers should process pages in parallel
  • OCRD_PROCESSING_PAGE_TIMEOUT for whether and how long processors should wait for single pages
  • OCRD_MAX_MISSING_OUTPUTS for maximum rate (fraction) of pages before making OCRD_MISSING_OUTPUT=abort

Fixed:

  • disableLogging: also re-instate root logger to Python defaults

v3.0.0b1

26 Aug 09:30
@kba kba
Compare
Choose a tag to compare
v3.0.0b1 Pre-release
Pre-release

Fixed:

  • actuall apply CLI --log-filename
  • adapt to Pillow changes
  • ocrd workspace clone: do pass on --file-grp (for download filtering)

Changed:

  • 🔥 ocrd_utils, ocrd_models, ocrd_modelfactory, ocrd_validators and ocrd_network are not published
    as separate packages anymore, everything is contained in ocrd - you should adapt your requirements.txt accordingly
  • 🔥 Processor.parameter now a property (attribute always exists, but None for non-processing contexts)
  • 🔥 Processor.parameter is now a frozendict (contents immutable)
  • 🔥 Processor.parameter validate when(ever) set instead of (just) the constructor
  • setting Processor.parameter will also trigger (Processor.shutdown() and) Processor.setup()`
  • get_processor(... instance_caching=True): use min(max_instances, OCRD_MAX_PROCESSOR_CACHE)
  • 🔥 Processor.verify always validates fileGrp cardinalities (because we have ocrd-tool.json defaults now)
  • 🔥 OcrdMets.add_agent without positional arguments
  • ocrd bashlib input-files now uses normal Processor decorator, and gets passed actual ocrd-tool.json and tool name
    from bashlib's ocrd__wrap

Added:

  • Processor.metadata_filename: expose to make local path of ocrd-tool.json in Python distribution reusable+overridable
  • Processor.metadata_location: expose to make absolute path of ocrd-tool.json reusable+overridable
  • Processor.metadata_rawdict: expose to make in-memory contents of ocrd-tool.json reusable+overridable
  • Processor.metadata: expose to make validated and default-expanded contents of ocrd-tool.json reusable+overridable
  • Processor.shutdown: to shut down processor after processing, optional
  • Processor.max_instances: class attribute to control instance caching of this implementation

v2.68.0

23 Aug 11:28
@kba kba
Compare
Choose a tag to compare

Changed:

  • ocrd_network: Use ocrd-all-tool.json bundled by core instead of download from website, #1257, #1260
  • 🔥 ocrd network client processing processor renamed ocrd network client processing run, #1269
  • ocrd network client processing run supports blocking behavior with --block by polling job status, #1265, #1269

Added:

  • ocrd network client workflow run Run, optionally blocking, a workflow on the processing server, #1265, #1269
  • ocrd network client workflow check-status to get the status of a workflow job, #1269
  • ocrd network client processing check-status to get the status of a processing (processor) job, #1269
  • ocrd network client discovery processors to list the processors deployed in the processing server, #1269
  • ocrd network client discovery processor to get the ocrd-tool.json of a deployed processor, #1269
  • ocrd network client processing check-log to retrieve the log data for a processing job, #1269
  • Environment variables OCRD_NETWORK_CLIENT_POLLING_SLEEP and OCRD_NETWORK_CLIENT_POLLING_TIMEOUT to control polling interval and timeout for ocrd network client {processing processor,workflow run, #1269
  • ocrd workspace clone/Resolver.workspace_from_url: with clobber_mets=False, raise a FileExistsError for existing mets.xml on disk, #563, #1268
  • ocrd workspace find --download: print the the correct, up-to-date field, not None, #1202, #1266

Fixed:

  • Sanitize self.imageFilename for the pcGtsId to ensure it is a valid xml:id, #1271