Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide which files to index based on dokumentobjekt.format too #12

Open
petterreinholdtsen opened this issue Nov 7, 2019 · 2 comments

Comments

@petterreinholdtsen
Copy link
Contributor

At the moment the indexer decide which files to extract content from based on their file name. This assume something about the content in dokumentobjekt.referanseDokumentfil that is not specified in Noark 5, and I have run into extractions where the file names did not include file extentions.

It would be better if values in dokumentobjekt.format were consulted in addition to looking at file suffixes. According to Arkivverket, the values in this field is now standardized as PRONOM codes, so those values should at least be recognized.

@oleliabo
Copy link
Collaborator

Yes, this is a difficult topic... Consulting dokumentobjekt.format makes sense, but if I understand the above it cannot be trusted to always be present.

One possible "data driven" design would be like this:

  • Xpath query to find document-nodes
  • Xpath query to find document-file-format-node for the document node
  • If document-file-format-node not defined or not found: file-format-id-tool to use instead

For each file format id:

  • text extraction tool
  • viewer tool

So the config file could look something like (I don't remember exactly the Xpath syntax):
noark-5.conf:
documentNode=/*/dokumentobjekt/referanseDokumentfil/value()
documentFileFormatNode=/*/dokumentobjekt/format/value()
documentNodeFileFormatTool=file %FILENAME%

fileformats.conf:
fileFormatName=PDF
fileFotmatPronom=pronom/...
fileFormatIdTool=file %FILE%
fileFormatExtractTool=pdf2text %FILENAME% %OUTPUTFILE%
fileFormatViewer=evince %FILENAME%
fileFormatExtension=*.pdf

Note 1: Currently I don't use Xpaths for node and node leaf matching, I use regexp. I always wanted to use xpath but due to the .XMLs size can be several GB I ended up using regexp as an optimise. So this is a larger change, but it could also be solved using regexp.
Note 2: Currently I rely on Qt to view files, it uses the OS default viewer for the format, this works nice so far, but at some point it would be good to override...

@petterreinholdtsen
Copy link
Contributor Author

petterreinholdtsen commented Nov 13, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants