-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide which files to index based on dokumentobjekt.format too #12
Comments
Yes, this is a difficult topic... Consulting dokumentobjekt.format makes sense, but if I understand the above it cannot be trusted to always be present. One possible "data driven" design would be like this:
For each file format id:
So the config file could look something like (I don't remember exactly the Xpath syntax): fileformats.conf: Note 1: Currently I don't use Xpaths for node and node leaf matching, I use regexp. I always wanted to use xpath but due to the .XMLs size can be several GB I ended up using regexp as an optimise. So this is a larger change, but it could also be solved using regexp. |
[Ole Liabø]
Yes, this is a difficult topic... Consulting dokumentobjekt.format
makes sense, but if I understand the above it cannot be trusted to
always be present.
I believe the format field can be trusted to be present (it is
required), but its value is not very consistent across systems, so one
would have to accept many values for the same format.
fileformats.conf:
`fileFormatName=PDF`
`fileFotmatPronom=pronom/...`
`fileFormatIdTool=file %FILE%`
`fileFormatExtractTool=pdf2text %FILENAME% %OUTPUTFILE%`
`fileFormatViewer=evince %FILENAME%`
`fileFormatExtension=*.pdf`
I suspect a format entry would need to take a list of both format values
and extentions, if not using "magic numbers" to identify a file format.
For PDF, one would for example access "pdf/a", "pdf", "PDF", "RA-PDF",
"fmt/95", "fmt/354" and probably a lot others. :)
…--
Happy hacking
Petter Reinholdtsen
|
At the moment the indexer decide which files to extract content from based on their file name. This assume something about the content in dokumentobjekt.referanseDokumentfil that is not specified in Noark 5, and I have run into extractions where the file names did not include file extentions.
It would be better if values in dokumentobjekt.format were consulted in addition to looking at file suffixes. According to Arkivverket, the values in this field is now standardized as PRONOM codes, so those values should at least be recognized.
The text was updated successfully, but these errors were encountered: