This version has been trimmed of all licensed and copyrighted materials. As a result, the test data is not included in this repository, and commands may not run as expected without additional setup.
make -j $(nproc) target/dnb18.i5.xml SRC_DIR=test/resources/DNB YEARS=18
make -j $(nproc) i5 SRC_DIR=test/resources/DNB
Prerequisite: KorAP-XML-CoNLL-U
make -j $(nproc) target/dnb18.zip SRC_DIR=test/resources/DNB YEARS=18
make -j $(nproc) index
The index will be in target/dnb.index
.
Adjust the following line in your korap4dnb-compose.yml
to point to your index (it is in target/dnb.index by default, but should better be copied to a safe place):
- "${PWD}/target/dnb.index:/kustvakt/index:z"
and start the docker:
docker compose -p korap4dnb --profile=lite -f korap4dnb-compose.yml up -d
docker compose -p korap4dnb down
Install prerequisite korap/conllu2treetagger and korap/conllu2spacy docker images if not present:
docker image inspect korap/conllu2treetagger:latest || curl -Ls 'https://gitlab.ids-mannheim.de/KorAP/CoNLL-U-Treetagger/-/jobs/artifacts/master/raw/conllu2treetagger.xz?job=build-docker-image' | docker load
docker image inspect korap/conllu2spacy:latest || curl -Ls https://corpora.ids-mannheim.de/tools/conllu2spacy.tar.xz | docker load
Make annotations fro dnb20:
make -j $(nproc) target/dnb20.marmot-malt.zip target/dnb20.spacy.zip target/dnb20.tree_tagger.zip
Build KorAP all, up to the deployable index:
make -j $(nproc) all
-
2024-05-26
- extended genre classification based on metadata keywords
- Saxon XSLT processor and license updated from 9 to 12.4
-
2024-05-08
- added
idno
elements with all ids given by dnb SRU api - fixed bug with ambiguous (dnb-id/isbn) ids
- basic genre classification based on metadata keywords
- added
-
2024-04-19
- SRC_DIR now defaults to the production sample!
- ISBN number recognition should be fixed now
- ignore faulty xhtml input files and conversion errors – just issue a warning
-
2024-04-15
- added pass2 and pass3 to xslt conversion to …
- fix div, p, hi, ref … nestings
- remove empty elements
- join subsequent hi elements
- improved korapxml2krill performance by using all cores (-1 does not work here)
- sanitized the Makefile and dropped YY variable, use YEARS instead
- added pass2 and pass3 to xslt conversion to …
-
2024-04-10
- multiple authors (and non-authors) are now correctly handled
- some more .(x)html files are now dropped (toc, cover, etc.)
- PRELIMINARY support for splitting everything into annual volumes
- use
make YY=22
to select 2022 - does not yet work for the index!
- use
-
2024-03-24
- slow udpipe2 dropped
- added marmot POS and morpho-syntactic annotations
- added malt dependency annotations
-
2024-03-18
- added
make deploy
to install new index and restart local KorAP@DNB instance (also available as ci target) - added
show-server-logs
andshow-server-status
make targets to monitor the local KorAP@DNB instance
- added
-
2024-03-17
- added
make all
to build all targets, including the index
- added
-
2024-03-16
- CI/CD pipeline added
- first working pipeline for EPub ⮕ TEI I5 ⮕ KorAP-XML ⮕ (UDPipe+TreeTagger+Spacy) ⮕ Krill ⮕ KorAP-JSON
-
2024-03-15: DNB test data added