Skip to content

Latest commit

 

History

History
102 lines (85 loc) · 4.01 KB

README.md

File metadata and controls

102 lines (85 loc) · 4.01 KB

CircleCI Coverage Status linting: pylint

The purpose of this code is to provide consistent Marc record parsing for deduplication, in order to compare how humans, a machine learning deduplication algorithm, and an implementation of the GoldRush algorithm deduplicate Marc records.

The intention is that the output of the current MarcRecord methods be human-readable and used for the machine learning deduplication algorithm, and the GoldRush methods be used to build a string for literal matching.

The implementation of the GoldRush algorithm is based on the Colorado Alliance MARC record match key generation, documented January 12, 2024.

Decisions

This application will provide two layers of normalization.

First layer of normalization - humans and machine learning algorithm

The first layer of normalization consists of selecting a subset of Marc fields and subfields for human and machine learning algorithm comparison.

This will include showing fields in the vernacular script when available. Since not everyone is familiar with different scripts, these will be presented with both the transliterated information and the vernacular script. The vernacular script is more likely to be accurately matched by both the machine learning algorithm and humans who are familiar with that script, the transliterated script is more likely to be accurately matched by humans who are not familiar with the vernacular script.

Second layer of normalization - GoldRush algorithm

The second layer of normalization will be built on the first layer of normalization, and will be an interpretation of the GoldRush algorithm, intended for exact string matching.

To this end, there will be much more strict string normalization in this layer. Only vernacular versions of fields will be preserved.

  • Some normalization strongly favors English-language texts - e.g.
    • Replacing English-language articles at the beginnings of titles
      • This also seems like it duplicates the 245 second indicator for non-filing characters
    • Replacing '&' with 'and'

Using the code

This obviously needs to be refined

  1. Start python interactive interpreter
python
  1. Import needed libraries
from pymarc import parse_xml_to_array
from src.marc_record import MarcRecord
from src.gold_rush import GoldRush
  1. Create an object with example marc records from marc xml
all_records = parse_xml_to_array("tests/alma_marc_records.xml")
  1. Create a dictionary of an example record
new_record = MarcRecord(all_records[0])
new_record.to_dictionary()
  1. Create a GoldRush string of an example record
gr = GoldRush(new_record)
gr.as_gold_rush()
  1. create list of GoldRush strings
list_of_records = []
for record in all_records:
  mr = MarcRecord(record)
  gr = GoldRush(mr)
  list_of_records.append(gr.as_gold_rush())

Developing this application

Set-up and install dependencies

  1. Make a .venv
python3 -m venv .venv
  1. activate the environment
. .venv/bin/activate
  1. install dependencies
pip install -r requirements/development.txt

Testing

pytest

Linting

  1. ruff - fast
  • Formatter - --check flag does not make changes. Run without --check flag for automatic fixing
ruff format . --check
  • Linter
ruff check .
  1. pylint - slower, does more in-depth checks
  • Currently excluding checks for documentation - remove these disables once this is remediated
pylint src tests --disable=C0114,C0115,C0116