OPS Hackathon: Archive team problem definition notes

January 29th, 2019

Domain knowledge

Record series are developed when there is a good reason for them to be developed - they hold relevance, relevant to Ontarios history,

Archives team decide why certain topics are important, and then find it
Rank documents based on relevance? Fei - that’s crazy

Want people to access archives database

Go to ‘Advance search options’

An office/business creates documents (for a specific business reason). At some point, the documents are sent to the archives to be stored.

Researchers would like to know what that business did, without

Organize it in different ways:

Context - Where did they come from, which office which ministry, why create them, who created them, which other ministries use them
Certain ministry creates different sets of documents - Meeting minutes, policies, drug program, directors office records but all these relate to the ministry of health
Groups of records
Item level

Records are usually one to one. A record can contain multiple files/items. Record - anything that is in a fixed format. Voicemail, cassette. It can be a collection or a single item.

F 4622 - Nelson Mandela Children’s Fund - unique identifier for a group of records Title, date, description, doc, pdf, audio

Description: What information is within this group, what is in it, gives you the context, a description of the records <—— Humans write this Creator - a separate database containing creator authorities

Every record is divided in to series - each series id is associated with its parent - essentially a subset. Can go all the way to a single item.

Scope property of a series is the hardest to determine, done by humans

Problem: Getting terabytes of data pouring in

A need to protect individual’s person details - kids, sickness, lots of exemptions

Potential problems to solve

Relevant or irrelevant for archives
A lot of energy going into scope and label - automate it, suggest
Retrieval - researchers want to find why a politician did a certain action e.g. ‘why did the wynn government …’
Need to exempt certain materials
Collection of digital photos
- Go through every single photo
- See what it is
- Sometimes you can’t do that — too many photos, cannot verify manually, some series
- Filter out or note poorly tagged things, incorrectly tagged things

Developing the problem statement

Tags - terms or phrases geographic names, subject names, terms

Determine if the document is actually what the collaborator says it is

Bottom line problem: Tagging on content and file name for search

Solutions:

What the model thinks is the most important keywords
Deep learning - not possible, training takes too long
Validate the tagging of a series (photos, text)
Don’t do images - too hard to define the object (cathleen wynn or random woman?), sticking to text
We have access to a validation set - a group of human generated tags

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jan-29th-problem-statement-notes.md

jan-29th-problem-statement-notes.md

OPS Hackathon: Archive team problem definition notes

Domain knowledge

Potential problems to solve

Developing the problem statement

Files

jan-29th-problem-statement-notes.md

Latest commit

History

jan-29th-problem-statement-notes.md

File metadata and controls

OPS Hackathon: Archive team problem definition notes

Domain knowledge

Potential problems to solve

Developing the problem statement