Skip to content

Commit

Permalink
Merge pull request #866 from nishio/patch-indexing
Browse files Browse the repository at this point in the history
Draft of index for manual edit created
  • Loading branch information
GlenWeyl authored Apr 10, 2024
2 parents 5887559 + 4bcdb2b commit 5c8d080
Show file tree
Hide file tree
Showing 16 changed files with 6,012 additions and 4,755 deletions.
130 changes: 19 additions & 111 deletions scripts/index/Plurality Book Indexing Exercise - Candidates.csv

Large diffs are not rendered by default.

33 changes: 22 additions & 11 deletions scripts/index/README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,40 @@
# Making Indexes

- `in.pdf`: input PDF, currently I used `release/latest` on 4/9 14:42 JST
## second step (4/9~)
- `in.pdf`: input PDF, currently I used the latest PDF from Sharepoint 4/10 11:30 JST (in previous version it was `release/latest` on 4/9 14:42 JST)
- `from_pdf.py`: read PDF `in.pdf` and output JSON `book.json`
- `main.py`




- `main.py`: output keywords to page numbers into `keyword_occurrence.tsv`
- `index_with_claude.tsv`: merge Claude 3 Opus output and `keyword_occurrence.tsv`
- `index_for_manual_edit.tsv`: Copy of `index_with_claude.tsv` for manual edit

### memo
- Removed keywords "not found" in "NotFound.csv". those are once added by human, not found by machine and then not found by additional human eyes.
- Tried to remove space after ⿻, but it won't improve outputs.
- For example, "Advanced Research Projects Agency" is recorded as in 2-0, but in the latest PDF, it found in 3-3. Searching only within the specified section, as chosen by the user, should not be the default behavior; it should be limited to cases where there are too many hits.
- For example, "Parliament of Things" is in text but not hit. It is because the new line with spaces causes extra space like: "Parliament of Things". This fix resulted in a decrease in the number of "not found" keywords from 242 to 79.
- In PDF, quotation `"..."` sometimes converted to `\u201...\u201d` (not all time). I removed quotation before matching.
- Considering future updates to the manuscript, human corrections should be kept to a minimum. I provided a subset of JSON converted from the PDF to Claude 3 Opus and identified "not found" keywords for each section. After confirming some cases by my eyes, it seems to be working well, so I decided to adopt it. Prompt:

```
You are great editor of books. Here are index candidates for a book, find where it is (page number) or output "NaN".
expected JSON format: {"<keyword>": "<page number or NaN>", ...}
```

## first step (~3/26)
- `Plurality Book Indexing Exercise - Main.csv`: raw file exported from [Spreadsheet](https://docs.google.com/spreadsheets/d/1gmyjFbErt_CW8-qLKChSpciLlCDGUhLriYFov0HO3qA/edit#gid=0)
- `step1.py`: output POC count, occurence of each keywords in each sections, and the count of occurences
- `ignore.txt`: keywords which should avoid mechine search
- `case_sensitive.txt`: keywords which should case-sensitive search (e.g. `ROC`, `BERT`, `UN`)

## output
### output
- `contributors.tsv`: number of contribution on the spreadsheet
- `1_keyword_occurrence.tsv`: occurrence of each keywords in each sections
- `section_occurrence.tsv`: number of occurrences in each sections of any keywords. It is to find less-covered sections.
- `1_keyword_occurrence.tsv`: occurrence of each keywords in each sections (renamed to `1_*` to avoid overwrite)
- `1_no_occurence.txt`: Keywords which does not occur in the contents.
- `1_too_many_occurrence.tsv`: Keywords which occur in more than 5 sections.
- `section_occurrence.tsv`: number of occurrences in each sections of any keywords. It is to find less-covered sections.
- `similar_keywords.txt`: Output if there are keywords like `Neural network` and `Neural Network`.


## memo
### memo

- At least, we need special care for the movie name "her".
- cFQ or cFQ2f7LRuLYP is GithubID: dedededalus. ref: https://github.com/dedededalus
Expand Down
7 changes: 1 addition & 6 deletions scripts/index/View.csv
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,12 @@
,"""left-right"" divide",Left-Right Divide ,cFQ,
,parliament of things,Parliament Of Things,cFQ,
,"""good old-fashioned AI"" (GOFAI)",Good Old-fashioned AI (GOFAI),cFQ,
,Page Rank,PageRank,cFQ,
,"""Pigouvian"" taxes",Pigouvian taxes,cFQ,
,public welfare shcemes,Public Welfare Schemes,cFQ,
,"""competitive"" effect",Competitive Effect,cFQ,
,"""hunter-gatherer"" model",Hunter-gatherer Model,cFQ,
,Steve Jobs,"Jobs, Steve",cFQ,
,Antiretroviral Therapie,Antiretroviral Therapy,cFQ,
,⿻istic Ignorance,Pluralistic Ignorance,cFQ,
,co-creation relationship,Cocreation Relationship,cFQ,
,Ganga,Ganga River,cFQ,
,Whanganui River,Whanganui,cFQ,
,"""competitive authoritarian"" regimes",Competitive Authoritarian Regimes,cFQ?,
,National Socialist German Workers party,National Socialist German Workers (Nazi) party,cFQ?,
,"""public"" data repositories",Public Data Repositories,cFQ?,
Expand All @@ -39,4 +34,4 @@
,Herbert Simon,"Simon, Herbert",cFQ?,
,Edward Snowden,"Snowden, Edward",cFQ?,
,Cross-cultural exchanges,Cross-cultural exchange,cFQ?,
,non-fungible tokens (NFTs),Non-fungible Tokens (NFTs),nishio,"04-04, should be singular in index"
,non-fungible tokens (NFTs),Non-fungible Tokens (NFTs),nishio,"04-04, should be singular in index"
68 changes: 68 additions & 0 deletions scripts/index/claude.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
{
"African Model": "NaN",
"Post-gender": "26",
"Robert Atkinson": "NaN",
"Robinson, James": "5",
"Saudia Arabia": "20",
"Edward Deming": "57",
"Georgist land value tax": "56",
"Global Anti-colonial Movement": "52",
"Hu Shih": "58",
"Nixon's visit to PRC": "59",
"Kao Chia-liang": "64",
"Taiwan's Digital Civic Infrastructure": "64",
"Bidirectional Equilibrium": "NaN",
"Einstein's Theories Of Relativity": "101",
"Gödel's Theorem": "98",
"Intersectional Identity": "124",
"Social Dynamics": "NaN",
"web of group-affiliation": "123",
"WEIRD societies": "118",
"modernity": "118",
"Digital-native currencies": "NaN",
"National Socialist German Workers party": "165",
"Blockchain-centric Identity Systems": "197",
"Digital-native Identity Infrastructure": "NaN",
"Licenses For Use": "NaN",
"On the Internet, Nobody Knows You're A Dog": "182",
"De Tocqueville, Alexis": "210",
"Decentralized Social Networking Protocol (DSNP)": "219",
"Distributed Ledger Technology (DLT)": "220",
"Edward Snowden": "223",
"The Emperor's New Clothes": "214",
"Moore's Law": "258",
"Quick Fixes": "NaN",
"Uniform Resource Locator (URL)": "256",
"non-fungible tokens (NFTs)": "256",
"Dao": "281",
"Cross-Cultural Exchanges": "326",
"Co-edited Project": "NaN",
"Online collaboration platform": "337",
"Collective Response Model": "NaN",
"Community Notes (CN)": "347",
"LLM-based Representative": "362",
"Left-right Divide": "348",
"Computer-simulated Neuron": "374",
"United States": "383",
"Weighted-voting": "383",
"Private Community-based Sponsorship": "410",
"Self-ownership": "396",
"Cross-cutting Benefit": "NaN",
"Cross-pollination Service": "436",
"White Collar": "435",
"Public Good": "451",
"UNDP": "NaN",
"Common Carriers of Public Discussion": "470",
"Open Source Intelligence": "466",
"Life-support System": "475",
"Paul Jozef Crutzen": "475",
"co-creation relationship": "479",
"For-profit Industry": "NaN",
"For-profit Private Corporations": "497",
"Non-profit ⿻ Infrastructure": "498",
"Open Collective Foundation": "501",
"Open Source Community": "496",
"Open Source Ecosystem": "486",
"Open Source Models": "498",
"Open Source Technology": "495"
}
10 changes: 9 additions & 1 deletion scripts/index/from_pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,15 @@
# text_file.write(text) # Write the extracted text to the file

text = text.replace("-\n", "") # remove hyphenation
text = text.replace("\n", " ") # replace newlines with spaces

text = text.replace("\u201c", "") # remove quotation
text = text.replace("\u201d", "")
text = text.replace('"', "")

# replace newlines with spaces (sometimes there are spaces
text = text.replace(" \n", " ")
text = text.replace("\n ", " ")
text = text.replace("\n", " ")
data[page_num + 1] = text

# Close the PDF document
Expand Down
Loading

0 comments on commit 5c8d080

Please sign in to comment.