Merge pull request #866 from nishio/patch-indexing

Draft of index for manual edit created
pluralitybook · Apr 10, 2024 · 5c8d080 · 5c8d080
2 parents 5887559 + 4bcdb2b
commit 5c8d080
Show file tree

Hide file tree

Showing 16 changed files with 6,012 additions and 4,755 deletions.
diff --git a/scripts/index/Plurality Book Indexing Exercise - Candidates.csv b/scripts/index/Plurality Book Indexing Exercise - Candidates.csv
diff --git a/scripts/index/README.md b/scripts/index/README.md
@@ -1,29 +1,40 @@
 # Making Indexes
 
-- `in.pdf`: input PDF, currently I used `release/latest` on 4/9 14:42 JST
+## second step (4/9~)
+- `in.pdf`: input PDF, currently I used the latest PDF from Sharepoint 4/10 11:30 JST (in previous version it was `release/latest` on 4/9 14:42 JST)
 - `from_pdf.py`: read PDF `in.pdf` and output JSON `book.json`
-- `main.py`
-
-
-
-
+- `main.py`: output keywords to page numbers into `keyword_occurrence.tsv`
+- `index_with_claude.tsv`: merge Claude 3 Opus output and `keyword_occurrence.tsv`
+- `index_for_manual_edit.tsv`: Copy of `index_with_claude.tsv` for manual edit
+
+### memo
+- Removed keywords "not found" in "NotFound.csv". those are once added by human, not found by machine and then not found by additional human eyes.
+- Tried to remove space after ⿻, but it won't improve outputs.
+- For example, "Advanced Research Projects Agency" is recorded as in 2-0, but in the latest PDF, it found in 3-3. Searching only within the specified section, as chosen by the user, should not be the default behavior; it should be limited to cases where there are too many hits.
+- For example, "Parliament of Things" is in text but not hit. It is because the new line with spaces causes extra space like: "Parliament of  Things". This fix resulted in a decrease in the number of "not found" keywords from 242 to 79.
+- In PDF, quotation `"..."` sometimes converted to `\u201...\u201d` (not all time). I removed quotation before matching. 
+- Considering future updates to the manuscript, human corrections should be kept to a minimum. I provided a subset of JSON converted from the PDF to Claude 3 Opus and identified "not found" keywords for each section. After confirming some cases by my eyes, it seems to be working well, so I decided to adopt it. Prompt:
+
+```
+You are great editor of books. Here are index candidates for a book, find where it is (page number) or output "NaN".
+expected JSON format: {"<keyword>": "<page number or NaN>", ...}
+```
 
 ## first step (~3/26)
 - `Plurality Book Indexing Exercise - Main.csv`: raw file exported from [Spreadsheet](https://docs.google.com/spreadsheets/d/1gmyjFbErt_CW8-qLKChSpciLlCDGUhLriYFov0HO3qA/edit#gid=0)
 - `step1.py`: output POC count, occurence of each keywords in each sections, and the count of occurences
 - `ignore.txt`: keywords which should avoid mechine search
 - `case_sensitive.txt`: keywords which should case-sensitive search (e.g. `ROC`, `BERT`, `UN`)
 
-## output
+### output
 - `contributors.tsv`: number of contribution on the spreadsheet
-- `1_keyword_occurrence.tsv`: occurrence of each keywords in each sections
-- `section_occurrence.tsv`: number of occurrences in each sections of any keywords. It is to find less-covered sections.
+- `1_keyword_occurrence.tsv`: occurrence of each keywords in each sections (renamed to `1_*` to avoid overwrite)
 - `1_no_occurence.txt`: Keywords which does not occur in the contents.
 - `1_too_many_occurrence.tsv`: Keywords which occur in more than 5 sections.
+- `section_occurrence.tsv`: number of occurrences in each sections of any keywords. It is to find less-covered sections.
 - `similar_keywords.txt`: Output if there are keywords like `Neural network` and `Neural Network`.
 
-
-## memo
+### memo
 
 - At least, we need special care for the movie name "her".
 - cFQ or cFQ2f7LRuLYP is GithubID: dedededalus. ref: https://github.com/dedededalus

diff --git a/scripts/index/View.csv b/scripts/index/View.csv
@@ -19,17 +19,12 @@
 ,"""left-right"" divide",Left-Right Divide ,cFQ,
 ,parliament of things,Parliament Of Things,cFQ,
 ,"""good old-fashioned AI"" (GOFAI)",Good Old-fashioned AI (GOFAI),cFQ,
-,Page Rank,PageRank,cFQ,
 ,"""Pigouvian"" taxes",Pigouvian taxes,cFQ,
-,public welfare shcemes,Public Welfare Schemes,cFQ,
 ,"""competitive"" effect",Competitive Effect,cFQ,
 ,"""hunter-gatherer"" model",Hunter-gatherer Model,cFQ,
 ,Steve Jobs,"Jobs, Steve",cFQ,
-,Antiretroviral Therapie,Antiretroviral Therapy,cFQ,
-,⿻istic Ignorance,Pluralistic Ignorance,cFQ,
 ,co-creation relationship,Cocreation Relationship,cFQ,
 ,Ganga,Ganga River,cFQ,
-,Whanganui River,Whanganui,cFQ,
 ,"""competitive authoritarian"" regimes",Competitive Authoritarian Regimes,cFQ?,
 ,National Socialist German Workers party,National Socialist German Workers (Nazi) party,cFQ?,
 ,"""public"" data repositories",Public Data Repositories,cFQ?,
@@ -39,4 +34,4 @@
 ,Herbert Simon,"Simon, Herbert",cFQ?,
 ,Edward Snowden,"Snowden, Edward",cFQ?,
 ,Cross-cultural exchanges,Cross-cultural exchange,cFQ?,
-,non-fungible tokens (NFTs),Non-fungible Tokens (NFTs),nishio,"04-04, should be singular in index"
+,non-fungible tokens (NFTs),Non-fungible Tokens (NFTs),nishio,"04-04, should be singular in index"
diff --git a/scripts/index/claude.json b/scripts/index/claude.json
@@ -0,0 +1,68 @@
+{
+    "African Model": "NaN",
+    "Post-gender": "26",
+    "Robert Atkinson": "NaN",
+    "Robinson, James": "5",
+    "Saudia Arabia": "20",
+    "Edward Deming": "57",
+    "Georgist land value tax": "56",
+    "Global Anti-colonial Movement": "52",
+    "Hu Shih": "58",
+    "Nixon's visit to PRC": "59",
+    "Kao Chia-liang": "64",
+    "Taiwan's Digital Civic Infrastructure": "64",
+    "Bidirectional Equilibrium": "NaN",
+    "Einstein's Theories Of Relativity": "101",
+    "Gödel's Theorem": "98",
+    "Intersectional Identity": "124",
+    "Social Dynamics": "NaN",
+    "web of group-affiliation": "123",
+    "WEIRD societies": "118",
+    "modernity": "118",
+    "Digital-native currencies": "NaN",
+    "National Socialist German Workers party": "165",
+    "Blockchain-centric Identity Systems": "197",
+    "Digital-native Identity Infrastructure": "NaN",
+    "Licenses For Use": "NaN",
+    "On the Internet, Nobody Knows You're A Dog": "182",
+    "De Tocqueville, Alexis": "210",
+    "Decentralized Social Networking Protocol (DSNP)": "219",
+    "Distributed Ledger Technology (DLT)": "220",
+    "Edward Snowden": "223",
+    "The Emperor's New Clothes": "214",
+    "Moore's Law": "258",
+    "Quick Fixes": "NaN",
+    "Uniform Resource Locator (URL)": "256",
+    "non-fungible tokens (NFTs)": "256",
+    "Dao": "281",
+    "Cross-Cultural Exchanges": "326",
+    "Co-edited Project": "NaN",
+    "Online collaboration platform": "337",
+    "Collective Response Model": "NaN",
+    "Community Notes (CN)": "347",
+    "LLM-based Representative": "362",
+    "Left-right Divide": "348",
+    "Computer-simulated Neuron": "374",
+    "United States": "383",
+    "Weighted-voting": "383",
+    "Private Community-based Sponsorship": "410",
+    "Self-ownership": "396",
+    "Cross-cutting Benefit": "NaN",
+    "Cross-pollination Service": "436",
+    "White Collar": "435",
+    "Public Good": "451",
+    "UNDP": "NaN",
+    "Common Carriers of Public Discussion": "470",
+    "Open Source Intelligence": "466",
+    "Life-support System": "475",
+    "Paul Jozef Crutzen": "475",
+    "co-creation relationship": "479",
+    "For-profit Industry": "NaN",
+    "For-profit Private Corporations": "497",
+    "Non-profit ⿻ Infrastructure": "498",
+    "Open Collective Foundation": "501",
+    "Open Source Community": "496",
+    "Open Source Ecosystem": "486",
+    "Open Source Models": "498",
+    "Open Source Technology": "495"
+}
diff --git a/scripts/index/from_pdf.py b/scripts/index/from_pdf.py
@@ -22,7 +22,15 @@
     #     text_file.write(text)  # Write the extracted text to the file
 
     text = text.replace("-\n", "")  # remove hyphenation
-    text = text.replace("\n", " ")  # replace newlines with spaces
+
+    text = text.replace("\u201c", "")  # remove quotation
+    text = text.replace("\u201d", "")
+    text = text.replace('"', "")
+
+    # replace newlines with spaces (sometimes there are spaces
+    text = text.replace(" \n", " ")
+    text = text.replace("\n ", " ")
+    text = text.replace("\n", " ")
     data[page_num + 1] = text
 
 # Close the PDF document