Skip to content

Commit

Permalink
Merge pull request #870 from nishio/patch-indexing
Browse files Browse the repository at this point in the history
Update index
  • Loading branch information
GlenWeyl authored Apr 11, 2024
2 parents 7e492f8 + a2772c4 commit 15be113
Show file tree
Hide file tree
Showing 8 changed files with 2,278 additions and 352 deletions.
13 changes: 5 additions & 8 deletions scripts/index/Plurality Book Indexing Exercise - Candidates.csv
Original file line number Diff line number Diff line change
Expand Up @@ -438,7 +438,6 @@ Plurality Book Indexing Exercise,Keyword,Chapter,POC
,Zero Trust,02-02,cFQ
,⿻,03-00,cFQ
,⿻ Image,03-00,cFQ
,⿻ 數位 Plurality,03-00,cFQ
,A Connected Society,03-00,cFQ
,Audrey Tang,03-00,cFQ
,Collective Organization,03-00,cFQ
Expand Down Expand Up @@ -470,7 +469,6 @@ Plurality Book Indexing Exercise,Keyword,Chapter,POC
,Towards a Connected Society,03-00,cFQ
,Universal Coded Character (unicode),03-00,cFQ
,Vulcans,03-00,cFQ
,數位,03-00,cFQ
,⿻ Foundations,03-01,cFQ
,⿻ Perspective,03-01,cFQ
,⿻ Social Science,03-01,cFQ
Expand Down Expand Up @@ -839,7 +837,7 @@ Plurality Book Indexing Exercise,Keyword,Chapter,POC
,Web3,04-01,cFQ
,Web3 Communities,04-01,cFQ
,Worldcoin,04-01,cFQ
,⿻ publics,04-02,tsuzumik
,⿻ Publics,04-02,tsuzumik
,ActivityPub,04-02,tsuzumik
,Association,04-02,tsuzumik
,Blockchain,04-02,tsuzumik
Expand Down Expand Up @@ -1506,7 +1504,7 @@ Plurality Book Indexing Exercise,Keyword,Chapter,POC
,⿻ Book,05-07,cFQ
,⿻ Funding Across Boundaries,05-07,cFQ
,⿻ Funding Formula,05-07,cFQ
,⿻ funding:,05-07,cFQ
,⿻ Funding,05-07,cFQ
,⿻ Future,05-07,cFQ
,⿻ Governance,05-07,cFQ
,⿻ Group,05-07,cFQ
Expand Down Expand Up @@ -1785,8 +1783,8 @@ Plurality Book Indexing Exercise,Keyword,Chapter,POC
,White Collar,06-01,cFQ
,Work-life Balance,06-01,cFQ
,Workplace,06-01,cFQ
,⿻ good,06-02,cFQ
,⿻ mechanism,06-02,cFQ
,⿻ Good,06-02,cFQ
,⿻ Mechanism,06-02,cFQ
,⿻ Public,06-02,cFQ
,⿻ Vision,06-02,cFQ
,Adverse Selection,06-02,cFQ
Expand Down Expand Up @@ -1981,13 +1979,12 @@ Plurality Book Indexing Exercise,Keyword,Chapter,POC
,Water Management,06-04,cFQ
,Whanganui,06-04,cFQ
,Yamuna River,06-04,cFQ
,⿻,07-00,cFQ
,⿻ Competence Education,07-00,cFQ
,⿻ Infrastructure,07-00,cFQ
,⿻ Taxes,07-00,cFQ
,⿻ Technologies,07-00,cFQ
,⿻ Technology,07-00,cFQ
,⿻ voting,07-00,cFQ
,⿻ Voting,07-00,cFQ
,A New World Order,07-00,cFQ
,Active Public Investment,07-00,cFQ
,AI Act,07-00,cFQ
Expand Down
9 changes: 9 additions & 0 deletions scripts/index/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
- `main.py`: output keywords to page numbers into `keyword_occurrence.tsv`
- `index_with_claude.tsv`: merge Claude 3 Opus output and `keyword_occurrence.tsv`
- `index_for_manual_edit.tsv`: Copy of `index_with_claude.tsv` for manual edit
- `index.md`: Sample index from `index_for_manual_edit.tsv` for visual verification


### memo
- Removed keywords "not found" in "NotFound.csv". those are once added by human, not found by machine and then not found by additional human eyes.
Expand All @@ -20,6 +22,13 @@ You are great editor of books. Here are index candidates for a book, find where
expected JSON format: {"<keyword>": "<page number or NaN>", ...}
```

- After merging the data, output it to `index_with_claude.tsv`, then copied it to `index_for_manual_edit.tsv` for manual updates.
- `index.md` was created for sequencing and visual considerations.
- The location of ``'s appearance was set as p.88, which is `3-0 What is ⿻?`.`⿻ 數位 Plurality` is 89, `數位` are 2, 92. Those are important concepts and the extraction from the PDF fails because it is included in the chapter titles.
- `⿻ Publics` in 4-2 section title. p.209. Also in pages 451, 461, 480, all of them are OK. Notice: `⿻ Public` is a part of `⿻ Public Media`.
- FIX of `ignore continuous pages`: During keyword extraction from the PDF, the inclusion of section titles every two pages causes an abundance of hits for keywords contained in the section titles. To address this, we decided not to pick up keywords that appeared two pages ago. This fix decreses keywords of >5 occurrences from 91 to 54.
- `(anti-)social media 71` is split to "Anti-social Media" and "Social Media". `(In)dividual identity 126, 129` is same.

## first step (~3/26)
- `Plurality Book Indexing Exercise - Main.csv`: raw file exported from [Spreadsheet](https://docs.google.com/spreadsheets/d/1gmyjFbErt_CW8-qLKChSpciLlCDGUhLriYFov0HO3qA/edit#gid=0)
- `step1.py`: output POC count, occurence of each keywords in each sections, and the count of occurences
Expand Down
Loading

0 comments on commit 15be113

Please sign in to comment.