clean up notebook
go to larger corpus
characterize different kinds of files. and resourdes
Filter out Tableaux and other files
Filter out tokens that aren't terribly helpful.
- d7 - matches subset of file name but not clear where it comes out in tokens
- t6,t7, a80 t12 - doesn't seem to show up in file names
- write a routine that will grab the tokens for sopmething that matches
- ok. parsing is making a mess. look at that.
- 'cti' is another problem
make notes of file name patterns that we have pulled out.
work on nnmf
- leave as is. zip codes are ok. we'll just have to characterize those.
- illinois_demo_race05_20_2022.csv',
- 'illinois_demo_gender05_21_2022.csv',
- 'county_historical_cases_2022-05-10_162303.cs
- 06 07 05 03 'Sheet_48_Benewah_2022-05-06_212304.csv'
- Ω push pre-compile all regex
- filter out time stamp patterns