Scrapes artists and metadata from Spotify, and lyrics from lyrics.wiki.com. Processes lyrics to include spellchecking and pronunciations, including of slang, non-dictionary words (the hardest part) Builds and trains RNNs combining word-level (via embeddings), character-level, pronunciations, syllabic stresses, and meta-contextual data. Generates new lyrics from a seed.
Scrapes metadata about artists on Spotify. Starts with a seed list of artists, then recursively finds related artists. Saves to MongoDB. Why is this written in Javascript/Coffeescript and not Python like the rest of the project? I don't know. I wrote this part 2 years ago and forgot. Might be easier to initialize Mongo models with Mongoose as opposed to Pymongo.
Files:
- scraper.coffee
- run_scraper.js
- artist.coffee
- genre.coffee
Generates histograms of various features of the data (collected from Spotify). For instance, the number of followers, or the number of plays.
Files:
- db_viz.py
From the names of artists found on Spotify via the Genre/Artist Scraper, scrapes lyrics.wikia.com for all their lyrics. Also does initial preprocessing of lyrics to find the names of artists on each verse, and metadata such as the song name, album, year. Saves lyrics to MongoDB Files:
- scraper.py
Utility tool to easily and quickly annotate pronunciations of slang words. Breaks words into syllables, and pattern-matches syllables from new words with existing syllabic annotations. Deals effectively with plural and other forms of words.
- find_pronunciations.py
Tries to replace misspelled words. Tries swapping, adding, & removing vowels, and adding "g" to ing. Then searches for most probable word (by frequency in English or in a given corpus). Also implemented is finding words that are 1 or 2 away by editdistance, but I found that produced too many erroneous corrections. Files:
- spell_checker.py
Generates text files from lyrics in MongoDB. Tokenizes words using StanfordTokenizer, pulls pronunciations from each word using combination of SpellChecker, CMU Pronouncing Corpus, and manually annotated pronunciations.
Files:
- generate_lyric_files.py
Unfinished module to find which words have word embeddings in GloVe or Word2Vec. Files:
- find_word_vectors.py
Generates the actual input data to the deep network. Pulls phonemes and stresses from pronunications, separates words into characters, and generates context vectors for the metadata (e.g. artist, genre, album, year, etc) Metadata is saved via rapper_matrix.py Files:
- extract_features.py
- rapper_matrix.py
Trains model and generates new lyrics. Generates embeddings for words first. Separate networks for words, characters, phonemes, stresses and context. Context network is non-recurrent. Possibility to preload GloVe or Word2Vec vectors into the embeddings for known words as some initial semantic meaning. Not currently tested. Files:
- tf_reader.py
- tf_rnn_model.py
Ensure you set the CLASSPATH environment variable to the stanford-postagger.jar
export CLASSPATH=stanford-postagger-2015-12-09/stanford-postagger.jar