About Dataset
Each sample in the dataset has the following information:
id - a unique identifier for each tweet
text - the text of the tweet
location - the location the tweet was sent from (may be blank)
keyword - a particular keyword from the tweet (may be blank)
target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)
About Notebook:
- Cleaning the data
i) Making text all lower case.
ii) Removeing punctuation.
iii) Removing numerical values.
iv) Removing common non-sensical text.
v) Tokenizing text.
vi) Removing stop words. - Organizing the data
i) Generating corpus.
ii) Generating Document-Term Matrix(dtm). - Exploring the data
i) Creating word cloud for most common words.
ii) Finding the count of top 30 words associated with top 10 keywords.
iii) Adding most common words to stopword list.
iv) Updating document-matrix with new stop_words.
v) Creating word cloud for top 5 keywords.
vi) Finding the number of unique words associated with each unique keyword.
vii) Check the profanity by analysing the common bad words. - Sentiment Analysis
i) Find the polarity and subjectivity of each tweet.
ii) Visualizing the results through scatter plot.
iii) Split each tweet into 10 parts and finding their polarity.
iv) Visualizing the results through subplots. - Topic Modeling
i) Putting dtm into new gensim format.
ii) Generating dictionary of the all terms and their respective location in dtm.
iii) Applying Latent Dirichlet Allocation (LDA) for all text.
iv) Applying Latent Dirichlet Allocation (LDA) for nouns only.
v) Applying Latent Dirichlet Allocation (LDA) for nouns and adjectives. - Text Generation
i) Building a Markov Chain Function.
ii) Creating the dictionary of text data.
iii) Creating a Text Generator Function.