Skip to content

Latest commit

 

History

History
141 lines (116 loc) · 7.81 KB

README.md

File metadata and controls

141 lines (116 loc) · 7.81 KB

FlairifyMe

FlairifyMe is a Reddit Flair Detector for r/india subreddit, that takes a post's URL as user input and predicts the flair for the post using a model generated by Logistic Regression. The web-application is hosted on Heroku at FlairifyMe(https://flairify-me.herokuapp.com/).

The web-application also offers visual content and temporal analysis of the collected data.

Directory Structure

The project has been developed using Python and several of its libraries and frameworks:

  • Scikit-learn
  • PRAW
  • NLTK
  • Flask
  • numpy
  • pandas
  • PyMongo

The scraped data is saved and loaded as a MongoDB instance.The web-application is based on Flask, and deployed using Heroku.

Following is the description of the files and folders in the repository:

  • Data: Contains CSV files with preprocessed scraped data, the MongoDB Collections and scripts for scraping, and preprocessing and analysing data.
  • Models: Contains the machine learning model used for predicting flairs.
  • Training: Contains the script for text-classification.
  • templates: Contains HTML scripts for the web-application
  • app.py: Used to start up the Flask server.
  • flair_predictor.py: Module to accept a valid URL and predict the post's flair by loading the model.
  • nltk.txt: Contains NLTK library dependencies for deployment on Heroku.
  • requirements.txt: Contains all dependencies for the project

Usage

The web-application allows the user to enter a r/india URL and displays the predicted flair for the submitted post. The user can view content and temporal analysis of the scraped data by clicking on the 'Post Analysis' button on the top right corner of the page.

To run on a local server:

  1. Clone the repository
git clone https://github.com/BhavyaC16/FlairifyMe.git
  1. Create a virtual environment
python3 -m venv FlairifyMe
source FlairifyMe/bin/activate
cd FlairifyMe/
  1. Finally, install the project dependencies
pip3 install -r requirements.txt
  1. Create the file RedditAPI.py as follows:
def accinfo():
	personalScript = '<enter_Reddit_App_personal_script_here>'
	secretKey = '<enter_Reddit_App_secret_key_here>'
	app = 'FlairifyMe'
	username = '<enter_your_Reddit_Username_here>'
	password = '<enter_your_Reddit_password>'
	return([personalScript,secretKey,app,username,password])

Copy the same file to the directory: ./Data/Scripts/ as well if you want to scrape posts from Reddit.

  1. To run the server, execute the following command
python3 app.py

Approach

Data Scraping

The python library PRAW has been used to scrape data from the subreddit r/india, with a total of 3,156 posts for 13 different flairs. The number of posts scraped per flair are as follows: alt text

Data preprocessing

The data has been preprocessed using the NLTK library. The following procedures have been executed on the title, body and comments to clean the data:

  1. Tokenizing and removing symbols
  2. Removing stopwords
  3. Stemming

Two separate databases have been prepared and saved as a MongoDB instance for training: one with stemming, and the other without stemming, as it is said to reduce prediction accuracy in certain cases by sources.

Training

The data has been loaded from MongoDB to a pandas DataFrame and split into 80-20 Training-Testing sets using scikit-learn. Each of the post features: Title, Body, Comments, Title+Comments and Title+Body+Comments were trained on three algorithms: Naive Bayes, Linear SVM and Logistic Regression, for both datasets(with and without stemming).

Following are the results, summarized as a table:

DATA WITHOUT STEMMING:

Feature\Algorithm Naive Bayes Linear SVM Logistic Regression
Title 0.59177 0.58386 0.54430
Body 0.20569 0.24367 0.24051
Comments 0.31171 0.59494 0.58069
Title+Comments 0.37500 0.64082 0.63449
Title+Body+Comments 0.37816 0.64399 0.65189

DATA WITH STEMMING:

Feature\Algorithm Naive Bayes Linear SVM Logistic Regression
Title 0.57753 0.57120 0.54430
Body 0.18354 0.23101 0.24051
Comments 0.30063 0.55538 0.56013
Title+Comments 0.36076 0.58703 0.60126
Title+Body+Comments 0.36551 0.59335 0.61392

After going through the flair-wise and overall prediction accuracies, the model trained using Title+Body+Comments on non-Stemmed data, using Logistic Regresssion was chosen.

Flair Prediction

The saved model is loaded for predicting the flair once the post features (title, body and comments) have been cleaned using NLTK. The returned result is displayed on the web-application.

API for querying FlairifyMe

A developer API using flask has been implemented, which returns a JSON containing the predicted flair of the Reddit Post queried by the user.

Can be accessed by querying:

flairify-me.herokuapp.com/api/resource?redditURL=<enter_url_here>

Returns JSON of the following format when successful:

{'status': 'successful', 'status_code': 200, 'result': {'flair': '<predicted_flair>'}}

Else, returns JSON of the format:

{'status': 'failed', 'status_code': <error_code>, 'result': {'error': '<error_message>'}}

Future Extension

I plan on adding the following features to the project:

  1. Improving the prediction by training the model on user inputs.
  2. Automating the script to allow users to develop prediction model for any subreddit entered by them.

Learnings

This task has been a great learning experience for me as it was my first time working with Machine Learning and Natural Language Processing, and with most of the tools like Heroku and MongoDB, as well as several libraries like scikit-learn, nltk, praw and Flask.

References

  1. Scraping Reddit
  2. Pre-processing Data
  3. Training Machine Learning Models with MongoDB
  4. Text-Classification
  5. Bag of Words in NLP
  6. Choosing a Text-Classifier
  7. Text-Classification using Scikit-learn
  8. Deploying Flask app to Heroku