-
-
Notifications
You must be signed in to change notification settings - Fork 215
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #705 from tanuj437/main
Customer Review Sentiment Anaylsis
- Loading branch information
Showing
22 changed files
with
388 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
# Dataset Overview | ||
This dataset contains customer reviews from a marketplace, with various attributes related to the reviews and products. The dataset file is approximately 450MB in size. | ||
|
||
## File Description | ||
The dataset file includes the following columns: | ||
|
||
**marketplace**: The marketplace where the | ||
review was posted. | ||
|
||
**customer_id**: Unique identifier for the customer. | ||
**review_id:** Unique identifier for the review. | ||
|
||
**product_id**: Unique identifier for the product. | ||
**product_parent**: Parent identifier for the product. | ||
**product_title:** Title of the product. | ||
|
||
**product_category**: Category of the product. | ||
**star_rating**: Star rating given by the customer. | ||
**helpful_votes**: Number of helpful votes the review received. | ||
|
||
**total_votes:** Total number of votes the review received. | ||
## Unique Values Overview | ||
### Marketplace | ||
Unique values: 0 | ||
Total values: [null] 55%, Other (405658) 45% | ||
### customer_id | ||
Unique values: 0 | ||
Total values: [null] 68%, Banjo 0%, Other (290242) 32% | ||
### review_id | ||
Unique values: 0 | ||
Total values: [null] 82%, Banjo 0%, Other (159357) 18% | ||
### product_id | ||
Unique values: 0 | ||
Total values: [null] 86%, Craft Work 0%, Other (122616) 14% | ||
Numerical Data Overview | ||
### Star Rating | ||
Label Count | ||
-32.00 - 371.20 94 | ||
371.20 - 774.40 4 | ||
774.40 - 1177.60 3 | ||
1177.60 - 1580.80 1 | ||
1984.00 - 2387.20 2 | ||
3596.80 - 4000.00 1 | ||
-32 | ||
4000 | ||
### Helpful Votes | ||
Label Count | ||
-5.00 - 427.60 54 | ||
427.60 - 860.20 7 | ||
860.20 - 1292.80 1 | ||
1725.40 - 2158.00 4 | ||
2158.00 - 2590.60 1 | ||
3888.40 - 4321.00 1 | ||
-5 | ||
4321 | ||
|
||
|
||
|
||
|
||
### Summary | ||
File Size: 475.5MB | ||
Number of Records: 904,615 | ||
### Usage | ||
This dataset can be used for sentiment analysis, customer behavior analysis, and various other machine learning tasks related to product reviews and ratings. | ||
|
||
# Dataset | ||
This Dataset can be accessible and downloadable from [Kaggle](https://www.kaggle.com/datasets/cynthiarempel/amazon-us-customer-reviews-dataset) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
# Customer Review Sentiment Analysis - Model | ||
## 📝 Description | ||
This folder contains the pre-trained machine learning models and scripts used for sentiment analysis on the customer review dataset. The goal is to automatically categorize customer reviews into positive, neutral, or negative sentiments, helping to understand public perceptions of products. | ||
|
||
## 📂 Contents | ||
-**customer-review-sentiment-analysis.ipynb:** Jupyter Notebook containing the complete process of data preprocessing, model training, evaluation, and visualization. | ||
|
||
-**model.pkl**: Pre-trained Logistic Regression model used for sentiment prediction. | ||
|
||
-**tfidf_vectorizer.pkl:** Pre-trained TF-IDF vectorizer used for transforming text data. | ||
|
||
-**README.md**: This document. | ||
|
||
## 🎯 Goal | ||
The goal of this sentiment analysis project is to enhance understanding of customer perceptions by organizing and analyzing reviews. By automatically classifying these reviews as positive, neutral, or negative, the project aims to provide insights into public opinion trends. | ||
|
||
## 🧮 What I Did | ||
In this sentiment analysis project, various models were evaluated to find the most effective one for classifying customer reviews. The models evaluated include: | ||
|
||
### Logistic Regression | ||
|
||
A simple linear model for binary and multi-class classification. | ||
Achieved a high accuracy and balanced precision-recall performance. | ||
### LightGBM Classifier | ||
|
||
A Light Gradient Boosting Machine known for its efficiency and performance with large datasets. | ||
Achieved competitive accuracy and was used as one of the benchmark models. | ||
### XGBoost Classifier | ||
|
||
An implementation of gradient-boosted decision trees designed for speed and performance. | ||
Achieved competitive accuracy and served as another benchmark model. | ||
### AdaBoost Classifier | ||
|
||
An ensemble method that combines multiple weak classifiers to create a strong classifier. | ||
Achieved good performance, particularly in precision and recall. | ||
|
||
### Data Preprocessing and Augmentation | ||
|
||
**Data Cleaning:** Normalized text, removed missing values, and duplicates. | ||
|
||
**Tokenization:** Processed text data to remove stop words and perform lemmatization. | ||
|
||
**TF-IDF Vectorization:** Converted text data into numerical features using TF-IDF. | ||
|
||
|
||
## 🚀 Models Implemented | ||
|
||
**Logistic Regression Model** | ||
|
||
-Achieved an accuracy of 90.0%. | ||
-Precision: 0.89, Recall: 0.90, | ||
-F1-score: 0.89 (weighted average). | ||
|
||
|
||
**XGBoost Classifier** | ||
|
||
-Achieved an accuracy of 89.0%. | ||
-Precision: 0.88, Recall: 0.89 | ||
-F1-score: 0.87 (weighted average). | ||
|
||
**AdaBoost Classification** | ||
|
||
-Achieved an accuracy of 88.0%. | ||
-Precision: 0.86, Recall: 0.88 | ||
-F1-score: 0.86 (weighted average). | ||
|
||
**LightGBM Classifier** | ||
|
||
-chieved an accuracy of 89.0%. | ||
-Precision: 0.88, Recall: 0.89 | ||
-F1-score: 0.88 (weighted average). | ||
|
||
**Multi-Layer Perceptron (MLP)** | ||
|
||
-Achieved an accuracy of 90.0%. | ||
-Precision: 0.89, Recall: 0.90, | ||
-F1-score: 0.89 (weighted average). | ||
|
||
|
||
**Model Performance Analysis** | ||
Training and Validation: Evaluated models based on accuracy, precision, and loss to select the best-performing model. | ||
|
||
|
||
**Best Model** | ||
The best-performing model, Logistic, has been saved as model.pkl and is ready for deployment using Streamlit. | ||
|
||
## 📢 Conclusion | ||
The customer review sentiment analysis project demonstrates the effectiveness of machine learning models, particularly Logistic Regression, in accurately predicting customer sentiment. The models help in organizing and prioritizing customer reviews, providing valuable insights for stakeholders. | ||
|
||
## ✒️ Your Signature | ||
Tanuj Saxena[LinkedIn](https://linkedin.com/in/tanuj-saxena-970271252/) |
1 change: 1 addition & 0 deletions
1
Customer Review Sentiment Anaylsis/Model/customer-review-sentiment-anaylsis.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
|
||
# Customer Review Sentiment Analysis | ||
Customer Review Sentiment Analysis is a project focused on automatically analyzing and classifying customer reviews based on their sentiment. The project leverages machine learning techniques to understand and categorize customer sentiments expressed in reviews. | ||
|
||
Note: The labeling is done based on Star. | ||
<img width="922" alt="webapp" src="https://github.com/tanuj437/Customer-Review-Sentiment-Anaylsis/assets/128210429/5528ae7e-6ded-46d3-a845-026451cf40e6"> | ||
|
||
## 📝 Abstract | ||
Customer Review Sentiment Analysis involves automatically identifying and classifying sentiment from customer reviews. Techniques such as natural language processing (NLP), machine learning models, and sentiment analysis algorithms are employed to achieve this. | ||
|
||
## 🔍 Methodology | ||
**Importing Libraries** | ||
-Libraries such as NumPy, Pandas, Sklearn, Transformers, and others are imported for data manipulation, visualization, and machine learning model building. | ||
|
||
**Loading the Dataset** | ||
-The dataset contains multiple rows of comments labeled with their sentiment based on the Star rating. | ||
|
||
**Data Preprocessing** | ||
-Prepare data for analysis: handle missing values, encode categorical data, scale features, perform feature engineering, split into train-test sets, and normalize data. Ensure data is in a suitable format for machine learning algorithms. | ||
|
||
**Training the Models** | ||
-Each model is compiled using techniques like LightGBM and Logistic Regression. | ||
The models are trained on the training dataset and evaluation is done. | ||
|
||
**Model Performance Analysis** | ||
-Training and validation loss and accuracy are plotted to visualize the models' performance. | ||
|
||
<img width="404" alt="precision_cmp" src="https://github.com/tanuj437/Customer-Review-Sentiment-Anaylsis/assets/128210429/f34e3bfc-c14c-4c9e-8a58-e9de9f27c2d9"> | ||
|
||
|
||
**Model Prediction** | ||
-The model is given a test dataset to check the accuracy and precision of the predictions. | ||
|
||
<img width="416" alt="recall_cmp" src="https://github.com/tanuj437/Customer-Review-Sentiment-Anaylsis/assets/128210429/748c1245-c55c-4b4b-850a-e687c6c9ffe7"> | ||
|
||
|
||
**Deploy** | ||
-Using the Streamlit library, the model is deployed for real-time sentiment analysis. | ||
|
||
**Data and Model File Download** | ||
-The dataset used in the project is taken from the Kaggle Customer Review Dataset. [Kaggle Dataset Link](https://www.kaggle.com/datasets/cynthiarempel/amazon-us-customer-reviews-dataset?select=amazon_reviews_us_Musical_Instruments_v1_00.tsv) | ||
|
||
### Project Directory Structure | ||
``` | ||
BRICS Sentiment Analysis | ||
|- Dataset | ||
|- column_overview.png | ||
|- dataset_view.png | ||
|- README.md | ||
|- Model | ||
|- customer-review-aentiment-analysis.ipynb | ||
|- README.md | ||
|- model.pkl | ||
|-tfidf_vectorizer.pkl | ||
|- Web App | ||
|- app.py | ||
|- README.md | ||
|- Images | ||
|- f1_cmp.png | ||
|- README.md | ||
|- precision_cmp.png | ||
|- recall_cmp.png | ||
|- review_length.png | ||
|- sentiment_distribution.png | ||
|- star_rating.png | ||
|- star_ratingtocount.png | ||
|- webapp.png | ||
|- running_test.mp4 | ||
|-wordcloud.png | ||
|- requirements.txt | ||
|-README.md | ||
``` | ||
|
||
## How to Use | ||
**Requirements** | ||
-Ensure you have the necessary libraries and dependencies installed. You can find the list of required packages in the requirements.txt file. | ||
|
||
**Download Data** | ||
-Download the brics_comments.csv dataset from Kaggle mentioned in the dataset section of the project. | ||
|
||
**Run the Jupyter Notebook** | ||
-Open the provided Jupyter Notebook file and run each cell sequentially. Make sure to update any file paths or configurations as needed for your environment. | ||
|
||
**Training and Evaluation** | ||
-Train the models using the provided data and evaluate their performance using metrics such as accuracy and loss. | ||
|
||
**Interpret Results** | ||
-Analyze the model's performance using the visualizations and metrics provided in the notebook. | ||
|
||
Feel free to reach out if you encounter any issues or need further assistance with running the notebook. | ||
|
||
## Connect with Me | ||
Tanuj Saxena [LinkedIn](https://www.linkedin.com/in/tanuj-saxena-970271252/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# Customer Review Sentiment Analysis Web App | ||
|
||
## Goal 🎯 | ||
The goal of this sentiment analysis web application is to understand public perceptions about customer reviews on various products. By analyzing these reviews, the app helps in organizing and prioritizing insights, detecting sentiment trends, and ensuring that diverse viewpoints are represented. It streamlines the process of understanding customer opinions and provides valuable feedback for stakeholders. 🌍🔍 | ||
|
||
## Model(s) Used for the Web App 🧮 | ||
The model used in this web app is a pre-trained Logistic Regression, which has been fine-tuned for sentiment analysis. The TF-IDF vectorize model is used for encoding the text into embeddings, and the Logistic model predicts the sentiment with high accuracy. | ||
|
||
## Video Demonstration 🎥 | ||
|
||
|
||
|
||
https://github.com/user-attachments/assets/46cada4d-cefd-41ac-af8d-415a23a035b9 | ||
|
||
|
||
|
||
|
||
## How to Run the Web App | ||
|
||
### Requirements | ||
Ensure you have the necessary libraries and dependencies installed. You can find the list of required packages in the `requirements.txt` file. | ||
|
||
### Installation | ||
1. **Clone the repository:** | ||
```bash | ||
gh repo clone tanuj437/Customer-Review-Sentiment-Analysis | ||
cd Customer-Review-Sentiment-Analysis/WebApp | ||
``` | ||
2. **Install the Dependencies** | ||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
3. **Run the Streamlit app** | ||
```bash | ||
streamlit run app.py | ||
``` | ||
### Signature ✒️ | ||
Tanuj Saxena | ||
|
||
[![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/tanuj-saxena-970271252/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
import streamlit as st | ||
import pickle | ||
import re | ||
from nltk.corpus import stopwords | ||
|
||
# Load NLTK stopwords | ||
stop_words = set(stopwords.words('english')) | ||
|
||
# Define text preprocessing function | ||
def preprocess_text(text): | ||
text = text.lower() | ||
text = re.sub(r'<[^>]+>', ' ', text) # Remove HTML tags | ||
text = re.sub(r'[^a-z\s]', '', text) # Remove non-alphabetic characters | ||
text = ' '.join([word for word in text.split() if word not in stop_words]) # Remove stopwords | ||
return text | ||
|
||
# Load the pretrained model | ||
model_filename = 'Model/model.pkl' | ||
with open(model_filename, 'rb') as file: | ||
logistic_model = pickle.load(file) | ||
|
||
# Load the pretrained TF-IDF vectorizer | ||
vectorizer_filename = 'Model/tfidf_vectorizer.pkl' | ||
with open(vectorizer_filename, 'rb') as file: | ||
vectorizer = pickle.load(file) | ||
|
||
# Define a function to make predictions | ||
def predict_sentiment(text): | ||
preprocessed_text = preprocess_text(text) | ||
transformed_text = vectorizer.transform([preprocessed_text]) | ||
prediction = logistic_model.predict(transformed_text) | ||
return prediction[0] | ||
|
||
# Streamlit app | ||
st.title('Sentiment Analysis Web App') | ||
|
||
st.write('This is a web app to classify the sentiment of customer reviews as positive, neutral, or negative.') | ||
|
||
# User input | ||
user_input = st.text_area('Enter a customer review:', '') | ||
|
||
if st.button('Predict'): | ||
if user_input: | ||
prediction = predict_sentiment(user_input) | ||
st.write(f'The sentiment of the review is: **{prediction}**') | ||
else: | ||
st.write('Please enter a review to get a prediction.') | ||
|
||
# To run the app, save this script and use the command: streamlit run your_script_name.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# Image Folder Overview | ||
This folder contains various visualizations that represent different aspects of the dataset and model performance. Below is a detailed description of each visualization. | ||
|
||
## Visualizations | ||
## Sentiment Distribution | ||
|
||
Description: This chart shows the distribution of sentiments (positive, neutral, negative) across the dataset. It provides an overview of how many reviews fall into each sentiment category. | ||
## Star Rating Distribution | ||
|
||
Description: This chart displays the distribution of star ratings given by customers. It helps to understand the overall satisfaction level of the customers based on the star ratings. | ||
## Star Rating to Count | ||
|
||
Description: This chart shows the count of reviews for each star rating. It is useful to see the frequency of each rating given by the customers. | ||
## Word Cloud | ||
|
||
Description: A word cloud visualization that highlights the most frequently occurring words in the reviews. Larger words represent higher frequency, providing insight into common themes and topics in the reviews. | ||
## Review Length Distribution | ||
|
||
Description: This histogram shows the distribution of review lengths in terms of word count. It helps to understand the typical length of customer reviews in the dataset. | ||
|
||
## F1 Score Comparison | ||
|
||
Description: This bar chart compares the F1 scores of different models used in the analysis. The F1 score is a measure of a model's accuracy, balancing precision and recall. | ||
## Recall Comparison | ||
|
||
Description: This bar chart compares the recall scores of different models used in the analysis. Recall measures the ability of a model to identify all relevant instances in the dataset. | ||
|
||
|
||
## Precision Comparison | ||
|
||
Description: This bar chart compares the precision scores of different models used in the analysis. Precision measures the accuracy of the positive predictions made by the model. | ||
|
||
### Usage | ||
These visualizations provide a comprehensive view of the dataset's characteristics and the performance of various models used for sentiment analysis. They can be used to gain insights into customer reviews, model effectiveness, and areas for improvement in analysis. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
pandas==1.3.3 | ||
matplotlib==3.4.3 | ||
seaborn==0.11.2 | ||
nltk==3.6.3 | ||
wordcloud==1.8.1 | ||
numpy==1.21.2 | ||
scikit-learn==0.24.2 | ||
keras==2.6.0 | ||
tensorflow==2.6.0 | ||
lightgbm==3.2.1 | ||
xgboost==1.4.2 | ||
streamlit==0.86.0 |