The task was to develop a machine learning model capable of predicting whether a given tweet is about a real disaster or not. The model should classify tweets into two categories: '1' for tweets about real disasters, and '0' for those that are not.
The input dataset consists of two sets: a training set and a test set, each containing the following attributes:
id
: A unique identifier for each tweet.keyword
: A specific keyword from the tweet (may be blank).location
: The location from where the tweet was sent (may be blank).text
: The actual text of the tweet.target
: Present only in the training set, indicating whether a tweet is about a real disaster (1) or not (0).
The approach involved several steps:
- Data Preprocessing: Cleaning and preprocessing text data, including tokenization, normalization, and stop words removal.
- Feature Extraction: Using the TF-IDF (Term Frequency-Inverse Document Frequency) technique to convert text data into numerical features.
- Model Selection and Training: A Naive Bayes classifier was chosen for its effectiveness in text classification tasks. The model was trained using the training dataset.
- Model Evaluation: The model was evaluated on a validation set derived from the training data.
The model achieved an accuracy of approximately 79.12% on the validation set, with detailed metrics like precision, recall, and F1-score for both classes provided in the classification report.
Several metrics were used to evaluate the model:
- Accuracy: A general measure of the model's performance.
- Precision, Recall, and F1-Score: Metrics that provide insights into the model's ability to correctly identify disaster-related tweets and its overall reliability.
- Confusion Matrix: Offers a detailed view of the model's performance in terms of true positives, false positives, true negatives, and false negatives.
- Precision-Recall Curve: Illustrates the trade-off between precision and recall for different threshold settings.
- ROC Curve and AUC: Showcases the model's ability to distinguish between the two classes.
Merits:
- Effectiveness: The Naive Bayes classifier performed well in classifying text data.
- Simplicity and Efficiency: The method is simple to implement and computationally efficient.
Limitations:
- Model Bias: Potential bias due to the nature of the training data or the limitations of the Naive Bayes classifier.
- Lack of Contextual Understanding: Possible misclassifications due to not fully grasping the context or subtleties of natural language.
- Dependence on Preprocessing: The performance heavily relies on the quality of data preprocessing.