This project creates an email spam filter based on supervised learning that classifies emails as either spam (unwanted) or ham (legitimate) for my data analysis and vsiualization class.
I used two supervised learning algorithms, K Nearest Neighbors (KNN) and Naive Bayes, and compared their performances. To train and evaluate these classifiers, I used the Enron spam email dataset, which consists of approximately 34,000 emails. Once the classifiers were trained, I ran them in a Jupyter Notebook to predict whether new emails are spam or ham.
- Explore and implement the KNN and Naive Bayes algorithms.
- Gain hands-on experience in preprocessing text data, specifically converting emails into numeric features suitable for model processing.
- Set up a supervised learning problem and analyze the results.
- Understand and follow a typical end-to-end supervised machine learning workflow.
- Work with a large, real text dataset.
I used the Enron spam email dataset for this project. You can download the dataset using the following links: