This is an implementation of the Naive Bayes Classification technique as a Spam Classifier.
Coded from the scratch in Python.
- Scan all mails to get the top 3000 most used words.
- Convert each mail into a feature matrix based on this.
- Calculate the summary of these top 3000 words for each label.
- For a given mail calculate the log gaussian probability for each class.
- Label with the highest probability wins.
Each version has it's own branch.
Master is the latest version. (And possibly under development.)
Working on making this project more generic.
No more changes to the actual classifier logic will be done.
Future plans for this project include creation of a Flask based APIs that will:
- Trigger creation of the class summary.
- Read test emails from a specified location.
- Add docstring and comments.
- Optimised variable usage.
- Fixed bugs.
- Better logging.
- Currently classification works in the specified set of mails.
- Modularised source code into separate files.
- Removed hard coded paths. (config.txt)
- Single script for preprocessing, training, testing.
- Simple implementation I did as a part of class project during my masters degree.
https://github.com/savanpatel
https://medium.com/machine-learning-101/chapter-1-supervised-learning-and-naive-bayes-classification-part-2-coding-5966f25f1475