This is the project for my thesis in Computer Science done at University of Palermo under the supervision of the professor Roberto Pirrone.
The goal was to build a data analysis pipeline with technologies related to Big Data:
- Data collection
- Data pre-processing
- Data labeling
- Machine Learning model tuning
- Application of the Naive Bayes algorithm
- Model evaluation
- Insight extraction
The technologies used are:
- Python 3.7
- Tweepy, Twitter API
- Pandas, Python Data Analysis Library
- NLTK, Natural Language Toolkit Library
- Apache Spark 2.4
The project consists of 4 python pages of code:
- tweetSave.py to collect the tweet, is set to collect italian tweet with music keyword
- tweetClean.py to clean and pre-process the data
- tweetSentimentRadici.py to label the tweet with positive, negative or neutral sentiment
- tweetSpark.py to apply the machine learning tools (RUNS ON SPARK)
Write me if you have doubts or to improve the solution.