In this project, I will try to predict whether the stock price increases or decreases based on top 25 news headlines of the day.
It is widely accepted that media, news and publicity can have a profound effect on stock prices. To check this hypothesis, I decided to carry out this project.
Dow Jones Industrial Average is a stock market index that collects the value of a list of 30 large and public companies based in the US. In this way it gives an idea of the trend that is going through the stock market.
News and global events, political or otherwise, play a major role in changing stock values. Every stock exchange is, after all, reflects how much trust investors are ready to put in other companies.
Source: Kaggle
Data: This data contains 8 years of daily news headlines from 2000 to 2016 from Yahoo Finance and Reddit WorldNews, as well as the Dow Jones Industrial Average(DJIA) close value of the same dates as the news.
This is a binary classification problem. When the target is “0”, the same day DJIA close value decreased compared with the previous day, when the target is “1”, the value rose or stayed the same.
Downloaded the csv file from Kaggle.
-
Date: Date column contains the dates from 2000 to 2016 on which the news are released.
-
Label: Binary Numeric, '0' represent that the price went down and '1' represent that the price went up.
-
Top#: Strings that contains the top 25 news headlines for the day ranging from 'Top1' to 'Top25'
Splitted the data at the very start so that there is no scope of information seeping from train dataset to test dataset.
train1=df[df.Date<'20150101']
test1=df[df.Date>'20141231']
train=train1.iloc[:,2:27]
test=test1.iloc[:,2:27]
train_label=train1.Label
test_label=test1.Label
Performed the following steps in order to clean and prepare the data:
- Removed punctuation marks
- Converted headlines to lowercase
- Combined various headlines per day in one sentence
train.replace(to_replace="[^a-zA-Z]",value=' ',regex=True,inplace=True)
test.replace(to_replace="[^a-zA-Z]",value=' ',regex=True,inplace=True)
for i in range(0,25):
train[i]=train[i].str.lower()
test[i]=test[i].str.lower()
headlines = []
for row in range(0,len(train.index)):
headlines.append(' '.join(str(x) for x in train.iloc[row,0:25]))
for i in range(0,len(headlines)):
words=nltk.word_tokenize(headlines[i])
words=[lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
headlines[i]=' '.join(words)
Used Bag of words and TfIdf to embedd headlines as it could affect model performance.
countvector=CountVectorizer(analyzer='word')
traindataset=countvector.fit_transform(headlines)
cv=TfidfVectorizer(analyzer='word')
traindataset1=cv.fit_transform(headlines)
First did a further split of train data to train and validation data. Then chose following models for evaluation.
lr_model = LogisticRegression(n_jobs=-1)
nb_model = naive_bayes.MultinomialNB()
svc_model = svm.SVC(probability=True, gamma="scale",)
rf_model = ensemble.RandomForestClassifier(n_estimators=100,n_jobs=-1)
models = ["lr_model", "nb_model", "svc_model", "rf_model"]
Used the following function to see classification report of the models to select baseline model.
def baseline_model_filter(modellist, X, y):
''' 1. split the train data further into train and validation (17%).
2. fit the train data into each model of the model list
3. get the classification report based on the model performance on validation data
'''
X_train, X_valid, y_train, y_valid = X[:3471],X[3471:],y[:3471],y[3471:]
for model_name in modellist:
curr_model = eval(model_name)
curr_model.fit(X_train, y_train)
print(f'{model_name} \n report:{classification_report(y_valid, curr_model.predict(X_valid))}')
models = ["lr_model", "nb_model", "svc_model", "rf_model"]
Each model generated a report for both bag of words and tfidf embeddings.
Proceeded with random forest and bag of words embedding. Its report is as follows.
Used GridsearchCV to tune the hyperparameters.
model=RandomForestClassifier(n_jobs=-1,warm_start=True)
param={'n_estimators':[100,200,300],
'criterion':['gini','entropy']}
gcv=GridSearchCV(model,param_grid=param,n_jobs=-1,scoring='accuracy')
Thus obtained the final classification_report.
- The current analysis is based on 16 years of data from 2000 to 2016. I would like to collect more data and recent data to improve the model.
- News from single relavent source will prove more efficient in building the model. It will ease cleaning and produce more quality data for modelling.
- In the future I would like to try deep learning model because deep learning works very well in natural language processing problems.