Abstract
This thesis investigates the prediction of possible stock price changes immediately after news article
publications. This is done by automatic analysis of these news articles. Some background information
about financial trading theory and text mining is given in addition to an overview of earlier related
research in the field of automatic news article analyzes with the purpose of predicting future stock
prices.
In this thesis a system is designed and implemented to predict stock price trends for the time
immediately after the publication of news articles. This system consists mainly of four components.
The first component gathers news articles and stock prices automatically from internet. The second
component prepares the news articles by sending them to some document preprocessing steps and
finding relevant features before they are sent to a document representation process. The third
component categorizes the news articles into predefined categories, and finally the fourth component
applies appropriate trading strategies depending on the category of the news article.
This system requires a labeled data set to train the categorization component. This data set is labeled
automatically on the basis of the price trends directly after the news article publication. An additional
label refining step using clustering is added in an attempt to improve the labels given by the basic
method of labeling by price trends.
The findings indicate that categorization of news articles into positive, neutral, and negative categories
provides enough information for it to be used to forecast stock price trends. Experiments showed that
the label refining method greatly improves the performance of the system. It was also shown that the
timing of when to start the price trends used to label the data sets had a significant impact on the
results. Trading simulations performed with the systems managed to gain positive returns (profits) on
most of its trades. Some of the methods also managed to give better results than what trades performed
with the manually labeled data set did.