Stock performance prediction for Udacity Data Science Nanodegree

11 min readDec 15, 2020

Project Overview

Investors, be it mutual fund managers or individual players, have dreamed to be able to predict the future stock prices. If successful, a large fortune can be made easily. However, according to Burton G. Malkiel, stocks take a random and unpredictable path that makes all methods of predicting stock prices futile in the long run. However, the hunt for treasure never ends, and with the increasing computing power, more complicated techniques have been developed for stock price prediction. Among them, the most exciting technique is machine learning, which tries to learn from historic stock data and make prediction of the the future.

In this project, I first created machine learning regression models trying to predict the price changes of specific stocks in the next few days. After showing that the price information of the past few days alone can not be used to train an accurate regression model, a classification model was instead trained that can predict the trend of specific stocks in the near future, which renders higher accuracy than random guessing.

Problem Statement

The goal of this project is to predict future performances (price change or trend) of specific stocks using machine learning. The tasks involved are as follows:

Load historical stock prices data from Yahoo! Finance
Perform data cleaning to fill missing values
Perform data preprocessing and feature engineering for machine learning
Train regression/classification models using the processed data
Model validation
Use the validated model to make recommendation of stocks

The final recommendation is expected to provide investors with stocks of highest profitability.

Metrics

For the regression models, the mean squared error (MSE) was used as the evaluation metric. This metric was used because it quantify the difference between the predict and real stock price changes. A small MSE indicates that the predicted price change is close to the real value, which can provide more confidence to the users. MSE=0 indicates that the model perfectly predicts stock prices, which is ideal but impossible.

Classification reports were generated for classification models to evaluate the model precision, recall, and overall model accuracy. Among these metrics, precision may be the most important one. Even though a low recall may cause users to miss many good stocks, a sufficiently high precision can help users to find stocks with highest probabilities of increasing stock prices.

Data Exploration

Yahoo! finance provides free access to historical stock prices data. While a Yahoo Finance API is provided to fetch financial information, I chose to use pandas-datareader (https://github.com/pydata/pandas-datareader), as it is easy to use and provides more flexibility if I want to get data from other providers in the future.

The data provided by Yahoo! finance include highest price, lowest price, open price, close price, volume, and adjusted close price of each stock for every trading day. To obtain sufficient data for model training, I fetched the stock price data of multiple stocks of interest (Apple, Microsoft, Google, Tesla, etc.) from 2010. The following table shows an example stock price data for Apple at the beginning of 2010.

Stock price data of Apple at the beginning of 2010.

Data Visualization

The following figure shows the price variation of multiple stocks of interest. All stocks show an increasing trend over the 11-year period. However, the price of each stock is too different to tell which stock offers higher profitability to investors.

The following figure instead shows the cumulative log returns of each stock, which compares the price variation of each stock in the same scale. It can be seen that investing in Netflix and Tesla in 2010 would yield the highest return for investors.

Cumulative log returns of stocks of interest.

Data Preprocessing

Not every stock has data from 2010. For example, Tesla IPO on June 29, 2010. Therefore, it do not have data before June 2010. Forward filling followed by backward filling is used to fixed the problem of mixing values. Then the adjusted closing price column of each stock is combined to form the flowing dataframe.

Adjusted closing price of multiple stocks from 2010.

There are two ways to use the data to train a machine learning model. The first one is to use the historic data as time-series and train a model directly. However, it can be seen that the prices of different stocks are quite different, which could affect the learning rate and accuracy of machine learning models. Therefore, the day-to-day percent price change of each stock is calculated as shown in the following table.

Day-to-day percent price change of each stock.

Then the percent price change of a stock in the previous n days can be used to predict the price change of the same stock in the next day. A dataframe created for this purpose is shown in the following table. I tried to use the price percent change data of 60 days to make prediction of the following data, which leads to 61 columns in total in the dataframe. The dataframe has 27000 rows in total.

Dataframe of time series data of each stock.

While directly using the history data to make prediction is straightforward, it may require more sophisticated model and larger data set to allow the model to learn the general pattern underneath the data. Instead, one can use feature engineering to extract informative data from the original data to improve performance of a relatively simple model. In this project, 6 features are extracted from the historic price data, including: highest price-lowest price (H-O), close price-open price (C-O), 7-day, 14-day and 21-day moving average of price (7d_MA, 14d_MA, 21d_MA), and 7-day standard deviation (7d_std), as shown in the following table. The last column (Shift_1) is the stock price of the following data, which is the desired output of a model.

Dataframe for training machine learning models with extracted features.

Implementation

Regression model with time series data

A simple Neural Network (NN) model was trained to predict the stock price percent change based on the stock price percent changes in the previous n days. Although a recurrent network (RNN, LSTM or GRU) may be more suitable for this task as the stock price change data can be treated as time series, a NN was chosen in this preliminary study due to its simplicity. If the model is proved promising, more advanced models can be developed later to improve prediction performances.

Model refinement: As mentioned, the stock price change in the previous n days is used to predict the stock price change in the following day. The selection of n is important in this model. A larger value of n indicates that the model tries to learn long term pattern of a stock, while a smaller value of n implies that the model tries to predict the stock price change based on short term stock behavior. In this project, n is chosen from (5, 10, 20, 40, 60) such that the model has the highest accuracy (lowest MSE). n=5 indicates that the model predicts future price change based on the past week (5 trading days), while n=20 indicates that the model predicts future price change based on the past month (4 weeks a month). For each value of n, 90% of the resulted dataframe is used as training set and the rest is used as test set.

In addition, the GridSearchCV method of Scikit-learn was used to tune the hyperparameters of the NN model, which includes the sizes of hidden layers (10–50), activation function (Tanh or Relu), and the learning rate (constant or adaptive) of the model. Each hidden layer is assumed to have the same number of neurons. All models have three hidden layers, which should provide sufficient complexity to fit the data while not severely overfitting.

After grid search, the model with n=5, a NN structure of (20,20,20) with tanh activation function and a constant learning rate yields the best accuracy. A reference model is used to evaluate the performance of the model. The reference model simply use the price percent change of the last day of the n-days period to make prediction for the following day. In other words, if a stock increases by 1% today, the reference model simply predicts that the stock will increase by another 1% in the next trading day.

The MSE of the reference model on the test set is 0.0011, while the MSE of the NN model is 0.00056, which is about half of that of the NN model. While this seems to be promising, plotting the predicted daily price change versus the true daily price change clearly shows that there is no correlation between the predicted value and the true value for both the reference model and the NN model. It should also be noted that the grid search of hyperparameters did not significantly affect the model performance. The failure of the NN model seems to support than random walk theory that one can not predict future stock price based on the historic price data.

Predicted daily price change versus real price change for (left) reference and (right) NN models.

Classification model with time series

To further test the random walk theory, a Neural network classifier is trained to predict whether the price of a specific stock is going up or down with the prices of previous n days as inputs. Grid search of model hyperparamters similar to that of the NN regressor has been performed to optimize the performance of the classifier.

The resulted optimized model has n=5, and a NN structure of (50,50,50) with Relu activation function and a constant learning rate. Similar to the case of NN regressor, a simple reference classifier is used to evaluate the performance of the NN classifier. The reference classifier simply uses the daily price change of the last day to predict the trend of next day. In other words, if the price of one stock increases today, it is predicted to keep the increasing trend tomorrow.

The table on the left shows the classification report of the reference model (True: price increase, False: price decrease). It shows that the accuracy of the model is 50%, meaning that that a stock has equal probabilities of going up or down even if it increases today.

The second table shows the classification report of the NN classifier. Its precision on predicting stock price increase is higher than the reference model, but the difference is negligible. The overall prediction accuracy is the same as the reference model. Again, the grid search of hyperparameters did not significantly affect the model performance. These results show that it is impossible to predict the stock performance using only the historic stock price data.

Classification model with extracted data

Now let’s try to use feature engineering to create more informative features and see whether we can obtain models with higher predictability. As mentioned, 6 simple features are extracted from the historic stock price data. Instead of using NN classifier, a random forest ensemble classifier was trained to predict whether the price of a stock is going up or down the next day. Compared with NN classifier, a random forest ensemble classifier is less computationally expensive, requires less data, and could lead to better performance by combining many decision tree models into a single model.

Model refinement: A grid search was performed on the hyperparameters of the random forest model. The searched hyperparameters included number of decision trees (4,8,12,20), max_features (‘auto’, ‘sqrt’, ‘log2’), and max_depth (2,4,8,None), the range of which were selected to achieve high accuracy while avoiding overfitting the data. The grid search yielded a random forest model with 12 decision trees, max_features = ‘auto’, and max_depth = 8. Compared with a model without refinement (4 decision trees, max_features=2, max_depth=2), the classification accuracy of the model on the training set was improved from 59% to 86%, which is definitely better than that of a random walk model (50%).

Classification report of the random forest ensemble model on the training set.

Model Validation and Evaluation

The following table shows the performance of the random forest ensemble model on the test set. The accuracy is 80%, while the precision on predicted price increase is 82%. While this is less accurate than in the training set, it is still notably better than the random walk theory.

Classification report of the random forest ensemble model on the test set.

The obtained random forest model has 12 decision tree models, which have different bias and variance, and the final prediction is the average of their predictions. In addition to predicting the class of each stock, the random forest model can also output the probability of each class. The following figure shows the class probability of each stock of interest as predicted by the model. With these probabilities, investors can select stocks that have higher probability of obtaining profits. Note that while the model can not predict how much the price is going to change, it may already be good enough knowing just the trend of a stock, thanks to the compound interest. For example, if at the beginning of a year you invested $1 on a stock selected using the model, the price of which turned out to increase by 0.1% in a day. Then you cash out and invest again using the model to select another stock. If the model is always correct (not possible) and the gain in one day is kept at 0.1%, then after one year your profit will be more than 28% (~250 trading days). Although this strategy will lead to a significant increase of taxable incomes, the point to make here is that knowing only the trend of a stock is already highly profitable for an investor.

Predicted probability of stock trend in the following day.

Conclusion

To sum up, in this post I investigated several methods to predict the future performance of a stock. Most of the models fail to provide reasonable predictions, lending support to the random walk theory. However, using the random forest ensemble model along with extracted features, the short-term trend of a stock can be predicted with a higher accuracy than coin flipping.

In this preliminary study, only six features are extracted from historic price data to train machine learning models. In the future, more features, especially those related to the performance of the entire market, such as the S&P 500 index, Dow Jones Industrial Average, and Nasdaq Composite, should be incorporated to further improve the accuracy of the model. Also, the model should be able to provide medium to long term prediction to minimize income tax.

For details of the code, please refer to my Github Repo:

https://github.com/maiweijie2009/Stock-prediction/blob/main/Stock_prediction.ipynb