This post outlines a framework for forecasting short term (i.e. daily tick data) directional movements of equity prices. The method used here relies on support vector machines and treats the system like a Markov Chain. Historical data is downloaded from stooq.com. This is not investment advice or recommendation of an investment strategy but provided for educational purposes only. The following code comes with no warranties whatsoever. The code which can be found in its entirety on GitHub, attempts to model the directional movement (i.e. above or below the previous close) of the closing price of a stock on the following variables:

  • return of the equity at lag = 1
  • return of the equity at lag = 3
  • return of SPY at lag = 1
  • return of SPY at lag = 3
  • return of QQQ at lag = 1
  • return of QQQ at lag = 3
  • return of UVXY at lag = 1
  • return of UVXY at lag = 3

Return at time t is defined as (closet /closet-1) -1

The code is broken up into four parts:

  • helpers.r – a file containing helper functions
  • setup.r – a file that specifies which data to mine (ticker symbols, start and end dates). This file also reduces the necessary data to a matrix-like extensible time series object and splits this data into a training set and testing set.
  • svm_trend_follower.r – a file that specifies and evaluates models. Evaluation results are saved to a text file.
  • svm_predict.r – a file that predicts directional movement 1 step ahead. Prediction is saved to a text file.

This code relies on the xts, formula.tools, and e1071 packages.

imports <- c('xts', 'formula.tools', 'e1071')
invisible(lapply(imports, require, character.only = TRUE))

From the helpers file the functions GrabData, Frame2Xts, Plag4, TTSplit, and LiveTestSplit are used to acquire, format, and manipulate data for use in SVM models.

Grab Data will download data from stooq.com given a stock symbol and start and end dates.

GrabData <- function(x, from = '20130225', to = '20181228'){
  # Retrieves historical price data from stooq.com.
  # Args:
  #   x: Stock symbol. Lowercase string.
  #   from: First date in the series. String in the format of %YYYY%MM%DD.
  #   to: Last date in the series. String in the format of %YYYY%MM%DD.
  # Returns: A dataframe with Date, Open, High, Low, Close, and Volume columns
  return(read.csv(sprintf('https://stooq.com/q/d/l/?s=%s.us&d1=%s&d2=%s&i=d', 
                          x, from, to)))
}

Frame2Xts will do some formatting and convert a data frame to an extensible time series object.

Frame2Xts <- function(x){
  # Convert dataframe pulled from stooq to xts object
  # Args:
  #   x: datafame object from stooq.com
  # Returns: xts equivalent of original data frame
  x$Date <- as.Date(x$Date)
  x.xts <- xts(x, order.by = x$Date)
  storage.mode(x.xts) <- "numeric"
  x.xts$Date <- NULL
  x.xts
}

Plag4 will calculate and store daily returns, dummy variables for whether returns were positive or negative, and will store vectors for lagged returns up to lag 4. For specifying different models that require a greater number of lags this function will likely need to be changed to accommodate.

PLag4 <- function(x){
  # Calculate and store daily returns, whether return was positive or negative,
  # calulate lagged returns up to 4 lags, return new xts object
  # Args:
  #   x: an xts object with 'Close' column
  # Returns: xts object with daily returns, up to 4 lags, 
  # and dummy for whether return was positive or negative
  x$Return <- ((x$Close/lag(x$Close, 1)) - 1 )
  x$PosR <- ifelse(x$Return > 0.0, 1, 0)
  x$NegR <- ifelse(x$Return < 0.0, 1, 0)
  x$LR1 <- lag(x$Return, 1)
  x$LR2 <- lag(x$Return, 2)
  x$LR3 <- lag(x$Return, 3)
  x$LR4 <- lag(x$Return, 4)
  x
}

TTSplit splits data into training and testing sets given an xts object and a proportion of data to be used as the training set.

TTSplit <- function(x, p = 0.7){
  # Splits data set in training and testing sets. Produce index of split.point
  # Args:
  #   x: xts object
  #   p: proprtion of entries to be used in test data set
  # Returns: train data (xts), test data (xts), split.point (integer)
  nm <- c('train.data', 'test.data', 'split.point')
  split.point <- floor(nrow(x)*p)
  train.data <- x[(1:split.point),]
  test.data <- x[(split.point+1):nrow(x),]
  out <- list(train.data, test.data, split.point)
  names(out) <- nm
  out
}

LiveTestSplit is a convenience function for formatting dates when subsetting extensible time series objects.

LiveTestSplit <- function(x, start, stop){
  # Reduces data sets to index from start to stop
  # Args:
  #   x: xts object
  #   start: string providing start date in the format MM/DD/YY
  #   stop: string providing stop date in the format MM/DD/YY
  # Returns: xts object indexed from start to stop date
  dates <- c(start,stop)
  dates <- as.Date(dates, '%m/%d/%y')
  formated.date <- sprintf("%s/%s", dates[1], dates[2])
  x[formated.date]
}

We put this all of this together in the setup.r file

###############################
## SetUp & Data Acquisition
###############################
###############################
## IMPORTS & WORKING DIRECTORY
###############################
imports <- c('xts','pROC','formula.tools', 'e1071')
invisible(lapply(imports, require, character.only = TRUE))
###############################
## DATA MINING - FROM STOOQ.COM
###############################
basket <- c('spy','qqq', 'tsla', 'pypl', 'sq', 'aapl', 'v', 'fb', 'amd','iq', 'uvxy')
basket2 <- c('tsla', 'pypl', 'sq', 'aapl', 'v', 'fb', 'amd')
start.date <- '20181001'
end.date <- '20190123'
###############################
## DATA MANIPULATION
###############################
stocks.df <- lapply(basket, GrabData, from = start.date, to = end.date)
names(stocks.df) <- basket
stocks.xts <- lapply(stocks.df, Frame2Xts) #Turn Data Frame int XTS Object
stocks.xts <- lapply(stocks.xts, PLag4) # Calcualte Lags
stocks.xts <- lapply(stocks.xts, na.omit) # Remove Entries with N/As
##############################
## TRAIN TEST SPLIT
##############################
c.stocks.xts <- as.xts(Reduce(merge, stocks.xts))
data.split <- TTSplit(c.stocks.xts['2018-10-11/2019-01-10'], p = 0.75)
oos.xts <- LiveTestSplit(c.stocks.xts,"01/14/19","01/18/2019") # Out of Sample/Live Test

Next we discuss the svm_tend_follower.r file, where we specify and evaluate models. We use the helper functions SvmEval2 to determine the accuracy of a model when applied on out of sample data. The function returns the proportion of accurate directional predictions and a confusion matrix.

SvmEval2 <- function(model, data){
  # Feed in logistic regression model, model factors, and data to be tested on
  # Args:
  #   model: logistic regresion model object
  #   data: data series for the model to be tested on
  # Returns: Accuracy, Confusion Matrix
  # CONFUSION MATRIX ---------------
  pred <- predict(model, data)
  m.confusion <- table(Predicted = pred, 
                       Actual = data[, gsub('as.factor, ', '', 
                                            toString(get(toString(model$call$formula))[[2]])
                                            )
                                     ]
                       )
  accuracy <- as.numeric(sum(diag(m.confusion))/sum(m.confusion))
  # OUTPUT -----------------------
  out <- list(accuracy, m.confusion)
  names(out) <- c("accuracy", "m.confusion")
  out
}

Given our ‘basket’ setup in the setup.r file here is an example of how we would model AMD:

kern_type = "polynomial"
#AMD-----------------------
amd.svm.factors <- c('LR1.8', 'LR3.8','LR1','LR3','LR1.1','LR3.1','LR1.10', 'LR3.10')
amd.svm.formula <- as.formula(paste('as.factor(PosR.8)~', paste(amd.svm.factors, collapse = '+')))
amd.svm.model <- svm(amd.svm.formula, data = data.split$train.data, kernel = kern_type)
amd.svm.eval <- SvmEval2(amd.svm.model, data.split$test.data)

After we similarly specify models for the other stocks in our basket we can filter the models by their out of sample accuracy and write to a text file. Here we filter for models that are accurate more than 50% of the time:

############################
## BULK LIST BASED ACTIONS
############################
evaluations <- list(tsla.svm.eval, pypl.svm.eval, sq.svm.eval, 
                    aapl.svm.eval, v.svm.eval, fb.svm.eval, amd.svm.eval)
names(evaluations) <- basket2
models <- list(tsla.svm.model, pypl.svm.model, sq.svm.model, 
               aapl.svm.model, v.svm.model, fb.svm.model, amd.svm.model)
names(models) <- basket2
##############################
## FILENAME DETAILS
##############################
filename <- sprintf("%s ANALYSIS %s.txt", "SVM", toString(format(Sys.time(), "%Y-%m-%d %H-%M-%S")))
############################
## SAVE TO FILE
############################
sink(filename)
Filter(function(x) x$accuracy > 0.5, evaluations)
sink()

Finally, if we find a model that appears promising we can use it for forecasting and prediction as demonstrated in svm_predict.r. We create a dataframe containing the variables our model will need for prediction. And attempt to forecast the next days directional movement.

##############################
### SVM PREDICT
##############################
##############################
### FILENAME DETAILS
##############################
filename <- sprintf("%s PREDICT %s.txt", "SVM", toString(format(Sys.time(), "%Y-%m-%d %H-%M-%S")))
##############################
### PREDICTION DATAFRAME
##############################
predict.df <- #Prediction for 1/25/19
  data.frame("LR1.8" = c(5.3), # AMD LAG 1 - i.e. 1/24/19
             "LR3.8" = c(-4.86), # AMD LAG 3 - i.e. 1/22/19
             "LR1" = c(0.05), # SPY LAG 1 - i.e. 1/24/19
             "LR3" = c(-1.35), # SPY LAG 3 - i.e. 1/22/19
             "LR1.1" = c(0.65), # QQQ LAG 1 - i.e. 1/24/19
             "LR3.1" = c(-2.0), # QQQ LAG 3 - i.e. 1/22/19
             "LR1.3" = c(1.88), # QQQ LAG 1 - i.e. 1/24/19
             "LR3.3" = c(-1.29), # QQQ LAG 3 - i.e. 1/22/19
             "LR1.10" = c(-4.79), # UVXY LAG 1 - i.e. 1/24/19
             "LR3.10" = c(13.46) # UVXY LAG 3 - i.e. 1/22/19
  )
#############################
### SAVE TO FILE
#############################
models <- list(amd.svm.model, pypl.svm.model)
names(models) <- c('amd', 'pypl')
sink(filename)
lapply(models, SvmPred, predict.df)
sink()

Although the models for Amd and Paypal appear promising from their accuracy, the prediction for the following day’s directional movement is incorrect. To improve accuracy one can consider optimizing SVM parameters, selecting a different kernel type , or changing sampling procedures (dates sampled, sampling methods, etc). This code and framework can be tweaked to be used with other methods that require matrix like objects and lagged values for forecast predictions. If one were to hypothetically trade such a strategy short term option positions near the money may be interesting trading vehicles.