Investing in the markets? You'd better backtest your theories... BUT do it the right way
The catalyst for writing this article on backtesting was mainly due to a video I saw by Tucker Balch called "10 Ways Backtests Lie" (see below). In it he aims to help new Quants (Quantitative Analysts) that are just getting started in finance. Even though I don't conduct these kind of tests anymore (other than for personal usage), there are some very good lessons here that I hope can help you (and that I had to learn about sometimes the hard way).
Tucker has an interesting background and I first saw his stuff online on one of the MOOC's (massively open online course) he teaches online. Check out his LinkedIn bio (https://www.linkedin.com/in/tuckerbalch/) for more details.
In showcasing these areas that a seasoned quant (Tucker) highlights, I relate my own thoughts and first-hand experiences in doing backtest work as a method of showcasing the validity of the data models/trading ideas I was selling to asset managers, hedge funds and other buyside firms. I have also added another couple of backtesting mistakes commonly made at the end that weren't covered in the above video, so hopefully you read to the end!
They say the first step towards improvement is to understand where you went wrong but hopefully you'll get the benefit of hindsight upfront after reading this article.
But first, a disclaimer
Nothing like a disclaimer to get us in the right mood but it needs to be done since I am not promoting any investments but rather commenting on lessons I learned when I was giving this sort of advice professionally.
The opinions below are my own and do not in any way reflect the views, opinions, forecasts or recommendations of ASX. Furthermore, these opinions are based on work I did whilst I worked at QMG and Canaccord Genuity as well as through my own personal trading experiences.
What is backtesting?
First of all, many of you might ask, what is backtesting? Quite simply, its about trying to test the effectiveness of your data model or algorithms predictive capabilities on historical data. Backtests exist as a way for you to showcase how your particular data model would have performed during a certain historical period. Investopedia has a good explanation on it here: https://www.investopedia.com/terms/b/backtesting.asp.
Whilst research in the financial markets are full of disclaimers like "past performance is not indicative of future results", many still rely on backtest performance as a way to compare and contrast different strategies
A common saying in finance is that "I've never seen a bad backtest" and if you've heard this it wouldn't be wrong when you consider that there are many ways in which this showcase of historical performance can potentially be manipulated to produce better looking results.
Top 10 ways backtests lie
Number 1 - In Sample Backtesting
This leads us to the first area in Tuckers list and what he considers the most egregious, but a very common backtesting failure (one in which he says is "doomed to succeed spectactularly"). The simple explanation of this is using known future outcomes as part of what is being used to train your model. It's similar to the notion that hindsight is 20/20 and when trying to convince investors to make certain type of bet when you're doing this, you can see why its egregious.
My opinion - This is definitely the one type of bias that I first learnt when I was getting started. There is a common approach to creating a backtest in the financial markets and what's considered best practice is to split up your data out-of-sample and in-sample groups. Investopedia has a good explanation of this (https://www.investopedia.com/articles/trading/10/backtesting-walkforward-important-correlation.asp).
Splitting up your data this way is done to make sure that your model is not being influenced by outcomes we know now but the model would not have known at the time.
It might not be something you're used to but when you get into the habit of splitting up your data into training/testing sets and looking at in-sample and out-of-sample data sets you'll ensure that your model and results are not informed by information that it would not have known at the time.
Number 2 - Survivorship Bias
Tucker starts off on this section with a healthcare/pharmaceutical example in the video and then goes on to the finance related one which would be testing only on stocks that exist now instead of including those which your model would have included had they survived up until now. Depending on how far back you go for an index like the S&P 500, the constituent stocks have changed over time. Running your strategy back then may have meant those stocks would have been included in your buy/sell trades.
If the companies those stocks belonged to fell out of the index due to poor performance, your results will be skewed to ignoring those unfavourable results.
My opinion - This is a simple issue to understand, but a hard problem to solve for because the data is often not easily available. If you started today and collected information on what stocks move in and out of something like the S&P 500, then you would have a good dataset against which to test but that would take time. In your situation, you're likely wanting to get historical movements to show why your strategy is good today for the index/universe you're looking to test against.
There are more services coming online that make this data available but without accounting for this bias, you're going to have skewed results.
Number 3 - Assuming you can get specific market prices in your backtest
In this example, it's about people assuming that they can get a certain price at a certain point in time. For example, having a model that states you can secure a trade at a certain price like the close price of a certain day.
Realistically, however, you'll likely be able to get in, at the earliest, at the opening (or near there) of the next trading day.
My opinion - This was one I faced early on. It made sense as to why this would not work so we played it safe and looked at the close of the next day. This was okay since the majority of the strategies I was looking at were focused on mid-long term, not intra-day turnover.
Trading strategies which don't take this into account will likely skew results more positively than they should because the market will likely have already moved against you as it reprices on new information (not quite as bad as South Park's take on it below but the market does work to outwit you).
Number 4 - Ignoring market impact
In this example, the backtest would see positive results during testing but when applied to the market, it does not perform as well. What is going on here? The model is ignoring its market impact which would happen as the market is dynamic and is built to act against you. Tucker spoke from his own experience here:
"Our method actually could forecast correctly prices in the future as long as we didn't act on that information"
The lesson is to account for market impact and, as Tucker says, put in a rule of thumb that you will be impacted by at least 5 bps when you make your trade.
My opinion - To get a true test, you'd need to reconstruct the market and how it behaved at that point in time. It either requires access to a large amount of data and the capacity to store this or subscribing to a service that can provide it to you. Because none such were available in my historical work, we accounted for this by adding a buffer on either our long or short trade ideas. If we had a stop loss of 5% in our strategy, our backtest would allow this to stretch further to allow for us not being able to get out at the prices we wanted.
Number 5 - Buying $10m of $1m company
This is about capacity limits. Occurs especially on smaller stocks (penny stocks). It's about Looking at a company and having your model try to allocate more money to it than normally trades for on a daily basis is a recipe for backtesting failure.
If your model is based on penny stocks and you assume you can book in a certain level of trade, you are probably focused on percentage improvements without considering how that trade would likely happen in real life. A good way to fix this is to look at the ADV (average daily volume) which can tell you just how much of that company's shares move around day to day.
My opinion - I was shown early on that a way to help alleviate this issue was to filter for stocks with ADV's (average daily volume) above a certain amount. In doing this you can better ensure that your trading algorithm would likely get filled at the level your model is trying to allocate at that point in time.
It may be fine to see a stock rocket to the moon and assume you can allocate a large proportion of your trades to it, but had you actually been there and seen those signals at that time, you may not have found enough willing sellers to put you in that position.
Number 6 - Data mining fallacy / Overfitting
Tuckers example here is about the pitfall that comes from continuing and continuing to look for a strategy that works. Eventually you will find one. Having knowledge of what to tweak is apparent when the previous work you've done has produced errors.
Just by playing around with the model will allow you to see what does not work. But, getting that first good set of results (i.e. outperforming the benchmark you set) may lead you to think you've struck gold. You won't really know, however, if you're just getting lucky or you're good.
He says you'll find strategies that work in the past but don't work now but a way you can check is to forward test it (i.e. paper trade - https://www.investopedia.com/terms/p/papertrade.asp).
My opinion - I had first-hand experience with this when I would see models that worked well in practice but not work as well after we'd run them in real life and getting these in front of clients to promote our trading ideas. This came about as we didn't forward test them as much as we should have behind the scenes. Had we done this we would have seen errors that we were making, adjusted the model and rerun another test. We would have been careful to take into account other biases as we did so but it is certainly not an easy problem to solve for.
For a good explanation on overfitting check out this article from Anup Bhande on Medium (https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76)
Number 7 - Using the wrong benchmark
In this case, Tucker talks about how using the wrong benchmark to test against can inflate your models performance. In his example, its where the shorting strategy of a particular model would revert to cash when the market was bullish which does not equate to a like for like performance of your model against a benchmark.
A better way would be to adjust the benchmark to make a more fair comparison.
My opinion - This was something that rarely came up since short only strategies were not my focus. At most we did long only and long-short. Additionally, our models would remain active (e.g. not fall back to cash) as we did not always deploy a stop-loss to them.
However, we did run some tests on US equities years ago that took into account a more similar benchmark to test its performance against. For example, a stock like Home Depot (a company like Bunnings here in in Australia) would be better to be compared against a home based benchmark like the S5HOME Index provided by Bloomberg instead of against the S&P 500. Doing the latter would mean less of a like for like comparison and could lead to bias when making your asset allocation decisions.
Number 8 - Stateful strategy bias
In this case, strategies are stateful in that they are being seen to change depending on what day you start them. For example, a strategy that starts on a Monday versus a Tuesday and having vastly different returns as a result.
To combat this, Tucker recommends trying out the strategy on different days to see whether the results of your model vary widely based on starting dates. If the differences are small then you may have a robust strategy, if they're not you may need to rethink your work.
My opinion - At QMG, we formed our views of the market on a monthly basis of course, our trading strategy would also look to rebalance monthly. However, being true to transparency, we also ran the results on a longer time frame (quarterly) to see how those tweaks might effect results.
Whatever your timeframe, it's best you test against different starting points and different holding periods because otherwise, you are going out into a bigger unknown than you think and that's dangerous.
Number 9 - Assuming you can buy at the open
In this example, Tucker talks about how having your model take opening prices for a stock is not a great order to make because its not guaranteed that you can get the opening price for your model. Many other forces are at play here and depending on the size of your order, you will very likely not be able to get the opening price (not even the cast of Billions would get likely get the prices they want at the open, even if they are literally within the NYSE below).
My opinion - A way to account for this is to not assume that you'll get the opening price. Elsewhere I mention the use of the adjusted close price but another way to combat this is to assume a buffer amount of time before your order would get filled. In getting help with my early days of backtesting work I was shown a technique that would form an opinion on what to buy on a certain day but only take the opening price of buying the stock/s at the end of the next trading day.
It might not seem like much but by doing this you avoid assuming more positive returns than likely would have happened.
Number 10 - Placing too much trust in complex models
Put simply, the more complex a model is, the less you can trust it. The reason is mainly due to overfitting which means you might be incorporating noise instead of having your model focus on the real underlying factors that are driving returns.
If you have a model that is based on 3 factors but it has 150% returns over a certain amount of time, it's likely going to be more favourable than one which is based on 8 factors but produces 200% returns.
My opinion - Having worked with multi-factor strategies, this is perhaps my favourite of all the lessons here. If I was still in my former roles (with hindsight 20/20), I would review how the various scoring systems worked to test whether or not we were doubling up on factors which could already be correlated. If they were, then we were potentially putting unfair bias on certain factor groups. For example, if part of your model relied on momentum based factors but many of them were correlated with one another and you don't remove this bias, you're likely going to skew your model to favour these momentum more than it should.
It would be better to have a simple model which can explain a large proportion of what's going on than a more complex model which gives better results.
A simple statistical method you can deploy to take care of this would be to use PCA (principle component analysis - https://en.wikipedia.org/wiki/Principal_component_analysis) in your testing. This technique will allow you to simplify your model to focus on the key factors that it will benefit from without losing too much of the expected returns.
Tucker actually had 2 more factors to watch out for so this is more of a top 12 from him. They are listed as follows:
Number 11 (Bonus) - Base strategies on what you know happened in the past
In this example, Tucker talks about how you incorporate things you know (your own biases) about the market such as the bull market going on in the USA since 2009 and ignoring how your strategy would run in down markets periods.
My Opinion - I've experienced this example first hand and it is a bit of a dichotomy. On the one hand you want to ensure that your model takes into account regime change. In market-speak this is when the underlying factors that appear to be driving the market (e.g. momentum, low interest rates, etc.) appear to change. As such you will be tempted to ignore historical data and focus on more recent results.
On the other hand, you might only be testing your model in a favourable environment for it and will not know how it could have performed in the past when the market regime was different.
A way to combat this is to include a longer time period but to put more weighting on the recent past. Additionally, you might have a smart model that detects regime change and adjusts accordingly. Furthermore, an easier fix is to just test your model on other historical periods (if possible) to see if it also worked then.
The choice you make on this will depend on your model but by at least taking this into account you'll be in a better position.
Number 12 (Bonus) - Not forward testing
His last bonus example here was to make sure to forward test which relates to paper trading which I mentioned above.
My Opinion - I touch on this in my response to the data mining problem. The only real way to know whether your model is good or not is to forward test/paper trade it. If you don't, you'll likely get a nasty surprise when your real life performance does not match the lofty expectations your backtest model had created for you.
The following items are ones that I did not see on Tuckers list but ones that I think are equally important to consider when you are doing your own testing. Below I describe the potential bias as well as my opinion and experience with it.
Number 13 (Bonus Bonus) - Applying the wrong type of statistical analysis
In this example, you might have data that is non-stationary (see explanation here https://www.investopedia.com/articles/trading/07/stationary.asp) and you apply a technique like correlation/linear regression analysis to infer a signal or pattern.
The mistake here is that you would be using correlation analysis on share price movement to predict where prices are going to move in the future. But, most share price movements are non-stationary so applying correlation analysis to it would be the wrong technique.
This problem mostly comes about when you haven't done enough research on backtesting methodology or statistics. It happened to me when I didn't know enough about the best ways to conduct time-series analysis and ended up performing correlation analysis of a specific signal on non-stationary data.
Luckily, there are so many articles and research about the right type of analysis that you can access (mostly free) online. Putting in the time to make sure you apply the right type of research will save you a lot of grief later on.
Number 14 (Bonus Bonus) - Not using the adjusted close price
In this example, you might have a strategy that takes the close price as a factor to inform your model. But when a stock pays a dividend, this needs to get taken into account since the actual price it would close at would be the close less the dividend. If it closes at $100 and pays a dividend the price you get will be $95 so your returns might be inflated as a result.
This is easily fixed by ensuring you take the adjusted close when you get your data inputs for your model.
Number 15 - Failing to take into account slippage, trading costs and taxes
Nothing hurts your returns like failing to take into account the extra costs of trading. It's not simply a matter of having a 100% return on your $10k investment and gloating to your friends about how you're a stock market genius. Whilst the initial return percentage might sound good, it will get affected by these other costs.
If your strategy trades often then you should consider the associated trading cost as well as what it ends up costing you in tax when you get that money back out of your online brokerage. For example if you had 10 stocks providing your $1,000 investment a 10% return in the month you'd be up $100. However, if you then rebalance and move out of 5 of those stocks but it costs you $10 for each new stock you trade, then expect that $100 profit to at least move down to $50.
A model that does not account for these costs will always look better than it really is.
As you can see here, with backtesting, there are lots of potential biases to take into account. If you don't consider these, you run the risk of your model showcasing false positives and being skewed towards more positive outcomes than would have been true at the time.
When i look at my own backtests (like this US focused one below), knowing what I do now versus back then, even I'd want to run it back and fix for things that I know are biased in there. You should do this to your work too.
The key here is to be aware of these pitfalls when you backtest your strategies and this can be especially important if you're just starting out with this sort of thing.
So, now that you know of what to watch out for, you can do some more research and build a better mousetra.... I mean backtest.