Machine learning is not magic—It is just a tool! | Portfolio for the Future

Keith Black, PhD, CFA, CAIA, FDP, Managing Director of Content Strategy, CAIA Association

“Building your dream has to start now
There's no other road to take
You won't make a mistake
I'll be guiding you
You have to believe we are magic
Nothin' can stand in our way” - Olivia Newton-John

The next conversation in the CAIA Association and FDP webinar series is with Tony Guida. Guida is the editor of the Journal of Machine Learning in Finance and author of Big Data and Machine Learning in Quantitative Investment, which is a book that is part of the FDP curriculum. As always, candidates should listen carefully to conversations with authors who have written portions of the FDP curriculum. There might be a quiz later!

Guida wants everyone to know that machine learning doesn’t have to be feared as a magic black box, but it is just a new tool that facilitates science and research. Machine learning uses data to train models that make predictions and then use live observations to supplement the data and make new predictions.

While econometrics is widely used in quantitative research, one of its main goals is to be unbiased. Rather than being a statistical purist, the main goal of machine learning is to build predictions that work well out of sample. “Econometrics is a beta question, while machine learning is an alpha answer,” Guida notes. Much of the work in econometrics is linear, but machine learning is more suited to high dimensional non-linear models that likely provide a better fit to real-world data.

One of the most popular machine learning methods is regression trees, which searches for both good performance and explainable parameters. This is one example of a model that can learn from its mistakes. If stocks are misclassified in a regression tree, the model can grow another branch to reclassify the stocks. The key to machine learning models and tree methods is that ensemble methods can be used that average models together. Rather than simply relying on a single model or point estimate, you can use the wisdom of the crowds to build better out-of-sample outcomes.

The quality of a machine learning model can be measured using statistics such as true positives and true negatives. True positives are predictions of outperformance that were realized, while true negatives were accurate predictions of underperformance. True positives and true negatives are bundled into evaluative measures such as accuracy, recall, and precision—formulas that might come in handy during your FDP exam.

Guida uses machine learning and tree methods to rank stocks into deciles from outperformance to underperformance relative to each stock’s sector peers. As much as 70% of any project time is spent on collecting, preparing, and examining data, while the machine learning programming is relatively straight-forward given the open-source magic that has been built into the R and Python ecosystems. He prefers to maximize the signal-to-noise on bulk of the distribution, while modeling the tails of the distribution separately. Interested readers are referred to Guida’s paper with Coqueret “Training Trees on Tails with Applications to Portfolio Choice”, which can be found on SSRN.

Before you actually program your model, you have to know your data. Perform some exploratory data analysis and look for correlations and interactions during your training period. While it is important to model over multiple time horizons, data-mining potential is reduced when using the same parameters over all time horizons. The accuracy of the model needs to be tested both on the in-sample training data as well as on the out-of-sample testing data. Estimated trading costs need to be included in the modeling, especially at shorter time horizons when turnover is greater.

Model interpretation focuses on the evaluation of simple average feature importance. Features can vary with importance at different time horizons, as some signals work better in the long term and others in the short term. This feature importance allows you to differentiate your model from a magic black box from an important tool with good performance and interpretable outcomes. The use of longer term, non-linear, ensemble models seem to have higher risk-adjusted performance with lower turnover and trading costs.

Especially today, we don’t need reminders that the world is non-stationary. Volatility regimes change and factors rotate into and out of importance. In such a rapidly changing world, we should use all the tools available to us, such as sentiment analysis of social media signals and machine learning models. They aren’t magic, they are just tools.

Even if you still listen to Olivia Newton-John (and that’s OK), you don’t have to continue to use linear regression and standard econometric models. Those are, so, like 1980. A-duh!

Watch the webinar and get a list of upcoming webinars. Learn more and sign up for the FDP exam.

“A Machine Learning Approach to Risk Factors” can be accessed through the readings packet available to FDP candidates or from the Journal of Financial Data Science.