How to Improve Prepayment Modelling has been saved
How to Improve Prepayment Modelling
Exploring the Added Value of Machine Learning
How to identify the most important risk drivers for capturing prepayment behaviour? What is the relationship between risk drivers and prepayment prediction? And how can machine learning models be used to complement traditional prepayment models? This article will reveal the added value of machine learning in modelling prepayment behaviour in mortgages.
Advancements of Machine learning in Modelling Prepayments
Machine learning models are a hot topic. They often offer greater insights than traditional models within a data-rich environment. Machine learning models are able to identify complex relations between the input variables and target variable which are hard to capture in traditional models. This article explores the added value of machine learning in modelling prepayment behaviour, based on a study that was conducted in collaboration with the Market Risk Management team of Nationale Nederlanden (NN) Bank.
This study offers insights in (1) methods for identifying the most important risk drivers for capturing prepayment behaviour; (2) methods for revealing relations between risk drivers and prepayment prediction and (3) how machine learning models can be used to complement traditional prepayment models.
Machine learning in prepayment modelling
The machine learning model for prepayments in this study is designed to reveal the risk drivers that are most important for prediction of the Conditional Prepayment Rate (CPR) caused by a relocation for the upcoming month. The prediction is based on contractual and macro-economic information known to NN Bank at the start of the month. A total of 52 potential drivers are included in the research. Some examples of these drivers are the mortgage age, annual income of the client and unemployment rate in the Netherlands.
We have used the popular machine learning technique “Random Forest” in this study, which is based on classification trees. The goal of training a classification tree is to split the training observations according to an input variable at each step in order to distinguish the “positive” from the “negative” observations. In this application, the “positive” observations are observations in which a prepayment is observed.
Figure 1 shows a basic representation of a classification tree. At each node, the training observations are split according to one of the variables.
An example of a split in the classification tree can be whether the client age is higher or lower than 30 years old as indicated in Figure 1. All training observations with a corresponding client age lower than 30 years old move to the left in the tree and all observations with a corresponding client age higher or equal to 30 years move right. Afterwards, the algorithm performs additional splits on the subsets of training observations. In the end, the set of training observations is divided into several nodes at the bottom of the classification tree. The prevailing class (prepayment or no prepayment) is assigned as label to each of the final nodes. This process of assigning a label to the final nodes is visualized for the leftmost node in Figure 1. Suppose a total of 30 training observations end up in this node of which 27 are prepayments and the remaining 3 are no prepayments. The label assigned to this final node by majority vote is therefore “Prepayment”. For new observations, the splits in the classification tree are followed in order to assign the observation to one of the final nodes. The label corresponding to the final node is the class prediction for the new observation.
In order to reduce the variance of the model predictions, the Random Forest technique uses an ensemble of classification trees. In practice, this means that several classification trees are trained on slightly different data and each tree casts a vote for the overall prediction. This is done because using a single classification tree results in unstable predictions.
Putting the Machine learning model into practice
In this study, the most important risk drivers for prepayments due to relocation and their effects are identified in three steps. Firstly, a machine learning model is developed which accurately predicts the Conditional Prepayment Rate (CPR). Secondly, the most important risk drivers are selected based on the importance of each of the potential risk drivers in the machine learning model. Finally, insights into prepayment behaviour are analysed through the effects of the selected risk drivers on prepayments. These effects are uncovered using “Partial Dependence Plots” and “Shapley Additive Explanations” (SHAP), which will be discussed later.
Step 1 – Machine learning model development
A Random Forest model was trained on past observations in the portfolio. This model is used afterwards to predict the probability of prepayment in the upcoming month for each loan part in the mortgage portfolio. These probabilities are transformed into a prepayment volume prediction. The prepayment volume prediction for each individual loan part is computed by multiplying the probability of prepayment with the outstanding principal. Afterwards, these predicted volumes per loan part are summed in order to obtain a volume prediction for the entire portfolio. In a final step, the volume prediction is transformed into a CPR prediction.
In order to assess the quality of the machine learning prediction model, the results are compared with those of traditional benchmark models. The benchmark models constitute the current prepayment model used by NN Bank as well as a time series model that relies solely on past CPR observations (the ARIMA model). The prediction accuracy of the models is assessed in terms of mean absolute error regarding one-month ahead CPR prediction.
The backtest of the models shows that the accuracy of the machine learning model and the benchmark model is comparable. However, there are two big advantages to the machine learning model. Firstly, the machine learning model allows for a better understanding of prepayment behaviour in the portfolio. By using classification trees, risk drivers can be identified that are of greatest added value in predicting prepayment behaviour. Second, the machine learning model is a dynamic model which offers robustness to shocks in the mortgage portfolio. Any shock in the mortgage portfolio is immediately incorporated in the model as this data serves as input to the model. The benchmark, a traditional time series model, only incorporates changes in the observed CPR which lags with respect to shocks in the portfolio.
Step 2 - Risk driver selection
After obtaining a machine learning prediction model with satisfactory accuracy, the risk drivers for prepayments can be selected. The input variables of the model are ranked according to the “mean decrease Gini” scores. These scores are related to the importance of the input variables in the training of the model. The scores represent the contribution of each input variable to the ability of distinguishing prepayment observations from “regular” observations in the data.
After assigning a score to each input variable, the variables are sorted accordingly. Figure 2 presents the sorted list of input variables ordered according to the variable importance scores. Based on this sorted list of input variables, the five most important risk drivers have been selected in this study. The identified risk drivers are client age, annual income, mortgage age, loan principal and loan-to-income ratio at origination.
Step 3 – Effects of the selected risk drivers
The effects of the selected risk drivers on prepayment behaviour are uncovered by means of two methods: partial dependence plots and SHAP.
In partial dependence plots, all possible values of the input variable are plotted against the corresponding average prepayment probability prediction. For example, for the input variable “client age”, the plot represents the average prepayment probability prediction of the model if all clients had a particular age. The plot for client age is presented in Figure 3. The curve in Figure 3 indicates the effect of the age of the client on prepayment behaviour caused by a relocation as found by the prediction model. Clearly, the average probability prediction corresponding to a client aged 30 is higher than to a client aged 60, for example. Younger people are more likely to move house due to a change in characteristics of the household.
The effects of risk drivers can also be analysed using SHAP. This concept is closely related to game theory in which the Shapley values indicate the “fair” distribution of the outcome of a game. For this application, the prediction of prepayment probabilities is interpreted as a “game” in which the final predictions are the “outcome” of the game. The input variables used by the model are the “players” over which the outcome of the game must be fairly distributed. Therefore, the Shapley values for the input variables represent the contribution of the input variable to the probability prediction.
Figure 4 presents the global SHAP analysis for the age of the client and indicates a similar pattern as observed in the partial dependence plot. The interpretation is as follows: for a mortgagor age of 20, a Shapley value of 0.1 indicates that this value for the age approximately contributes to a deviation of 0.1 from the average probability in the data. The Shapley values are centred on zero, but indicate a positive effect at age 30 and a negative effect at age 60 as expected from the partial dependence plot earlier discussed.
On top of single-variable analyses, interactions of variables can be analysed through partial dependence plots. Instead of plotting the range of one variable against the average probability prediction, a combination of variables can be analysed. The benefit of such an analysis is an insight in the interaction of the risk drivers. For instance, the effect of the age of the client may interact with the annual income of the household. In practice, this means that the effect of the age of the client differs for different values for annual income.
Figure 5 presents the level plot in which the client age is featured on the horizontal and the annual income is featured on the vertical axis. The client age ranges from 20 to 100 years and the annual income of the household is assumed to range from 10,000 to 250,000 euros. The possible combinations of age and income constitute a grid of which the average probability prediction can be computed for each combination. These average predictions corresponding to the combinations are visualised through the colour of the area in the level plot.
Figure 5 displays the interaction between the age of the client and the annual income of the household. Horizontal cross-sections of the level plot indicate the effect of client age conditional on a value for annual income. In these cross-sections, the effect of the client age is similar to the overall effect presented in Figure 3. The additional information these cross-sections provide is that the range of the curves diminishes as the annual income is fixed on higher levels. This means that there is a negative correlation between the effect of the client age and the annual income of the household. The effect of the client age decreases as the annual income of the household increases.
These types of multi-variable analyses can be used to gain a better understanding of the prepayment behaviour of clusters of clients. In practice, instead of incorporating the overall effect of client age, the effect can be tailored to the characteristics of several clusters of clients. The mortgage pool can for example be partitioned into clusters corresponding to the annual income of the household. Afterwards, the effect of client age corresponding to the income of clients in the cluster can be incorporated.
How the results can be used to improve modelling
These insights into the risk drivers for prepayment behaviour can be used to improve modelling. Using the identified risk drivers and their effects on prepayment behaviour, the mortgage pool can be partitioned in clusters of clients with similar characteristics. Using this partitioning, more accurate forecasts of the CPR corresponding to the individual clusters can be made. Also, more traditional models can be improved using the results of this research. In the financial industry, a Logit model is often used to predict probabilities in binary classification problems such as this prepayment application.
The results of this study can be used to incorporate the most important risk drivers in such a Logit model. Furthermore, the uncovered effects and interactions of the risk drivers on prepayment behaviour can be used for feature engineering in order to further improve the accuracy of the CPR prediction model.
This article is based on the Master’s thesis of Dion Pijpelink for his study MSc Quantitative Finance and Actuarial Science at Tilburg University. Deloitte would like to thank Monique Gerritsma and Dimitri Hendriks as representatives of Nationale Nederlanden (NN) Bank for facilitating this thesis research collaboration.