Understanding a Random Forest model through Feature Importance

Article

Understanding a Random Forest Model through Feature Importance  

This post will discuss on a high level how a Random Forest operates and explain how the feature importance plot can improve transparency for this black box algorithm.

A Random Forest is a type of tree-based learning algorithm that uses a collection of decision trees to perform classification or regression tasks, for example classifying customers as ‘high risk’ or ‘low risk’. The algorithm is frequently chosen to solve a complex prediction task, however the process of how the model transforms inputs to outputs is, to most non-experts, a black box. This model can be a great predictor, however not many people are enthusiastic about black boxes making decisions that impact their lives or their businesses.

 

The mechanics of a random forest

Decision trees consist of multiple nodes, at which data is tested on a condition on some feature. This test results in a split of the data. Both groups of data continue to the next node, where they will be tested and split again until they reach the terminal node and we obtain the output . A model combining multiple decision trees is known as a Random Forest model.

The advantage of pooling all these decision trees is that it significantly increases robustness in the modelling process. If the training data would be equal for all decision trees in the forests, and the same features are selected at the same nodes,  we would have a random forest consisting of the same decision tree over and over again. Therefore, decision trees are trained on a random subset of data and features.

This results in a forest of different decision trees. The principle of the Random Forests is to feed test data to all of the decision trees, which leads to a large number of results and by majority vote or by averaging the tree predictions, the algorithm maps input data to its predicted outcome.

An important factor of decision tree algorithms is how to choose the right features for splitting. In each node a random subset of features is available to make the split, the number of random features available per split can be chosen as a parameter of the model. Evaluating which feature realizes the best possible split can be performed with various methods, for example by choosing the feature and split criteria in such a way that the variance in the child nodes is minimized.

Iteratively a tree is constructed by making splits of data until the terminal nodes contain data points of the same class or that are very similar in terms of features.

 
 

Feature importance in a Random Forest

Understanding the entire Random Forest models is challenging, but understanding how decisions are made in a single decision tree is easy. We can simply look at the decision criteria at every split  to get a sense of what features are important drivers for the decision tree to make its prediction. When more decision trees are combined into a Random Forest model, it becomes hard or even impossible to inspect all trees and gauge what features in the forest are the most important for making the predictions.

The feature importance method for Random Forests is able draw conclusions about what features contribute most to the decision making in the model and help the user to better  understand the drivers behind the model.

In general, feature importance is defined as its relative contribution to the decision making of the algorithm. The feature importance is plotted on the horizontal axis,  with the ranked  feature number on the vertical axis.

 

Mean decrease Gini

Determining the feature importance can be done with different methods, the Gini Importance method or the Mean Decrease Accuracy method, depending on if the Random Forest purpose was either classification or regression. The Gini method is based on the Gini Index which is a measure for inequality in a population, a low Gini coefficient represents more equality in a data set. Now, if we think about the trees in a Random Forest, every node contains a certain distribution of the response variable. After a split, the child nodes should have a lower Gini coefficient, because the goal of the splits is to make the class distributions in the child nodes as pure as possible, all observations in a node should be as similar as possible. So the variable that was used to make the split has decreased the Gini. Now if we evaluate the mean decrease in Gini that every feature used in the trees of the forest has realized, we know how much each feature has contributed to the performance of the model.

 

Mean decrease accuracy

The Mean Decrease Accuracy method uses observations that have not been used to train the model. These observations with known outcomes are scored by the model and the accuracy is then recorded. Next a random permutation is done for the values of one feature of these unused observations, then the model scores these observations again (with noise in one feature) and calculates how much the accuracy of the model has decreased. When this is done for all features we can rank the features on importance.

 

Utilizing feature importance

Now that we have the features ranked we gain more insight in what takes place inside the black box. For internal model development purposes, the feature importance measures are extremely valuable. It shows what features contributed, and on what scale, to mapping the testing data to an output. With this information people are better able to understand what things are important when the Random Forest makes a prediction. A limitation of this method is that the feature importance plot does not provide insight in the interactions between parameters, it does not show how the outputs and features are correlated, negatively or positive.

Overall, by providing a clear perception of what features play an important role in making decisions in a Random Forest algorithm, this model is better interpreted. Understanding this model’s decision making process will improve transparency in its development, validation and implementation which would in turn enable owners and users of the Random Forest model to be more comfortable with its predictions because they have a sense of the drivers behind the model, building trust.

Transparency is one of the main pillars, among Fairness, robustness and others, necessary to build trust in AI models. Understanding the model decision making enables model developers and model validators, to explain and monitor, model decisions in a more effective way, obtaining the trust of the different model stakeholders, and ultimately enabling the better and more efficient servicing of clients.

*) For more information about understanding how black box models work, please do not hesitate to contact Koen Dessens, Roald Waaijer or Bojidar Ignatov via the contact details below or have a look at GlassBox.

 
 
Did you find this useful?

Related topics