Introduction
Applying labels manually to data can be expensive and time consuming
One of the main problems that data scientists face today is unlabeled data. The algorithms that they use must have an initial amount of data labeled to learn upon. There are different routes to get there, but uncovering the best, most efficient and accurate route, needed further exploration. Deloitte’s Innovation and Platforms Machine Learning research team took on the challenge.
If data is not already labeled, subject matter experts (SMEs) must spend many hours labeling enough data to test the model. This technique is referred to as passive learning, but it’s not perfect. SMEs will often provide more labels in specific data subsets than are needed, while neglecting other subsets, which can heavily affect how the model performs.
Active learning is an alternative to this time consuming and often biased approach to dealing with unlabeled data. It takes a smaller amount of labeled data, runs it through a semi-supervised model iteratively, and uses these iterations to select the most useful new rows of data to label. The technique offers a faster alternative to creating labeled data while also providing more beneficial data for the model. Why? The approach results in a representative sample of training data with typically very little bias because the machine is choosing what’s most relevant instead of a human.
Business problem: Insufficient labeled data to adequately train a model
- The problem that prompted research into active learning was having a lack of training data to reflect a business's spending.
- The goal of the research was to find a cost effective and quick method to label data from this unlabeled dataset and then create a model that performed in a comparable manner to past models.
- The data was high level accounting information that specifically covered accounts payable data as well as unstructured invoice text data.
Passive learning
Passive learning is the typical approach for training a machine learning model and it applies a multi-step process to get training data, requiring an intense amount of work from SMEs. To identify subsets of data, clusters of transactions are created using the Birch clustering method. This approach builds each cluster by gradually grouping transactions that are close to each existing cluster and measuring how distinct each cluster is.
For example, similar transactions can have different classifications based on what jurisdiction it originated in. A transaction for computers for administration purposes could be tax exempt in one jurisdiction, but fully taxed in another.
Active learning
Active learning is a special case of semi supervised machine learning in which a learning algorithm can interactively query the user (or some other information source) to obtain the desired labels of new data points. In statistics, it is sometimes called optimal experimental design. This interaction between the model and the user or data source often occurs iteratively, with each iteration selecting new data points. It’s not a linear process. It repeats several times.
In active learning, the model is attempting to select the unlabeled data that will be the most informational. The goal is to minimize the number of iterations and total number of labeled data points, while maximizing the accuracy of the predictive model.
In comparison to passive machine learning, where humans typically use existing labeled data to train the model, active learning reduces human involvement in the training data selection process to a semi-supervisory role.
Humans train the model with a small set of initial labeled data, then the model selects additional unlabeled data points that could provide the most information for the next training iteration. Over many iterations, the model will have selected the best set of data from which to train itself.
Methodology
In active learning, there are several methods commonly used to select data points for labeling. Deloitte evaluated three methods of selecting data for further labeling.
Uncertainty sampling: Perhaps the simplest and most used framework is uncertainty sampling. In this type of sampling, the model selects the data point for which it has the least confidence in the label it has predicted. This approach is often straightforward for probability-based learning models.
Margin sampling: In margin sampling, the model selects the instance that has the smallest difference between the first and second most probable labels . It relies on a basic machine learning model that attempts to separate groups of data to identify which transactions to sample, though it doesn’t account for the distribution of the data. This can lead to oversampling a dense subset of the population.
Entropy sampling: Entropy is an inquiry taken from information theory that measures disagreement among the likely labels for each data point. The entropy formula is applied to each instance and the data with the largest entropy is chosen as part of the next sample for labeling.
Conclusion
Active learning proved to be an effective method for better selecting training samples from an unlabeled set of data. Compared to passive learning, our research found that active learning models perform six to 12 percent better than passive learning models. This performance was different for each dataset, implying that the distribution of data heavily impacted the effectiveness of active learning.
While an active learning model isn't appropriate or beneficial for all cases—especially when an adequate set of labeled data already exists—it is able to quickly identify transactions that would provide the most information to a model.
The active learning model selects data that is more representative of the whole population, leading to a model that has less bias and is better equipped to generalize information.
In cases where a good training dataset of labeled transactions already exists, active learning doesn't provide a meaningful benefit. The increased time to iteratively add more training data would add further cost to a project, while also not providing a significant amount of new information to the model. However, in some cases the new method may fast become the norm.
Download the REPORT
Humans train the model with a small set of initial labeled data, then the model selects additional unlabeled data points that could provide the most information for the next training iteration.