Improve data quality in time series has been saved
Improve data quality in time series
An advanced analytics approach to imputation in time series
Missing data is a widely occurring data quality problem. Various techniques, ranging from simple to very advanced, can estimate missing data points and thereby improve the completeness and utility of the data.
Imagine yourself in a state-of-the-art industrial facility. Every machine contains a multitude of sensors that track, monitor, and report on key statuses, movements and indicators of the equipment. Video cameras monitor production output continuously. This enables precision manufacturing, advanced process control and predictive maintenance.
Now suppose you are given the task to analyse the lead-up to a crucial event and it becomes apparent that part of the data is missing for one of the essential signals. What to do?
Data quality is essential to data-driven processes
The above example is inspired on an industrial environment, but similar issues occur in any environment in which data is generated. In this article we are especially interested in any process that generates data throughout time, in other words we consider time series of observations.
As impactful business decisions are made based on data, it is crucial to make sure that data quality of these signals is sufficient for the purposes for which it is used. In order to deliver the expected results, remediation of data quality issues is usually required.
Data quality is measured in 6 dimensions: completeness, accuracy, consistency, validity, uniqueness, integrity. It is a very broad topic and so in this article we will constrain ourselves to the dimension of completeness, and focus on missing data in particular.
Missing data: ignore, complete or correct?
Missing values are a very common issue among data quality problems. Generally speaking, there are three strategies to handle this problem when performing data analysis:
- Complete case analysis: remove observations or variables with missing values
- Enhancement: completing the dataset by leveraging supplemental sources
- Imputation: fill missing values with estimates based on available data
When a large proportion of the values for a given variable is missing, and it is unreasonable to assume the data set contains enough information to base an estimate on, one should consider to apply complete case analysis and remove the variable.
Unfortunately, removing observations or variables is not always an option. For example, in financial reporting one is not allowed to omit transactions or balances, just because some data is missing. In this case, it should be investigated whether the missing data can be found from other sources. Or alternatively, whether it makes sense to treat the missingness as a feature of the observation.
However, in situations where enhancement is not feasible or too costly and remediating the missing values is desirable, imputation is certainly a worthwhile strategy.
Imputation is built on the central assumption that the available data contains (enough) information about the missing value, so that an estimate can be inferred based on it.
For a more extensive exposition on when to use imputation and an overview of techniques we refer to this Deloitte article by our colleagues Albert Wibowo and Veronica Cheng.
In his Posterior Analytics, Aristotle (384-322 BC) wrote that “we may assume superiority, other things being equal, of the demonstration which derives from fewer postulates or hypotheses.” To leave in non-essential hypotheses or postulates in your argument runs the risk of claiming more than one can proof, or in words more familiar to data scientists, it risks incorporating bias.
Cutting away non-essential arguments from explanations has come to be known as applying Occam’s razor, after the English theologian William of Ockham (1287-1327), who used parsimony as a guiding principle in his reasoning.
In data science, applying Occam’s razor translates to finding the simplest model that is fit for purpose, balancing accuracy with sophistication. In imputation in particular, it means one should employ the least complicated method that adequately estimates the missing value. To illustrate this principle we give examples of the range of methods available.
Basic methods for simple situations
The simplest methods for filling missing values are to simply copy the previous available value before (forward-fill) or next available value after (back-fill) the missing value. This method can be valid for discrete signals, such as status indicators or on/off-signals, that only report updates when changes occur, provided you can be reasonably confident of the integrity of the signal.
Slightly more advanced, but still rather down-to-earth is to impute the mean of the variable taken over the entire dataset, or interpolate by imputing the mean of the previous and next value. Applications for this technique could be slow-moving variables.
State space models for smarter extrapolation
An often used method to handle missing values in noisy time series is to make use of Kalman filtering. This technique treats the time series data as if the combined set of variables is describing the state of a process at regular time intervals. The Kalman filter then specifies how this state changes from one point in time to the next, i.e. how it moves through its state space.
Imputation is performed by first fitting a state space model to the available data and then estimating missing values in the time series by applying a Kalman filter to the state space representation of the time series.
Neural networks can handle more complicated patterns
In situations where even state space models are too down-to-earth, several other options exist. Artificial neural networks have proved their usefulness to tackle a wide variety of problems and imputation is no exception. Especially when the variables in the time series are related in a unknown or presumably very complicated way, neural networks may reach better results than state space models or regression-based methods.
The approach is to train neural networks on the available data to predict values of the time series and then use this predictive capability to estimate missing values. Neural networks have been shown to outperform state-space methods on the most challenging data sets.
When you can reasonably estimate missing values from available data, imputation can be very useful to improve data quality and leverage all available data for further analysis. Depending on the type of missing data and the complexity of the underlying interdependence within the data set, a range of methods can be applied from simple to very advanced. One should strive to use the simplest method that adequately estimates the missing value. Sound knowledge of the data generating process helps to identify the best method to apply.