Mending the fabric of data with closed-loop neural networks
It is a common mantra in data science that the majority of time is spent on cleaning/munging/wrangling data rather than on modelling. Producing clean and structured datasets is essential to train useful machine learning models. Inevitably real-world datasets tend to be noisy and incomplete, especially when the data come from physical-digital interfaces subject to noise, such as industrial sensors and IoT networks. Imputation is part of the data cleaning pipeline creating “machine-friendly” datasets from raw data.
The objective of imputation is to fill missing data values in features of the dataset where a measurement was expected but was not obtained, or the delivered value is a suspected error or outlier.
When the data comes in the form of a numerical time-series, or is organized in some fixed structure with proximity and order (e.g. spatial sensor nets), methods of interpolation can be useful to fill in the gaps and correct for errors. However, for complex nonlinear systems the results of interpolation can be misleading. An example of such a system is highway traffic, where the interplay of vehicle flow/speed and road features creates complex dynamics with abrupt variations and phase transitions, that cannot be easily interpolated.
To perform imputation on these datasets, deep learning algorithms that learn and generate nonlinear patterns can be very useful. One such method that we have developed at Numerico is the closed-loop predictive network, where a deep neural network normally used for prediction is adapted for imputation. The adaptation consists in creating a closed loop in the network graph by feeding back predicted values into the network input at places where the original values are missing. By training with back propagation the predictive accuracy of the model is improved, and the estimated missing values for imputation are more accurate, as can be shown by an analysis of the modified closed-loop objective function (we call this “second order learning”). The process is shown schematically below.
The algorithm is trained for prediction accuracy, and imputation accuracy is tested by artificially masking some of the existing values. Since the actual missing measurements will never be known, the goal is to fill in the missing cells in the database with values that preserve as much as possible the correlations and patterns of the existing data. In that way the imputed dataset can be used in further applications directly, without introducing too much spurious information that reduces the effectiveness of the data.
Coming back to the example of vehicle traffic, the highway system of the Netherlands is a prime example of a dense and extensive sensor network. With sensors – magnetic loops or just “loops” – that measure vehicle flow and speed per lane about every 500 meters, it has been producing some of the most comprehensive datasets of highway traffic for many years. To impute missing measurements, we use a closed-loop predictive regression network, that fills in the missing data minute-by-minute. A small slice of the resulting completed dataset is shown below, with imputed values in orange. The average error in estimating missing values is ~5%.
It is possible to apply this approach to any ordered dataset, but depending on the specific needs, a hybrid model is usually adopted for the end result. For the highway data, imputing vehicle speed does not make sense when vehicle flow is zero, and we can impose that at the output of our imputation.
The closed-loop neural networks that form the basis of the method are part of the wider family of generative models that includes Generative Adversarial Networks, where the objective function input contains another network output (here the network is structurally the same, but the parameters are sampled from a previous time step). These models are becoming central in problems of constrained optimization to mimic real-world datasets, of which imputation is an example often needed in practice.