Detecting and Correcting Irregularly Spaced Time Series Data With EvalML
Many modern machine learning models leverage time series data to predict future outcomes. Instead of utilizing static data to predict a target variable, time series adds the element of chronology. This can be leveraged with any sort of time-based task, such as predicting the temperature tomorrow or future sales growth based on season. Popular new models such as ARIMA and Facebook’s Prophet model are specifically designed with time series problems in mind. However, models like Prophet require perfect time series information.
What do we mean by “perfect”? There are lots of small problems with your data that could prevent it from being perfect. Say you have a series of measurements you took at hourly intervals for a week… except for 2pm on Tuesday. For whatever reason, that one date is missing. Or maybe you accidentally measured all of Wednesday twice!
Let’s consider an example:
Here we have a very simple time series problem where the only feature we have is the date, and we have a target that follows a very clear sinusoidal path. However, you can tell there’s a few things wrong with this data. January 11th has been removed by setting it to None, January 13th and February 9th have been dropped entirely, and January 22nd was accidentally recorded as the 21st. On a dataset this size, it isn’t too hard to go in and manually fix these problems. If we had a series of 500 dates instead of 50, however, this problem would be a lot harder. These sorts of mistakes are very frequent in the real world but make it hard to predict on. This is where Woodwork and EvalML’s new tools come in.
Woodwork recently released a new utility function,
infer_frequency. If your column with DateTime values has perfect time intervals and no mistakes, it will return that inferred frequency between intervals without issue. If your DateTime column does have errors, however, it will pass back a handy-dandy debug object. This debug object not only tells you where your errors occur, but what kind of errors they are as well. In the case of our toy example, we get:
Notice here that the debug object highlights the duplicate value on January 21st and the NaN value at index 10, which translates to the missing January 11th. This is reflected in the “missing values” section as well, where the NaN January 11th is listed alongside the other dates that were dropped from the original data.
EvalML leverages this new Woodwork function in the
DateTimeFormatDataCheck, one of the default data checks we provide. This will give you different error messages based on what’s wrong with your data, and will suggest using a few new components in your pipelines in order to clean your data automatically.
The first step in creating a dataset that is properly ordered is figuring out what the proper ordering should be. This is what the
TimeSeriesRegularizer component is designed to do. Using the inferred frequency given to us by Woodwork, it creates an ideal DataFrame with a column that perfectly fits the frequency. Then, all rows with completely correct times are ported over into this new DataFrame. If there are any rows that have the same timestamp as another (like January 22nd in our example), these will be dropped. The first occurrence of a date or time maintains priority. If there are any values that don’t quite line up with the inferred frequency they will be shifted to any close missing times, or dropped if there are none nearby.
If we run this on our example dataset:
Our dataset now looks like this:
There are a number of significant gaps in this data. However, the lack of NaN values plus the inferable frequency is a great step forward. We now have one DateTime value at every step of the assumed interval. When we added the missing values, we left the target that corresponds to that date blank. We could pass this off to EvalML’s
TargetImputer with no issue. This can fill in the missing target values with the mean, median, or most frequent target value. Let’s see what happens when we do that:
Well, that doesn’t look great. The problem here is that mean, median, and mode values are all static. They don’t account for the passing of time, and how neighboring data points may have more predictive power than simply taking the average. If we were working with an example that had other features besides purely the date, the same idea would hold with those features and EvalML’s
Imputer component. This is where we can leverage the new
TimeSeriesImputer component, which will employ time series-specific imputation strategies on both features and target data. Let’s try that instead:
Suddenly, things are looking much better than they were before. This is a more believable way for data to behave over time, because time was taken into account. From a messy dataset with missing dates and odd targets, we have created a dataset with perfect intervals and sensical target values. This will be able to provide much more predictive power than anything we had previously.
Tying it all together
While these components on their own make it much easier to remedy irregularly spaced time series data, these components are also integrated into our AutoML process. The
DateTimeFormatDataCheck discussed above outputs a suggested action of including the
TimeSeriesImputer in any AutoML pipelines, which is exactly what our function
make_pipeline_from_data_check_output will do.
Passing the input X and y values through this pipeline will result in the same corrected dataset we found by building this piece by piece. From here, this data can be passed in to any EvalML model for successful training and prediction.