One of the biggest challenges data scientists need to address is the creation of prediction problems from large data sets.
A typical data science endeavor begins when domain experts decide to solve a particular problem by marshaling the data repositories available to them.
Once this problem is defined, data scientists enter the picture and go through a sequence of steps in order to approach a solution. This includes working with the data collection and maintenance teams to understand how the data is organized and what it contains, and then turn it back to the domain expert to propose potential predictive formulations and ask whether particular types of predictive models would be helpful, given some acceptable level of accuracy.
This process is typically challenging, since both domain experts and data scientists want to learn about the data and create a prediction problem at the same time. Because of that, the process is iterative, and it often takes a team several attempts before they can take even the first steps toward building a predictive model.
Upon its conclusion, data scientists decide on a formulation for the prediction problem. But because this process is executed prior to the actual model-building, no one involved can deduce beforehand whether a good predictive model can be generated.
Most of the prediction problem formulations data scientists come up with share many common elements. For example, they include picking a time-varying column, selecting a window, and applying a limited set of operations to create the outcome.
This challenging process raises a critical question:
Can we automatically list the prediction problems we would want to consider for a given data set, and can we do it without manually manipulating the data or knowing a lot about that data beforehand?
The answer to that burning question among data scientists is, “Yes.”
To get to that “yes,” however, data scientists and those who ask questions of data need a common language to describe prediction problems – a language that’s descriptive enough to be general-purpose, but limited enough to be enumerable. This is the path to automatically formulating hundreds, if not thousands, of prediction problems for a single data set.
Organizations can already access both automated feature engineering and machine learning tools to automatically solve these predictive problems, and present the results to a domain expert or data scientist.
Among many questions targeted at the data, data scientists also ask these questions:
“How can we collapse the iterative process to automatically generate prediction problems?”
And with the above question answered through automated feature engineering, they also ask:
“With all this time freed up through automation, and the ability to increase the exploration of prediction problems a thousand-fold, how can data science advance the business in ways never before possible?”