Prior to starting Feature Labs, I researched data science automation in the Data to AI Lab at MIT. Unlike most data scientists who work in a single domain, our group had sponsors from a wide range of industries. This gave us the unique opportunity to develop innovative solutions to use with the diverse problems we worked on.
Yet, regardless of the problem, we realized that our biggest challenge was cutting down the amount of time it took to reach a solution. To expedite the process, we utilized databases to store and query data, open-source machine learning libraries to train our AI systems, and scalable clouds to operate our platform effectively.
Although we employed all available tools at our disposal, it still took 3 months on average to develop a single end-to-end solution. As engineers, we knew one thing was certain — there must be a better way.
The Feature Engineering Bottleneck
The data we received was not only extremely detailed, but often distributed across multiple files. or database tables. While we could easily describe its potential, we still had to manually prepare the data for the machine learning algorithms. These algorithms need data to be in a single table, with training examples in the rows and the explanatory variables (also known as features) in the columns.
This data representation for machine learning is called the “feature matrix.” And “feature engineering” is the process of identifying and extracting predictive features in the complex data that enterprises typically work with.
Feature engineering is challenging because it depends on leveraging human intuition to interpret implicit signals in datasets that machine learning algorithms use. Consequently, feature engineering is often the determining factor in whether a data science project is successful or not. Stanford Professor Andrew Ng accurately said, “…applied machine learning is basically feature engineering.”
The importance of feature engineering is clear. Unfortunately, it is frequently the bottleneck in the data science process because it requires both domain expertise to brainstorm ideas and the technical expertise to implement them.
This means that the best predictive models are only capable of being developed by a select group of people in most enterprise organizations today.
Inventing Feature Engineering Automation
Renowned machine learning professor Pedro Domingos says, “one of the holy grails of machine learning is to automate more and more of the feature engineering process.” In 2014, we made it our goal to create an approach to automate feature engineering for real-world datasets.
Our work at MIT became the “Data Science Machine,” which was used to compete in key data science competitions. The results at the time showed the potential of what we were building — our system outperformed 615 of the 906 human teams that we competed against.
We accomplished this by using an approach that took hours, instead of weeks, to run. While this system deployed many advanced techniques to its advantage, the integral innovation to our success was Deep Feature Synthesis, our algorithm for automated feature engineering. We demonstrated that our system could reach human-level performance. But our goal was never to replace human data scientists. Rather, we sought to augment their work.
According to Gartner, even if an “organization already has a data science team … it may need to be enhanced with even more specialized data science skills specific to machine learning, such as feature engineering and feature extraction.” With Deep Feature Synthesis, we could make it easier for people to not only learn data science, but apply it too.
Feature Labs: making machine learning more accessible
Companies today have a growing number of machine learning needs, but they face a shortage of data scientists who can solve them. Paradoxically, more people than ever before are interested in learning the data science process, but they don’t have access to the tools that make it easy or even feasible to learn.
Feature Labs is committed to dramatically increasing access to machine learning by making automated feature engineering a core part of our product. In the last few months, we have:
- Released Featuretools, a free Python library for automated feature engineering
- Shared how Spanish bank BBVA used our automated feature engineering technology to improve their fraud detection models
- Announced that Featuretools will be taught to over 800 students in a new class offered by MIT for professionals
Through these initiatives we want to lower the barrier to entry for those who want to help companies address their machine learning difficulties and become the data scientists the business world so desperately needs. With Feature Labs, building high-performance machine learning models to generate value for businesses is accessible to everyone.