Feature Engineering for Automated Machine Learning

Feature Engineering for Automated Machine Learning. One of the holy grails of machine learning is to automate more and more of the feature engineering process” — Pedro Domingos, CACM 2012.

One of the biggest challenges in machine learning workflows is identifying which inputs in your data will provide the best signals for training predictive models. For image data and other unstructured formats, deep learning models are showing large improvements over prior approaches, but for data already in structured formats, the benefits are less obvious.

At Zynga, I’ve been exploring feature generation methods for shallow learning problems, where our data is already in a structured format, and the challenge is to translate thousands of records per user into single records that summarize user activity. Once you have the ability to translate raw tracking events into user summaries, you can apply a variety of supervised and unsupervised learning methods to your application.

This is image title

I’ve been leveraging the Featuretools library to significantly reduce my time spent building predictive models, and it’s unlocked a new class of problems that data scientists can address. Instead of building predictive models for single games and specific responses, we’re now building machine learning pipelines that can be applied to a broad set of problems. I’ll be providing an overview of our approach at the AI Expo in Santa Clara.

I provided a deep dive of feature engineering with my AutoModel talk at the Spark Summit. Since then, we’ve found a variety of different use cases for automated feature engineering. The key takeaway is that if you can translate raw event data into summaries of user behavior, then you can apply machine learning models to a variety of problems.

Propensity Models

Predicting which users are likely to perform an action is useful for personalizing gameplay experiences. I used the Featuretools library to automate feature engineering for a project, but it required using newer Pandas UDF functionality to scale to massive data sets.

Propensity Models

Recommendations

Collaborative filtering is a valuable tool for providing personalized content for users. Instead of using past purchases as a feature vector for collaborative filtering, I’ve been exploring a number of proxy variables to suggest items.

Recommendation Engine

Archetypes

Segmentation is one of the key outputs that an analytics team can provide to a product organization. If you can understand the behaviors of different groups of users within your product, you can provide personalized treatments to improve the engagement of your user base.

User Clustering

Anomaly Detection

There’s bad actors in any online environment. We’ve been exploring deep learning for this problem, and applying autoencoding on our generated feature sets has provided a powerful tool for flagging problematic users.

Anomaly Detection

Conclusion

Automating the feature generation step in machine learning workflows unlocks new problems for data science teams to tackle. Instead of focusing on specific games, our data scientists can now build solutions that scale to a portfolio of titles.

Ben Weber is a distinguished data scientist at Zynga. We are hiring!

Suggest:

☞ Machine Learning Zero to Hero - Learn Machine Learning from scratch

☞ Python Machine Learning Tutorial (Data Science)

☞ Platform for Complete Machine Learning Lifecycle

☞ Introduction to Machine Learning with TensorFlow.js

☞ Data Science at Shopify

☞ Python Tutorial for Data Science