One of the biggest challenges in machine learning workflows is identifying which inputs in your data will provide the best signals for training predictive models. For image data and other unstructured formats, deep learning models are showing large improvements over prior approaches, but for data already in structured formats, the benefits are less obvious.
At Zynga, I’ve been exploring feature generation methods for shallow learning problems, where our data is already in a structured format, and the challenge is to translate thousands of records per user into single records that summarize user activity. Once you have the ability to translate raw tracking events into user summaries, you can apply a variety of supervised and unsupervised learning methods to your application.
I’ve been leveraging the Featuretools library to significantly reduce my time spent building predictive models, and it’s unlocked a new class of problems that data scientists can address. Instead of building predictive models for single games and specific responses, we’re now building machine learning pipelines that can be applied to a broad set of problems. I’ll be providing an overview of our approach at the AI Expo in Santa Clara.
I provided a deep dive of feature engineering with my AutoModel talk at the Spark Summit. Since then, we’ve found a variety of different use cases for automated feature engineering. The key takeaway is that if you can translate raw event data into summaries of user behavior, then you can apply machine learning models to a variety of problems.
Predicting which users are likely to perform an action is useful for personalizing gameplay experiences. I used the Featuretools library to automate feature engineering for a project, but it required using newer Pandas UDF functionality to scale to massive data sets.
Collaborative filtering is a valuable tool for providing personalized content for users. Instead of using past purchases as a feature vector for collaborative filtering, I’ve been exploring a number of proxy variables to suggest items.
Segmentation is one of the key outputs that an analytics team can provide to a product organization. If you can understand the behaviors of different groups of users within your product, you can provide personalized treatments to improve the engagement of your user base.
There’s bad actors in any online environment. We’ve been exploring deep learning for this problem, and applying autoencoding on our generated feature sets has provided a powerful tool for flagging problematic users.
Automating the feature generation step in machine learning workflows unlocks new problems for data science teams to tackle. Instead of focusing on specific games, our data scientists can now build solutions that scale to a portfolio of titles.
Ben Weber is a distinguished data scientist at Zynga. We are hiring!
☞ Machine Learning Zero to Hero - Learn Machine Learning from scratch
☞ Python Machine Learning Tutorial (Data Science)
☞ Platform for Complete Machine Learning Lifecycle