Lyft Designs the Machine Learning Software Engineering Interview

Lyft Designs the Machine Learning Software Engineering Interview.Iterations on revealing recurring patterns of thought, feeling, and behavior

Lyft’s mission is to improve people’s lives with the world’s best transportation and it’ll be a slow slog to get there with dispatchers manually matching riders with drivers. We need automated decision making, and we need to scale it in a way that optimizes both the user experience and the market efficiency. Complementing our Science roles, an engineer with a knack for practical machine learning and an eye for business impact can help independently build and productionize models that power product experiences that make for an enjoyable commute.

A year and a half ago when we began scouting for this type of machine learning-savvy engineer —something we now call the machine learning Software Engineer (ML SWE) — it wasn’t something we knew much about. We looked at other companies’ equivalent roles but they weren’t exactly contextualized to Lyft’s business setting. This need motivated an entirely new role that we set up and started hiring for.

Most companies are open about the expectations for the role being interviewed for, the interview process, and preparation tips. Most have recruiters give verbal tips, some provide written guides, and fancy ones hire entire studios to produce feature films. One thing you probably won’t find, however, are notes about how they came up with them. Admittedly, we’re no exception.

Typical interview funnel.

As a candidate, it’s easy to get a foot in the door and be evaluated by an interviewer. As an interviewer, it’s easy to get thrown in a room and be asked to evaluate a candidate. What’s not easy is for an organization to design and scale an interview process while ensuring its consistency and reliability for the candidates, the interviewers, and the hiring managers making the hard choices.

Overview

In a break from convention, this post lays out the motivations for our interview design and a set of principles that guide our iterative approach to it. To illustrate our approach, we focus on the ML SWE “interview loop” and dive deeper into how we applied these principles to the modeling onsite interview. We address the key components to our interview loop design in each of the following sections:

Defining problems,
Harmonizing scale,
Uncovering talent, and
Tracking progress.

Defining problems

Before diving into the actual design principles for an interview loop, we need to understand the motivation for the loop. The motivation comes from what we want out of the role, which in turn helps us define what we should look for in a candidate. The following are three questions that we need to address:

What are Lyft’s challenges (and can a specific role help)?
What should the role be with respect to the organization’s goals?
What are the desired skills, knowledge, and talents given the expectations for the role?

We’ve motivated Lyft’s challenges introduced the role of the ML SWE above. What’s left is to define the necessary ingredients for what a successful hire looks like vis-a-vis (3) in Lyft’s context:

Skills acquired through practice,
Knowledge learned through study and personal experience, and
Talents that make each candidate unique.

The desired skills and knowledge are simple to define and test for; e.g., skills needed to prototype a model and knowledge of ML theory, concepts, and libraries. You can gain it from study, practice, and experience. We won’t belabor it here but let’s be more specific about what we mean by the talents care about.

Our desired talents are recurring patterns of thought, feeling, and behavior that can be productively applied in the context of Lyft’s ML SWE role. What we’re looking for here is a bit more complicated than simply work done in the past by a candidate. Faced with the same stimuli, people react and behave differently. When we look for role and values fit, we do mean just that. Beyond skills and knowledge, will a candidate’s unique way of responding to the problems thrown up in Lyft’s business context help that candidate succeed? So while conventional wisdom might suggest it, we’re not always looking for the Michael Jordans of machine learning (be it I. or J.). The narrow sort of talents associated with celebrated excellence can be important but in most cases the interviewers are listening for predictive clues of how a candidate will react when posed Lyft-specific problems on the job. (We found Gallup’s First, Break All the Rules useful in explaining distinctions between skills, knowledge, and talents through their data-driven approach.)

Since Lyft’s business context changes (sometimes rapidly when going public), our desired talents change and so, too, must our interviews. In yesteryear’s context, we may value a candidate’s comfort and excitement with ambiguous business problems because there are plenty of low-hanging fruits to pick. Tomorrow, we may value the clarity of mind in organized problem solving more because as our product matures the biggest business opportunities lie in ensuring smooth interactions between product features. Iterating on the interviews is an important part of recognizing change and ensuring that the role stays relevant.

The section title is thus an extended pun on the different types of definition problems we face in our interview design:

Lyft’s high-level problems that stem from its business context,
The perceived problems that an ML SWE role should readily tackle, and
The practical problems that candidates respond to in an interview to help interviewers learn about them.

Once we can comfortably address these problems, we think about scaling the interview. In particular, how can different interviewers and hiring managers consistently and reliably assess role and level fit across diverse candidates?

Harmonizing scale

Most of this post focuses on what goes on between the interviewer and the candidate. That said, there are a couple of layers of the interview process that we need to peel away to set context for how we actually design the problems and evaluations that help us look for talent. At a high level, a good interview design harmonizes all of these components.

First, candidates on the ML SWE loop go through Lyft’s hiring review. The review is a regularly scheduled session for a committee to study candidates with an unbiased perspective and decide whether to hire them. Working alongside the review committee is a separate panel of interviewers that provides technical feedback. This feedback is designed to help the committee decide if there’s a fit and, if so, the candidate’s technical level. At first glance, this review process may seem cumbersome. Examining the checks and balances more carefully, however, we notice that they are intentionally introduced to put friction on the hiring process. Having a consistent review committee unifies standards and eliminates bias.

Second, the loop consists of a technical phone screen followed by a series of onsite interviews in no particular order. The phone screen determines if the candidate’s working knowledge of fundamental ML concepts and basic coding skills would allow them to succeed on site. The onsites are a set of interviews to further test the depth of the candidate’s skills and knowledge in practical job requirements for the role. Additionally, the onsites also provide opportunities for candidates to reveal their talents when faced with various technical challenges.

This is image title

The various interviews of a typical ML SWE interview loop. We mix and match from this set of interviews depending on the role and candidate.

These two layers of the interview process are dynamic. When I joined Lyft, the hiring review didn’t exist and we simply had debriefs conducted with the interview panel and the hiring manager. It wasn’t efficient. When we first rolled out the ML SWE loop, we simply tacked on a couple of modeling interviews to the standard SWE loop. As our understanding of the ML SWE’s role at Lyft evolved, we rapidly learned that the loop needed to be better contextualized to the role. We introduced a committee to revamp the loop. And being part of the committee requires grasping how these pieces intertwined with the questions that interviewers ask and candidates answer.

Uncovering talent

Pairing a candidate with an interviewer lets us test for the candidate’s skills, knowledge, and talent. Skills are taught and mastered. We even offer internal “mastery courses” for machine learning. Knowledge is picked up through experience and personal learnings. Both are important and part of the bar and leveling criteria we set for any hire. To decide if there is a fit, however, simply validating a candidate’s skills and knowledge is insufficient. We need to know if her recurring patterns of thought, feeling, or behavior match the role.

As much as we hate to admit it, interviews that last barely an hour are blunt tools for teasing out the talents that are relevant to the role. It’s tricky to design questions and ask them in a way that reveals consistent behavior across multiple interviews and interviewers. It’s even trickier if you consider the confounding effect of having practiced for such interviews: is it practiced interview behavior or natural behavior? At Lyft, we regularly discover problems that no one has ever solved before. As much as we appreciate the time and effort put into preparing for the interview, interview skills aren’t equivalent to on-the-job performance.

In scientific parlance, interviews have low statistical power and a high sampling cost. Nevertheless, here are some design principles we found useful in sharpening our tools. To illustrate, we discuss how these principles apply to the design of the ML modeling onsite interview.

Pick calibrated open-ended problems

The goal of the interview is to predict how candidates naturally perform when placed in Lyft’s business context. To that end, our interviewers design and draw from a pool of diverse problems amenable to a variety of potential solutions. This way, candidates aren’t forced to study any specific topic to piece together a good response. Further, it reduces the likelihood that candidates can practice for them.

In the context of the modeling onsite, we ask open-ended problems with sufficient business and problem context such that the candidate can clearly identify an ML-based approach to solve it. After all, we make no attempt to hide the fact that we’re hiring an ML SWE. A key ingredient to a good interview is therefore introducing the implicit problem constraints by way of context early without biasing the candidate to any one right answer. Interviewers polish their problems and calibrate them against trusted peers before using them in formal interviews.

Create room to evaluate expectations

In the modeling interview, we want to test how the candidate turns a vague business problem described in prose into a well-defined ML-amenable problem; e.g., regression or classification. The interviewer’s focus is on the candidate’s spontaneous response. So while the interviewers provide business context and clarifications, they deflect and stress on the candidate driving the conversation. What’s important is how the candidate interprets and approaches the business problem. This way of deferring to the candidate to drive the conversation applies to design questions and experience interviews, too.

Of course, the candidate’s response is confounded with experience. If the candidate isn’t comfortable with driving the conversation even when pushed, the interviewer offers guidance while noting what was offered. Again, there’s a fine line to draw between giving ample context without giving too much away. Part of designing the loop involves creating an interviewer onboarding process that aligns the interviewers on the specifics we should look out for. When we design the interview loop, we strive to be scientific without forgetting that a lot of it is an art that the interviewers need to learn and practice.

Encourage creativity and provide validation

To sufficiently evaluate candidates, the modeling problem must be ambiguous enough that there can be multiple problem definitions and different good ML approaches. In fact, the candidate should be inspired by the ambiguity to get creative. It is the interviewer’s job to assess whether a creative approach is valid or simply “too creative” and try to steer the candidate in a relevant direction. For instance, I once had to push a candidate to try a more obvious heuristics-backed ML approach for the immediate problem instead of formulating a generalized version of that problem as, say, a non-convex constrained optimization problem. The gravitation towards a more generalized, complicated solution is part of the talent we want to learn about but it’s also important to stay on track so that the talents are assessed in the right context.

One way to ensure that the discussion stays on track is to know when to provide validation and steer the conversation. Even when we have an open-ended challenge, interviewers have to let the candidate know when to dive deeper and when to move on. Decomposing the broader problem into big, rough chunks of subproblems allows candidates to respond on their own terms while staying within the bounds of what we want to assess. In practical terms, we’re simulating various stages of working with an ML model at Lyft. We may want to think about problem formulation at the prototyping phase. We may also want to think about model evaluation at the productionization phase. These are broad enough categories that candidates can offer clues to their recurring thought patterns while making sure all interviewers adhere to a consistent set of standards around skills, knowledge, and talents to uncover.

Beyond mirroring real-world conditions, the interview is a good avenue to provide a feel of the types of problems we tackle at Lyft and establish brand. With that in mind, our questions are generally Lyft-inspired problems that candidates won’t find elsewhere. For consistency, each question is peer-reviewed before being published to an internal questions repository. Interviewers conduct informal interviews with colleagues and get them to sit in actual interviews as “shadowers” for feedback.

The overarching goal is to maximize the discerning power of our interviews and minimize their predictive variance. Of course, interviewers will have objective and subjective feedback. It’s important that team members interview a candidate and think about whether the candidate and the broader Lyft organization match each other. The challenge is that our candidates are diverse and it’s impossible to maintain an exhaustive interviewer’s checklist of how they can fit in at Lyft. We rely on our interviewers, their training, and our internal calibration to help discern that.

Tracking progress

Once you have the key components of a loop, it’s crucial to maintain a solid feedback loop to measure the health of our interview funnel over time. As difficult as it is to obtain enough samples for statistical metrics, we strive to use data to decide if something is unhealthy. For instance, suppose that the healthy pass through rates were 50% for the phone to onsite and that that for onsite to offer was 25%. If the pass through rates from the phone to onsite were at 70% and our onsite to offer was 10%, there’s something very broken in our phone interview process. In this case, we look closer at the phone interviews, confer with the interviewers, and examine anomalous cases to establish broader trends. Any subsequent changes made are monitored pretty intensely to ensure that our rates stabilize.

Loops for different roles and even the same loop at different times will yield slightly different baselines. For instance, a pipeline with fresh grads from a recruiting event probably has higher offer acceptance rates than one with senior candidates we’re sourcing. The recruiting team is represented on the loop design committee to offer guidance on what are reasonable numbers. Once the loop is “live,” the recruiting team regularly pulls these numbers and call out potential trends. Beyond the interview funnel, it may be of interest to track longer-term trends in the hired candidates’ performance.

This is image title

YMMV: healthy numbers for an interview loop for the phone screen to onsite, to offer, and to offer acceptance are, for example, 50%, 25%, and 70%.

Sharing information

There are good reasons to be cagey about the interview design process. What if the information helps candidates game the interview? What if the things we’re looking for in our interview change over time? What if the design process is actually bad? Worse, what if folks realize that their LeetCode premium subscriptions won’t help for the ML SWE loop?

Despite the what-ifs, being transparent about how we design interviews can improve our interviews. Call it enlightened self-interest: candidates invest time to talk to us and we mutually benefit from learning if there is a good fit. Even if there isn’t an immediate fit, positive experiences build brand and improves candidate sourcing. Maybe the candidate can reapply when the timing is better. Practically, hiring an engineer easily costs tens of thousands of dollars. By showing how we iterate on our interviews, we reveal what we truly care about and how we try to probe at them, hopefully adding to the virtuous cycle for the hiring pipeline.

This post focuses on the ML SWE loop because I’m familiar with it. I’ve conducted over a hundred variants of it with candidates of all backgrounds since the inception of the loop. Having worked with the committee that defined the role and the recruiting team to introduce the loop, I’ve had a hand in designing and tweaking the interview goals, formats, and question banks for the set of specific skills, knowledge, and talents we want. More recently I motivated and onboarded more team members onto the loop to help expand our pool of interviewers. The other interview loops I’m on and the various interviews I’ve had with other companies are also good references.

Ultimately, we want good candidates that match our role and values to succeed, and we wanted them yesterday.