Quizlet's has a fast-growing community of 25 million active users who log hundreds of millions of events daily. This massive influx of data allows us to do a lot of cool things, from gaining a better understanding of how students and teachers use Quizlet, to improving Quizlet's efficacy as a great study tool.
However, handling large amounts of data comes with a number of infrastructure challenges and extracting useful information from this data often involves complex data processing pipelines, or data workflows as they’re often called. As Quizlet has become more data-driven, we found ourselves in need of a framework for orchestrating complex data processing pipelines. This led us to search for a workflow management system (WMS) solution that could carry out a robust set of operations, while also being able to scale with the future growth of Quizlet.
Choosing and deploying a great WMS is a complex project, and we wanted to share our experiences searching for, and eventually deploying Apache Airflow as our WMS. The process is so multi-faceted, we found it difficult to fit it all into a single blog post! Thus, we introduce a four-part series of blog posts (hosted on Medium) aimed at sharing our insights. The series is organized as follows.
- Part I of the series introduces WMSs and motivates the need for them in modern business using an example data processing problem similar to the ones we often encounter here at Quizlet. We’ll refer back to this example throughout the series, as we believe that an end-to-end demonstration is helpful for explaining key concepts.
- In Part II, we present a wish list of features that we at Quizlet believe are essential for a WMS to meet our data processing needs. We then describe how we used this wish list to guide us through the landscape of available WMS projects, leading us to adopt Apache Airflow.
- Part III gives a detailed technical background on Airflow, including its key concepts and architecture, as we work through the example workflow introduced in Part I.
- In the final post we describe the initial Airflow deployment used here at Quizlet and provide some key learnings we gathered along the way. We then wrap up with Quizlet’s future plans for Airflow and data workflows in general.
We hope that the series provides useful material for a wide range of readers, including those just beginning their research into WMS projects, to those readers who want an in-depth understanding of Airflow’s operation.
Many high fives go out to all the members of the Quizlet team who helped research and evaluate multiple workflow management systems, deploy Airflow, and provided thoughtful comments on this series of posts. I’m looking at you Shane Mooney, Karen Sun, Amanda Baker, Miguel Flores, Tim Miller, Laura Oppenheimer, Amalia Nelson, and Andrew Sutherland