At Quizlet, we want to learn how to make our users' learning experience even better and a lot of that comes from understanding what our users are doing on Quizlet. And over the past few months, we've constructed our own data pipeline to better inform us about what's happening on Quizlet - no small feat with 287 million events per day (~3,320 per second)!
In the past, we've relied on New Relic Insights to learn about our users, but going forward we knew we wanted to create our own data pipeline and warehouse to enable everyone who works at Quizlet to learn about our users most effectively.
Our analysts want to query across all the data we have (such as event logs and database rows) in a standard, reliable, and quick way without having to learn non-standard query languages. Product managers want to run experiments and track user behavior across different segments and platforms. Developers want to instrument A/B tests without sacrificing development time.
To meet all these diverse needs, we arrived at the following high-level goals:
- Own our data - We care a lot about protecting our students' data. Controlling and owning it allows us to guarantee that, especially as student data privacy laws evolve.
- Have confidence in our data - We want accurate data to inform decisions but we also want it fast.
- Keep our data long-term - We want to be able to look at data year over year for years to come to see improvements
- Analyze and draw conclusions, quickly - If getting data will be a slow process, no one is going to use it. We should be able to have a question and run queries to answer it quickly as well as query over data in near real-time.
- Query production MySQL data against event data - Event data about what our users are doing on Quizlet is orders of magnitude richer when analyzed along with our data about the content they're studying on Quizlet.
- Avoid significant latency in our page load times - Our web application, written in PHP, is single threaded and does not support async operations. Uses shouldn’t feel any extra latency when they're trying to study.
- Move quickly with limited engineering resources - Building a data pipeline could be a 20-engineer project (and therefore become over-engineered). But with Quizlet's 20-person engineering team, I was the entire data-pipeline engineering team! Cool, right?
Now that we have our goals, let the planning begin!
First, we laid out our options. We could setup and manage an entire Hadoop cluster, use on-premise data warehouse software like Teradata or use a hosted solution such as RedShift or BigQuery. In the end, we decided to go with BigQuery as our data warehouse. The fact that we'd just moved our infrastructure to Google Cloud Platform meant that BigQuery was a great option and in addition it met all of our goals above.
- We’d own our data.
- Functions like EXACT_COUNT_DISTINCT tell us the truth. Goodbye estimation for mission critical metrics!
- BigQuery has built-in export to Google Cloud Storage where we can store data cheaply for redundancy.
- BigQuery is ~fast~. Just try some sample queries over the large publicly available datasets and you’ll see what I mean. Also, streaming data means new events will show up in seconds.
- BigQuery is managed and easy to use. All we had to do was shovel data into it and forget about it which would allow us to move quickly.
Sending our MySQL data to BigQuery is stretching its use-case since BigQuery tables are append-only. That means we're missing the
CRUD. The way we get around that is by periodically dumping our entire MySQL tables once every couple of days. We lose real-time analysis of our MySQL data with this approach, but it turns out to work just fine in practice.
We couldn’t send requests to BigQuery directly from our web application, since the latency was too high. We also didn’t want to tie all of our clients (backend web, front-end web, Android, and iOS) directly to BigQuery to allow for decoupling if we wanted to make changes to the pipeline. That would put more work on our engineering teams trying to build features for Quizlet users. Additionally, BigQuery imposes a rigid, typed table schema that we didn’t want the clients to worry about. That just meant even more engineering effort spread across the team.
Instead, we decided to write a pretty lightweight service we call the Eventlogger (creative, huh?) that runs on our web machines as well as on dedicated machines, serving as a middleman between our event creators (i.e. desktop clients, mobile clients, and backend web application) and BigQuery. Here's a high-level overview of how the pieces would fit together:
Sitting on many machines inside our own infrastructure, the Eventlogger provides an HTTP interface for our clients that accepts event logging requests to queue for streaming into BigQuery. It also gives us the flexibility to change what happens behind the scenes transparently. As mentioned, the interface had to be low-hassle for our clients and abstract away the rigidity of BigQuery’s table schemas. We arrived at the following—a simple POST endpoint that expects a JSON document with the following structure. We also expose a batch endpoint that expects an array of the following structure:
We'll define metadata fields that are required for each event, the most important being
type, used to determine into which table the event will be inserted. Each Eventlogger instance keeps a local definition of the BigQuery table schemas to compare the
event sub-document against.
event sub-document will contain the fields of the event, which can be arbitrary. The fields that correspond to the table schema, will be removed from the event document, validated as the appropriate type, and inserted into the target table. This leaves the fields not listed in the schema in this sub-document, which is then inserted into an
other (STRING) column in the target table. This means that fields not listed in the schema will be accessible in a JSON document in the event row in the
other column which can be accessed in a query through BigQuery's JSON functions.
Here's a practical example. We send the following event to the Transaction table, which corresponds to the simple schema listed above.
This event document will be inserted into BigQuery as the following row in the
other system works out really well in practice. We could have decided to let the Eventlogger patch BigQuery tables to make a column for every single key that comes its way, but then you end up with an unwieldily schema cluttered with that one piece of information you tracked for 5 minutes. In this system, we don’t lose data being sent to the data warehouse. It’s always going to be in the
other column for analysis. If it turns out that piece of data is interesting, we can promote it to a first-class column.
At a high level, the Eventlogger is divided into two parts—the server and the streamer. The server is responsible for accepting event logging requests over the HTTP interface, validating and transformation the events against the schema, and responding to clients once the event is deemed ready to send to BigQuery. After that, it’s enqueued via table specific channels to be processed by the streamer which is responsible for batching the events to be sent to BigQuery.
We decided to write the service in Go. Though we hadn't previously run it in production, we wanted to start moving away from infrastructure utilities written in Python and Bash. While we wanted to move away from both due to lack of strong types, Bash in particular has caused numerous headaches due to its odd edge-cases. Overall, we're glad we picked Go. It's an extremely easy language to pick up in large part due to its excellent documentation and just as easy to deploy with a single compiled binary. That said, we hit a few gotcha, pain-points, and learned some useful things along the way:
Every package imported in Go is expected to rely in a global
GOPATH. Furthermore, Go doesn't provide any way to pin versions of your program's dependencies. While we didn't yet have other Go projects running to produce conflicting dependencies, we've seen that problem before with things like
Luckily Go 1.5 came with the experimental flag,
GO15VENDOREXPERIMENT. If the flag is enabled, the compiler will look for imported packages in the local directory's
vendor folder before
GOPATH. We use glide to pin our dependencies inside of the
vendor folder. That way we have reproducible builds over time. There's an interesting discussion regarding dependency management on the Go Forum.
This variable limits the number of operating system threads that your program can use simultaneously. Pre-Go v1.5, this value was defaulted to
1 which meant we weren't taking advantage of all of our machines CPUs. Now, it defaults to the number of CPUs visible to the program at startup.
While we're able to run our binaries for long periods of time thanks to Go's managed memory, we still want to be able to deploy new versions of the service or update the service's knowledge of the table schemas. Taking an Eventlogger process down for even a split second would case use to lose event data since every single request we serve produces at least one event. We rely on grace to handle handing-off of the port and letting the old process finish outstanding requests. In our case, we have to do extra work after the last HTTP request is responded to so we have to be mindful of handling the
SIGTERM grace sends from the newly spun up process to bring down the old version.
Results and Iteration
After about a month and a half of design, development, and stress testing, we launched our pipeline into production. As mentioned above, the pipeline (of which the Eventlogger is the center) serves 287 million events per day. The vast majority of those events come by way of PHP where we are most sensitive to latency due to lack of support for async operations. The big tradeoff is latency and dropped events due to timeouts.
At first, we had our connect and wait timeouts set to 5ms each but we were dropping events much more frequently than we wanted as seen via server logs. Unfortunately we weren't yet recording metrics on dropped events and my memory isn't perfect so I don't have numbers. ApacheBench was enlisted to do some more stress testing. The latency distribution skewed higher as the number of concurrent requests was increased.
In response, we decided to batch events from the server to reduce the number of requests sent from PHP to the Eventlogger. It seemed like the change worked, but when comparing our Nginx logs to the number of rows stored in BigQuery, we noticed an unnervingly large discrepancy. At this point, we decided to add some telemetry to record the number of dropped events. For every request we recorded (a) did the request complete or timeout? and (b) how many events were sent?
The numbers were shocking! At peak traffic about 1/3 of events were dropped. At minimum traffic, nearly all events were being dropped. This means the number of requests in flight did not greatly impact how many events were dropped.
We also noticed that the number of successful requests was much larger the number of request dropped which suggests that we were timing out requests with many events in the batch and successfully completing requests with few (e.g. one or two) events. In response, we stopped batching events and sent one event per request. The improvement was quite drastic:
For the rest of the month of October, we dropped ~0.02% of events. In the month of November, that number increased to ~0.05%. What happened? There were 4 abnormal events where we overloaded one of the Eventloggers with a rapid influx of events:
During those events we see a huge increase in connection timeouts. Each machine we run with PHP as a local instance of the Eventlogger running in order to reduce the roundtrip time to the service. One of these machines is responsible for processing batch jobs such as sending newsletter emails to users which results in a single event per email sent. Go's http server starts a new goroutine for every incoming request. We believe the goroutine that accepts new TCP connections isn't scheduled often enough to accept all of the incoming requests since it has to contend with the thousands of others that are handling the requests once accepted.
The batch processing is done in PHP, but doesn't actually run in-line with user facing requests. To mitigate the connection timeouts, we decided to make two changes. The first was increasing the timeout length, giving the goroutine more time to accept incoming connections. The second was pointing that instance of PHP to a pool of machines dedicated to running the Eventlogger load-balanced by Nginx to reduce the number of goroutines started for a given instance. In our most recent email-blast, we sent out 12 million emails about some new Quizlet features without a single timeout!
Now, our high-level event architecture looks like the following:
In the last week, our timeouts have settled around 0.01% though that is likely impacted by the lower traffic we saw over Thanksgiving break. Of course, we're going to continue tweaking the pipeline to get both the latency impact and dropped events as low as possible. A couple of projects to try next are:
- Pointing all of our web machines to the pool of Eventloggers as well. There would be an increase in best-case latency due to the extra hops, but we think we'll achieve more scalable and predictable behavior.
- Moving our PHP to Hack. This is a larger project that would benefit us in a number of ways, but in this context would allow us to perform async requests.
- Adding durability. Currently, if an Eventlogger process crashes, the events it was holding in memory will be lost. This isn't a huge concern due to the lack of crashes so far and the non-vital nature of event data. Though we could add an intermediary step between the server and streamer sections that persists events to disk.
- Start gathering metrics from the beginning. You'll never know if you made a difference if you have nothing to compare against.
- Make your high-impact knobs easily configurable ahead of time through an admin interface instead of hard-coding constants. It's no fun to have to go through a whole build and deployment cycle to tweak a timeout value or a remote host. Even worse, it slows down the iteration process to fine-tuning the system.
- Use more production realistic data when benchmarking. ApacheBench in our initial load testing didn't surface the same characteristics in response to batching events that we saw in production
All-in-all, we're happy with the way the data pipeline project turned out. Due to thoughtful planning, we avoided the need for large re-architecting and were able to start learning about our users in a relatively short amount of time. For me personally, it was great to be able to take ownership over such an impactful project that now empowers the entire company to make the best decisions possible. Of course, there's always more to learn and further improvements to be made, but until then we'll keep on making the best learning tools we can!
(cross-posted on Medium)