We build software with the intention of powering the core work of our customers. If we want Streak to be at the center of business, the very first feature we need to offer is reliability. For us to be successful, we need you to be successful, and availability is where that begins.
We recently had a service interruption from 5:30 AM-8:45 AM PST on Wednesday, October 2nd. During that time, Streak was unreliable or unusable for most customers. This didn’t meet our standards, and we know it didn’t meet yours. We apologize for the interruption to your business, and want to give you context on what happened and what we’re doing to make sure this doesn’t happen again.
What happened
We’ve been making significant changes to our infrastructure so we can better support collaboration features in Streak. Some of those features have already launched (add email to more than one pipeline) and more are coming in the near future. In support of imminent upgrades, we ran a data migration on the evening of Tuesday, October 1. This migration added a new layer of permissions to ensure that our users are able to better control the sharing of their emails, and involved creating a permissions record for any email added to a box.
The permissions code in question had been tested in our continuous integration environment and had been tested with a small number of example accounts in production. The data migration ran without issue, and Streak was working as expected Tuesday evening.
The incident started at 5:30 AM PST Wednesday morning. Our automated monitoring system correctly detected a service degradation. The specific alert that was triggered had been noisy recently due to another migration we had run the week before, and did not successfully page our on-call engineer.
At 6:45 AM PST, our support team came online, and manually paged the on-call engineer at 6:55 AM, starting our engineering response. The support team then replied to all users, who had either emailed support@streak.com or were able to access our live chat, while we identified the root engineering cause.
While the on-call engineering team could diagnose the immediate symptom of user-visible errors from API requests and API timeouts, they ran into multiple issues that delayed finding the root cause of the incident:
- Our system that logs changes in production is primarily focused on deployments of new code so the data migration that was run had not been automatically logged.
- The code changes for the migration had not been checked in at the same time it was run, since there was some delay between the code changes and the actual running of the migration.
- The engineer who had run the migration had reviewed the migration with the engineer who was leading the permissions effort, but neither of those engineers were part of the early morning on-call response.
- The servers were largely unresponsive to debug requests because they were overloaded, which slowed debugging.
At 7:45 AM, the support team followed up with users who reached out to us on Twitter and posted to updates.streak.com. At the same time, the on-call engineer was able to successfully inspect a server that was having issues. We observed that unoptimized code was overwhelming the server when trying to simultaneously load multiple users with many boxed emails. Each server handles many users at the same time, so this affected many other teams as well. Streak clients will retry their requests if a server is overloaded, leading to many servers being overloaded in the same manner.
After verifying the root cause with an engineer familiar with the permissions changes, we deployed an updated version that fixed the unoptimized code. The deployment was delayed due to the server being overwhelmed. Service was returned to normal around 8:45 AM PST.
Learnings
Our automated monitoring didn’t perform as it should. We are going to tune our automated alerts, add more engineers to the on-call rotation, and establish a regular cadence of reviewing monitoring updates to ensure that we’re prioritizing improvements of noisy alerts.
Context around the production migration was not adequately shared. We’ve updated our process around events that affect production to ensure that on-call engineers have the context they need to debug issues.
Monitoring and debugging shortfalls delayed our response to the incident. Over the coming quarter, we’re going to invest in our deployment and monitoring stack to ensure that it works as expected during incidents.
We focused too heavily on replying to users using our support@streak.com help channel in a 1:1 manner. In response, we’ve already deployed a Status Page that shows any current incidents as well as historic ones. In addition, we’re creating a communication playbook for any incidents going forward to make sure we provide continual updates to our customers.
Conclusion
We want to apologize again for this outage. Looking ahead, we can’t wait to show you the major improvements we’ve been working on. We’re confident the new features will save time and energy for your team. As we roll them out, we’ll make sure the new changes don’t affect what you originally came for: rock solid CRM, directly inside Gmail.