Moving Calm to Microservices
Transitioning business logic from one service to another, in production, without downtime, sometimes feels like changing an airplane’s jet engine mid-flight
Moving a large backend application from a monolith into microservices is a hugely complex undertaking. It’s been done many times by many teams, but it’s never done quite the same way twice, and the details are always super interesting. I’m excited to tell you about this journey we’re on at Calm. We hope you find some of our decisions and learnings valuable in building your own distributed software applications, and that you’re able to see a bit more “behind the curtain” of the magic that makes Calm’s products tick.
Some Brief Background
Calm is the #1 Health and Fitness app, with over 60 million downloads and millions of daily users. Our backend services receive thousands of inbound requests per second. Our backend is responsible for a broad array of different tasks — content and asset management, intelligent dynamic content display, ML-driven recommendations, A/B test enrollment, user activity & stats, internationalization, data anonymization, our teacher & team programs, subscriptions and payments, and at least a dozen more big meaty features.
The move to microservices
So why did we decide to move to microservices? Three primary reasons drove this decision for us late last year when we began the big transition:
- Risk reduction — Having a single backend service and a single database is convenient when you’re small and iteration speed is close to your only concern. But it starts to present extremely high risk as a single point of failure once your scale grows and downtime becomes even more costly.
- Engineering autonomy as we scale — Microservices encourage teams to be small and autonomous in their decisions, with ownership over code standards, logging, choice of database etc. We focus on sharing tools and libraries between services to avoid duplicate work and unnecessary fracturing, but the unlocked autonomy of a service architecture is critical as we grow the team.
It’s an exciting time on the engineering team, because we’re right in the thick of things on extracting this functionality out into separate services. There’s a lot of architectural and platform design work being done right now, because we’re planning for this next phase of our system to last us a really long time. When it comes to platform architecture, we’re taking the time to get these things right, rather than rushing into expedient decisions that will come back to haunt us later.
We have high standards when it comes to creating a new service. Microservices solve a lot of problems, but they also introduce massive new challenges and complexity. In order to succeed as a team, there needs to be a strong strategy for managing the chaos that can creep in if the boundaries between services are not well-defined. Business and product requirements will always encourage rapid release of new features, so in order to maintain uptime and stability and avoid team burnout, strong principles that are enforced collectively are critical to success. The speed vs. stability tension is healthy and everpresent, and a world-class microservices ecosystem allows for developers to make judgement calls on this tradeoff for their own service, without sacrificing the stability of the whole. These are a few of our standards to allow us to strike the balance at Calm:
Logging, alerting and monitoring: Service owners are responsible for anticipating failure or reliability issues with their service, logging at the appropriate severity level, and setting up dashboards and alerts on service health. We log into the ELK stack, and we love statsd for stats aggregation. We also leverage DataDog’s APM features extensively, and we use PagerDuty to escalate alerts to the proper on-call engineer.
Documentation: Well-structured documentation for a microservices application is critical to democratize the ability to resolve downtime and to understand who to escalate to if necessary.
Data Ownership: Every new service we spin up that requires a datastore should be the sole owner of that data. This is a principle that is central to our architectural pattern, because it allows us to simplify a lot of the reasoning about how the whole will behave as a function of the sub-parts. Multiple codebases touching the same database gets extremely messy extremely quickly — it makes rigorous end-to-end testing quite difficult, makes debugging substantially harder, and it makes reasoning about the system much more challenging.
We’ve also kept things fun and creative with a unique naming scheme for our services. Any figure in the classical pantheon — god or goddess, Greek or Roman or Hindu or Nordic — is a candidate for a Calm service to be named in their honor:
- Kali is our data anonymization and privacy service, named in honor of the Hindu goddess of destruction
- Hermes is our event messaging service, responsible for forwarding user-behavioral events to 3rd party analytics and attribution partners, named in honor of the Greek messenger god
- Vulcan is our media asset processing service, named in honor of the Roman god of the forge
Blamelessness: This is a cultural value more than a technical one, but it’s just as critical to our success. When an engineer identifies an issue, the most important thing at that time is to keep records of everything they’re seeing, to alert the proper stakeholders, and to be creative and pragmatic in their solutions. When they’re worried about who caused the problem, especially when they know it may have been them, their mind is not in the right place to accomplish these more important goals. Engineers who are afraid their job might be on the line or their reputation at stake may be focused on covering their tracks or may avoid raising awareness in hopes they can fix things “before anyone finds out.” And they are likely to be way more stressed out than they should be, which leads to bad decision-making and pigeon-hole thinking.
Plus, there’s tons of evidence that people are way happier, better retained, and more likely to suggest their friends to work with them when they work in a blameless, positive and super fun environment where they know their work has impact. Psychological safety matters, especially for a mental healthcare company like Calm.
Exciting Challenges Ahead
We’ve started to lay a solid foundation, but we have a massive amount of work ahead of us. We still build a lot of our core functionality inside the Node monolith in order to continue to move rapidly on product features, but we’re simultaneously extracting logical pieces of the application step by step. We’re also not always perfect at upholding our standards, and we’re working out as a team how to make some of those tough decisions (“Did I write enough tests to feel confident yet?”, “Is this documentation useful if it may be stale in two months?” etc).
We’re also mindful of the fact that sometimes external deadlines and product roadmaps require us to cut some corners — even though we generally try to avoid technical debt, we understand that, just like financial debt, it’s often a good tradeoff to take on some sustainable level of debt in order to learn on the product more quickly. We have a strong MVP and testing culture, so figuring out how to balance those instincts against our goals around technical excellence is an exciting challenge as we continue this transition.
Kafka and Events
The team is currently fast at work building out a highly available, high-throughput events pipeline built using Kafka and Go. This service is abstract enough to support arbitrary consumers. It persists all raw events data into Amazon S3 by default and supports replay mechanisms in the case of downstream failure. We’re discussing how to achieve idempotency, anonymization and obfuscation to increase the robustness and compliance of the system.
The first major consumer of the events fire-hose is Hermes, mentioned above. Once things are rocking and rolling, we’re planning to build new application services to sit on top of the events pipeline (e.g. the subscriptions service or the user activities service). We know this event-based application architecture has a ton of huge advantages over a simpler CRUD model, so we’re making sure the pipeline is designed with scale and flexibility in mind so that it can handle all of these different use-cases.
In 2020, a huge challenge we plan to tackle is taking our distributed application multi-region. There are lots of reasons to go down this path, from reliability and fault-tolerance to compliance to improved latency and performance for our millions of users all around the world. But the engineering challenges this presents are substantial. As we design our services and the infrastructure we use to deploy them, we’re making sure to keep our multi-region future top of mind. We’re avoiding application design patterns that make subtle and hard-to-remove assumptions about the locality of the data. We’re making sure our network topology, auto-scaling systems and AWS account design are all setting us up for these big meaty problems we know are coming.
Another major class of exciting challenges has to do with maintaining our data infrastructure during the microservices transition. Our data warehouse aggregates our data from many sources — the backend database snapshots, Kafka event stream data in S3, and 3rd party direct integrations. We use this warehouse to run realtime Machine Learning models, and to make extremely important decisions every day across the business. So uptime in the data services is just as critical as uptime in our core backend services themselves. To decouple these systems, we’ve built out a cascading set of abstraction tables on top of the raw backend database schemas — the models and dashboards rely on these higher-level tables rather than the raw data. When we extract core portions of the monolith database out into a new schema for a microservice, this abstraction layer reduces the scope of changes needed to keep the higher-level view of the world consistent. But this is an ongoing challenge, and we’re innovating on new ways to make this abstraction barrier even stronger. The goal of these efforts is to let the data and backend teams operate quickly and independently according to their own requirements, and to reduce the total downtime across the whole system.
Our vision for Calm is to make the world happier and healthier. Our meditation, sleep and relaxation content is already deeply impacting the lives of millions of people, and we continue to expand our product and content offerings massively every month. With this new growth will come additional challenges in our backend application. We feel really confident that we’ve set ourselves up on a great foundation to balance all of the competing requirements: to iterate rapidly, scale up our traffic quickly, minimize downtime, grow the engineering team without getting bogged down in stifling process or technical debt, and give everyone the autonomy they need to be creative and fulfilled.
But of course, no solution is perfect, and there are big strides we want to take in our architecture going forward. The Calm engineering team is eagerly running head-first into these new and exciting technical problems, and we can’t wait to tackle them!