By: Ray Kao
How we build for scale can be summed up into four points:
- Build twelve-factor applications in a cloud-native world
- Make constant and iterative improvements to our automation pipeline (DevOps) process
- Leverage managed cloud services as our underlying substrate
- Platform as a Service (PaaS) first
- Functions as a Service (FaaS) second
- Infrastructure as a Service (IaaS) last or when most “secure”
- Leverage multiple regions/data centers
We start by building our software with the twelve-factor-app approach. This is a well-written methodology that does not call out explicit implementation details on how to build scalable applications. Instead, this higher-level methodology allows us to objectively evaluate our processes and whether or not a technology ultimately helps to enhance, accelerate, or reduce operational overhead. This in turn allows us to onboard new paradigms as they emerge (e.g. containers and container orchestrators) and to adapt/adopt them into our process as it makes pragmatic sense to do so.
While we do not strictly adhere to all points of this methodology, it has helped us to determine, reason, and think about scale when creating a Software as a Service business. The key points in the methodology for us that we keep top of mind are:
(5) Build, release, run
(10) Dev/prod parity
The last (dev/prod parity) is critical to use to ensure we can have higher quality testing and code assurance given. This is because our dev and production environments are nearly identical in most respects (with the exception of a few things such as compute/instance scale and data sources.
DevOps can mean something different to everyone. For us, DevOps is about quality assurance and removing as much human error from our release cycles as possible. We take an approach that allows us to learn and update our operational knowledge as quickly as it makes sense to, share what we have learned with the broader engineering team, and, when it can be identified as a repeatable task, integrate that into our DevOps process.
While we want to eliminate unnecessary human error caused by manual processes as much as possible, we are not attempting to eliminate all human interaction. We do believe that there should be a gated process as an example before a new version or feature can be promoted into production. For these gates, however, there is a battery of automated pipelines that help augment a manual human check/validation, to which we add or remove based on a constant feedback cycle from what we learn.
Cloud Services First
We are not in the business of building, running, and maintaining a data centre. There simply is not any value in doing so. Operationally our engineers are primarily software engineers who have had experience working in a traditional on-premise environment. As such, we have favoured taking advantage of software-defined environments and being able to deploy our underlying infrastructure in much the same way we deploy our applications – taking the twelve-factor approach as guidance and using DevOps (aka GitOps/InfraOps in this context) to be able to redeploy our underlying infrastructure/services idempotently. In other words, we write our infrastructure as code.
Taking this approach, we have a way of gracefully deciding/falling back on which cloud services we consume from the big three cloud providers (Microsoft, Amazon, and Google) as our underlying substrate. Our current preference in order is:
- Platform as a Service (PaaS)
- Serverless, aka Functions as a Service (Faas)
- And, lastly, the more traditional Infrastructure as a Service (IaaS)
Platform as a Service
In short, PaaS provides us a better operational model out of the box on top of the fundamentals of traditional IaaS. There is a multitude of out-of-the-box services to consume, leveraging technology stacks that we do not have to own, operate, or maintain (e.g. deploying dockerized container applications into Azure Web App for Containers). This reduces our operational overhead and configuration complexity – scaling up (vertically) or scaling out (horizontally) is typically as simple as running a command line interface (CLI) operation or, if needed, a UI-driven scaling toggle from the cloud provider.
The CLI path is the most ideal as it again fits in line with our previous considerations (twelve-factor apps and DevOps). It also provides an evolving platform from a security perspective as more traditional security practices are adopted into a PaaS platform, which again we do not need to own, operate, and maintain – it’s baked into the platform.
Functions as a Service
FaaS services allow us to take advantage of event-driven solutions that enhance, supplement, or augment our existing services. In the future, as FaaS services mature and evolve we predict that there will be further enhancements that will provide a better balance of security features vs. scale. Currently, FaaS is primarily driven by public communication over HTTPs as the security model. This works in most cases, but we must ensure further measures for most of our needs internally and externally for customers.
Infrastructure as a Service
Finally, IaaS provides us with the most control and security over our environments, but also the most operational complexity/overhead. While this is advantageous for hyper-optimized or specifically tuned workloads for a small subset of our needs, it in many cases runs against our objectives from our previous considerations (twelve-factor apps and DevOps). We fall back to an IaaS model for services that do not have a PaaS offering in our given cloud provider, or where the security requirements are not met by the PaaS offering.
Multiple Data Centres
Arguably this last point is more about disaster recovery/business continuity and high availability than it is about scale. However, given that our customers are in different geographic locations and the distribution of our customers across these geographies differ, we take this as an added benefit when thinking about scale.
Designing our application with initially three geographies in mind (and theoretically infinitely more) allows us to automatically react based on the three other dimensions mentioned above, as well as increased demand based on our customers’ end clients (workers and employers). As a result, we are able to increase the number of app instances in a specific location that is closest to the end-user and scale-up and scale-down events accordingly.
The added side effect of this, of course, is that if a given region goes down because our cloud provider has an outage, or if there is a catastrophic natural event, we’re able to rebalance or redirect traffic to a specific region automatically. No one region is a bottleneck nor a single point of failure.