Our small team acts big with the help of Serverless and NoOps
Our team is small but our application is large. To balance this constraint, we have harnessed the power of managed services and an event-driven architecture to focus engineering cycles on core business logic. Initially, that means investing in infrastructure that will support new features and increasing scale. Here I will discuss some of the key decisions that have helped and hindered our progress.
Bottlenecks
As a small engineering team, our tightest bottleneck is human-cycles. Functionally, we face challenges like handlings bursts of events from sensors in our network or needing to display raw and aggregated data on a variety of interfaces, (web, mobile, BI tools, and stations). Given our widespread GUIs, we prefer tools which are reusable across interfaces. Ensuring minimal management of each interface is also critical; once a feature is released we might not be able to spend time on it again for weeks or months.
Solutions
Development
We use Node and run it on Lambda for nearly all of our services. Separating our cloud-based services into separate functions helps to ensure that they can scale independently according to their load. Two primary concerns were raised due to this infrastructure choice:
- Cold starts. We care about latency on our API endpoints so we created a some New Relic Synthetics to keep endpoints warm by testing them every few minutes. For matured endpoints, organic traffic now keeps them warm enough to keep latency down. The rest of our architecture can tolerate the ~40ms it might take to start a container. Problem averted.
- Connection Flooding. This was the most significant problem since we get bursts of traffic from our stations which we want to record in the database. At worst, there could be a few hundred events received at once which is well above the default connection limit on RDS, a default AWS probably set for a good reason. Fortunately, we got a new feature at re:Invent 2017: concurrent execution limits. The limit is set at 100 for all functions by default which is murder for RDS so we dropped all non-API functions down between 2 and 5. Problem solved.
We use SNS to send messages between services. Data is passed around less than it is composed by the Lambdas which consume it; if the result is interesting it will be re-emitted under a new topic name. The main challenge this poses is dropped messages: SNS is not a queue so, if a Lambda exists unsuccessfully, the message is dropped. Lambda does have a built-in retry mechanism, but you can also attach a Dead Letter Queue to collect all failed requests. We also have a service which archives all critical messages to S3, from which we can replay messages once the issue in the service has been resolved.
AWS has an awesome tool called SAM local, similar to the Serverless.io offline plugin, which allows us to run a specific service locally as though it were in the cloud. SAM local is my least favorite part of the development stack because it does not make testing logic that flows through multiple services easy, and it lacks support for a few AWS features like Step Functions at the time of this writing. Even so, SAM local makes it easy to generate events and emit them to a service locally. Add the command to a yarn
script and you’ve got a dead simple way of validating PRs during review:
sam local generate-event sns -m "message body" | sam local invoke MyService
Deployment
One of the core values of our infrastructure is that deployments should be easy. As a result, onboarding new engineers and rolling deployments back should be straightforward as well.
CodePipeline monitors a GitHub repository and CodeBuild packages our deployments. Setting it up may be less trivial than connecting Travis CI to your repository, but the cost difference is significant, (it’s based on usage instead of concurrent jobs), and it integrates well with other services like CloudFormation.
CloudFormation is where the magic happens. We decided to write our infrastructure as code, like any reasonable person who doesn’t want to futz around in the AWS Console, then we took it a step further: we also defined our deployment infrastructure as code. That means when we decided to use Yarn in our CodeBuild environments we modified the one buildspec.yaml
which all of the Node services use, then merged to master. Similarly, if we need to add a test phase to a specific pipeline, that would be added to the template.yaml
stored in the same repository as the code which the pipeline deploys. Like magic, the initial merge to master will modify our pipeline. The one caveat: when infrastructure is changed the first deployment won’t run code through that change, so that new test phase wouldn’t get run until the subsequent deployment. For that reason, we try to separate infrastructure and business logic pull requests or remember to manually rerun the pipeline after any deployment infrastructure changes.
By co-locating code, infrastructure, and deployment infrastructure we ensure each stack is self-sufficient and self-explanatory. Each stack also relies on a “global” stack which exports a few common features: roles for CodeBuild, a build image, common SNS topics, etc. This way a new engineer can understand a service, it’s infrastructure, and it’s deployment all from one repository and a few of its global dependencies.
Monitoring
Dashbird is our primary monitoring tool. Unlike CloudWatch, each Lambda invocation is broken into a separate thread which makes debugging and tracking down issues much easier. Dashbird also links directly to relevant X-Ray traces, another tool we’ve been enjoying.
X-Ray has been most helpful for identifying slow database calls. There are a lot of features we haven’t explored yet, mostly due to the amount of configuration required, (even instrumenting our database properly required an additional library, what the heck, AWS?), but the ability to follow a request from API Gateway through to the database alone seems very powerful.
CloudWatch is also somewhat useful, but it gets less attention than the others. What we have found useful is building a dashboard with cross-service metrics, (like invocations rates, database metrics, errors across all services, etc.), and then we use Dashbird to drill into the details. It’s nice to see Lambda, RDS, and SNS metrics right next to each other, but it isn’t as actionable on a day-to-day basis.
Reflections
Serverless servers
Not everything wants to be serverless, especially when working with obscure protocols like OCPP or legacy requirements like an FTP server.
OCPP is only used by a handful of companies since charging hardware is still very niche; this means less open-source software supporting it and even fewer contributors experimenting with it. Yes, we did manage to launch OCPP on a pair of Lambdas to handle client and server-initiated requests; however, the amount of time it took to adapt the SOAP-based protocol to our serverless architecture made it very inefficient in terms of engineering cycles. In retrospect, buying a managed OCPP service off the shelf would have been more cost-effective than deploying an obscure functionality on a bleeding-edge infrastructure. If only a decent one existed.
When the FTP requirement came up next it was clear we should weigh engineer cycles more carefully thanks to our lessons from the OCPP deployment, and we ultimately decided to deploy our first EC2 instance to support it.
Monitoring
Though tools like Dashbird provide a sane way to monitor a Lambda-heavy environment, visibility into services like API Gateway still pose challenges for debugging issues and identifying trends. Ideally, API Gateway will emit more interesting information to CloudWatch in the coming months. Better yet, if it could support middleware in the same way Express-type servers do, the community would take care of any features gaps on its own. Either way it evolves, there is currently an insufficient level of visibility into various aspects of API Gateway, (like cache), that make it a difficult choice to live with.
Next Steps
On the whole, we have invested well in areas that significantly improve our efficiency despite a small team size, but have also gotten caught without necessary features in cutting-edge services. We will need to continue investing in ways to make services like API Gateway a more transparent part of our infrastructure and sam local
a more useful testing and debugging tool.
Could we be doing something better? Let us know in the comments, we are constantly evolving our infrastructure and are also probably hiring. Are we doing something well? Hold down the hands and show your appreciation!