Lessons learnt from going serverless

Xavier Caron

Senior Software Engineer - Microservices

October 10, 2018

Xavier Caron

Senior Software Engineer - Microservices

Versent recently started working with a large telco to help build and deploy APIs and microservices.

Our customer was very clear about what they needed the new microservices to deliver: asynchronous and decoupled business logic, scale well (be able to handle peak-hour traffic) and manage fail-over recovery.

We also needed to have solid metrics and alarms in place for monitoring and rapid investigation.

Most of Versent’s mandated APIs are used for orchestration calls between new and legacy systems. For instance, our ‘create-mailbox’ API will generate a new user identity in the new identity management back-end solution, then propagate this creation event to some other customer legacy systems. Pretty cool stuff.

So, how do you build such a system? And how do you do it while still meeting the client expectations and still hitting tight deadlines?

The answer for us was going serverless.

Indeed, going serverless allows us to cut some of the time needed on the infrastructure part (auto-scaling management, logging & alarm systems…) by using out-of-the-box solutions provided by AWS and its Lambda ecosystem. This allows us to focus on the business value outcome for our customer: work on the business logic and deploy quickly into non-prod environment early versions of the system.

This blog will briefly explain the architecture Versent chose, what worked well and what did not work so well, and share a few lessons learnt from this design and provide some guidance and ideas on how and when to choose serverless.

Architecture

Based on our customer’s requirements, we use the following architecture for our serverless APIs:

– API Gateway

– Lambda(synchronous)

o Gets triggered byAPI Gateway

o Validates payload& auth token

o Fetches / saves data from / into internal system

o Encrypts data

o Pushes data to queueing system

o Returns success message to calling system

– Queueing System

o Stores the information with a given TTL

o Allows for failure recovery (replay message)

– Lambda (asynchronous)

o Gets triggered by queueing system

o Decrypts data

o Transform data

o Calls legacy systems

Choosing the queuing system

With this design in mind, two main implementations emerge for the queueing system: SQS vs DynamoDB.

DynamoDB + Stream

The DynamoDB is used to store data temporarily (which can be used for logging / audit) and allows to trigger a Stream on each data change.

With DynamoDB Stream, you get the following:

– Direct integration with Lambda.

– No need to poll /delete messages manually.

– Easy scale up / down (out-of-the-box offer by AWS).

– Auto recovery incase of errors (the stream does not move forward until the current batch of messages succeeds).

– 24h data retention(non-configurable).

– Order is preserved.

SQS

At the time of the design choice, SQS was not supported as a Lambda event source, which meant the following implication:

– You need an EC2 instance and a long-lived service (i.e.: express server app).

– Your service needs to poll SQS regularly to fetch new messages.

It’s important to note that since June this year, Lambda now supports SQS as an event source.
‍

With SQS, you also need to consider the following rules:

– The messages stay available in the queue until marked as processed or TTL expires (configurable from 1 min to 14 days).

– You need to remove the message from the queue manually once successfully processed.

– Order might not be preserved.

– Limited to a maximum of 10 messages per read.

Pros vs Cons

What worked for us:

– Asynchronous & decoupled business logic

– Scalability

– Metrics + Logging

– Alarms based onlog filters + Metrics

– Canary deployments

– Fail-over recovery

– “Cheaper” than EC2

Going serverless allows us to meet all the requirements that our customer mandated, while saving a lot of time on the infrastructure part.

For instance, most of the monitoring and scalability comes for free using AWS Lambda configuration. We were, therefore, able to quickly focus on the business logic implementation, allowing us to build quickly production-ready services.

What did not work so well:

– Poison messages mgmt.

– Stateless implementation

– Lambda in VPC

– Artefact sizelimit (50MB)

– Lambda cold starts

Despite all the benefits, we were still faced with some limitations with serverless.

While we managed to fix some limitations, for example – doing code improvements (reducing the artefact size), using a ping rule to keep the lambda warm (limiting the cold start) – other issues cannot be fully fixed, but we will get to those in moment.

Crucial Lessons Learnt

‘Poison’ Messages

When using a stream (Kinesis, DynamoDB stream…), you need to be aware that when a message part of a batch fails, the entire batch fails. This means that the batch will be retried (your system needs to allow the same message to be processed multiple times) and the processing of the stream is blocked until the batch finally succeeds.

This design is great for recovery purposes: If a downstream system is down, the stream will retry the failing batch until the downstream system gets back up and the message processing will move forward. This is also great if the order of the messages is important for your design (as the stream will enforce the order).

But, in case of an actual poison message, the stream will be blocked until the TTL expires. The data retention being 24h, your system might be stuck for up to one day because of an invalid message in your stream. When this happens, there is no recovery possible (deleting your record in DynamoDB still won’t remove it from the existing stream).

If you want to go ahead with streams, then you need to be able (by code) to detect a poison message and reject it: either up in the processing (before pushing it into the stream) or down in the processing line (riskier and costlier). For the latter, you will need to be able to retry (by code) the failing message, and if it fails again, decide it might be a poison one, then mark it as properly processed so the lambda can mark the batch as successful and allows the process to continue.

Stateless Implementation

One of the main issues found when using this architecture is that the Lambdas are by definition Stateless.

In our current solution, the secrets configuration of the micro-services is hosted in CredStash(certificates, encryption, passwords…) and loaded on each call to Lambda.

CredStash is a service that allows you to store / retrieve secrets and handles their versioning. Under the hood, the secrets are encoded using KMS and the storage is done via DynamoDB.

This starts to be a bottle neckas the configuration grows bigger and it adds an extra overhead internet call each time the service is trying to process a message. In this case, it could make sense to live in a Stateful environment.

If you end up having an important overhead caused by the statelessness of the design, it might indicate that you need a ‘long-life’ solution to handle the implementation of your micro-services (therefore moving away from a serverless architecture).

Lambdas in a VPC& Cold start

One of the requirements of our project was to have the system living in a VPC.

Unfortunately, Lambdas have not been designed to perform well in such an environment, and we have witnessed issues when deleting a deployed Lambda: the deletion takes up to 1h, getting stuck on the Network Interface removal.

When this happens, the following message is being displayed in the AWS console:

CloudFormation is waiting forNetworkInterfaces associated with the Lambda Function to be cleaned up.

Also note that using a privateVPC with Lambda adds a huge overhead on your cold start, as stated here:

Stay as far away from VPCs as you can! VPC access requires Lambda to create ENIs (elastic network interface)to the target VPC and that easily adds 10s (yeah, you’re reading it right) to your cold start.

While using a ping rule to keep your Lambdas cold usually works, it does not cover peak hour traffic. This is a known issue and should be carefully thought about before going serverless.

For instance, if you are using a serverless architecture to expose a public API, the response time can be impacted by the Lambda cold start. This means that some users might have a response time being very slow compared to others, hence degrading the overall user experience.

Checkout more about Lambda cold starts.

Decision

You now have to decide to go Serverless (using Lambda) or using a more classic and Stateful environment(using containers, like EC2).

Serverless (Lambda)

The serverless architecture approach allows to leverage some out-of-the-box AWS goodness (scalability, metrics, canary deployment…)

This allows us to focus on the core business logic rather than spending time on the infrastructure itself.

However, nothing is perfect, and we have already depicted some of the main drawbacks of this design, meaning you will need proper thinking before choosing to go serverless.

Containers (EC2 &ECS)

On the other hand, with containers, you are the one managing the infrastructure. This means that you have more control over the platform, can easily debug and it is great for stateful applications.

However, to meet the above requirements, you will need to take care of the following yourself:

– Provide your own scalability capability

– Configure an ElasticLoad Balancer

– Configure an Auto-scaling group

– Create your own Logging system

– Create your own Metrics system

– Manage your own Alarms system

– Create your own AMI

– Can be expensive to run (depends on the number of instances and their size)

So, should you choose serverless or not?

Based on the above analysis, going serverless for your APIs / micro-services definitely represents a viable solution to be able to get up-and-running quickly.

However, this implementation is not without flaws and some of them are actually hard to fix, as they are part of the design itself (cold start, package size limit…)

Before starting your journey, you will need to decide what kind of architecture can help you deliver value quickly while meeting production-ready requirements. At the moment, this involves choosing between going serverless or a more traditional container platform implementation.

Decision making process:

A way to decide which solution to go for will depend on:

– Your budget

– Your deadline (Container solution usually takes longer to implement than going Serverless, mainly because of the infrastructure setup cost)

– Your application load & complexity:

o how many requests per day?

o what is the average response time?

o how big is the needed computing power?

– Your stack (e.g. node.js is a lot more light weight than Java)

– Your DevOps capability (containers need a lot of infrastructure know-how to be a viable production-ready solution)

I found this article comparing both architectures really useful.

Costs & predictions:

Regarding cost, here are some websites that can be used to derive estimates:

– https://dashbird.io/lambda-cost-calculator

– https://servers.lol

– https://calculator.s3.amazonaws.com/index.html

In order to compare what is comparable, we try to stick with:

Serverless = 3 * m3.large EC2 instances

And we apply the following formula:

Serverless Cost < EC2 Cost + 40%

If you can make these numbers work, then going serverless should be a good solution for your project.

Xavier is a Senior Software Engineer – API / Microservices for Versent. He is currently focused on developing Java / NodeJS microservices. Xavier has a passion for travel and playing tennis. He is also an avid runner and has run the City2Surf three times and completed two marathons. When he is not running or traveling, he enjoys drinking stout.