We assume that if you have started reading the article about operating 50 Lambdas you already know how to take advantage of serverless AWS Lambdas such as:

  • Paying for What You Use
  • Fully managed Infrastructure
  • Tight integration with the other AWS services
  • Being an interlink between AWS services
  • Automatic scaling
  • Event-driven triggers

We have rarely seen projects with more than 5 AWS Lambdas used for critical parts of the production application. So we decided to share our experience operating 50 Lambda functions in crucial production workflows.

We have an event-driven system with hundreds of millions events per day. Our use case is bulk real-time data processing. Data comes in different formats (CSV/JSON) from different sources (API endpoints, Excel spreadsheets, etc.). We do multiple transformations with the events on the way:

  • aggregate with different granularities: mins, hourly, daily
  • enrich event data with metadata about the video or place of publication
  • and many more

Data should be delivered to multiple data stores: file storage, analytical database and billing engine. We are doing all these with 50 Lambda functions.

How did we actually get there?


Choosing the solution
We went the way from our own solution to Terraform with apex with a couple of stops. In the early days of the company, while finding product-market fit, we had a dedicated team of 5 DevOps Engineers, who were creating the next standard tool for managing AWS infrastructure. When we focused all the efforts on our product, there was no time to maintain the universal infrastructure tool. After months without support and maintenance the tool, we weren’t able to deploy CloudFormation Stacks and Lambdas. In that time, we used the version 0.0.7x, but that latest one was 0.0.42x. We didn’t update for such a long time, because we never had time for a new configuration and changes. Minor releases were actually major ones. We spent quite some time extinguish the fire. After investigation, we discovered that the package deep inside of a dependency tree was deleted from pypi. We learned that there are no shortcuts. You will invest time on maintenance and infrastructure one way or another. After this tricky situation, we started thinking about choosing a new solution for our Lambdas with these main criteria:

  • Open-source: no more self-developed tools
  • Light-weight: it should not do too much, we should be able to replace it if needed

The first option was serverless.com. It seemed like a powerful solution but it was built with CloudFormation and did a lot of stuff implicitly, like creating buckets and CF stacks without any notification, request, or explanation. Shortly before, we introduced Terraform instead of CloudFormation and our internal tooling, as we weren’t happy with CloudFormation (too complex troposphere code). We liked the level of abstraction the Terraform brought and that Terraform AWS modules were well defined and supported by the community.
Usually, we wouldn’t deploy a separate Lambda function, but create AWS infrastructure like SQS queues or subscribe Lambdas to triggers. We explored GitHub and found the utility apex (which is no longer maintained, so that we forked it and did adjustments to it), which did what we wanted: Deploys Lambda packages using Terraform under the hood with Lambda ARNs as parameters. As you can see, if we don’t like a tool we want a level of abstraction above it. Apex looked like the thing we were always looking for.

Local development

For comfortable local development and correct delivery to production, we needed to have the next two things:

  1. same environment for running code and building packages
  2. same versions of packages

The first challenge was a piece of cake after dockerizing our development. And the second one was a living hell. After years with Ruby, the main surprise was that Python does not have a decent default package manager. At first, we tried to use pip-tools, but we weren’t able to set up the correct dependencies to run it. Then we found a hidden gem — poetry. Though, it was in active development and far from perfect, it was a breeze of fresh air in the dependency hell.

CI/CD

Though, we had a lot of Lambda functions we needed a powerful CI to manage all the stuff. Our solution was Jenkins with the Job DSL plugin due to historical reasons. The best way of code management became the mono repository of the source code. We had our Lambdas in separate directories. We decided not to write a separate Jenkins job for each Lambda function, but rather to have a configurable template for all jobs and create all the jobs dynamically with the seed job. JSON config describes all Lambdas and their locations in the repo. Based on this configuration, we generate multiple Jenkins jobs using the Job DSL plugin and were able to run all of them one by one using Jenkins pipeline job. The Lambda build job template code was stored in the same repository. It gave us the flexibility and possibility to modify all deployment scripts in one place. To have the same environment as in the AWS cloud, we run all our build jobs in Docker passing the code to the container via volumes.

Monitoring

We use Grafana for monitoring and alarms because of its flexibility and of course because it’s free.
You can monitor the efficiency of your Lambda code with these key metrics:

NameDescriptionMetric type
DurationThe elapsed time for a function’s execution, in millisecondsWork: Performance
Billed durationExecution time billed in 100 ms blocks, rounded up to the next full blockWork: Performance
Memory sizeThe amount of memory that is allocated to a function, in megabytesResource: Utilization
Max memory usedThe maximum amount of memory used for an invocation, in megabytesResource: Utilization
ErrorsThe number of failed invocations due to function errorsWork: Error
We have set alarms for the metric „Errors“. The only problem with this metric is that it counts the number of invocations that failed due to function errors, meaning issues with your code or a functions time out. The metric doesn’t include invocation errors or internal service errors from other AWS services, when a service doesn’t have the appropriate permissions to invoke the function or if you hit the concurrent execution limit for your account.

We haven’t set alarms for other metrics, yet there is space for improvement here. You can find more about Lambdas monitoring from a Datadog article.
Using Lambdas for Big Data flows gives you another perspective on functions monitoring. You can monitor the dynamic of decreasing the input event queue on the Lambda function and the creation timestamp of the last event on the next step after the Lambda function. This allows you not only to assess the system status and availability but also metrics to tweak the performance.

Reprocessing

We have a rule of thumb: if there is more than one step in a data flow or more than one data source, we need a reprocessing job for each step and source. This simple rule gives us reliability and certainty that we could lose the data only at the entry point. After that, all transformation could be restored or redone with new business logic.

How does the team need to look like for such a scale and what’s our opinion on such an approach?


Team

Initially, the data platform was developed by a team of 7 people for more than a year. After pivot, the company’s needs in data solutions became more specific, which required way less development, so the Data platform could be operated by one or two Engineers. Obviously, there should have been the transition period. The Data platform was transformed with flying colors by 3 Engineers. There was even some period when only one Engineer operated the Data platform, so we needed our solution to be as agile and efficient as possible. After our quite broad and deep experience with AWS Lambda, we came to following conclusions:

Advantages and Limitations

AWS Lambda is a great solution with the following benefits:

  • Fast development. We would rather say a fast extension of the system after the
    proper initial setup.
  • Resilient code. AWS specifies that Lambda functions should be stateless, and with some careful architecting you can also make each call idempotent.
  • No infrastructure to manage with continuous scaling. No need to worry about scaling instances for the most part. AWS does request that you inform them of any massive changes in scale. AWS handles all this for you.
  • Easy debug. Have you seen the meme about fixing a bug in production? AWS has made a tremendous job by adding the console editor. So you can change code and run it right on prod. Yeah, as you may have understood, it happened a couple of dozens of times.

The benefits come with the price of the following limitations:

  • Code package size. The zipped Lambda code package should not exceed 50MB in size, and the uncompressed version shouldn’t be larger than 250MB. Once, a new
    version of the Snowflake connector grew in size and didn’t fit the limits anymore, we had to tie to the previous version until they fixed it.
  • Execution time and Complex Call Patterns. A Lambda function will time out after running for 15 minutes. There is no way to change this limit even if you have special contract terms, premium support, and even read the 170-page book about Amazon Leadership Principles. This means more time orchestrating and organizing functions so that they can work in a distributed fashion on the data.
  • Not always cost-effective. Running high-load web endpoints will be expensive: Approximately 10 times more than a setup with Flask app running in the ECS cluster. 10 times, Carl. 10 times, Jeff.

Could we operate more Lambdas?


The short answer is we would extend our Lambda function zoo for the ETL or simple API. Lambda is good at moving bytes from one storage to another. The longer answer is very simple, yet essential. There is no good or bad tool if it serves its purpose on the needed scale. AWS Lambda is good for ETL purposes when you use it with precautions of reprocessing, monitoring, and alarms.

We might consider using Lambdas for a simple API, because we are fine dealing with a routing workaround for Lambdas (i.e. an OSS tool chalice).
We definitely do not recommend using Amazon API Gateway with Lambdas. Even slight customization is hard. We hope we will never operate their templates again.
For all the purposes above we learned how to operate enough Lambdas. AWS Lambdas gives you fast development and autoscalable computational power. Of course, in order to use
many Lambdas you need to invest in development infrastructure: CI/CD, monitoring and alarming. We hope this article has helped you to understand how many AWS Lambda you can operate with your team and when it makes sense.

 

By Alex Tonkonozhenko and Maksym Voitko

Top
%d Bloggern gefällt das: