I’m far from the world’s biggest advocate of AWS lambda (or serverless computing more broadly), but I have used it for a few production applications. This post will hopefully be a living document of all the times choosing a serverless computing environment has caused me great pain 1.

Caching outside /tmp

Many libraries have assets which are downloaded at runtime. It may not be obvious to you that this is the case, however, until your run into the problem of “said library downloading files outside of /tmp on a lambda” function. Worse, some libraries won’t expose this path such that it can be changed. If you’re lucky, you’ll just need to change an environment variable and override some setting. If unlucky, you’re looking at forking the repo.

SQS Flow Control

When using a queueing system (such as Amazon’s Simple Queueing Service), one generally inherits a set of intuitions about how can use that queueing system to solve certain problems. These intuitions can break down when using SQS with AWS Lambda.

As a specific example, imagine you have some lambda function which needs to talk to some other service. Unfortunately, however, lambda function load (for whatever reason) is far from equally distributed, resulting in large spikes of messages, followed be relative lulls without any messages. This seems like a great use case for AWS Lambda. Well, until the burst in load spawns dozens of lambda functions, clobbering your other service.

One intuition might be: “what if I try to restrict the number of concurrent lambda function invoctions” 2. Perplexingly, the SQS to Lambda integration (well, at least last I checked) is, for lack of a better word, odd. Based on observed characteristics, it seems like there is a “queue reader” service which receives a message, checks for a “warm” lambda, and routes the message to that lambda 3. If there are no warm lambdas, this service instead spins up a new one. Such a setup works well for many workloads, but functions poorly if one restricts the number of concurrent lambda function invocations. Our “queue reader” service has already read a message and, upon attempting to spin up a new function, has failed. This service, however, cannot simply “release” the message, making it visible again, as it would simply be snatched up in the next poll. Instead, the message remains not visible until it eventually circles around.

At this point you might be tempted to try and work around this. If you rely on deadletter queues, however, this can become a little weird. Tuning deadletter queue settings to effectively balance between retries and resource utilisation can end up a little weird. If you need some sort of handling of bad messages, it seems likely that the best solution involves moving such handling from the queue level into the application level. Once again, this weakens the benefits of using SQS, but it’s a solution if you’ve already committed to this path.

In my case, the problem was (ultimately) a poorly optimized API. Once that was fixed, the problem faded into the background. At some future scale, this would likely become a problem again 4.


  1. At least one typically learns something. ↩︎

  2. It’s worth noting that walking through this a few years down the line clarified to me that there are indeed methods for handling this. ↩︎

  3. A generally useful exercise is to try and ask the question “why would somebody design this system in this way”. One advantage might be that the dependency graph for such a service is relatively shallow. Other options (such as relying on SQS metrics) means depending on more services. Is this the reason? No idea. Still worth thinking about, however. ↩︎

  4. In practice, the service of interest would likely have encountered other bottlenecks first. In favor of trying to solve the “problems you actually have”, we left it as is. ↩︎