At Lost My Name I’m responsible for the reliability and performance of the platform. This platform is the core of the business as it enables the creatives to publish their work. This work is what makes Lost My Name’s products so personal, magical and popular.
As of writing, my team is preparing for another holiday season. This will be my second time going through this period at Lost My Name after been through it twice at Boomf. The holiday season at a seasonal company is unlike anything else. We do more sales in 3 months than we do in the rest of the year combined.
Increased sales means increased prints we need to deliver to our printhouses. Every single one of these prints goes though the platform to generate the personalised content at high resolution. Increased sales also means that the platform needs to render more personalised previews for people visiting the website.
In this blog post, I’ll talk about a specific piece of one of our books, The Incredible Intergalactic Journey Home.
The starting situation
For one of the spreads in the book (which are in essence 2 pages next to each other) - we request your address. We then use your address’ coordinates and personalise the book by adding images from the surrounding area.
To do this, we need to compose several high resolution images which come from external APIs. This is the same for both print and preview as we can only retrieve one size of the image through the API.
This means that for every user that requests a book preview, we need to download several images, stitch them together and resize the high resolution image afterwards. All this work is quite CPU intensive and can’t be cached (every request is unique due to the specific coordinates).
Illustrated below is the starting state of the platform (simplified).
With the above architecture, we are able to serve roughly 2 books per second consistently without errors on AWS EC2 t2.medium boxes. Crank up the concurrency and we see availability drop. High availability is crucial to us as we’ve seen that our preview drives conversion rates. 2 concurrent requests is not the traffic we want to serve. We want more, a lot more.
As mentioned before, we’re preparing for Q4 where our sales increase exponentially. To scale this, we would need a lot of boxes.
As you can see in the diagram above, this is also blocking our main
Render Service. This
Render Service is also used for other pages in the book that are fast (~120ms). We’ve noticed that when combining these requests, the other pages suffer from this as well.
The solution to this comes in 2 parts. First, we need a way to easily scale these
Image Services when demand is high. The second one is unblocking the main
Render Service so it’s not sitting idle waiting for external work to be done whilst it could be doing other, more useful work.
From monitoring, we know that the
Image Service hits its maximum CPU. We also know this is purely because of decoding the images and encoding the end result again. Doing requests to external endpoints isn’t resource intensive, just time consuming.
Problem 1: scaling
This is where Lambda comes into play. Instead of doing the heavy work in our service on a single box, we set up an AWS API Gateway which delegates its requests to a AWS Lambda function. This Lambda function does all the heavy lifting which means that the
Image Service doesn’t need that much power anymore.
There is one caveat though. Since we can’t send binary data through the API Gateway, we can’t serve the image directly from the Lambda function. To bypass this, we upload the image to AWS S3 and return that URL, which we then download again in the
Issue one is solved! We accidentally managed to throttle our API Gateway by requesting 80 generated images per second. But the idea worked and we could scale it without having to worry about it.
Problem 2: unblocking
The solution to this is actually relatively simple. Since we send a “Definition File” to our
Render Service to say how the book is laid out, we could adjust this file before we actually send it off. This is an additional reason we didn’t put the binary data in the JSON response as a return value from the Lambda functions - which is a known workaround for not being able to serve binary data from the API Gateway.
By putting it on S3, we have a URL available. Instead of having the
Render Service request that URL from the
Image Service, we should do this beforehand and modify our definition file with it.
By moving this dependency out of the
Render Service, we’ve ensured that the
Render Service can do more work. Instead of having to wait 3 seconds like before, it can now serve its request within 200ms. This means it can do 15 requests instead of one.
This is all fine, but how do we know this works? We didn’t want to push this into production without knowing what the impact was. See if we have any performance issues (after all, we need to do an upload and download to S3 now) and if error rates improve. To do this, we’ve hooked this up with my Experiment package and monitored its behaviour closely.
We saw our response time go up slightly but this was well worth it for the increased availability(+12.5%, meeting our internal SLA) and scalability.