Cloudthread Y Combinator
October 20, 2022

CloudCostClip: AWS architecture for highest savings with Spot Instances

Insights from Cristian at CloudUtil on AWS Spot Instances. How to get the most out of this discount mechanism and how Cristian’s tool Autospotting can help.

Transcript:

Daniele Packard:

Very excited to be chatting to you today about spot. Maybe you could start off by giving a brief introduction, and then we'll dive into the nuances of spot and information that you can share based on your deep expertise with theAWS service.

Cristian Măgherușan-Stanciu:

Yeah, thank you, Daniele. My name is Christian, I'm ex AWS for about a month and a half now. I was in AWS part of the spot extended spot service team. And yeah, before AWS, I created this open source automation for Spot Instances called AutoSpotting. So I have quite a bit of background in this space. And at AWS, I have been involved with a lot of customers who try to optimise by using spot and yeah, have quite quite a few experiences from there.

Daniele Packard:

It's hard to imagine someone with more depth of expertise in spot given that you developed the open source tool before even going to AWS and then you work directly at AWS, you know, on the service and professional services related to it. So thank you so much for joining.

Cristian Măgherușan-Stanciu:

Happy to be here. Thanks

Daniele Packard:

As a starting point, you know, for someone that has only ever used on demand EC2 instances, what would be your two sentence description of what Spot Instances are?

Cristian Măgherușan-Stanciu:

Yeah, I mean, if you look at it spot is just spare capacity that is not used at the moment by on demand customers. So AWS has these 1000s and 1000s of servers in each region. And not at all times all of them are used. But when there's somebody who needs an instance, it has to be there. So they always have to have the spare capacity. And what happens is, you can get these as a steep discount, it can be up to 90%. Typically, now it's about 60 to 70%, the average of them some more, some less. And basically you get these at these high discounts. But then if somebody needs them, AWS will take them away from you, with like two minutes notice. And the key to have capacity in these conditions is essentially to have diversification, over multiple, they call them capacity pools, it's essentially, instance types in availability zones. So the most instance types you can use and availability zones you can use in a particular configuration, the more resilient you are to these interruptions, and in particular to losing the capacity you have. So then the application needs to be able to, like switch between these instance types, pretty much at any time. And be flexible over them. Right. So that's pretty much it.

Daniele Packard:

One of the general perceptions of Spot Instances is they're only appropriate for fault tolerant applications and workloads. And this is somewhat related to what you were just talking about. Can you expand on when and where you think Spot Instances are appropriate?

Cristian Măgherușan-Stanciu:

Right, so it's exactly like that. So you have to be fault tolerant, to sustain these interruptions without your your customers noticing or use your users noticing anything. And then you have to be flexible with multiple instance types. So that you can diversify over over a wide range of instance types. So many, many applications such as containerized applications, big data, things like EMR are a good fit for spot. But most nowadays is about containers. Big Data, if you have auto scaling groups is also where I started with auto spotting or to look into auto scaling groups. And you have to have applications that start relatively fast. And that can sustain individual instance failures, essentially.

Daniele Packard:

Right. It is a perfect segue into you explaining a bit more about auto spotting and and the mechanics of how it works.

Cristian Măgherușan-Stanciu:

Right. So when I started, it was all about doing replacement of existing on demand instances with Spot Instances. And it was doing it on like a cron basis. So every every five minutes it will do some sort of replacement. And over time, I evolved it into a more event driven approach. So the way it works right now is it looks at your groups whenever there is a new instance being launched in the Auto Scaling group. As long as the group is selected to be used by auto spotting you, you just have to tag the group and auto spotting will take it over. And then auto spotting will notice these instances being launched. And for each new instance being launched, it replaces it, clones it with an identical spot instance, and swaps it out of the group and swaps in the spot instance. And essentially, you're  replacing the entire capacity with spot without having to do any configuration changes. So and that's kind of the main thing that people don't realise about auto spotting. It's optimised for these environments where you may run them for a while, you don't want to attach to the configuration, except for having the instances replaced. Maybe not even all of them, just a percentage of them. And that's kind of the main use case for adopting spot on things where you don't really want to disrupt them too much by changing to different setup.

Daniele Packard:

And are you automatically doing the diversification that you talked about that makes it more resilient?

Cristian Măgherușan-Stanciu:

Right. So when whenever there is a new event instance being launched, I compute a range of instance types automatically based on the initial instance, and I fire an API call to  give me any of these, I give them actually ordered by price and whichever is available right now. So then the API will give me the first that's available, and then I will run with that.

Daniele Packard:

Amazing. You know, broadly speaking, what's the most that you've seen a company save by intelligently adopting Spot Instances?

Cristian Măgherușan-Stanciu:

Yeah, I mean, I cannot really talk numbers, they have signed a bunch of documents. But I can say the figures they are were like, eight figures. So yeah,  it's significant money. But it depends a lot on the scale of the company. And you have to be able to roll it out quite widely at the company. So that's kind of the main requirements to see such figures -  you have to have certain scale.

Daniele Packard:

How do you see the process of adoption? If a company decides they want to look at spot as a way to to start claim savings from their cloud bill. Do you see companies rolling out one by one to fault tolerant workloads or take batch jobs that need to be done and start experimenting in spot? This would be a more team or a more isolated initial approach. Or do you see companies adopting it holistically, right away? How do you see companies that are successful adopting?

Cristian Măgherușan-Stanciu:

I mean, most of the companies outside of auto spotting are doing it into this slower model. So you have you have initiatives driven at scale within the larger organisations. And then they try to beat by beat adopt spot, like on a team by team basis. And yeah, that's fine. It reduces the risks. It takes a bit longer, but eventually you'll get there. And it's a good approach, like if you're if your fortune 500, you probably don't want to rock the boat so bad. But for for like smaller companies, they can afford to be a bit more. And they  sometimes  do things like auto sporting, you could do things like enable it in an opt out model, and configure a percentage of the groups so that all the old ASG is at a given company would would adopt spot for like 40% of the of the, of the fleet. Back at my previous employer before AWS at Here Technologies, we actually did something similar, but we went all in and we did it for all their r&d accounts. So they have in the hundreds of development accounts running EC2 in the organisation and we just deployed Auto Spotting, literally overnight, replacing everything to spot and that was like, a big move. There were a few issues with that. turns out there were people running things like Cassandra or, or memory databases on their ESGs. And if you do it like that, then without without proper automation to replace those instances, there were a bit of cases where they lost a bit of data. But it was development account so that we could recover and there was no impact. But yeah, normally I would not do that anymore. So I will do a percentage around 40 to 50%. And you can still recover from that without nobody noticing anything.

Daniele Packard:

So, Spot started with EC2 at this point extends to Fargate. How do you see spot as a pricing mechanism evolving at AWS?

Cristian Măgherușan-Stanciu:

Yeah, I mean, not able to talk about what they are doing under the hood and other services they're building, but what you can see from from outside is pretty close to what was going on, you have these pricing model on demand spots, savings plans. And essentially, you can you can adopt them from various services. And besides Fargate, and Spot is also quite big in the container space where it integrates nicely with, with EKS and ECS. Now, they also released recently something called Karpenter, which makes it easier for Eks environments to adopt spot in a nicely diversified way. There's also EMR, which is pretty big on the spot. And a lot of the customers who use EMR should be looking into spot, it's quite quite a good fit for spot as well.

Daniele Packard:

Got it. So less new services specifically adopting Spot as a pricing mechanism and more services that can work with spot in intelligent ways.

Cristian Măgherușan-Stanciu:

That's pretty much it.

Daniele Packard:

Awesome. We're almost at time so gonna wrap up, but thank you so much, Christian for spending a brief session talking about spot . I'm sure people will be really excited to understand more about spot generally and more about auto spotting specifically.