Responsive

How Cloudflare Tests in Production for Unbreakable Performance

July 28, 2025
0
:
31
:
43
Ole Lensmar
Ole Lensmar
CTO
Testkube
Sachin Fernandes
Sachin Fernandes
Engineering Manager
Cloud Flare
Share on X
Share on LinkedIn
Share on Reddit
Share on HackerNews
Copy URL

Table of Contents

Want a Personalized Feature Set Demo?

Subscribe to our monthly newsletter to stay up to date with all-things Testkube.

You have successfully subscribed to the Testkube newsletter.
You have successfully subscribed to the Testkube newsletter.
Oops! Something went wrong while submitting the form.

Transcript

Welcome to the Cloud Native Testing podcast. I'm Ole Lensmar, your host, and I've spent years building testing tools like SoapUI and Swagger, and now I'm the CTO of Testkube where we're redefining testing for the cloud native landscape. In this podcast we invite people in the space to discuss with us the challenges solutions and best practices for testing in Kubernetes and beyond, covering automation CI/CD, GitOps, and more.

This podcast is proudly sponsored by Testkube. If you're interested about how to test cloud native applications and infrastructure, subscribe and stay tuned.

Now let's dive into the episode. Thank you for joining.

---

Ole Lensmar: Hello, everyone. Welcome to this episode of the Cloud Native Testing podcast. I'm super happy and thrilled to be joined by Sachin from Cloudflare. Sachin, how are you?

Sachin: I am great, I am hanging out in San Francisco in my balcony so I'm pretty excited to be here. Thank you for inviting me and talking about cool testing stuff.

Ole Lensmar: Well, thank you for joining. Just before we dive in, tell us a little bit about yourself and what you're doing at Cloudflare and anything else you want to add about testing maybe in specific, but go for it.

Sachin: Yeah, absolutely. a few years ago, Cloudflare sort of realized that we're pretty critical infrastructure for the internet. And we could no longer sort of yolo things into the ether, and let's expect them to sort of work. And incidents became really sort of. a critical focus for how we wanted to become resilient and making sure we keep things running for our customers. And so we ended up. coming up with this idea of a testing at scale team and we didn't really back then know what that meant. I love breaking things, I love testing things, so I sort of put my hand up to volunteer to be one of the first people on that team and sort of help spin that up. And so that's what I do. help, today we have a few folks on our testing at scale team and we provide critical testing infrastructure to a lot of our internal teams at Cloudflare and it's quite cool to see that growth where we didn't have a testing at scale team and now we do. So I help sort of manage and grow that part of Cloudflare.

Ole Lensmar: OK, awesome. Thank you. I'm sure people listening and myself, obviously, as you said, when I hear clouds there, I think scale. So I'm curious kind of how can you test the scale that you guys must be having? And how do you break that down into tests that kind of give a realistic measurement of what your capacity is? We don't have to go into numbers. I'm just curious about what's the process for that?

Sachin: Yeah, so, We have this internal thought process at Cloudflare where we really leverage Cloudflare's superpower is really like its network. It's one of its things we have interconnectivity across the globe and all of these co-location sites and things like that. So it's fairly interesting for engineers because they write software and they'll deploy something and suddenly it'll become the most available, most resilient something that they've ever deployed, right? and that sort of works on every level. And so for the testing at Scalepiece, we had a similar thing happen where we were like, we have all of this overpowered infrastructure at the edge. We should figure out how to leverage it to run a lot of our testing infrastructure and things like that. at this point today, we started out pretty small. Actually, my engineer just wrote a blog post on some of the things we did to unlock new capacity and how we were planning on scaling because the internal teams themselves started pushing us on scale because they were like hey my default traffic that I'm doing is like this much and your testing traffic isn't like keeping up with that we need I want to add more tests but your infrastructure isn't supporting it like how do I add more tests add more probes or add more you know blanket my service with a bunch of like good requests or bad requests or whatever. And so he just wrote a recent blog post that went from a daily test unit of hundreds of thousands to tens of millions of tests per day. And so we were able to do that jump in six months. And we did that jump with more headroom than we had before. And so... To answer your question in sort of a roundabout vague way for now is we really leverage Cloudflare's existing network and infrastructure. I can tell you a little bit about... the tool that my team provides. So it's a thing called Flamingo. The service that we run is called Flamingo. And it's a service that runs on every Cloudflare Edge Metal. So everywhere that we have a presence, there will be a Flamingo. Flamingo, only internal Cloudflare on my team currently, and my team services can talk to these Flamingos. sort of issue requests or request to run a test to these edge flamingos. And we say, can you run a test that does this type of request for this team or this type of traffic for this team? And the flamingo will kind of realize where it's running and say, I can do that. Let me do that. And so it'll look at that request, service it, and then emit metrics around. it pass, did I blow up, did it fail, how did the network react to that and things like that and then it will phone home with those results. And so that's basically how we ended up leveraging our current infrastructure to be able to run these tests at scale. Now, of the challenges that actually shows up, which was quite interesting, is we have a singular control plane in our core infrastructure that gets phoned home. And when you realize how much more powerful Cloudflare's Edge infrastructure is than your tiny little service requests, like responding to all those requests, basically you have a self-DDoS kind of situation, right? where you issue a bunch of tests and then suddenly the millions of these flamingos are trying to phone home and say I ran that test like I did a thing and so we noticed that our our control plane became the singular part of failure and started like really struggling to keep up. And so those were some of the improvements we made that my engineer sort of blogged about to improve that. scale a little bit on our end. it was quite fascinating to sort of see we could do the flamingo part, the edge part pretty easily because of the infrastructure that existed and then we had to really think hard about how we handle that capacity that Cloudflare provides on our end. was quite fun, was quite cool.

Ole Lensmar: It's super fascinating. It seems to me like there's a meta level here. Running the tests is a test in itself. Or running the test at the scale that you're running them becomes a test on the system itself. Is your control plane with all those flamingos able to run these many tests at the same time and aggregate all those results, which becomes like a a self-testing system. That sounds super fascinating and it must be to your point really interesting to see just because of the scale that your infrastructure operates at.

Sachin: Yeah, we also adhere to pretty strict SLOs and SLIs for ourselves because we consider ourselves critical observability tools. whenever there's something that's horrendously broken or something is kind of... not working the way teams expect. They're under a lot of load and stress, right? And like it's it's one of those times where they really rely on their tools and specifically for us it is more important to work when other team services are not working. So we take our SLOs and SLIs very seriously around what are our dependencies. So we also try to push our dependencies to declare their own SLOs and SLIs because you're only as available as your Postgres database or your Prometheus scrapers and whatever else that you rely on implicitly without realizing just gets baked into your SLO and SLI because the moment that breaks, it doesn't work. So we're trying to work on degraded availability and things like that too to sort of say well if these core systems at Cloudflare are sort of blown up how can we still provide value to the teams to say Well, did it pass or not? That's actually all I care about right now. Everything else we can get back to later, but being able to provide certain critical data in those situations has been really helpful. And we measure some interesting things. So a lot of... we've put a lot of thought into our SLOs and SLIs. So like the simple ones are sort of like, know, have your uptime and your availability and you say, you know, I calculated my 500s over my total requests and that's how much available I am. So those are some basic ones that you kind of want to do and things like that. Maybe you have your API latency and all of those things, but we actually found that the most interesting and SLIs for us are around data. so one of the interesting ones that we have is we have a heartbeat job. that we run for ourselves. And this is one of the meta tests that we run at vast scales. And all we do is like a hello world across the flamingos to say, hello world. Like we issued a test and it came back and it's a hello. And we monitor that for data loss. And so we say, every, you know, how many seconds or how many ever minutes this should have this many data points. And if that's missing and we, let's say we expect it to run every second, right? You should have at least 60 test runs. Maybe they passed, maybe they failed, whatever, but you should have 60 data points. And so if you have 55, you're losing data. And so that's sort of how we measure our data as lows and things like that and that's been really interesting because it reveals certain nuances and weird data things where some of your minutes have one extra data point and you're like what's happening here like this we're miscalculating certain things and so you can so you can calculate drift and then the other availability SLO and SLI that we sort of have is teams can schedule things on a ticker and that allows them to sort of blanket their service with Certain tests at all times of the day and so, know some teams run it every every minute every how many ever Some teams run it every hour some teams have Very big tests that they're like, well, actually I don't do that many deployments I just want to run it every day and make sure that like at this point every day. It's fine so one of the SLOs and SLIs that we Put in there was Every test has a start latency or a test latency of less than five seconds. So for whatever reason if You're you scheduled your plan to start at 6 p.m and it starts at 6 p.m. and 10 seconds. That is counted towards a failure in us to meet our SLO. And so that number we actually closed in on and we said what right now think we we do it almost exactly on time Everything happens exactly on time But it reveals interesting things because the more you scale a lot of this is based on like queuing and Capacity and so the more you scale and the more stuff you have in the queue Over time just the ability to service things on time falls away is one second is like a lot of time But when you do it at scale, it's not a lot of time to like, you know, hit everything, especially when time is really critical and you really want to get those timing things correct. So yeah, that's been really fun to think of.

Ole Lensmar: How did you solve that, though? Is it by introducing some infrastructure components? I'm curious to learn, obviously, what

Sachin: Well, so one of the answers is like, you throw more hardware at it, right? Like that's one of the easy answers. So we did a little bit of that. We added more like CPU capacity, added more mem, more... more deployments, more bots, whatever, because you do need some of that because Cloudflare's edge is extremely overpowered. You're not going to be able to keep up with that with just a limited amount of hardware capacity. You need some amount of hardware to actually deal with the vast amount of hardware on the other side, right? So there is that component. But one of the things that we were... we were doing there is we had this So singular worker queue that was picking up jobs and like doing these jobs on the schedule and working through the queue and it was actually working fine. Like it didn't show up as any problem and no one was complaining until we started measuring stuff. And this is when we started measuring the data pieces and the plan latency pieces. actually started backwards where we were like, well, from a human perspective, it feels like everything's on time. Everything works and no one's complaining. But the moment you get accurate with metrics, so we, my, it's the same engineer that wrote the blog post, his name's Wyatt, but he, he started. this approach of data-driven development. And so he is obsessed with graphs and metrics. So he is, before he starts a project, he just spins up a bunch of graphs about everything. And so he will measure, like some of his first tickets are just add metrics for this, add metrics for this, add metrics for this. And so he will lay out all of the metrics, all of the things that he wants to measure. And it's funny, like some of his graphs will start at 0%. And he's like, I want to make this 100%. and he'll work on, then it becomes like this very concrete thing. And so if he makes a change and it makes no difference to the 0%, like he's like, okay, I'm not, this doesn't count. But the moment it pushes it to like 5%, and so like, that's how he progresses doing his different approach. And so some of the things that we were measuring, we realized the spaces that were sort of choking because of those metrics were things like go routines. were like using too many go routines per CPU capacity and things like that and so that was like affecting throughput over time. He came up with this cool way to schedule and hash each test to run based on the hashes per job worker and things like that so we did a lot of like intelligent queuing, sort of building out a queuing system and things like that. But the key unlock there was measuring the stuff first and then going from the data of what are we trying to improve and writing that out in words to say well we want plans to start on time and then what is on time mean well we're okay with five seconds and so the moment we measured that it was at zero percent because it would be like you know it would be at six seconds or seven seconds and we're like seems fine as a human but the directionality is wrong right we so yeah

Ole Lensmar: But yeah. So, but are you then continuously monitoring all your tests and everything to kind of see, detect if things slow down or you're introducing latency in one of the subtests that you're doing? And if you are, how are you, how have you automated all of that? And it feels like there could be a lot going on at the same time. And how do you make sure that you're really catching those anomalies that are important versus, you know, noise, I guess?

Sachin: This is a great question. It doesn't just apply to testing at scale, but most teams have this where humans are just bad at time. Humans are good at availability. You can tell when a thing is working or not working is very easy to complain about. You make an API request, it doesn't work, and you're like, fine, it's broken. pick that up pretty easily. When was the last time you opened Google and you were like, oh, this is slightly slower than my last Google query? Never, right? You're never thinking about it. You're just OK waiting for the spinner to go up. Humans are... I've come to... It's my network, right? Right, exactly. so when we... problem is compounded when you have asynchronous things running at such large scales beyond your purview. you're not actively, like teams are scheduling this stuff and they're scheduling it at such vast scales that it's almost impossible for us to. no protest. Like, did this actually run at like this millisecond of this time when we had scheduled this across the board over millions of tests? And it's like, yeah, it's kind of hard to measure. So we have an SLO team at Cloudflare and we worked very closely with them to sort of understand how to tie these pieces together. any time we discover something that we missed during an incident, we immediately convert that to an SLO. And so we say, so if something sort of broke and it didn't show up on a graph, that's an SLO. Because it means it's a space that we sort of haven't been able to observe and look into on a graph. and convert that into some sort of solid number that you can start monitoring. So the moment you have that graph, you can then understand the gradient of that graph, right? And so if it's flat, you're great. If it's going down, something's wrong. And so then the next step is the moment you have the gradient, based on the angle, you can choose to alert or not alert on that graph. So maybe, you know, across millions of tests, if we schedule one test, one a second off of the time that it was supposed to schedule at. Is that a problem? I don't know. I don't want to get paged for that, right? Like that fine. We did it correctly at the next second. But if we scheduled 500,000 tests wrong out of a million tests, like that seems like a pretty big misstep there, right? And that is something I do want to get paged on. And so we have the SLO team came up with the idea of burn rates. so burn rates are basically the calculation of that gradient and understanding how much of your error budget are you burning per time that you've specified. And so if it's slowly decreasing and you have maybe a week till it hits your 99.98 instead of 99.99, maybe you send a chat notification and say, hey, you're burning your error budget. FYI, like this isn't a problem yet, but it will be in a week. Right. But if you're breaking something and it's burning your error budget at like a percent per minute, it, it's going to page you immediately and it's going to say hey this is broken. So I'll give you like a pretty interesting example of how we face this all the time. So we had an issue where we had a bottleneck and the bottleneck was preventing our ability to add more capacity to our control plan. So we fixed that and we were like, cool, this is great. The person that wanted to schedule X number of tests per minute or whatever could now do it. And then we noticed that everything was fine. No one was complaining, but our response time decreased by like 50 milliseconds or something. It was very interesting because no person sort of noticed this, but we made things just a little bit worse. But we solved the person's problem, right? And we just made it a little bit worse. we realized that we forgot to bump another number that needed to go in correspondence to that other thing that we bumped. And so now, yeah, I was using a bunch of CPU capacity, but the memory was thrashing and then preventing a bunch of other things from happening. But that only affected latency a little, little bit. And it only showed up on this graph and nowhere else. And so we actually then saw that and we were like, we made things a little bit worse for everyone else, including us. and we unlock this use case for this one person, we would have never discovered this if this was not plotted out or not on this graph, right? And so the only way, and then we basically bumped that and then everything flatlined again and went back to where it was and we were like, great. We did both things. kept our latency where it was supposed to be and then also ended up being able to be more available and more everything. And so I think teams suffer from this fairly often where we make things a little bit worse. No one notices it, but we end up fixing the bug that we wanted to. Or we end up adding that one extra check that does something. We add that one extra database query that did something else. And know application performance monitoring and all those things are good, but they don't generally represent the end user experience and the whole customer journey that your customer is experiencing. And especially internally and through... technical customers have a high threshold for nonsense, right? Like they will generally work around all of this stuff and not complain about... They have empathy for other engineers and other things and so they generally don't complain. So you can't rely on them to give you that feedback just like most other sort of, know, standard users that will say, this is slow for me. It doesn't work. Like your thing is broken. And most internal customers will not do that. So that's been an interesting way to monitor is just graph out and metrics out everything and then use error instead of deciding if the graph is correct or wrong, use things like burn rate alerting or error budgeting to understand how much time we have to fix this thing and how serious this thing is. Like should we page on this yet? Is this a thing that will eventually sort of resolve because of another incident happening over there and it's not something we can fix right now? So that's been that's actually been really fun and I've only ever done this type of development on this team because of the level of scale and being core observability for everyone else. So it's been quite fun to do that.

Ole Lensmar: It's fascinating. And I love this concept of error budgeting, where you can say, this is our budget for errors. And then you can kind of work against that. I'm just kind of thinking, trying to think through how that could play out in different types of tests and different types of errors. It's super interesting. It's a super interesting concept. So definitely something I'm going to spend some more time on. I wanted to ask you something that we come back to in almost all of the podcast episodes, which is testing in production. And I'm curious, the tests that you run, are these in a pre-deployment scenario, or are they running in production, or both? And in either case, what's your stance on testing in production? And if that's something that you do at Cloudflare, or do not, or if it's something you've discussed?

Sachin: Yeah, it's definitely some, so my team only does testing in production. And the reason is sort of pretty simple, right? Like where do your customers experience issues and the challenges like in production. And with the complexity and the scale of Cloudflare, there is... It is extremely hard to reproduce your production environment in some safe space. You know, could call it staging, but are you really going to duplicate your entire production infrastructure in staging? No, like that doesn't make sense, right? Like you're not going to spend the same amount of money, the same amount of capacity, the same amount of... I've tried to have this be reality for like a lot of environments, but staging always, almost drifts away from production in some way. And that has value too, you you can sort of calculate how your performance might behave. production you can do things like did the service boot up like that's yeah sure like run that test right in staging like definitely don't want to deploy something that's completely exploded in production so you can you can do those sort of basic minimal tests in staging for sure but the moment you get to extremely complicated teams. I mean, there's like dozens of teams just deploying one part of the edge stack at Cloudflare. And so how do you coordinate everyone's releases in staging to make sure like the stack of software that they're running the test on is valid and is going to completely represent what's happening in production? It never will. It might be close, but it's not really going to do that. So the only way to completely guarantee that the thing that you're running represents a user experience is to do it in production and a lot of people are like scared of this concept of like my god we gotta like test this thing in production but you shouldn't be like it's your software you're deploying this to production i think we're i think the fear comes from the fact that we might discover something in production and it's like no that's a good thing like you've discovered something in production Don't shy away from it, like lean into why this is being weird in production. So that's one piece. And the other piece is very obviously the scale piece. Like we can't achieve the scale piece that production provides in any other environment. We would have to come up with like another Cloudflare to do that, right? And like, that's not something we can do. And in terms of how we... build out, we have this concept of edge slivers and things like that. So Cloudflare will partition out how we do these. deployments and so they progressively roll out to each piece of Cloudflare's infrastructure and so some of them even though their production are designated to behave differently so a large section of them will be the real internet but only Cloudflare employees will hit that colo And that allows us to of dog food the actual internet before you hit the actual internet to make sure that the stack that we spun up for the real internet is working. And so teams can go target that infrastructure. And it is production, right? It's still actually running the real internet, but it's only served to Cloudflare employees. They can go target their tests against those things. They can target their tests against any of this infrastructure so they can choose to run and validate their tests in different infrastructure slivers. So that's one way that we give teams more control over quote unquote what staging might be. It's still production, it's a real world thing, but it's not something that's treated separately from... from production itself. So my team is a complete proponent of production stuff. It's like, well... Depending, you know, like the goal is to always have customers have sort of the best experience. So we don't recommend trying to break your service via load testing or other things in production. because it's not a good experience for customers. But if you want to load test in sort of like a production-like environment, that's something that we can help with. But you should be running your tests. Like I said, all of the flamingos run on every sort of edge. infrastructure node that we have and so those are all production nodes. They are all serving live traffic to live customers and that is the exact stack that a customer will be hitting at that exact time. And so we know if something fails it's failing for our customers. Like it is not something that we made up that is like this is failing in staging and so we can very confidently then alert on those SLOs and SLIs for other teams on their behalf to say Hey, if we're seeing this failure at our scale, like your customers are definitely seeing it like this is happening here like and and Even in that complexity like you know, it might fail in a particular region in a particular transit for some random Situation that you had no idea existed like maybe an ISP is doing something weird. Someone else is doing something weird We changed this one metal to do some weird behavior and so we've been able to pull up all of these strange things in production that you really wouldn't get in staging because it's such a fake contrived sort of environment that we spin up to stage changes to production but they they rarely ever map one-on-one

Ole Lensmar: Okay, Sasha, this is fascinating and I'd love to continue, but unfortunately we're out of time. So thank you so much for sharing everything you've shared with us. I have so much things in in my running in my head now that I need to make sure I capture afterwards, but I'll just re-listen. Super inspiring. Thank you so much, Sasha, for joining and thank you everyone else for listening. Thank you, producer Katie for producing and have a great weekend. Bye bye.

Sachin: Yeah, thanks for the invite. This was awesome.