

Table of Contents
Start your free trial.
Start your free trial.
Start your free trial.




Table of Contents
Executive Summary
Your test just failed again. Same code. Same deployment. You re-run the job, and it passes.
Sound familiar?
Whether your tests run in a CI/CD pipeline or triggered manually, you have likely been caught in this frustrating loop before. Welcome to the world of flaky tests. But here is the thing: although flakiness often stems from the tests themselves, the culprit can also be found elsewhere.
Traditional flaky test debugging follows a predictable pattern. You examine your test logic, hunt for timing issues, and refactor problematic assertions to make them more resilient to unforeseen but still mostly valid behaviour in your application or service under test. Sometimes this works. Other times, you are left scratching your head because the test logic looks solid. And sometimes, the test just needs to be muted.
If you are running tests in Kubernetes, you might be dealing with something entirely different: workflow-level flakiness caused by underlying infrastructure. Your pods are getting evicted. Your tests are hitting cluster-wide timeouts. The Kubernetes scheduler is making decisions that have nothing to do with your carefully crafted test code.
This shift toward container orchestration has introduced a new category of failures that many teams do not recognize. They are still debugging their test code when the real problem lies in how their cluster is configured or being utilized while their tests are running.
Want the related structural issue? Why tests pass on your laptop but fail in CI is the same environment-mismatch pattern at a different layer. Read: Why your tests pass locally but fail in CI →
What are flaky tests?
What is a flaky test? A flaky test passes or fails inconsistently without any code changes. Traditional flaky tests stem from problems in the test itself: race conditions, poorly scoped setup or teardown, timing issues, or shared state between tests. Especially common in integration, end-to-end, and UI tests where ordering and external dependencies matter.
A flaky test is one that passes or fails inconsistently without any code changes. These failures usually stem from problems in the test itself.
Common causes include:
- Race conditions.
- Poorly scoped setup/teardown logic.
- Timing issues or asynchronous behavior.
- Shared or leaking state between tests.
These problems are especially common in integration, end-to-end, and UI tests.
Real-world examples:
- Playwright or Cypress end-to-end test: Fails intermittently in GitHub Actions because the DOM takes a bit longer to load, so
cy.get()cannot find the element. - JUnit database test: Occasionally fails because previous test data was not cleaned up properly.
- Postman test suite: Runs fine locally but fails in CI due to missing environment variables or hitting an API rate limit.
- Selenium UI test: Fails only when run after another test that does not fully reset the UI state, causing false positives or negatives.
Debugging and fixing these problems is time-consuming. You might insert sleep() statements, isolate tests, or rewrite them entirely, but fixing them is not always as straightforward as you hoped. How can you ensure that no other tests or processes are messing with the state of your application? How can you hedge for unforeseen timeouts without compromising the test itself? What if your test logic is solid and something else is breaking it?
The critical distinction: test-level vs workflow-level flakiness
What is the difference between test-level and workflow-level flakiness? Test-level flakiness is caused by problems in the test code itself. Workflow-level flakiness is caused by problems in the infrastructure executing the tests. The test code can be perfect and still report as flaky if the cluster running it is unstable. Understanding this distinction is critical for debugging modern pipelines.
Understanding this distinction is critical for debugging and preventing flakiness in modern pipelines running on Kubernetes.
Flakiness has evolved: workflow-level failures
What is workflow-level flakiness? It is a category of test failure caused by infrastructure rather than test code. As more teams moved test execution into Kubernetes-based pipelines, a new pattern emerged: tests reported as flaky when the real issue is pod eviction, scheduling delay, network policy change, or resource pressure. The test code is fine. The cluster running it is not.
As more teams move test execution into Kubernetes-based pipelines, a different kind of flakiness has emerged: workflow-level flakiness.
This is not about the test code failing. It is about the infrastructure interfering with your tests.
Common scenarios:
- Kubernetes evicts your test pod due to resource limits or cluster pressure.
- Your test exceeds a cluster-wide timeout and is terminated prematurely.
- Pods are stuck in scheduling queues during high load.
- The Kubernetes scheduler throttles test jobs due to misconfigured priorities or other services requesting resources.
Another variant shows up with a heterogeneous set of tests, or when a complex hierarchy of tests is orchestrated into a comprehensive end-to-end test suite. For example, the output of one test is the input for another, and one test's flakiness causes ripple effects across your entire testing suite.
These failures often look like flaky tests but the cause is structural, not code-level. They are flaky because of how Kubernetes manages workloads, not because of how your tests are written.
Why Kubernetes makes flakiness more common
Why does Kubernetes increase the risk of flaky tests? Because it adds complexity that did not exist in traditional CI runners. Massive parallelism, autoscaling, namespace isolation, and environment parity are real wins, but each adds variables. If the cluster is not configured carefully, those variables introduce flakiness that has nothing to do with your test logic.
Kubernetes is a powerful platform for managing scalable, containerized workloads (including tests).
It enables:
- Massive parallelism for fast test execution.
- Cost efficiency through autoscaling and shared infrastructure.
- Environment parity with production setups.
- Namespace isolation for ephemeral environments.
The tradeoff: when you offload your test execution to Kubernetes, you gain power and scalability but also complexity. If the cluster is not configured correctly, it can introduce flakiness that has nothing to do with your test logic.
Testing challenges in Kubernetes
What specific Kubernetes behaviors cause workflow-level flakiness? Three categories. Resource competition (memory pressure causing OOMKills, CPU oversubscription causing timeouts). Networking and state issues (DNS propagation delays, network policies blocking traffic, shared volumes leaking state). Environment drift (different node configurations, runtime versions, cluster-level changes between runs).
While Kubernetes offers powerful orchestration capabilities, it introduces unique challenges that can turn reliable tests into unpredictable failures.
Resource competition and timing issues
In shared clusters, tests compete for resources and face scheduling unpredictability. Memory-intensive tests get OOMKilled during high load. CPU-bound tests timeout on oversubscribed nodes. Pod startup times vary dramatically based on node resources and image pulls. Tests may start before dependent services are ready, or get delayed by cluster autoscaling decisions.
Networking and state management
Kubernetes networking adds complexity that breaks tests in unexpected ways. Service discovery fails when DNS propagation is slow. Network policies block traffic that worked in development. Shared volumes cause test isolation issues, and database state bleeds between test runs in different pods.
Environment drift and observability gaps
Dynamic Kubernetes environments change between test runs, introducing subtle inconsistencies. Different node configurations, varying container runtime versions, and cluster-level changes affect test behavior. Traditional testing tools struggle with distributed environments, logs are scattered across pods, and there is no clear way to distinguish between test logic failures and infrastructure problems.
These challenges explain why teams often struggle with "flaky" tests after moving test execution to Kubernetes. The tests themselves might be fine, but the execution environment has become more complex and unpredictable.
Solving workflow-level flakiness
How do you fix workflow-level flakiness? Use a testing platform that understands both your tests and the Kubernetes environment running them. Traditional testing tools treat Kubernetes as a black box and report "failure" without context. Testkube runs as native Kubernetes jobs inside your cluster, tracks pod health alongside test results, and explicitly classifies outcomes (Passed, Failed, Cancelled, Aborted) so you know whether the failure is code or infrastructure.
Traditional testing tools treat Kubernetes as a black box. When your pod gets OOMKilled, preempted, or stuck in scheduling queues, they simply report "failure" without context. This leaves teams chasing ghosts, investigating test failures that are not actually code problems.
Solving workflow-level flakiness requires a testing platform that understands both your tests and the Kubernetes environment running them. This is where Testkube makes the difference.
Testkube is a cloud-native continuous testing platform that runs tests directly in your cluster as Kubernetes jobs. By operating within your infrastructure rather than outside it, Testkube provides comprehensive visibility into both test outcomes and the underlying system health that affects them.
Infrastructure-aware test execution
Testkube eliminates the guesswork around test failures by monitoring the complete execution context. When tests run as native Kubernetes jobs within your cluster, they have direct access to services and databases without network-related flakiness from external runners. More importantly, Testkube tracks the health of test pods themselves, detecting when failures stem from resource constraints, evictions, or scheduling issues rather than actual code problems.
Instead of generic "failed" status reports, Testkube provides clear classification: Passed (successful completion), Failed (genuine test failure), Cancelled (user-stopped), or Aborted (infrastructure issues prevented completion). This distinction alone saves teams countless hours of misdirected debugging.
Advanced workflow intelligence
Testkube's Workflow Health System goes beyond individual test results to identify patterns of instability. The platform automatically detects flaky workflows by measuring how often they oscillate between passing and failing states, helping you address reliability issues before they undermine team confidence.
Visual health indicators provide instant visibility into test suite reliability, while workflow health scoring quantifies the consistency of each test workflow by analyzing both pass rates and flakiness patterns. This data-driven approach helps you prioritize which tests need attention most urgently.
Scalable multi-cluster orchestration
Modern development teams need testing that scales with their infrastructure complexity. Testkube supports namespace-scoped execution and ephemeral environments, eliminating shared-state issues that create false positives. Multi-cluster scaling with time zone support enables global testing coordination, while Helm chart deployment ensures consistent automation across all environments.
Comprehensive test orchestration
Complex applications require sophisticated testing workflows. Testkube handles hierarchical and heterogeneous test workflows where one test's output becomes another's input, with built-in retry logic and execution branching to prevent cascade failures. Custom workflow views let you filter and group tests by repository, branch, team, or custom attributes, providing clear pathways to release decisions without information overload.
Centralized observability automatically collects logs, artifacts, and results in one place, eliminating the scattered data hunt that plagues traditional testing setups.
The bottom line
With Testkube, you are no longer guessing why tests fail. You get clear visibility into what broke in your workflows and whether the issue requires code changes or infrastructure attention. This clarity transforms testing from a source of friction into a reliable foundation for continuous delivery.
Best practices for preventing workflow-level flakiness
How do you configure a Kubernetes cluster to prevent flaky tests? Set accurate CPU and memory requests and limits. Use pod disruption budgets to prevent random evictions. Apply affinity rules to balance test jobs. Configure autoscaling to handle test surges. Monitor resource usage of long-running test pods. These cluster-side fixes prevent most workflow-level flakiness regardless of which testing tool you use.
Whether you are using Testkube or not, configuring your Kubernetes cluster properly is essential for stable test execution.
Cluster-side recommendations:
- Set accurate CPU and memory requests and limits for both your workloads and your tests.
- Use pod disruption budgets to avoid random evictions.
- Monitor resource usage of long-running test pods that trigger timeouts.
- Apply affinity and anti-affinity rules to balance test jobs.
- Configure autoscaling policies to handle testing surges.
Testkube-specific strategies:
- Use test workflows for retries, conditional logic, and sequencing.
- Scope execution to ephemeral namespaces to avoid shared-state issues.
- Track test outcomes by root cause, not just pass/fail.
- Monitor resource usage of flaky tests to ensure adequate resource allocation.
Flaky infrastructure ≠ flaky tests
When your test fails, it is natural to assume the test is the problem. But in modern cloud-native environments, that is not always true. Your tests might be solid. The failure might stem from a misconfigured autoscaler, an overloaded node, or a timeout defined at the infrastructure level.
Recognizing that difference is the first step toward building reliable pipelines.
As teams adopt Kubernetes for test execution, the scope of what causes flakiness has expanded. It is not just about fixing flaky test code anymore. It is about gaining visibility into how your environment affects test outcomes. Kubernetes brings scalability and speed, but it also introduces new failure modes that you have to manage.
Key takeaways
- Two categories of flakiness exist. Test-level flakiness comes from race conditions, timing, and shared state in your test code. Workflow-level flakiness comes from pod evictions, scheduling delays, OOMKills, and other infrastructure issues. They require different fixes.
- Traditional CI tools cannot distinguish between them. They report "failure" with no context about whether the pod completed, was evicted, or never started. Teams waste hours debugging code when the actual problem is cluster configuration.
- Kubernetes amplifies flakiness when misconfigured. Resource competition, scheduling unpredictability, networking complexity, and environment drift all add variables that did not exist in traditional CI runners.
- Pod-level visibility tells the truth. Tracking pod events alongside test results (status classification like Passed, Failed, Cancelled, Aborted) eliminates the guesswork in distinguishing code failures from infrastructure failures.
- Cluster configuration is half the fix. Accurate resource requests and limits, pod disruption budgets, affinity rules, and autoscaling policies prevent most workflow-level flakiness regardless of which testing tool you use.
Frequently asked questions
What is a flaky test?
A flaky test passes or fails inconsistently without any code changes. Traditional flaky tests stem from problems in the test itself: race conditions, poorly scoped setup or teardown, timing issues, or shared state between tests. They are especially common in integration, end-to-end, and UI tests where ordering and external dependencies matter.
What is the difference between test-level and workflow-level flakiness?
Test-level flakiness is caused by problems in the test code itself (race conditions, timing issues, shared state). Workflow-level flakiness is caused by problems in the infrastructure executing the tests (pod evictions, scheduling delays, OOMKills, network policies). The test code can be perfect and still report as flaky if the cluster running it is unstable.
Why does Kubernetes make tests flakier?
Kubernetes adds resource competition, scheduling unpredictability, networking complexity, and environment drift between runs. Pods can be evicted, OOMKilled, or stuck in scheduling queues. DNS propagation, network policies, and shared volumes can break tests that worked in isolation. The infrastructure becomes part of the failure surface in ways traditional CI runners did not have.
How do I tell if a failure is test code or infrastructure?
Check whether the pod completed or was terminated. If the pod ran to completion and the assertion failed, the test code is responsible. If the pod was OOMKilled, evicted, or never scheduled, the infrastructure is responsible. Testkube classifies these explicitly as Passed, Failed, Cancelled, or Aborted, which removes the guesswork that traditional CI tools leave you with.
How does Testkube detect workflow-level flakiness?
Testkube runs tests as native Kubernetes jobs and tracks pod health alongside test results. The Workflow Health System measures how often workflows oscillate between passing and failing, identifies flaky workflows automatically, and provides visual health indicators with workflow health scoring. You see whether failures come from code or infrastructure without manually correlating logs.
What cluster configurations help prevent flaky test failures?
Set accurate CPU and memory requests and limits for both your application workloads and your tests. Use pod disruption budgets to prevent random evictions. Apply affinity and anti-affinity rules to balance test jobs across nodes. Configure autoscaling policies to handle testing surges. Monitor resource usage of long-running test pods that hit timeouts.
Should I retry flaky tests or fix them?
Retrying hides the failure without fixing it. For test-level flakiness, fix the test (timing issues, shared state, ordering dependencies). For workflow-level flakiness, fix the cluster configuration. Adding blanket retries erodes trust in the suite over time. Most teams eventually stop investigating failures because they assume everything is noise.


About Testkube
Testkube is the open testing platform for AI-driven engineering teams. It runs tests directly in your Kubernetes clusters, works with any CI/CD system, and supports every testing tool your team uses. By removing CI/CD bottlenecks, Testkube helps teams ship faster with confidence.
Get Started with a trial to see Testkube in action.





