AI Writes the Code. Who Tests It? Notes From the AI Summit

read

Andy Pemberton

President

Testkube

Start your free trial.

Get Started

Start your free trial.

Get Started

Start your free trial.

Explore Testkube hands-on.

30 days

no commitment

no credit card needed

read

Executive Summary

SummaryIn my AI Summit talk I argued that AI now produces a large share of production code while developer trust in that code keeps dropping. Adding more tests does not close that gap by itself. Quality comes from three jobs done together: creating tests, running them in real environments, and analyzing the results. A human still defines what correct means and decides whether a result can be trusted. This recap walks through the data, the cautionary cases, and why testing earns more attention as AI writes more code.

A year ago, AI wrote less of your code than it does today, and your developers trusted it more than they do now. Those two lines moving in opposite directions were the reason I gave this talk at the AI Summit in London. I called it "AI writes the code. Who tests it?" because that is the question I keep landing on in customer conversations. The more of the keyboard we hand to AI, the more the verification step decides whether speed is real or borrowed against future outages. The full talk is embedded below, and this recap pulls out what is worth keeping. You can also read more in our breakdown of testing AI-generated code.

Watch the full talk here:

The code is increasingly written by machines

Microsoft has said software now writes somewhere between 20 and 30 percent of its code. Google has put the figure higher, saying 75 percent of new code is generated by AI and then approved by an engineer. The teams I talk to are not at Google scale, but they are on the same road, and the percentage climbs every quarter.

The number that should stop a room is not the volume. It is the trust. In Stack Overflow's 2025 Developer Survey, 84 percent of developers said they use or plan to use AI tools, while only 29 percent said they trust the accuracy of what those tools produce, down from 40 percent the year before. Usage went up. Belief went down. With most tools, confidence rises as people get comfortable. Here the two lines are pulling apart, and that tells you something about what developers see when they read the output.

The biggest engineering teams are feeling it too

It would be convenient to call this a small-company problem, something the giants have already solved. They have not. In March 2026, Amazon reportedly pulled engineers into a mandatory review after a string of outages, one of which took its retail site down for roughly six hours. An internal briefing described a pattern of incidents with a high blast radius tied to changes made with generative AI, and the company responded by requiring senior engineers to sign off on AI-assisted code before it shipped.

I do not read that as proof that Amazon, Google, and Microsoft write poor software. They run some of the strongest engineering organizations on the planet. I read it the other way. If teams with that much talent and budget are adding controls, a company with fewer specialists and a tighter budget will feel the same pressure, only sooner. The lesson is that the practice has to change, not only the amount of code produced.

"Can AI just write the tests?"

If AI writes the code, surely AI can write the tests, and the problem takes care of itself. My honest answer is that this is half right and not enough. AI is good at generating tests. It probably writes an even larger share of test code than production code, because developers have spent years avoiding the task by hand.

Generating tests is one piece of a much larger job. Testing runs across a wide range. At one end is a unit test on a laptop. At the other is the kind of resilience work Netflix made famous, where the team deliberately switches off live infrastructure to confirm the system holds up when something real breaks. The same principle holds wherever the cost of a failure is high: confirming a server responds is not the same as confirming the transaction completes, and you only learn the difference by running the test where the software actually runs. For the technical version of why surface checks miss the failures that matter, read system-level testing for AI-generated code.

Shipping code straight from prompts? The verification step is the one most teams skip. Read: What is vibe testing →

Three jobs, not one

The point I most wanted the room to remember is that testing AI-generated code breaks into three connected jobs. Quality is the result of all three. Skip any one and the other two stop protecting you.

The job	What I told the room
Test creation	AI is strong here and the tooling improves weekly. A human still has to define what correct means.
Test execution	Harder than it appears. Reproducibility and real environments are what make a result worth trusting.
Test analysis	The step teams quietly drop. A failing test that nobody reads is a smoke alarm with a dead battery.

This is the point where the talk turns toward what we build. We covered each job, with a live demo, in AI test automation: fixing the three new testing bottlenecks. The short version is that Testkube runs any tool as native Kubernetes jobs inside your own clusters through the execution engine, and its AI agents take on the analysis, telling a real failure apart from a flaky one.

Why a person stays in the loop

For now, regulated software needs a human somewhere in the chain. Banks, hospitals, insurers, and government systems carry consequences that make hands-off automation a bad trade. A person has to set the definition of correct behavior and make the final call on whether a result holds.

I also raised a stranger problem: testing the AI itself. When money moves in a banking app, you do not want a non-deterministic agent moving it, because it can do the job correctly nine times and hallucinate on the tenth. That changes the shape of the test. Instead of one pass-or-fail assertion, you measure a confidence score across many runs and ask whether the agent is reliable enough to trust. This is one reason continuous testing now has to cover both the deterministic software and the non-deterministic systems writing and running inside it.

Build testing into the workflow your team already has

My closing note was that AI tooling has to settle into the workflow a team already runs, rather than asking the team to adopt a new one. When a test fails, the result belongs where people already work: a message in the Slack channel they watch, or a comment on the pull request under review. Pushing engineers into a separate console to find out what broke adds friction at the exact moment they need speed.

The payoff shows up in the numbers. As an example, DocNetwork automated its testing and visibility with Testkube and saved 30 DevOps hours per week, in part by retiring the weekly deployment meeting that existed only to make sense of scattered results. Once the three jobs connect and the output arrives inside the existing workflow, the manual coordination cost falls away.

Key takeaways

Adoption climbs while trust falls. Stack Overflow's 2025 survey shows 84 percent of developers using AI tools and 29 percent trusting the output, which moves the burden of proof onto testing.
Elite teams are adding controls. Amazon now requires senior sign-off on AI-assisted code after a run of high blast radius incidents, a sign that practice has to change with volume.
Writing tests is only the start. Quality comes from creation, execution, and analysis working together. A test that never runs, or whose failure goes unread, protects nothing.
Keep a person in the loop and test the AI itself. People define correct behavior and judge results, and non-deterministic agents need confidence scores rather than a single pass or fail.
Fit beats force. Tooling that reports into Slack and pull requests gets used. DocNetwork saved 30 DevOps hours per week once results were centralized.

Want to go deeper than the talk? Walk through test execution and AI failure analysis with our team.

Book a demo →

About Testkube

Testkube is the open testing platform for AI-driven engineering teams. It runs tests directly in your Kubernetes clusters, works with any CI/CD system, and supports every testing tool your team uses. By removing CI/CD bottlenecks, Testkube helps teams ship faster with confidence.
Get Started with a trial to see Testkube in action.

AI Writes the Code. Who Tests It? Notes From the AI Summit

Table of Contents

Start your free trial.

Start your free trial.

Start your free trial.

Table of Contents

Executive Summary

The code is increasingly written by machines

The biggest engineering teams are feeling it too

"Can AI just write the tests?"

Three jobs, not one

Why a person stays in the loop

Build testing into the workflow your team already has

Key takeaways

About Testkube

Related Content

AI Test Automation: Fixing the Three New Testing Bottlenecks

Testkube AI: More Use Cases, a Smarter UX, and Now available for Open Source

Testing Built for the Way Engineering Actually Works Now

See Testkube in Action

AI Writes the Code. Who Tests It? Notes From the AI Summit

Table of Contents

Start your free trial.

Start your free trial.

Start your free trial.

Subscribe to our monthly newsletter to stay up to date with all-things Testkube.

Table of Contents

Executive Summary

The code is increasingly written by machines

The biggest engineering teams are feeling it too

"Can AI just write the tests?"

Three jobs, not one

Why a person stays in the loop

Build testing into the workflow your team already has

Key takeaways

About Testkube

Related Content

AI Test Automation: Fixing the Three New Testing Bottlenecks

Testkube AI: More Use Cases, a Smarter UX, and Now available for Open Source

Testing Built for the Way Engineering Actually Works Now