approx. 10 min. reading time

Quality Impact Logo

AI Tests — But Who Tests the AI?

Written by Anja Kribernegg /

 June 2026

Wer testet die KI?

Table of Contents:

Ever since generative AI tools made their way into software development, a new narrative has become popular: AI writes code, AI generates test cases, AI finds bugs. Humans set the goal — the model does the rest. Productivity gains that used to take quarters now happen in hours.

This development is real. And it is good.
But it raises a question that often gets lost in the euphoria: if AI tests — who tests the AI?

Anja Kribernegg

As an experienced test manager, project manager and business analyst, Anja Kribernegg has successfully managed complex IT projects in the banking, insurance and public sectors over the past 20+ years. With her expertise in agile methodologies (Scrum, Kanban, SAFe), test automation, end-to-end testing and requirements engineering, she has not only implemented technical solutions but also guided teams through digital transformations.
Her focus lies on combining process optimisation, quality assurance and compliance – always with the aim of creating robust and user-centred systems. She applies this experience and expertise specifically to projects that require innovative testing approaches and sustainable IT solutions. Her open-mindedness supports her in this endeavour.

Veröffentlicht: 06.2026

The New Test Object

In classical software development, the test object is clearly defined: a system with specified requirements, deterministic behavior, and traceable logic. A well-formulated test case has an unambiguous expectation. Either the system delivers the correct result — or it doesn’t.

AI systems play by different rules.

A large language model gives a different answer today than it did yesterday for an identical input. An image classification model makes the right decision in 98% of cases — but in which 2% does it get it wrong, and why? A recommendation system optimizes for click-through rate even though it is really supposed to maximize customer satisfaction. These systems aren’t incorrectly programmed. They are built that way. And that is precisely what makes testing them fundamentally different.

The quality community faces a challenge for which the classical methods are only partly equipped: how do you test a system whose behavior is probabilistic, context-dependent, and shaped by training data — rather than by explicit logic?

Where Classical Testing Methods Reach Their Limits

Let’s look at three core principles of professional testing — and what happens to them when the test object is an AI system:

Reproducibility: A classical test always delivers the same result under the same conditions. With generative AI systems, this is not guaranteed. Factors such as temperature parameters, sampling strategies, and model updates can cause a test that was green yesterday to be red today — without the code having changed. Flaky tests (unreliable or wobbly tests) take on a new dimension.

Expected values: Every test case needs an oracle — a defined expectation. With AI systems, the oracle is often unclear. What is the “right” answer to a complex customer inquiry? What is a “fair” credit decision? These questions are not technical — they are ethical and domain-specific. Test design no longer begins in the test team, but in the business domain, in the ethics committee, or in legal review.

Quality characteristics: This sets entirely new priorities for the quality characteristics that must be considered. Ethics, fairness, and freedom from bias, for example, are not even included in the ISO/IEC 25010 standard. Measuring the quality of these systems additionally requires statistical methods that have no significance in classical quality engineering. The F1 score, for instance, is a central metric in statistics and machine learning for evaluating the quality of a classification model. It is calculated as the harmonic mean of precision and recall.

Coverage: Statement coverage, branch coverage, path coverage — all of these metrics presuppose that there is a defined code flow. For a neural network with millions of weights, “coverage” is not a meaningful concept in the classical sense. New metrics are needed: coverage of the input space, robustness against adversarial inputs, and behavior at distribution boundaries.

What AI Testing Means — Concrete Approaches

The good news: the craft of testing is not obsolete. It has to be expanded.

Here are approaches that are already applicable today:
Metamorphic Testing: When no unambiguous oracle exists, relations between test cases can be checked. If a translation system correctly translates the sentence “The cat is sitting on the mat,” then the translation of “The cat is not sitting on the mat” should contain a consistent negation. It is not the absolute answer that is tested — but the consistency of the behavior under defined transformations of the input.

Property-Based Testing: Instead of individual test cases, properties are defined that the system should always satisfy — regardless of the specific input. A credit decision system should arrive at the same results for identical financial data, independent of the applicant’s ethnic origin. This property can be checked automatically with thousands of generated test cases.

Adversarial Testing: AI systems can be induced to produce wrong outputs through targeted manipulation of the input — so-called adversarial examples. An image that clearly shows a cat to the human eye can, through minimal pixel manipulation, cause an AI system to classify it as “dog.” Safety-critical AI systems must be tested for this kind of robustness.

Bias and Fairness Testing: Training data reflects historical realities — and thus historical inequalities. A model trained on applicant data from the last 20 years may structurally disadvantage certain groups. Bias testing means: systematically checking whether the model decides consistently and fairly across different demographic groups. Tools such as IBM AI Fairness 360 or Facets offer initial methodological support here.

Monitoring as Continuous Testing: AI systems also change in the field — through new training data, model updates, and changed usage contexts. A one-time test before go-live is not sufficient. Production monitoring that responds to statistical deviations in model behavior (data drift, concept drift) is a form of continuous testing — and must be planned and operated as such.

Human Responsibility Remains

AI can help with testing — generating test cases, analyzing logs, detecting anomalies. That is valuable. But AI cannot decide what a fair outcome is. AI cannot weigh which risk is acceptable. AI cannot take responsibility for a system that affects people.

That responsibility lies with humans. And it lies specifically with those who practice testing professionally.

The role of the quality engineer is changing: less manual test execution, more test design for non-deterministic systems. Less script maintenance, more risk assessment and ethical review. Less writing bug reports, more quality responsibility within interdisciplinary teams — together with data scientists, business domains, and the legal team.

This requires new competencies. And it requires the quality community to actively help shape this development — instead of waiting for others to provide the answers.

An Open Question to the Industry
AI systems today make decisions that grant loans, diagnose illnesses, screen out job applications, and steer autonomous vehicles. The question of how these systems are tested — systematically, traceably, responsibly — is not an academic one. It is a societal one.

The quality community has the tools, the intellect, and the experience to take a leading role here. The question is whether we will also demand it.

AI tests. But who tests the AI — that should be us.

Leave A Comment