This experiment was designed to evaluate how effectively AI agents generate software test cases, and whether providing a reference document containing basic QA concepts meaningfully improves the quality of the output.
Two Claude AI chat agents were used: one with no prior context or documentation (Agent 1), and one provided with a reference document explaining two fundamental QA techniques — negative testing and boundary value analysis (Agent 2). Both agents were given the same prompt and their outputs were evaluated against a predefined set of control test cases and a point-based scoring rubric.
The feature being tested was intentionally simple: a single input field that accepts whole numbers between 1 and 10 with a submit button. This prompt was selected because it naturally lends itself to both boundary value analysis and negative testing, making it a good candidate for evaluating whether an AI applies these techniques.
The AI agent provided with a reference document (Agent 2) will outperform the untrained agent (Agent 1) in all categories: BVA coverage, negative test coverage, and I/O structure.
Two agents were used in this experiment. Each run used fresh agents with no prior context. One agent was given only the prompt, while the other was first given a reference document explaining the concepts of negative testing and boundary value analysis, followed by the same prompt.
The prompt was:
“Generate functional test cases for a simple input field that accepts whole numbers between 1 and 10 and has a submit button.”
The word “functional” was included to scope out non-functional test cases such as security tests (e.g. SQL injection), which earlier test runs had produced. The word “whole numbers” was included to remove ambiguity around decimal inputs.
No additional context was provided — such as error handling expectations or what happens after a successful submission — in order to present both agents with the same level of ambiguity that a real-world tester might encounter.
After each run, the agent was asked to provide its test cases in .csv format, which were then imported into a spreadsheet for analysis and scoring.
The test was run five times per agent, for a total of 10 test runs.
Controlled Variables:
Independent Variable:
Dependent Variables:
Results were evaluated against a control set of 16 test cases, divided into three categories:
Happy Path (1 test case)
BVA Test Cases (6 test cases)
01291011Negative Test Cases (9 test cases)
01)A test case was considered a match if the input value fell within the accepted inputs defined in the control, regardless of how the agent categorized or labeled it.
Point Scoring Rubric (maximum 8 points):
BVA Coverage (0–3 points)
Negative Test Coverage (0–3 points)
I/O Structure (0–2 points)
The hypothesis was not supported. Agent 2 did not outperform Agent 1 in all categories. While Agent 2 demonstrated superior BVA coverage, it underperformed significantly on negative test coverage. Both agents scored identically on I/O structure and achieved the same average total point score.
BVA Coverage: Agent 2 achieved perfect BVA coverage in all five runs (6/6 every run, 100% average). Agent 1 achieved full BVA coverage in only two of five runs (80% average), consistently missing the “just inside” boundary values (2 and 9) in the other three runs. The reference document clearly and consistently improved BVA performance.
Negative Test Coverage: Agent 1 significantly outperformed Agent 2 on negative test coverage. Agent 1 averaged 8.0/9 matched negative test cases per run (89% average), while Agent 2 averaged only 4.6/9 (51% average). The most commonly missed negative test cases for Agent 2 were:
I/O Structure: Both agents performed equally well on I/O structure. All runs from both agents received a 2/2, indicating that AI agents naturally produce well-structured test cases with clear inputs and expected outputs without requiring explicit instruction.
The Reference Document Created a Bias: The most significant and unexpected finding of this experiment is that the reference document appeared to constrain Agent 2’s thinking rather than expand it. Agent 2’s outputs were closely aligned with the types of test cases described in the reference document, suggesting the agent treated the document’s examples as a checklist rather than as a framework for broader thinking. This likely explains why Agent 2 consistently missed negative test cases that fell outside the scope of the document’s examples.
Agent 1 Was More Exploratory: Agent 1 included test cases not present in the control set, demonstrating broader exploratory thinking. Examples included inputs with leading/trailing spaces, decimal values formatted as whole numbers (e.g. 3.0), scientific notation, and combinations of letters and numbers. This suggests that without a reference document anchoring its approach, Agent 1 explored the problem space more freely.
Redundant Test Cases: Redundant test cases were observed in several Agent 1 runs — for example, testing 0 twice under different descriptions. While this would need to be cleaned up in a real test suite, it is a minor issue and does not significantly impact coverage quality.
Match Rate Scores (out of 16 control test cases):
| Agent | BVA Matched | % BVA | Negative Matched | % Negative | Happy Path | Total Matched | % Total |
|---|---|---|---|---|---|---|---|
| A1R1 | 4/6 | 67% | 9/9 | 100% | 1/1 | 14/16 | 88% |
| A1R2 | 4/6 | 67% | 8/9 | 89% | 1/1 | 13/16 | 81% |
| A1R3 | 6/6 | 100% | 8/9 | 89% | 1/1 | 15/16 | 94% |
| A1R4 | 6/6 | 100% | 7/9 | 78% | 1/1 | 14/16 | 88% |
| A1R5 | 4/6 | 67% | 8/9 | 89% | 1/1 | 13/16 | 81% |
| A1 Average | 4.8/6 | 80% | 8.0/9 | 89% | 1/1 | 13.8/16 | 86% |
| A2R1 | 6/6 | 100% | 4/9 | 44% | 1/1 | 11/16 | 69% |
| A2R2 | 6/6 | 100% | 5/9 | 56% | 1/1 | 12/16 | 75% |
| A2R3 | 6/6 | 100% | 5/9 | 56% | 1/1 | 12/16 | 75% |
| A2R4 | 6/6 | 100% | 5/9 | 56% | 1/1 | 12/16 | 75% |
| A2R5 | 6/6 | 100% | 4/9 | 44% | 1/1 | 11/16 | 69% |
| A2 Average | 6.0/6 | 100% | 4.6/9 | 51% | 1/1 | 11.6/16 | 73% |
Weighted Match Score (BVA 50%, Negative 35%, Total 15%):
| Agent 1 | Agent 2 | |
|---|---|---|
| BVA Average | 80% | 100% |
| Negative Average | 89% | 51% |
| Total Average | 86% | 73% |
| Weighted Score | 85.2% | 80.1% |
Point Scores (out of 8):
| Agent | BVA (0-3) | Negative (0-3) | I/O (0-2) | Total (0-8) |
|---|---|---|---|---|
| A1R1 | 2 | 3 | 2 | 7 |
| A1R2 | 2 | 2 | 2 | 6 |
| A1R3 | 3 | 2 | 2 | 7 |
| A1R4 | 3 | 2 | 2 | 7 |
| A1R5 | 2 | 2 | 2 | 6 |
| A1 Average | 2.4 | 2.2 | 2.0 | 6.6 |
| A2R1 | 3 | 1 | 2 | 6 |
| A2R2 | 3 | 2 | 2 | 7 |
| A2R3 | 3 | 2 | 2 | 7 |
| A2R4 | 3 | 2 | 2 | 7 |
| A2R5 | 3 | 1 | 2 | 6 |
| A2 Average | 3.0 | 1.6 | 2.0 | 6.6 |
Both agents averaged 6.6/8 on the point scoring rubric, confirming that the reference document did not produce an overall improvement in score. Agent 2’s perfect BVA performance was directly offset by its weaker negative test coverage, resulting in an identical average total score.
This experiment raised several questions that could be explored in future experiments:
1. How should a reference document be structured to be most effective? The reference document used in this experiment included specific examples that closely mirrored the test prompt. This appeared to create a bias in the trained agent’s output. A future experiment could compare different document formats — abstract conceptual explanations vs. concrete examples vs. a combination of both — to determine which produces the best results.
2. What happens when the prompt is more complex? The feature tested in this experiment was intentionally simple. A more complex feature — such as a multi-field form with conditional logic, or a feature with unclear or ambiguous requirements — may produce significantly different results. It is worth investigating whether AI agents perform as well relative to human testers as the complexity of the feature increases.
3. What would happen if the AI was given an actual application to test? This experiment provided only a text description of a feature. A future experiment could explore what happens when an AI is given access to a working application and asked to generate test cases based on direct observation rather than a written prompt.
4. What happens if the reference document is given after the prompt? In this experiment, the reference document was always provided before the prompt. It is worth investigating whether the order of context matters — would an agent that generates test cases first and then reviews them against a reference document produce better coverage than one that reads the document first?
5. Can an AI evaluate and improve its own test cases? A follow-up experiment could ask the AI to review its own output after generation and identify any gaps or redundancies. This could reveal whether AI agents are capable of self-correction and whether that improves overall test coverage.
6. What happens with a more abstract reference document? The reference document used specific examples that closely mirrored the prompt. Would a more conceptual document that teaches the thinking behind the techniques — rather than demonstrating them with specific examples — produce less bias and better overall coverage?
7. How does AI perform on negative testing for more subjective features? Negative testing a numeric input field is relatively straightforward — the boundaries are clearly defined. It is less clear how AI agents would perform on features where the wrong inputs are less obvious, such as a free-text search field, a rich text editor, or a feature with user-defined inputs.
8. Does the AI model matter? This experiment was conducted using a specific version of Claude Sonnet. It is unclear whether the findings would hold across different AI models or future versions of the same model. A comparative experiment across multiple models using the same prompt and rubric could reveal whether the bias effect is model-specific or more universal.
9. How does AI-generated test coverage compare to human-generated test coverage? This experiment only compared AI to AI. A natural next step would be to compare the output of both agents against test cases written by an experienced human QA tester, using the same prompt and rubric, to better understand where AI adds value and where it falls short.
10. Would a larger sample size change the conclusion? With only five runs per agent, some variability is expected. A larger sample size — 20 or 50 runs per agent — would produce more statistically reliable results and could either reinforce or contradict the findings of this experiment.
The results of this experiment suggest that AI agents have a reasonable baseline understanding of software testing concepts, even without any reference material. The untrained agent consistently produced well-structured test cases that covered both boundary values and negative scenarios — arguably well enough to be useful in a real-world setting, at least for simple features.
The more interesting finding is what happened when the agent was given a reference document. Rather than using the document as a springboard for broader thinking, the agent appeared to use it as a boundary — producing test cases that closely mirrored the examples provided while missing scenarios that fell outside of them. This is worth keeping in mind when considering how to use AI in a testing workflow. Providing an AI with documentation or examples may improve consistency and structure, but it could come at the cost of coverage breadth.
This mirrors a well-known challenge in human testing as well. Testers who follow a rigid test plan sometimes miss bugs that fall outside of it, while exploratory testers — those given freedom to investigate without a strict script — often catch things nobody thought to document. The untrained agent in this experiment behaved more like an exploratory tester, while the trained agent behaved more like one following a plan.
It is also worth noting that neither agent produced a perfect set of test cases. Both missed edge cases, included redundant tests, and made assumptions about expected behavior where the requirements were ambiguous. This suggests that AI-generated test cases should be treated as a starting point rather than a final deliverable — useful for generating coverage quickly, but still requiring human review and judgment before being used in a real test suite.