AI Test Case Generation — Experiment Summary

1. Experiment Overview

This experiment was designed to evaluate how effectively AI agents generate software test cases, and whether providing a reference document containing basic QA concepts meaningfully improves the quality of the output.

Two Claude AI chat agents were used: one with no prior context or documentation (Agent 1), and one provided with a reference document explaining two fundamental QA techniques — negative testing and boundary value analysis (Agent 2). Both agents were given the same prompt and their outputs were evaluated against a predefined set of control test cases and a point-based scoring rubric.

The feature being tested was intentionally simple: a single input field that accepts whole numbers between 1 and 10 with a submit button. This prompt was selected because it naturally lends itself to both boundary value analysis and negative testing, making it a good candidate for evaluating whether an AI applies these techniques.

2. Hypothesis

The AI agent provided with a reference document (Agent 2) will outperform the untrained agent (Agent 1) in all categories: BVA coverage, negative test coverage, and I/O structure.

3. Method

Two agents were used in this experiment. Each run used fresh agents with no prior context. One agent was given only the prompt, while the other was first given a reference document explaining the concepts of negative testing and boundary value analysis, followed by the same prompt.

The prompt was:

“Generate functional test cases for a simple input field that accepts whole numbers between 1 and 10 and has a submit button.”

The word “functional” was included to scope out non-functional test cases such as security tests (e.g. SQL injection), which earlier test runs had produced. The word “whole numbers” was included to remove ambiguity around decimal inputs.

No additional context was provided — such as error handling expectations or what happens after a successful submission — in order to present both agents with the same level of ambiguity that a real-world tester might encounter.

After each run, the agent was asked to provide its test cases in .csv format, which were then imported into a spreadsheet for analysis and scoring.

The test was run five times per agent, for a total of 10 test runs.

4. Variables

Controlled Variables:

The exact same prompt was used for every test run across both agents
A new agent was used for every test run, ensuring no context or prior conversation history carried over between runs
Chat memory was disabled for the duration of the experiment to ensure no prior context could influence results
Both agents used the same AI model (Claude Sonnet) for every run
The reference document given to Agent 2 was identical across all five runs
All runs were graded by the same evaluator using the same rubric to ensure scoring consistency

Independent Variable:

Whether the agent was provided a reference document

Dependent Variables:

BVA coverage score
Negative test coverage score
I/O structure score
Total point score
Number of control test cases matched

5. Limitations

Small sample size — each agent was only tested five times. This may not be enough to fully account for the non-deterministic nature of AI responses, where the same prompt can produce meaningfully different results across runs
Evaluator subjectivity — all runs were graded by a single human evaluator using a predefined rubric. Some scoring decisions, particularly around partial matches and edge cases, involve judgment calls
Model specificity — this experiment was run using a specific version of Claude Sonnet at a specific point in time. Results may differ across different models or future versions of the same model
Prompt sensitivity — minor changes to the prompt wording could produce significantly different results. The conclusions of this experiment apply only to the specific prompt used
Reference document design — the reference document used specific examples that closely mirrored the test prompt (a number field with a defined range). A more abstract document may have produced different results
Simple feature — the prompt described an intentionally simple feature. Results may differ significantly for more complex features with multiple fields, conditional logic, or less obvious edge cases

6. Evaluation

Results were evaluated against a control set of 16 test cases, divided into three categories:

Happy Path (1 test case)

Any number within the accepted range (1–10)

BVA Test Cases (6 test cases)

Outside lower bound: 0
Lower bound: 1
Just inside lower bound: 2
Just inside upper bound: 9
Upper bound: 10
Outside upper bound: 11

Negative Test Cases (9 test cases)

Special character input
Whitespace only
Leading or trailing space
Blank field submission
Decimal within the accepted range
Letter string
Letter/number/character combination string
Accepted number with appended zero (e.g. 01)
Very long string (50+ characters)

A test case was considered a match if the input value fell within the accepted inputs defined in the control, regardless of how the agent categorized or labeled it.

Point Scoring Rubric (maximum 8 points):

BVA Coverage (0–3 points)

3 = All six boundary values tested
2 = Some boundary values but not all six
1 = Only obvious boundaries (0 and 11) without just inside/outside cases
0 = No evidence of BVA

Negative Test Coverage (0–3 points)

3 = All nine negative test cases
2 = 5 to 8 negative cases
1 = 1 to 4 negative cases
0 = No negative cases

I/O Structure (0–2 points)

2 = Every test case has a clear input and expected output (including “review/unsure” as a valid expected output)
1 = Some test cases have clear input/output
0 = No clear input/output structure

7. Findings

The hypothesis was not supported. Agent 2 did not outperform Agent 1 in all categories. While Agent 2 demonstrated superior BVA coverage, it underperformed significantly on negative test coverage. Both agents scored identically on I/O structure and achieved the same average total point score.

BVA Coverage: Agent 2 achieved perfect BVA coverage in all five runs (6/6 every run, 100% average). Agent 1 achieved full BVA coverage in only two of five runs (80% average), consistently missing the “just inside” boundary values (2 and 9) in the other three runs. The reference document clearly and consistently improved BVA performance.

Negative Test Coverage: Agent 1 significantly outperformed Agent 2 on negative test coverage. Agent 1 averaged 8.0/9 matched negative test cases per run (89% average), while Agent 2 averaged only 4.6/9 (51% average). The most commonly missed negative test cases for Agent 2 were:

Leading/trailing spaces (missed 5/5 runs)
Letter/number/character combinations (missed 5/5 runs)
Appended zero to accepted number (missed 5/5 runs)
Very long string, 50+ characters (missed 5/5 runs)
Whitespace only (missed 2/5 runs)

I/O Structure: Both agents performed equally well on I/O structure. All runs from both agents received a 2/2, indicating that AI agents naturally produce well-structured test cases with clear inputs and expected outputs without requiring explicit instruction.

The Reference Document Created a Bias: The most significant and unexpected finding of this experiment is that the reference document appeared to constrain Agent 2’s thinking rather than expand it. Agent 2’s outputs were closely aligned with the types of test cases described in the reference document, suggesting the agent treated the document’s examples as a checklist rather than as a framework for broader thinking. This likely explains why Agent 2 consistently missed negative test cases that fell outside the scope of the document’s examples.

Agent 1 Was More Exploratory: Agent 1 included test cases not present in the control set, demonstrating broader exploratory thinking. Examples included inputs with leading/trailing spaces, decimal values formatted as whole numbers (e.g. 3.0), scientific notation, and combinations of letters and numbers. This suggests that without a reference document anchoring its approach, Agent 1 explored the problem space more freely.

Redundant Test Cases: Redundant test cases were observed in several Agent 1 runs — for example, testing 0 twice under different descriptions. While this would need to be cleaned up in a real test suite, it is a minor issue and does not significantly impact coverage quality.

8. Results

Match Rate Scores (out of 16 control test cases):

Agent	BVA Matched	% BVA	Negative Matched	% Negative	Happy Path	Total Matched	% Total
A1R1	4/6	67%	9/9	100%	1/1	14/16	88%
A1R2	4/6	67%	8/9	89%	1/1	13/16	81%
A1R3	6/6	100%	8/9	89%	1/1	15/16	94%
A1R4	6/6	100%	7/9	78%	1/1	14/16	88%
A1R5	4/6	67%	8/9	89%	1/1	13/16	81%
A1 Average	4.8/6	80%	8.0/9	89%	1/1	13.8/16	86%
A2R1	6/6	100%	4/9	44%	1/1	11/16	69%
A2R2	6/6	100%	5/9	56%	1/1	12/16	75%
A2R3	6/6	100%	5/9	56%	1/1	12/16	75%
A2R4	6/6	100%	5/9	56%	1/1	12/16	75%
A2R5	6/6	100%	4/9	44%	1/1	11/16	69%
A2 Average	6.0/6	100%	4.6/9	51%	1/1	11.6/16	73%

Weighted Match Score (BVA 50%, Negative 35%, Total 15%):

	Agent 1	Agent 2
BVA Average	80%	100%
Negative Average	89%	51%
Total Average	86%	73%
Weighted Score	85.2%	80.1%

Point Scores (out of 8):

Agent	BVA (0-3)	Negative (0-3)	I/O (0-2)	Total (0-8)
A1R1	2	3	2	7
A1R2	2	2	2	6
A1R3	3	2	2	7
A1R4	3	2	2	7
A1R5	2	2	2	6
A1 Average	2.4	2.2	2.0	6.6
A2R1	3	1	2	6
A2R2	3	2	2	7
A2R3	3	2	2	7
A2R4	3	2	2	7
A2R5	3	1	2	6
A2 Average	3.0	1.6	2.0	6.6

Both agents averaged 6.6/8 on the point scoring rubric, confirming that the reference document did not produce an overall improvement in score. Agent 2’s perfect BVA performance was directly offset by its weaker negative test coverage, resulting in an identical average total score.

9. Questions for Further Research

This experiment raised several questions that could be explored in future experiments:

1. How should a reference document be structured to be most effective? The reference document used in this experiment included specific examples that closely mirrored the test prompt. This appeared to create a bias in the trained agent’s output. A future experiment could compare different document formats — abstract conceptual explanations vs. concrete examples vs. a combination of both — to determine which produces the best results.

2. What happens when the prompt is more complex? The feature tested in this experiment was intentionally simple. A more complex feature — such as a multi-field form with conditional logic, or a feature with unclear or ambiguous requirements — may produce significantly different results. It is worth investigating whether AI agents perform as well relative to human testers as the complexity of the feature increases.

3. What would happen if the AI was given an actual application to test? This experiment provided only a text description of a feature. A future experiment could explore what happens when an AI is given access to a working application and asked to generate test cases based on direct observation rather than a written prompt.

4. What happens if the reference document is given after the prompt? In this experiment, the reference document was always provided before the prompt. It is worth investigating whether the order of context matters — would an agent that generates test cases first and then reviews them against a reference document produce better coverage than one that reads the document first?

5. Can an AI evaluate and improve its own test cases? A follow-up experiment could ask the AI to review its own output after generation and identify any gaps or redundancies. This could reveal whether AI agents are capable of self-correction and whether that improves overall test coverage.

6. What happens with a more abstract reference document? The reference document used specific examples that closely mirrored the prompt. Would a more conceptual document that teaches the thinking behind the techniques — rather than demonstrating them with specific examples — produce less bias and better overall coverage?

7. How does AI perform on negative testing for more subjective features? Negative testing a numeric input field is relatively straightforward — the boundaries are clearly defined. It is less clear how AI agents would perform on features where the wrong inputs are less obvious, such as a free-text search field, a rich text editor, or a feature with user-defined inputs.

8. Does the AI model matter? This experiment was conducted using a specific version of Claude Sonnet. It is unclear whether the findings would hold across different AI models or future versions of the same model. A comparative experiment across multiple models using the same prompt and rubric could reveal whether the bias effect is model-specific or more universal.

9. How does AI-generated test coverage compare to human-generated test coverage? This experiment only compared AI to AI. A natural next step would be to compare the output of both agents against test cases written by an experienced human QA tester, using the same prompt and rubric, to better understand where AI adds value and where it falls short.

10. Would a larger sample size change the conclusion? With only five runs per agent, some variability is expected. A larger sample size — 20 or 50 runs per agent — would produce more statistically reliable results and could either reinforce or contradict the findings of this experiment.

10. In Summary: What This Means in Practice

The results of this experiment suggest that AI agents have a reasonable baseline understanding of software testing concepts, even without any reference material. The untrained agent consistently produced well-structured test cases that covered both boundary values and negative scenarios — arguably well enough to be useful in a real-world setting, at least for simple features.

The more interesting finding is what happened when the agent was given a reference document. Rather than using the document as a springboard for broader thinking, the agent appeared to use it as a boundary — producing test cases that closely mirrored the examples provided while missing scenarios that fell outside of them. This is worth keeping in mind when considering how to use AI in a testing workflow. Providing an AI with documentation or examples may improve consistency and structure, but it could come at the cost of coverage breadth.

This mirrors a well-known challenge in human testing as well. Testers who follow a rigid test plan sometimes miss bugs that fall outside of it, while exploratory testers — those given freedom to investigate without a strict script — often catch things nobody thought to document. The untrained agent in this experiment behaved more like an exploratory tester, while the trained agent behaved more like one following a plan.

It is also worth noting that neither agent produced a perfect set of test cases. Both missed edge cases, included redundant tests, and made assumptions about expected behavior where the requirements were ambiguous. This suggests that AI-generated test cases should be treated as a starting point rather than a final deliverable — useful for generating coverage quickly, but still requiring human review and judgment before being used in a real test suite.