What experiments are for

How product teams, social organisations, and design leaders can build better learning systems.

The word experiment has narrowed in many product teams.

It often means one thing: an A/B test. Two variants, split traffic, a primary metric, a winner.

That technique has been useful. It moved a generation of teams away from opinion-led decision-making and towards observable behaviour. The world is better for that shift.

Mature teams know its limits. They know when to use discovery research, when to prototype, when to segment, when to hold back a control, when to run a staged rollout, and when a result is too local to generalise.

The deeper problem begins after that discipline is in place. Experiments are often treated as instruments for choosing between options, when their more important role is to help teams understand what can be learned, claimed, trusted, and scaled.

A test can show that one variant outperformed another for a defined group, on a defined metric, over a defined period. That is useful evidence. It does not settle why it worked, whether it will persist, whether it will transfer, whether it created value, or whether the behaviour it produced is one the organisation should be comfortable scaling.

The experiment may be well run. The learning may still be too narrow.

If we treat experimentation as a way of learning under uncertainty, rather than a way of selecting which variant wins, the picture changes. Different uncertainties need different experiments. The shape of the right experiment depends on the shape of the question.

Results have boundaries

An A/B test can show that a controlled change affected a defined behaviour for the users included in the test. That boundary is what makes the result useful. It is also what prevents the result from carrying every adjacent question the organisation may want to answer.

A variant can increase completion while leaving the quality of the experience unresolved. It can improve conversion while weakening comprehension. It can perform well for the users in the sample while creating problems for users at the edge. It can produce a short-term lift while persistence, transferability, and unintended consequences remain unknown.

Those are reasonable follow-on questions. They are not automatically findings.

This matters because the gap between “the variant won” and “we have learned what we wanted to learn” is often where overclaiming happens. The mechanism may be under-explained. The persistence is unknown. The transferability is untested. The unintended consequences are unmeasured. The team has a clear result and a much less clear understanding of what the result means.

Those are the boundaries of what the instrument is built to see. Outside those boundaries, other instruments are needed.

A wider picture of experimentation

If experimentation is the design of learning under uncertainty, the toolkit becomes wider than A/B testing.

The methods are best understood as different forms of protection against different failures of inference.

An experiment does not remove uncertainty. It creates an aperture through which one claim can be seen more clearly.

A prototype protects against building something people do not understand or want. A usability test protects against assuming the journey works because the concept makes sense. A concierge or Wizard of Oz test protects against committing to infrastructure before the value proposition is established.

A pilot protects against mistaking desirability for operational viability: can the team actually run this, does the experience hold up, what breaks first?

An A/B test protects against relying on preference, opinion, or hierarchy when behaviour can be observed under controlled conditions. A quasi-experimental comparison protects against assuming that change over time was caused by the intervention when randomisation is unavailable or inappropriate. A formal randomised evaluation protects against harder forms of selection and comparison bias when the stakes justify stronger control, pre-specification, independent scrutiny, and longer-term outcome measurement.

These methods are not interchangeable. Each defends against a different kind of wrongness. Choosing well means knowing which failure the moment cannot afford.

The deeper skill is diagnostic judgement: recognising the kind of learning the moment calls for, then choosing the method that protects the work from the failure modes that matter most.

Designers shape the experiment, not just the interface

Product designers are part of the experimental apparatus, whether or not they think of themselves that way.

The design of the variant determines what the experiment is actually measuring. A confusingly designed treatment can make a sound idea look ineffective. A treatment that includes too many simultaneous changes makes attribution impossible, no matter how clean the analysis.

The experimental signal is downstream of the design.

The choice of population and entry point determines who is being studied. A test that runs only on highly engaged users tells you something about highly engaged users. It does not automatically tell you about new users, excluded users, reluctant users, or the people for whom the service may matter most. These choices are often made through design decisions before they appear as analytical decisions.

The framing of success determines what the experiment can find. A success metric defined narrowly will produce findings that look definitive but generalise poorly. A success metric defined broadly may be more honest but harder to act on. Both are design problems before they are measurement problems.

The reporting layer matters too. The dashboards that translate raw experimental data into organisational understanding are themselves a design output. The decisions teams make based on experiments are mediated by how those experiments are presented. A poorly designed reporting layer can make a sound experiment lead to a wrong conclusion.

This is one of the places where design leadership is undervalued. Designers do not need to run the statistical analysis to shape the quality of the experiment. The work of asking what we are trying to learn, who we are trying to learn it from, and how we will know when we have learned it is design work in the broader sense.

It is a question about meaning, audience, behaviour, context, and consequence applied to the apparatus of learning.

Hypotheses and claims are different things

Product teams are comfortable with hypotheses.

We believe this change will improve activation. We believe this guidance will reduce support requests. We believe this new flow will increase completion.

Hypotheses are useful because they make assumptions explicit. But a hypothesis is only one part of the work.

A hypothesis says what you expect to happen. A claim says what the evidence will entitle you to say.

This is a hypothesis:

We believe the new onboarding flow will improve activation.

This is closer to a claim:

The new onboarding flow caused a measurable increase in activation among new users from this acquisition channel during the test period.

The second sentence is less comfortable, but more useful. It names the intervention, outcome, population, context, causal language, and boundary of the finding.

It also makes the limits visible. The claim may not apply to all users. It may not hold beyond the test period. It may not explain why activation increased. It may not tell us whether users received more value after activating. It may not tell us whether the increase is worth the trade-offs.

A good experiment does not eliminate all uncertainty. It tells us which uncertainty has been reduced.

That matters because experimental results accumulate. A roadmap is a sequence of decisions, each informed by the experiments that preceded it. Each step compounds whatever overclaiming was done at the previous step. A team that is mildly loose about what each experiment proves may, after a year, have built a roadmap on a series of mildly loose claims, none of which would survive serious scrutiny if examined individually.

The remedy is to define the claim before the experiment runs.

Not just the hypothesis. The claim.

What statement will we make based on what result? How strongly will we make it? What conditions will we attach to it? What will we explicitly avoid claiming, even if the result is positive?

Ethics is a claim problem too

Every experiment carries an implied claim about acceptable risk.

In social impact work, that claim is usually visible. Participants may be vulnerable. The intervention may affect support, opportunity, health, confidence, safety, or access. Decisions about comparison groups, consent, data collection, and evaluation design shape people’s real conditions.

Product experiments carry ethical claims as well. A pricing experiment, recommendation change, onboarding flow, ranking system, or AI-assisted workflow can alter what people see, understand, choose, disclose, pay for, trust, or avoid.

The experiment is claiming that the uncertainty is worth resolving, that the method is proportionate to what is being learned, that the people affected are not being used merely as a means to organisational improvement, and that the result will be interpreted within its limits.

Experiments can improve conversion by reducing comprehension, increase engagement by increasing anxiety, or optimise one group’s experience while quietly excluding another. The fact that consequences are diffuse rather than concentrated does not make them absent.

A mature experimentation culture asks: what kind of behaviour did we produce, for whom, at what cost, and would we be comfortable scaling it?

Responsible experimentation lets teams move quickly without becoming careless about what they are learning.

From testing variants to designing learning

A mature experimentation culture is defined by the quality of its learning.

Some organisations run many tests and learn very little because the tests are disconnected from meaningful decisions. Others run fewer experiments and learn more because each one is tied to a clear uncertainty, a credible claim, and a decision that will actually change.

The useful shift is from asking “what should we test next?” to asking:

What decision are we trying to make?
What do we already know?
What remains uncertain?
What claim would we like to make?
What evidence would make that claim credible?
What is the lightest responsible way to get that evidence?
What will we do differently once we have it?

Those questions travel across contexts: commercial products, AI-enabled workflows, public services, and social interventions.

Experimentation designs evidence under uncertainty. It matches the form of learning to the question, the decision, and the consequences of being wrong.

The real skill is diagnostic judgement: recognising what kind of experiment the moment calls for, then choosing that over the default.