Misunderstood? p values in CRO

Keywords: Experimentation, A/B testing, CRO, Ecommerce, p values

Prof Dan Franks

Monday, May 12, 2025

If you use any form of statistics for decision-making or “optimization,” you need to understand what those statistics actually mean. p-values are no exception. They’re one of the most commonly used metrics in CRO - especially in A/B testing - yet they’re probably also among the most widely misunderstood.

P-values don’t measure uplift. They don’t quantify risk. P values are for hypothesis testing. But even then, they don’t provide certainty about a hypothesis (e.g. “B is better than A”). And they don’t really answer the one thing your client actually cares about: should we act on this result? What you want to know really is: “Is the effect of our change large enough to matter - and are we confident enough to act given the data we have?” This is not what you get with p values.

It’s not just CRO. p values are often misunderstood in some scientific disciplines too. Despite intense efforts to correct understanding, it has often fallen on deaf ears. This has forced the hand of some more statistically minded scientists, who have got fed up and called for p values to be banned. Understandable, but we’d prefer that we simply a) don’t ban p values, but demote them appropriately, b) understand and apply them properly (if we chose to use them).

At Optifi, we don’t use p values, we're fully Bayesian. But given that they’re probably the main way that CRO is performed in ecommerce, we thought we’d help with understanding their meaning, and how you can demote their importance and use them alongside other statistics to make better decisions.

What a p-value Does NOT Tell You

  • Whether your hypothesis is true
  • The probability the null is true
  • The probability A and B differ
  • Whether the effect is large, valuable, or meaningful.
  • Whether you should take action.

p values Aren’t Built for Business Decisions

Your clients care about things like:

  • How big is the uplift?
  • How confident can we be?
  • Is this result worth acting on?
  • What’s the cost/benefit tradeoff?

p-values answer none of that. Instead, a p-value answers this:

Assuming there’s no difference between A and B (i.e. the narrow null hypothesis), how likely is it to observe data as extreme - or more extreme - than what you saw?

p values only actually tell you something about the null hypothesis, and even then it’s not the probability that the null is true.

But really, you want to know something about your hypothesis.

A p value is the probability of the data given the null hypothesis. But in practice, what you want is the reverse: the probability of your hypothesis being true given the data.

That is, you want to know if the data give evidence of your hypothesis (or which hypothesis they give evidence for) and how confident you are based on the data you have. If that’s what you’re after, then you’re not looking for a p-value. Now you’re talking Bayesian. This is not just a technicality.

p-values Don’t Measure Evidence

Evidence involves comparing competing explanations. p values don’t do that. They essentially tell you how weird your data are if the null was true. They say nothing about how well an alternative explains the data.

P values don’t behave like valid evidence metrics. A p-value of 0.01 isn’t “twice as much evidence” as one of 0.02. Evidence should be coherent, additive, and interpretable. p-values are none of these things.

The Null Is Rarely the Right Question for Decision Making

In CRO and decision making in general the goal is usually “Is the effect large enough to matter - and are we confident enough to act?”

Standard hypothesis testing focuses on rejecting a sharp null (zero effect), but that’s rarely the real concern. In real-world systems, some difference almost always exists. The key question is whether that difference is large enough and certain enough to justify action.

What Should We Do?

Some teams supplement p values with effect sizes and confidence intervals. That helps and should be applauded. It gets you some of the way there.

It’s still indirect, though. Confidence intervals don’t give you a full probability distribution and so you can’t properly know if the probability of the effect exceeds a meaningful business threshold. And they don’t support decision-making under uncertainty in a principled way using decision theory.

Ideally we don’t just ask, “Is B better than A?”

Ask, “By how much, how confident are we, and is it worth acting on?”

A 0.3% lift might be statistically significant with enough data - but commercially irrelevant. You need methods that unify statistical inference with business value.

Bayesian methods provide a full probability distribution over the effect size, combining magnitude and uncertainty. Unlike using p values and effect sizes, there is no separation between is it real and is it useful - both are integrated.

With this approach, you can calculate something actionable:

What’s the probability the effect exceeds a minimum threshold that justifies rollout?

For example: P(effect > δ) > 95%, where δ is a business-defined minimum viable lift.

This isn’t just more interpretable, it connects naturally to decision theory, where choices are based on expected outcomes. You evaluate trade-offs under uncertainty, align analysis with goals, and choose actions based on expected value.

The Bottom Line

P values are widely used in CRO and A/B testing, but they’re often misunderstood and frequently misused - especially when teams peek at results and stop tests once a “significant” result appears (blog on this tomorrow). While p values can highlight surprising data under a no-effect assumption, they don’t tell you whether an effect is meaningful, reliable, or worth acting on. In practice, too much weight is placed on them. At a minimum, they should be paired with effect sizes, uncertainty intervals, and clear decision thresholds. CRO benefits from inference that supports decisions. Ideally, teams move toward approaches like Bayesian inference, which provide a fuller view of possible outcomes and better support decision-making under uncertainty. Either way, it’s time to demote p-values from being the final word in experimentation or statistical analysis.

Maximize Profitability with Optifi Autopilot

Talk with our team about how Optifi can help you

Talk to us