p values: How Misuse Could Undermine CRO

Keywords: Experimentation, A/B testing, CRO, ecommerce, p-values, p-hacking, optional stopping

Dan Franks

Monday, May 12, 2025

Following on from our breakdown of what p-values actually are—and what they aren’t—this post focuses on how they could be abused in if used incorrectly in CRO. These aren’t minor technicalities. They break the assumptions behind p-values, invalidate your inference, and lead to overconfident but invalid decisions.

There are many good CRO agencies out there doing all this correctly - but some self-use apps default to bad behaviour.

Early Stopping? Just Stop It.

One of the most widespread misuses of p-values in A/B testing is optional stopping: checking results repeatedly and stopping the moment the p-value drops below 0.05. Some Shopify A/B testing apps even default to this behaviour.

If your experiment ends when the p-value becomes “significant,” your inference is not valid. You’re not running a statistical test, you’re running a significance-triggered lottery. You’d be better off guessing than drawing conclusions from a process that systematically inflates false positives.

Frequentist p values require pre-specified power calculations. You must determine your required sample size before running the test and commit to that stopping point. If you ignore this and stop early based on significance, your false positive rate climbs well beyond the intended 5%. The more you look, and the more willing you are to stop when p < 0.05, the more likely you are to declare a false win.

Data Dredging and p-Hacking: Manufacturing Significance

Another massive problem is data dredging: slicing and reslicing your results until something looks significant. In CRO, this could take the form of:

  • Testing dozens of metrics and reporting only those with p < 0.05
  • Just testing a load of random stuff rather than hypothesis testing
  • Segmenting endlessly (device, geography, behavior) and declaring subgroup wins
  • Running many variants and highlighting the one that “worked”
  • Retesting small changes until a significant result appears
  • Collecting more than the pre-calculated sample size because a significant result did not appear yet

Whether intentional or not, all of this invalidates the p value. The test was no longer pre-specified, and the more choices made after seeing the data, the more likely you can’t assign any real meaning to significance.

It’s not that you can’t run multiple tests or explore subgroups - you can. But you must adjust for multiple comparisons using appropriate corrections (e.g., Bonferroni, Holm, false discovery rate), or use modeling strategies that account for selection. In CRO, this is rarely done, and the result is a steady stream of unreliable “wins.” If this IS done, then it will take a very very long time to run.

Summary

p values are abused when used outside their assumptions. Peeking at the results and stopping early inflates false positives by turning live tests into significance-triggered endpoints. Data dredging and p-hacking introduce bias by selectively reporting only “successful” slices of the data. Both practices invalidate the statistical meaning of the p-value.

You can run multiple comparisons. You can explore segments. But if you don’t adjust your inference accordingly, you’re not doing valid testing - you’re capitalizing on randomness.

If you’re going to use p values you need to: pre-specify your hypotheses, commit to fixed pre-calculated sample sizes, and correct for multiple comparisons. If you can’t or won’t do that, use methods that are robust to adaptive workflows and continuous learning, especially Bayesian approaches that align better with the needs of CRO.

Maximize Profitability with Optifi Autopilot

Talk with our team about how Optifi can help you

Talk to us