How to Do a Power Analysis
A practical guide to power analysis for A/B tests — what it is, how to calculate sample sizes, how to choose your MDE, and the common mistakes that invalidate your experiment before it starts.
If you've ever been asked "how long should we run this experiment?" in an interview or at work, the answer is a power analysis. And yet most data scientists either skip it, butcher it, or cargo-cult a formula they found on Stack Overflow without understanding what it's actually telling them.
This post is a practical guide. We'll cover what power analysis is, why it matters, how to actually do one, and — just as importantly — the common ways people get it wrong.
What a Power Analysis Actually Does
An A/B test is a hypothesis test. You're comparing a control group to a treatment group and asking: is the difference I observe real, or could it be noise?
A power analysis answers a prerequisite question: given the size of the effect I want to detect, how many observations do I need to have a good chance of detecting it?
That's it. It's a sample size calculation. But it requires you to think carefully about four quantities that are all connected to each other.
The Four Parameters
Every power analysis involves four things. Fix any three, and the fourth is determined.
1. Significance level (α)
The probability of rejecting the null hypothesis when it's actually true — a false positive. Convention is α = 0.05, meaning you accept a 5% chance of declaring a winner when there's no real difference.
This is the threshold for your p-value. If your p-value falls below α, you reject the null.
2. Statistical power (1 − β)
The probability of rejecting the null hypothesis when the alternative is true — correctly detecting a real effect. Convention is 0.80, meaning you want an 80% chance of detecting the effect if it exists.
β is the false negative rate. At 80% power, you have a 20% chance of missing a real effect. Some teams use 90% power for high-stakes decisions. Higher power requires larger samples.
3. Effect size
The magnitude of the difference you want to be able to detect. This is usually expressed as the minimum detectable effect (MDE) — the smallest difference that would be practically meaningful to your business.
This is where most of the thinking should happen. Your MDE isn't a statistical concept — it's a business decision. If your conversion rate is 5%, is a 0.1 percentage point lift worth detecting? What about 0.5 points? 1 point? The answer depends on your traffic, your revenue per conversion, and the cost of running the experiment longer.
4. Sample size (n)
The number of observations per group needed to achieve your desired power at your chosen significance level for your specified effect size.
This is usually what you're solving for.
The Relationship
Here's the key intuition: smaller effects require larger samples to detect. This is not a limitation of your analysis — it's a property of the universe. A tiny signal buried in noise requires more data to distinguish from zero.
The relationships work like this:
- Smaller MDE → larger n needed
- Higher power (1 − β) → larger n needed
- Smaller α (stricter significance threshold) → larger n needed
- Higher variance in your metric → larger n needed
These all point in the same direction. Detecting subtle effects with high confidence in noisy data requires a lot of observations.
Doing the Calculation
For comparing two proportions (e.g., conversion rates)
This is the most common case in product A/B testing. You have a control conversion rate p_c and you want to detect an absolute lift of δ, meaning the treatment rate is p_t = p_c + δ.
The approximate sample size per group for a two-sided test is:
n = (Z_{α/2} + Z_β)² × (p_c(1 − p_c) + p_t(1 − p_t)) / δ²
Where:
- Z_{α/2} is the critical value for your significance level (1.96 for α = 0.05 two-sided)
- Z_β is the critical value for your power (0.84 for 80% power, 1.28 for 90% power)
- p_c is the control proportion
- p_t is the treatment proportion (p_c + δ)
- δ is the absolute difference (p_t − p_c)
Example: Your signup page converts at 10%. You want to detect a 1 percentage point absolute lift (to 11%) with 80% power at α = 0.05.
n = (1.96 + 0.84)² × (0.10 × 0.90 + 0.11 × 0.89) / (0.01)²
n = (2.80)² × (0.090 + 0.0979) / 0.0001
n = 7.84 × 0.1879 / 0.0001
n ≈ 14,731 per group
So roughly 15,000 users per group, or about 30,000 total.
A quick sanity check: if you're trying to detect a 1pp lift on a 10% base rate, that's a 10% relative change. Needing ~15K per group feels right. If you only wanted to detect a 2pp lift, you'd need roughly a quarter of that (sample size scales with the inverse square of the effect size).
For comparing two means (e.g., revenue per user, time spent)
When your metric is continuous, you need to know (or estimate) the variance. The formula for a two-sample test with equal group sizes is:
n = (Z_{α/2} + Z_β)² × 2σ² / δ²
Where:
- σ² is the variance of the metric (assumed equal in both groups)
- δ is the difference in means you want to detect
- The factor of 2 accounts for having two groups
Example: Average revenue per user is $12.00 with a standard deviation of $25.00 (revenue data is notoriously skewed). You want to detect a $1.00 lift with 80% power at α = 0.05.
n = (1.96 + 0.84)² × 2 × (25)² / (1.00)²
n = 7.84 × 1250 / 1
n = 9,800 per group
Notice how the high variance ($25 standard deviation on a $12 mean) inflates the required sample. Revenue metrics are hard to experiment on precisely because of this variance. We'll talk about variance reduction later.
Using software
In practice, you don't need to compute this by hand. Python's statsmodels has functions for this:
from statsmodels.stats.power import TTestIndPower
import math
# For proportions — direct calculation
z_alpha = 1.96 # two-sided, alpha=0.05
z_beta = 0.84 # 80% power
p_c, p_t = 0.10, 0.11
delta = p_t - p_c
n = (z_alpha + z_beta)**2 * (p_c*(1-p_c) + p_t*(1-p_t)) / delta**2
# n ≈ 14,731 per group
# For means (using Cohen's d)
cohens_d = 1.00 / 25.00 # delta / sigma = 0.04
analysis = TTestIndPower()
n = analysis.solve_power(
effect_size=cohens_d,
alpha=0.05,
power=0.80,
alternative='two-sided'
)
# n ≈ 9,800 per group
A note on tooling: statsmodels.stats.proportion.proportion_effectsize uses Cohen's h (an arcsine transformation) rather than the raw difference in proportions. This is a valid approach but will give a different sample size than the formula above. For proportions close together and not near 0 or 1, both approaches are reasonable — just be consistent.
R users can use the pwr package, which also uses Cohen's h for proportions.
Choosing Your MDE
This is the hardest part of a power analysis, and it's not a statistical question. It's a business question.
Your MDE should be the smallest effect that would justify shipping the change. Think about it this way: if the experiment showed exactly this effect size and it was statistically significant, would you actually ship it?
Some frameworks for choosing an MDE:
Revenue-based: If your feature costs X to build and maintain, what lift in revenue would pay for it within a reasonable timeframe? Work backward from that to an effect size on your metric.
Opportunity cost: Every experiment that runs is a slot that another experiment can't use. If you're going to commit your entire user base for 4 weeks, the expected value of the lift should exceed the expected value of whatever else you could have tested.
Practical significance: A 0.01% conversion lift may be statistically detectable with enough traffic, but it's not going to move the business. Don't waste time trying to detect effects that don't matter even if they're real.
A common anti-pattern is setting the MDE to whatever effect size gives you a "reasonable" sample size. This is backwards. The MDE should come from business reasoning, and then the sample size follows — even if the answer is "we'd need to run this for 6 months, which means we shouldn't run it."
Common Mistakes
1. Peeking at results before the experiment is done
This is the most common and most dangerous mistake. If you calculated that you need 15,000 users per group and you check the results at 5,000, you've inflated your false positive rate. Substantially.
The math: a single test at α = 0.05 has a 5% false positive rate. But if you check results 5 times during the experiment and stop when you see significance, your effective false positive rate can exceed 14%, even under the null hypothesis. This is called the "peeking problem" or "optional stopping."
If you need to monitor results as they come in, use a sequential testing framework (like group sequential designs or always-valid p-values) that explicitly accounts for repeated looks at the data.
2. Ignoring the variance of your metric
For continuous metrics, the variance of your outcome drives the required sample size. Revenue, session duration, and engagement metrics often have extremely high variance (heavy right tails, lots of zeros). If you use a point estimate of the mean without thinking about the spread, your power analysis will be wrong.
Always look at the distribution of your metric before running a power analysis. Consider whether you should winsorize outliers, use a log transformation, or switch to a less noisy proxy metric.
3. Forgetting that n is per group
The formulas above give you n per group. If you have a 50/50 split, your total sample is 2n. If you have an 80/20 split (which some teams use to limit downside risk), the total sample is larger than 2n because the smaller group is the bottleneck. Specifically, with an 80/20 split, you need about 56% more total traffic compared to 50/50 to achieve the same power.
For unequal group sizes, the effective sample size depends on the harmonic mean of the two group sizes, not the arithmetic mean. This is why 50/50 is optimal for power — it maximizes the harmonic mean for a given total sample.
4. Post-hoc power analysis
Running a power analysis after the experiment to "check" whether you had enough power is meaningless. Post-hoc power is a direct mathematical function of the p-value — it tells you nothing you didn't already know from the p-value itself. If the p-value was non-significant and you compute post-hoc power, you'll always get low power. This is circular.
Power analysis is a planning tool, not a retrospective diagnostic.
5. Using the wrong unit of randomization
If you randomize at the user level but your metric is measured at the session level, your effective sample size is not the number of sessions — it's the number of users. Sessions within a user are correlated. Using the session count in your power analysis will overestimate your power and underestimate the required sample size.
Always match the unit of analysis to the unit of randomization, or account for the clustering in your variance estimate.
Variance Reduction: Getting More Power Without More Data
Sometimes the power analysis tells you that you need more traffic than you have. Before giving up, consider techniques that reduce the variance of your metric, which is equivalent to increasing your effective sample size.
CUPED (Controlled-experiment Using Pre-Experiment Data): If you have pre-experiment data on the same metric, you can use it as a covariate to reduce variance. The idea is simple: if a user's post-experiment revenue is correlated with their pre-experiment revenue, you can partial out the pre-experiment signal and reduce the noise in your estimate. Variance reductions of 20-50% are common, which can cut required runtime by a similar fraction.
The adjusted metric is:
Y_adjusted = Y_post − θ × Y_pre
Where θ is the coefficient from regressing Y_post on Y_pre (typically estimated as Cov(Y_post, Y_pre) / Var(Y_pre)). The key insight is that this adjustment doesn't introduce bias because treatment assignment is independent of pre-experiment data.
Stratified randomization: If you know that certain user segments have very different baseline metrics (e.g., power users vs. casual users), you can stratify your randomization to ensure balance across these segments, then analyze within strata. This reduces the between-strata variance and can meaningfully improve power.
Metric choice: Sometimes the most effective variance reduction is choosing a different metric. Click-through rate has lower variance than revenue per user. A binary "did the user convert" has lower variance than "how much did they spend." If your primary metric is too noisy, consider whether a closely related but less variable metric can serve as a reliable proxy.
Multiple Comparisons
If you're testing multiple variants (A vs. B vs. C) or looking at multiple metrics, you need to account for the multiplicity.
With k independent comparisons at α = 0.05 each, the probability of at least one false positive is 1 − (1 − α)^k. For 3 comparisons, that's about 14%. For 10, it's about 40%.
The simplest correction is Bonferroni: divide α by the number of comparisons. If you're doing 3 tests, use α = 0.05/3 for each. This is conservative — it controls the family-wise error rate but reduces power for each individual test. Your power analysis needs to use the adjusted α, which means you'll need larger samples.
Less conservative alternatives exist. Benjamini-Hochberg controls the false discovery rate (the expected proportion of false positives among rejected hypotheses) rather than the family-wise error rate, and is more appropriate when you're running many comparisons and can tolerate some false positives as long as the overall discovery rate is reliable.
For multi-variant experiments (A/B/C/D), the power calculation is more involved because you're typically interested in pairwise comparisons against a control. Dunnett's test is designed for exactly this scenario and is less conservative than Bonferroni for the specific case of "each treatment vs. one shared control."
The practical implication: more variants means you need more total traffic. If you're constrained on traffic, test fewer things at once with higher power rather than many things with insufficient power.
What This Looks Like in an Interview
If you get an experiment design question in a data science interview, the power analysis portion is where you demonstrate rigor. (For a deeper look at how to approach the full experiment evaluation, see our guide on how to nail the experiment evaluation case study.) Here's the structure:
State your hypotheses. Null: no difference. Alternative: there is a difference (usually two-sided unless you have strong prior reason to expect a specific direction).
Define your primary metric. Pick one. Having a single pre-registered primary metric is what separates a well-designed experiment from a fishing expedition.
Choose α and power. State them and briefly justify. "I'll use α = 0.05 and 80% power, which are standard conventions" is fine. If you have a reason to deviate (e.g., "this is a high-risk change so I want 90% power"), say so.
Determine the MDE. This is where you show business thinking. "Given that our conversion rate is 10% and each conversion is worth roughly $50 in LTV, a 0.5pp lift would represent $X in annual revenue, which justifies the engineering cost. So I'd set the MDE at 0.5 percentage points."
Calculate the sample size and translate to runtime. "At 100,000 daily active users in a 50/50 split, we'd need about 30,000 per group, so roughly 12 hours of traffic — call it 1 day with a buffer." If the runtime is unreasonably long, discuss tradeoffs: can you increase the MDE? Use CUPED? Accept lower power?
Mention what you'd monitor. Sample ratio mismatch (are the groups actually balanced?), guardrail metrics, any pre-registered stopping rules.
This kind of structured thinking is what separates candidates who understand experimentation from those who just know how to run a t-test.
The Bottom Line
A power analysis is not a formality. It's the single most important step in experiment design because it forces you to confront the limits of what your data can tell you before you collect it.
Most failed experiments don't fail because the analysis was wrong. They fail because nobody thought carefully about whether the experiment had a realistic chance of detecting the effect in the first place.
Do the math upfront. You'll run fewer experiments, but the ones you run will actually tell you something.
Want to practice experiment design with real data? Rabbit Hole has case studies that test your ability to evaluate A/B tests, diagnose metric shifts, and build recommendations — the way you'll actually be tested in interviews.
Ready to practice?
Apply these concepts on realistic case studies with real datasets.
Browse Case Studies