How to Nail the Experiment Evaluation Case Study
A/B test results look great on the surface — but the interviewer wants to know if you can find what's hiding underneath. Here's how to systematically evaluate experiments and avoid the traps that catch most candidates.
The Head of Product shows you a dashboard: "We ran an A/B test on the new recommendation algorithm. Time spent is up 9% in the treatment group. Should we ship it?"
That's the experiment evaluation case study. It's the second most common format in data science interviews after the metric investigation, and it's deceptively dangerous. The results are sitting right there. The number looks good. The team is excited. Your job is to figure out whether the excitement is justified — and more often than not, it isn't.
If you've read our general framework for case study interviews, you already know the high-level approach. This post is about how to apply it specifically to experiment evaluation — the checks that matter, the traps interviewers plant, and the thinking that separates strong candidates from everyone else.
What makes experiment evaluations different
In an investigation, you start with a broken metric and work backward to find the cause. In an experiment evaluation, you start with a result and work forward to decide if it's real, meaningful, and safe to act on.
That sounds easier. It's not. Investigations have a clear failure mode — you don't find the root cause. Experiment evaluations have a subtler failure mode: you accept a flawed result and recommend shipping something harmful. The interviewer is testing whether you have the skepticism and rigor to push back on a good-looking number.
The core skill being evaluated is multi-dimensional thinking. Anyone can confirm that the headline metric went up. The question is whether you can check the other dimensions — retention, quality, fairness, sustainability — and synthesize them into a judgment call.
Start with the basics: is the test valid?
Before you evaluate results, make sure the experiment itself is sound. Spend two minutes on this. It's fast, it's impressive, and it occasionally catches a fatal flaw that changes everything.
Sample ratio mismatch (SRM). Check if the treatment and control groups are the size you expect. If the test was supposed to be 50/50 and you have 52/48, that's a red flag. It could mean the randomization is broken, or that one group is triggering more events than the other due to the treatment itself.
Baseline balance. Are the groups similar on pre-experiment metrics? If the treatment group happened to get more power users, any positive result could be an artifact of the assignment, not the treatment.
Duration and power. Did the test run long enough to detect the effect size you're seeing? A 2% lift measured over 3 days with 500 users per group is noise. A 2% lift measured over 4 weeks with 50,000 users per group is probably real. Ask about the power calculation if it's not provided.
Novelty and primacy effects. New features often get an initial bump (novelty) or an initial dip (learning curve) that fades. If the test ran for only 7 days, the result might not represent steady-state behavior. Look at the metric by day — is the effect stable, growing, or decaying?
Most candidates skip this entirely and jump straight to the results. Don't be most candidates.
Confirm the headline, then go wider
Yes, confirm the primary metric first. If the claim is "+9% time spent," verify it. But spend no more than one query on this. The interviewer already knows it's +9%. What they want to see is what you do next.
Check retention. This is the single most important second check. Time spent per session can go up while retention goes down — users spend more per visit but visit less often. That's a net loss. Day-7 and day-14 retention are the signals that tell you whether the change is sustainable.
Check satisfaction signals. Look for explicit negative signals — "not interested" taps, report actions, unsubscribes, support tickets. A feature that increases engagement but also increases the rate at which users actively reject content is not a healthy feature.
Check quality metrics. What's the nature of the engagement? If time spent went up because users are watching more diverse content, that's healthy. If it went up because they're stuck in an outrage loop, that's not. Content type distribution, creator diversity, and interaction quality all matter.
Check for harm. Is there a segment that's being hurt? A treatment that's +9% overall but -3% for users over 35 is not generalizable. An algorithm that improves recommendations for power users but degrades them for new users has a fairness problem. Always segment the primary metric by key dimensions.
This is where you demonstrate senior-level thinking. Junior analysts confirm the number. Senior analysts interrogate it.
Segment aggressively
The most common trap in experiment evaluations is a result that looks great in aggregate but falls apart when you segment it. Interviewers love this pattern because it directly tests whether you have the instinct to look beneath the surface.
By demographics. Age, country, platform. If the effect only exists for one age group, it's not a general improvement — it's a feature that works for a specific audience.
By user tenure. New users vs. established users often respond very differently to changes. A positive effect on new users with no effect on existing users suggests novelty, not lasting value.
By engagement level. Power users, casual users, and dormant users each have different baselines. A change that re-activates dormant users is very different from one that squeezes more engagement out of already-active users.
By time. Plot the treatment effect by day. Is it stable? Growing? Decaying? A decaying effect suggests novelty. A growing effect might suggest a genuine behavioral shift — or it might suggest an addictive loop that will eventually cause churn.
When you find a segment where the effect is dramatically different from the aggregate, say it out loud. "The +9% overall is driven entirely by users 18-24. Users over 25 show no effect." That sentence alone signals that you're thinking about this the right way.
Look for confounds
This is the advanced move that separates top candidates from good ones. A confound is something that changed alongside the treatment that could explain the result — something other than the feature itself.
Notifications. Did the treatment group receive push notifications that the control group didn't? This is extremely common in real experiments and frequently planted in interview scenarios. If the feature includes a reminder or nudge, the notification might be driving the engagement lift, not the feature.
UI changes. If the treatment includes a visual change (a new badge, a different layout, a prominent CTA), is the engagement lift from the underlying feature or from the UI novelty?
Coupled features. Sometimes the treatment bundles multiple changes. A new recommendation algorithm + a new UI + a notification. Which one is driving the result? If you can't isolate the components, you can't attribute the effect.
When you find a confound, quantify it if you can. "Treatment users who received the notification show 79% retention. Treatment users who didn't show 71%. Control shows 68%. The notification explains 8 of the 11 percentage points." That's the kind of analysis that makes interviewers write "strong hire."
Build a composite view
Here's where most candidates stop: they report the primary metric, note a concern or two, and give a lukewarm recommendation. Strong candidates go further — they build a multi-dimensional view of the experiment's impact.
Combine the metrics. If time spent is +9%, retention is -2%, and content diversity is -25%, what's the net effect? You don't need a formal composite metric (though proposing one is impressive). You need to show that you can weigh multiple signals against each other and arrive at a judgment.
Frame the trade-off. "This feature trades long-term retention for short-term engagement. That might be acceptable if we're optimizing for a quarterly target, but it's a losing trade over 12 months." Framing the trade-off shows business maturity.
State your recommendation clearly. Don't hedge endlessly. After your analysis, say one of three things: "Ship it," "Don't ship it," or "Ship it with modifications." Then explain why in two sentences.
Close with next steps
Even if your recommendation is "don't ship," you should propose a path forward.
If you recommend shipping: Identify what to monitor post-launch. Which metrics should you watch? What would trigger a rollback? Is there a segment you'd want to exclude?
If you recommend not shipping: Explain what to fix and how to re-test. Should the notification be tested independently? Should the feature be redesigned to incentivize different behavior? Should the test run longer or with different success metrics?
If it's ambiguous: Propose a follow-up experiment that would resolve the ambiguity. A longer test, a different metric, a factorial design that separates the components.
This is the part that demonstrates ownership. Anyone can say "the data looks concerning." Recommending what to do about it is what makes you sound like someone who ships products, not just analyzes them.
Common mistakes to avoid
Stopping at the headline metric. "+9% time spent, ship it." This is the most common mistake and the one interviewers are specifically testing for. Always go deeper.
Not checking retention. Engagement and retention can move in opposite directions. If you don't check retention, you're only seeing half the picture.
Ignoring the timeline. A 7-day test with a decaying treatment effect is very different from a 28-day test with a stable effect. Look at the metric by day.
Accepting the framing uncritically. The prompt says "the results look great." That's the product team's interpretation. Your job is to form your own.
Running out of time without a recommendation. Budget your time. In a 45-minute case, you should start synthesizing by minute 25-30. A clear recommendation based on partial analysis beats a thorough analysis with no conclusion.
Being afraid to say "don't ship." If the data says don't ship, say don't ship. Interviewers respect analytical courage. They do not respect people who twist ambiguous data into a positive recommendation because they think that's what the team wants to hear.
Practice with realistic scenarios
The experiment evaluation pattern is consistent: confirm, segment, check retention, look for confounds, synthesize, recommend. What varies is the specific scenario — the feature, the metric, the hidden trap, the confound.
The only way to make this automatic is practice. Read a prompt, open the data, set a 45-minute timer, and work through it end to end. After a few reps, you'll develop the instinct to check the right things in the right order. After ten reps, the interview will feel like a conversation, not a test.
(Rabbit Hole has experiment evaluation cases modeled on real interviews at TikTok, Duolingo, and Spotify — a good place to start if you want realistic practice with live data.)
Ready to practice?
Apply these concepts on realistic case studies with real datasets.
Browse Case Studies