A/B Tests Pick Flavors, Not Lift

October 17, 2025      Kevin Schulman, Founder, DonorVoice and DVCanvass

Thought experiment, turn fundraising off for a year – zip, nada, nothing.  What happens?

Revenue doesn’t hit zero, it decays. Long-time donors still give, some monthly gifts keep running, bequests arrive. That “baseline” money would come in even if you did nothing for a while.  Yet, right now, it’s being credited to fundraising performance.

Flip it. Increase spend 10x next year, will revenue jump 10x? Of course not, there’ll be some increase but nowhere near in linear lock step.

Our reality lives between those two poles, and that’s where the real question sits:

Is this next thing—this campaign, extra email, or new channel—actually adding money, or just taking credit for money that would have arrived anyway?

Most A/B tests can’t answer that, they’re taste tests.  Imagine testing two pills for the same illness. Patients say Pill A tastes a bit better than Pill B, so you switch everyone to Pill A.  But neither pill treats the condition.

That’s what most A/B tests are doing in disguise, optimizing for taste, not effect. They pick the more liked version, not the one that actually moves the outcome you care about.

Why preference ≠ performance

  • Attribution is generous to the activity you’re staring at. When everything moves together, the closest tactic gets the credit.
  • Small wins compound into big illusions. Ten 3% “wins” can still net out to 0% incremental growth if each is just siphoning from something else.

The Ladder of Evidence From Vibes to Truth

  1. A/B preference test
    Good for: crafting and micro-optimizing after you’ve proven the channel/campaign matters. Not good for answering “does this create new money?”
  2. Randomized holdout
    Withhold a statistically valid slice from the activity. Compare total giving over an appropriate window. This is the workhorse.
  3. Turn-off test
    Pause the channel or dial it down materially. If revenue barely moves, you’ve been reallocating. If revenue drops beyond expected variance, you have causal signal.
  4. Geo-lift test
    Run activity in matched regions and compare outcomes. Great when randomization at the person level isn’t practical.

Run activity in matched regions and compare outcomes. Great when randomization at the person level isn’t practical.

Use the highest rung you can operationalize. Then use A/B within the proven rungs to tune the details.

When to use what:

  • New channel or big budget shift? Start with a holdout or geo-lift. Prove there’s a there there.
  • Mature channel you believe in? Periodically turn it off in a controlled way. Keep yourself honest.
  • Creative tweaks, subject lines, landing pages? A/B away, but only inside a channel that’s already proven incremental.

If you wouldn’t choose a medicine on taste, don’t choose your fundraising program that way either.

Kevin