You’ll never look at your A/B test the same…
Back in my salad days of working for a nonprofit (which was as of three weeks ago; time moves pretty fast nowadays), I was looking back at test results of a campaign we ran with our agency partner.
We did 15 panels of 20,000 per, each with the same RFM segments in equal amount. Think of it as an A/B test, except that this was an A/B/C/D/E/F/G/H/I/J/K/L/M/N/O test. I’d generated the list request myself based on the previous year, where we’d done a test of five different things at the same time as three different other things – hence 15 test panels.
Here were the (only slightly) changed results:
What are your conclusions from this? There are a few things that jump out at me:
- E2 is a strong winner largely because of the spike in average gift – could make 14% more than if we had rolled out with A2.
- C as a concept had a relatively strong showing – three of the top 7 panels. It would be worth looking into what was done there to see if it can be replicated.
- A, on the other hand, had three of the six worst showings and the two lowest response rates. Don’t do what you did in A again.
Here’s the trick. All of these panels were the same audience and received the same piece.
Yes, this 15-panel test was all an accident. It was the first time I’d done a data request of our database vendor, so I left the panels the same as they had been for the piece the previous year where we did actually run a five by three test.
And yet there were significant swings in response rates and average gift, just like you’d see in a regular test.
In fact, if someone told you that A2 was the control and E2 was the test, you’d be hard-pressed not to call E2 the winner and roll-out with it:
Response rate | Average gift | Revenue per piece | |
Test (E2) | 3.62% | $27.43 | $.99 |
Control (A2) | 3.33% | $25.93 | $.86 |
That’s what we had done the previous year. We had run a 15-panel test, picked a “winner” and that “winner” was the new control. And the results varied by about as much as they did when everyone got the same piece in the mail.
What lessons can you take from this?
First, there is a lot of what statisticians call ‘noise’ in any test results. What does this practically mean?
You have undoubtedly deemed test panels winners that were not and, on the flip side, test panels “losers” that were not. And it’s not just you; I know from looking at this that I have as well. The vast majority of test results have no winners and losers, just noise that we mislabel as signal.
How to avoid this?
Set up your list selects correctly. Take it from someone who had to aggregate 15 test panels to get the actual results he was looking for.
But, more importantly, this can teach us about why we test. It’s easy to do an A/B test of red envelope versus blue envelope. But it won’t get you to a deeper understanding of your donors’ behaviors; it might have gotten you the same results if they all get the red envelope by mistake.
Random testing to get random results keeps you from thinking about how to maximize the value of a communication, rather than the value of a donor. It’s when you are thinking about how to treat donors and what causes them to do what they do that you can engage with strategy, build donor loyalty, and maybe, just maybe, make your donors happier with their experiences.
Failing that, however, there is still a strong case to be made for retesting previous tests, backtesting (where you look at your hypothesis using retrospective data), much larger test quantities and a general sense of skepticism about how your test results could be wrong. After all, even if something has p = .04, that still means there’s a 4% chance the result was random. And when you test many panels and test over the course of a year, you are going to have much more randomness, chance and noise mixed in with your wins and losses.
This also has lessons for having a strong hypothesis. Each idea you test should have a reason to believe it before you start. You want to have a hypothesis that says “I believe this test should [increase response rate/increase average gift/increase donor lifetime value/some combination]” because of X so that you know what you are going to measure against instead of looking at a table like mine and picking ‘winners’ based on wherever a bigger (and false, but alluringly so) number appears.
It also helps you create the test correctly. If you have a strong hypothesis, you can make sure you get the details right to support or reject it.
What you want, ideally, is a robust testing tool like our pre-testing platform, so you can roll out with a winning test before you ever mail a piece. We’d love it if you’d like to learn more here.
And you can always learn more with our free newsletter below.