TESTING: When A/B Tests Attack (your results)
In yesterday’s post we talked about when A/B tests show results, but all that’s really there is noise.
Today, we’ll flip that on its head: sometimes an A/B test shows no result, but there’s an important finding just below the surface.
An example of this is a great study by Karlan and Wood looking at whether emotion or education was a more potent factor in mail appeals. The study is here.
To oversimplify, the control group got an emotional appeal and a personal story about a participant in the nonprofit’s program; the test group received the same letter, plus an additional paragraph talking about the “rigorous scientific methodologies” on which the nonprofit’s program was based.
The study found the information on program effectiveness had no impact on either the likelihood of giving or amount given. So, case closed: it doesn’t matter if you talk about your program’s effectiveness or not.
But wait.
The researchers found an interesting split in the data: effectiveness data significantly harmed response among smaller (under $100) donors (.6 percentage points lower response rate) and helped response among larger ($100+, but you probably guessed that) donors (one percentage point higher response rate). With controls in place for things like household income, previous gifts, etc., the researchers were able to reject the idea that larger and smaller donors behave the same.
So what looked like “no result” actually exposed an important difference between donors. One might say (and I do) that this highlights a dichotomy in how people give: smaller gifts are heart gifts; larger gifts are head gifts. Or you can go all Kahnemann System I and System II on me, if you prefer.
This is almost certainly happening in your file as well. We’ve talked about cat people versus dog people as a simplified example of different donor identities with different wishes in the same donor file.
Let’s say you made a choice to test an all-dog-all-the-time mail piece to a full donor file of half cat people and half dog people. The cat people would reject the mail piece, the dog people would embrace it, and the result would look like no improvement as long as they rejected/embraced in equal numbers and proportions.
Identity and average gift aren’t the only fault lines along which A/B tests can hide real results. We did a test of six additional cultivation pieces to new donors for an international nonprofit. No result… until we looked at more committed donors versus less committed donors.
Those donors who were highly committed to the organization had their retention go down by nine points when they received six additional communications versus none. They said things like, “Stop convincing me; I’m already convinced.”
When we looked at low-commitment donors, the six additional communications corresponded to a 12-point increase in retention. They said things like, “I believe you do important work, but I actually don’t know you well.” The study is discussed in more detail here.
Commitment level can even impact the age-old debate over happy versus sad faces in imagery. Xiaoxia Cao and Lei Jia tested what types of faces worked best in charity ads. They found that people who were highly psychologically involved with a nonprofit wanted to see happy faces, whereas those who weren’t as involved donated more if they saw sad faces.
This is not to say you should cut your data into tiny chunks to try to get a significant result. That’s p-hacking and it’s intellectual malpractice.
That said, meaningful segmentation means that segments will behave differently. (This usually occurs around commitment, identity, or both.)
If you are looking for these types of fissures in your data, you might be able to gain some unique insights. Insights that a one-size-fits-all testing outlook would miss.
Nick
Nick, This is great data to share. I am very appreciative of all the research that you have been putting out there for us to consider.
And I always wonder at what point does one study become more than one study? Perhaps you could add a bit more to how we might evaluate the applicability of single research projects or tests. I’m assuming at recommendation would be to test those for your own organization to see how they work. Though it would seem a more comprehensive understanding of the other conditions (like time of year, segment selected, ranking in high vs low loyalty, etc etc) might also be helpful.
Thanks – glad it’s helpful. When to react to these new bits of research is always a good question. A few rules of thumb I use:
– How strong is the evidence for it? Peer-reviewed pieces, while not perfect, certainly hold more interest to me than those that aren’t because they’ve gone through an extra vetting step and are usually by people who know how to craft a study. That said, within peer-reviewed pieces, I default to tests with actual fundraising results when possible. So I’d take Karlan and Wood’s study over Cao and Jia’s because the former actually mailed out mail pieces and got donations (the latter measured donation intent). For non-peer-reviewed pieces, I look at robustness of results. For example, the difference between high and low commitment donors discussed above was tested over the period of a year with strictly controlled randomized groups, so I feel good about the strength of the findings.
– How applicable are the results to you? If you are in international relief, I’d get very excited about Karlan and Wood’s results, because they tested it with Freedom From Hunger – there’s probably a lot of crossapplicability within that sector. Or if you see results from an end of the year mailing, you may feel most comfortable testing it in your end of the year mailing.
– How much an effect could it have for us? Let’s say you do stat-heavy appeals everywhere. You look at Karlan and Wood and the myriad of studies on the identifiable victim effect and see that a successful test could change the way you communicate with donors. That should likely be your priority over, say, positioning your donation amount against a hedonic good (which, while it has an effect, won’t transform your communications).
– How easy is it to test? Can I run it on my online donation page? Is there a test already running there? If the answers are yes and no respectively, then let’s put it up to test! If it will take my legal counsel signing off and an act of Congress, then the potential impact has to be pretty big to look at it.
– Does it excite you? This is so subjective that I hesitate to put it in, but the best tests I’ve seen are the ones where people can’t wait to see the results. There are 100s of studies you could draw from right now and it can be a glut. So, all other things being equal, you might as well test the ones that get you up in the morning.
Thanks, Nick.