Testing, Testing, A/B/C

August 7, 2019      Kevin Schulman, Founder, DonorVoice and DVCanvass

TestingWe’ve received a few testing questions here at Agitator | DonorVoice HQ:

  • How can I test inexpensively?
  • What level of statistical significance is necessary to call a winner (and how do I get it)?
  • What is most important to test?

We aim to help!  I’d first recommend Roger and Kevin’s Curse of Testing Illiteracy post, which talks through how to plan an effective testing regime.  Now on to your questions:

How can I test inexpensively?

Mail testing is necessary and expensive.  You can use other venues to test concepts so you know you are putting the best of the best head-to-head in the mail:

  • For acquisition audiences, Facebook ad testing can give you a good read on concepts. You can see, as an example, what using a medical professional identity would do for Make-a-Wish on a small budget and one week.  So we set up a Facebook test.  A few days and less than $200 later, we were able to generate a 42% lift for them.  There’s often not a lot of difference between online and offline donors, assuming they share similar identities, so that will give you a reason to believe in a test in the mail.  If this is of interest, we are doing a free webinar on using Facebook advertising testing and lead generation on September 18th.  We’d love to see you there, where we’ll be sharing a number of other case studies on tests you can try and strategies you might employ.
  • For donor audiences, email is an excellent proxy for mail. Similar sorts of things work to get someone to open an email (subject and subhead) as work to get them to open a mail pieces (teaser).  In fact, one of my favorites – “Look what you’ve done” on messages aimed at showing a person their impact – has increased both email and mail open and response rates.
  • Many of the same mechanics work for donation pages and reply devices (especially ask strings).  And story tests in email usually will translate to mail.  There are some differences – mail often has greater permission to be longer – but the general themes should work for each.
  • For either a donor or acquisition audience, you can do testing online that will translate to online and offline results with DonorVoice’s Pre-Test Tool. No Kid Hungry, Concord Direct, and we are doing a free webinar on September 4th called Beyond A/B: Running Thousands of Tests at Once.  There’s also more detail on this process here.

What level of significance is necessary to call a winner (and how do I get it)?

 Depends on your goal.  I’m legally required to say that 95% confidence is necessary to call a winner, as that’s the level that has been handed down on stone tablets from the gods of science.  If you are looking to publish your results, you would want to have this level of significance or better.  And it’s not a bad rule of thumb for our work either.  (I like the M+R chi-squared tool and Neil Patel’s AB test calculator to help me determine confidence level on response rate differences.)

That said, it’s not a magic line.  If you get 95.1% confidence on one test and 94.9% confidence on another, are you going to feel like you can take the first to the bank, whereas the second one is worthless?  Hopefully not.

So confidence is more of a spectrum than a yes/no.  If your alternative to using test results is to flip a coin, then any test results are better than trusting to fate and the president/sovereign’s face of your choice.  On the flip side, if you are deciding what will be your control acquisition mail piece that you will print literally millions of, you want to be ironclad sure.

I once did a 15-panel mail test on a membership mailing (details here) where we got significant results – a scientific “winner.”  The trick was that every panel got the same mail piece – the test panels only existed because I messed up the instructions to the mail house.  Even if something has p = .04, that still means there’s a 4% chance the result was random.  And when you test many panels and test over the course of a year, you are going to have much more randomness, chance and noise mixed in with your wins and losses.

So the critical question is “how important is this to get right?”.  Based on that, you can assess whether you would be satisfied with results less than statistical significance, you need results that are more than 95% confidence, or are comfortable with the standard results.  Always set this criterion ahead of time lest your will get weak and accept inferior results after the fact.  Again, there’s a great template for this planning in the Curse of Testing Illiteracy post.

As how to get greater predictive power in your test results, some tips:

  • Take big swings. Don’t test whispers. Greater effects mean greater significance.  That means package tests, not teasers or which color envelope or the like.  This also can mean testing thematically – stories versus stats, for example – across multiple communications
  • Repeat your test. We have a client with a list of 20K, which is still on the smaller side for getting significance.  A couple of times we’ve sent audience X package A and audience Y package B one month, then reversed in the next month: audience X gets B; audience Y gets A.  That way you get double the predictive power.  Even larger organizations can benefit from this tactic if you are testing with a smaller audience (e.g., what do our sustainers respond to).
  • Use other people’s results. If you are testing something that comes from the scientific literature or from a case study from another nonprofit, you have a somewhat lower burden of proof if your results line up with theirs.  Similarly, if you have results from testing in the previous section – Facebook ads, email, pre-test tool – you may be willing to accept a lower burden of proof because you already have evidence to support (like any good Bayesian, remember your priors!).

 What is most important to test?

One interesting aspect of the Pre-Test Tool is it not only tells you what wins, but what is more important.  For some organizations, it’s most important to get the voice of the message right; others have identity statements as the most vital.  You can see this dynamic at work with our testing with the DMA (now ANA) Nonprofit Federation.

That said, you can see patterns across organizations.  Looking at a number of nonprofits’ results, the aspects of communications most important to get right are most often around:

  • Who the donor is (why we talk constantly about donor identity)
  • What the organization does
  • What your gift does and why give it

Simple, I know.  You probably already knew this.  But think about all of the testing around what isn’t here: tagline, imagery, trust indicators, brands and logos, who signs the communication, etc., etc.  These are all items that are less important on average, but often seem to use a disproportionate amount of time and money.

Finally, a question I wish we were asked more often: when is a normal A/B test not enough?

When it has ripple effects.  So often, we will declare a subject line a winner but forget to check if it suppressed click/donation rates (e.g., you can get a pretty good open rate with the subject line “Your IRS audit”, but probably not a lot of donations).

We will declare an acquisition strategy a win and forgot to check if the donors were worth a darn thereafter.  A premium package, for example, could have a lower cost-to-acquire, but lower lifetime value as well if people are giving to get the premium rather than out of altruism or personal connection.

We will declare more communications as better for fundraising because they win head-to-head and forgot to check if donors had better retention or full-year giving (knowing as we do that there are interactions among communications where some gifts aren’t generated but taken from the pieces around them).

It’s important to look at the second-order impacts of these items that have ripple effects, looking at our previous tests to see the longer term of what they’ve truly wrought.

Nick

P.S. If you want more testing info, we’d love to have you as a part of TNPA’s Test of the Month webinars!

2 responses to “Testing, Testing, A/B/C”

  1. Jay Love says:

    Excellent information Nick!

    Hopefully, every nonprofit will try to make some form of communication testing happen. Even just taking the time to think through a few tests will often help in the long run just like any form of metric focus…

  2. Cindy Courtier says:

    BRAVO!
    I hope those who read this will pay special attention to “When is normal A/B testing not enough?”
    Yes, the key is results…but ALL results, not just test scores.