The Curse of Testing Illiteracy
Spurred on by my post The Curse of Fundraising Innumeracy, reader Mikaela King over at the National Geographic Society decided to “dog pile” on with what she termed “another illiteracy” in our sector — testing illiteracy.
Mikaela noted, “A lack of discipline in conducting accurate A/B split testing, truly ensuring randomized segments, making sure your test segments are large enough to ensure to statistical significance, only testing one element at a time (unless you’re intentionally testing a completely different offer), holding all other factors constant, only calling a test after it’s achieved statistical significance and only extrapolating the conclusions that were proven by the test. We have to “bootstrap” a lot in our industry to meet our budgets and grow our programs, but if I hear one more time about some amazing test results, only to see later that the test was fatally flawed and the results are unreliable…”
Of course she’s right. So I might as well tackle this “curse” next.
Let’s face it. Every one talks about ‘testing’. Few fundraisers and their consultants really understand what true testing means and how to conduct it.
As I noted in an earlier post, The Idiocy of Testing, one of the great barriers to growth in our sector is that despite the countless thousands of hours and millions of dollars spent on so-called ‘testing’, the result is navel-gazing at best and months or years of time wasted at worst. Years that could have/should have resulted in breakthroughs and growth.
Instead, most of the testing I’ve seen in the nonprofit sector is worthless, yielding little or zero by way of insights and producing zilch when it comes to sustained change. Oh sure, there are lots of one-off ‘winners’, but it is a temporary, fleeting success. One winner for every 4, 5…20 losers; one step forward, two back, at best. The act of treading water. There is no sustained impact on net, head count or growth of any sort.
In recent years The Agitator has done a two-part series on Direct Mail Testing for Acquisition (Part 1 here and Part 2 here.)
Two years ago, in a post titled Direct Mail Testing to Nowhere, we once more warned that while the logic of the simple A/B test is sound, it is incredibly inefficient and unproductive. Given the usual manner in which this type of testing is conducted, it’s slow, painstaking and amounts to little more than a nudge forward. The affliction of massive incrementalism.
So, let’s try again.
Assuming you’re testing with growth and breakthroughs as your goal and not just going through the motions, what is the proper way to test?
Even more to the point, how do we break the all-too-common pattern of timid take-little-risk that infects both agencies and nonprofits? An infection that ends up with testing only the marginal and incremental. Orange vs. blue envelopes … this letter signer vs. that letter signer … $25 vs. $45 … and sizes of envelopes. Incrementalism to nowhere.
QUESTION: How do we conduct testing that is truly strategic and purposeful rather than habitual?
ANSWER: With discipline in the form of a proper plan and by meticulously following proper guidelines and methodologies for each test.
I’ve taken the Testing Plan and Protocol used by our sister company DonorVoice as a real-life example of proper testing. Feel free to copy it. More importantly, please use it. (In fact you might ask your consultant or agency to show you the process they use and compare notes.)
First let’s start with an illustration of the Worksheet DonorVoice uses for putting together each and every test they’re involved with. [Click to enlarge.]
This Testing Worksheet/Planning tool is used by DonorVoice in conjunction with the following 10-point framework or protocol. Kevin Schulman, CEO of DonorVoice says: “This testing protocol will lead to far fewer and more meaningful tests (a big plus), and more definitive decision-making regarding outcomes (another big plus).”
1) Allocate 25% of your acquisition and house file budget to testing.
2) Of the 25%, put 10% into incremental and 15% into big ideas.
An important corollary here: some of this money should go into researching ideas or paying others to do it. You can even use the online environment to pre-vet ideas with small, quick tests of the ideas to gather data.
3) Set guidelines for expected improvement.
Any idea for incremental testing must deliver a 5% (or better) improvement in house results, and 10% in acquisition (we’ll see why the difference in minute). Any idea considered “breakthrough” must deliver a 20% increase (or better).
4) Any idea – incremental or breakthrough – must have a ‘reason to believe’ case made that relies on theory of how people make decisions, publicly available experimental test results, or past client test results.
The ‘reason to believe’ must include whether the idea is designed to improve response or average gift or both – this will be the metric(s) on which performance is evaluated.
A major part of this protocol is guided by the view that far more time should be spent on generation of test ideas and therefore, creating the necessary ‘rules’ and incentives to create this outcome.
This may very well result in 3 to 5 tests per year. If they are well conceived and vetted, that is a great outcome.
5) Determine test volume with math, not arbitrary, ‘best practice’ test panels of 25,000 (or whatever).
Use one of many web-based calculators (and underlying, simple statistical formulas). Here is one DonorVoice likes, but there are plenty – all free.
An acquisition example: if our control response rate is 1% and we want to be able to flag a 5% improvement – i.e. response rate greater than 1.05% – to say it is real, the test size would need to be 626,231 (at 80% power and 95% confidence and 2-tail test). That 626,231 is not a typo.
How many acquisition test panels have been used in the history of nonprofit DM that are producing meaningless results because of all the statistical noise? A sizeable majority, at least.
6) Do not create a ‘random nth’ control panel that matches the test cell size for comparison.
I don’t know how many nonprofits and agencies employ this approach but it can lead to drawing the exact wrong answer on whether the test lost or won.
The problem with the ‘random nth’ control test panel of equal size to the test – e.g. two panels drawn with random nth at 25,000 each – is that this creates a point of comparison that has its own statistical noise and far more than the main control with all the volume on it. There are a few retorts or excuses that have surfaced in defense of this practice, but they are simply off-base.
7) Determine winners and losers with math, not eyeballing it.
Use one of many web-based calculators to input test and control performance and statistically declare a winner or loser. Again, here’s DonorVoice’s free choice.
8) Declare a test a winner or loser.
Add results to the ‘reason to believe’ document; maintain a searchable archive.
9) All winners go full volume rollout.
10) Losers can be resurfaced and changed with a revised ‘reason to believe’ case.
Denny Hatch, one of the best copywriters and direct mail veterans in the business and editor of Business Common Sense, reminds us of the late Ed Mayer’s admonition: “Don’t test whispers.” Meaning, small, incremental changes (‘whispers’) produce only incremental results not worth whispering about, let alone shouting.
Whether up or down, tiny changes hardly matter and they cost lots of time and money. So, put the DonorVoice testing discipline, or one as rigorous, to work for your future.
What’s your experience with testing?
Roger
P.S. In the Agitator Toolkit you’ll find the description of a highly accurate, fast and inexpensive technology for testing literally hundreds or even thousands of variables at a time. You might want to explore this. It’s why we labeled it “18 Months’ Worth of Testing in a Day.”
Here’s a short video describing how the process works.
Love the overall summary and points! But I think your example of 626,231 might be misleading. I think the evanmiller.org sample-size calculator assumes the control panel is the same size as the test? You rightly point out that the control should be larger. And when it is, you can see significantly results with somewhat smaller test panels