What’s Wrong With Your Testing?
Let’s face it. Every one talks about ‘testing’. Few fundraisers and their consultants really understand what true testing means and how to conduct it.
As I noted in an earlier post, The Idiocy of Testing, one of the great barriers to growth in our sector is that despite the countless thousands of hours and millions of dollars spent on so-called ‘testing’, the result is navel-gazing at best and months or years of time wasted at worst. Years that could have/should have resulted in breakthroughs and growth.
Instead, most of the testing I’ve seen in the nonprofit sector is worthless, yielding little or zero by way of insights and producing zilch when it comes to sustained change. Oh sure, there are lots of one-off ‘winners’, but it is temporary, fleeting success. One winner for every 4, 5…20 losers; one step forward, two back or at best treading water. There is no sustained impact on net, head count or growth of any sort.
In recent years The Agitator has done a two-part series on Direct Mail Testing for Acquisition (Part 1 here and Part 2 here.
Again last May, in a post titled Direct Mail Testing to Nowhere, we once more warned that while the logic of the simple A/B test is sound, it is incredibly inefficient and unproductive. Given the usual manner in which this type of testing is conducted, it’s slow, painstaking and amounts to little more than a nudge forward. The affliction of massive incrementalism.
So, let’s try again.
Assuming you’re testing with growth and breakthroughs as your goal and not just going through the motions, what is the proper way to test?
Even more to the point, how do we break the all-too-common pattern of timid take-little-risk that infects both agencies and nonprofits? An infection that ends up with testing only the marginal and incremental. Orange vs. blue envelopes … this letter signer vs. that letter signer … $25 vs. $45 … and sizes of envelopes. Incrementalism to nowhere.
To put the question another way: How do we conduct testing that is truly strategic and purposeful rather than habitual?
With discipline in the form of a proper plan and by meticulously following proper guidelines and methodologies for each test.
I’ve taken the Testing Plan and Protocol used by our sister company DonorVoice as a real-life example of proper testing. Feel free to copy it. More importantly, please use it. (In fact you might ask your consultant or agency to show you the process they use and compare notes.)
First let’s start with an illustration of the Worksheet DonorVoice uses for putting together each and every test they’re involved with. [Click to enlarge.]
This Testing Worksheet/Planning tool is used by DonorVoice in conjunction with the following 10-point framework or protocol. Kevin Schulman, CEO of DonorVoice says: “this testing protocol will lead to far fewer and more meaningful tests (a big plus), and more definitive decision-making regarding outcomes (another big plus).”
1) Allocate 25% of your acquisition and house file budget to testing.
2) Of the 25%, put 10% into incremental and 15% into big ideas.
An important corollary here: some of this money should go into researching ideas or paying others to do it. You can even use the online environment to pre-vet ideas with small, quick tests of the ideas to gather data.
3) Set guidelines for expected improvement.
Any idea for incremental must deliver a 5% (or better) improvement in house results, and 10% in acquisition (will see why difference in minute). Any idea for breakthrough must deliver a 20% (or better).
4) Any idea – incremental or breakthrough – must have a ‘reason to believe’ case made that relies on theory of how people make decisions, publicly available experimental test results, or past client test results.
The ‘reason to believe’ must include whether the idea is designed to improve response or average gift or both – this will be the metric(s) on which performance is evaluated.
A major part of this protocol is guided by the view that far more time should be spent on generation of test ideas and therefore, creating the necessary ‘rules’ and incentives to create this outcome.
This may very well result in 3 to 5 tests per year. If they are well conceived and vetted, that is a great outcome.
5) Determine test volume with math, not arbitrary, ‘best practice’ test panels of 25,000 (or whatever).
Use one of many web based calculators (and underlying, simple statistical formulas). Here is one DonorVoice likes, but there are plenty – all free.
An acquisition example: if our control response rate is 1% and we want to be able to flag a 5% improvement – i.e. response rate greater than 1.05% – to say it is real, the test size would need to be 626,231 (at 80% power and 95% confidence and 2-tail test). That 626,231 is not a typo.
How many acquisition test panels have been used in the history of nonprofit DM that are producing meaningless results because of all the statistical noise? A sizeable majority, at least.
6) Do not create a ‘random nth’ control panel that matches the test cell size for comparison.
We don’t know how many nonprofits and agencies employ this approach but it can lead to drawing the exact wrong answer on whether the test lost or won.
The problem with the ‘random nth’ control test panel of equal size to the test – e.g. two panels drawn with random nth at 25,000 each – is that this creates a point of comparison that has its own statistical noise and far more than the main control with all the volume on it. There are a few retorts or excuses that have surfaced in defense of this practice, but they are simply off-base.
7) Determine winners and losers with math, not eyeballing it.
Use one of many web-based calculators to input test and control performance and statistically declare a winner or loser. Again, here’s DonorVoice’s free choice.
8) Declare a test a winner or loser.
Add results to the ‘reason to believe’ document; maintain a searchable archive.
9) All winners go full volume rollout.
10) Losers can be resurfaced and changed with a revised ‘reason to believe’ case.
Denny Hatch, one of the best copywriters and direct mail veterans in the business and editor of Business Common Sense, reminds us of the late Ed Mayer’s admonition: “Don’t test whispers.” Meaning, small, incremental changes (‘whispers’) produce only incremental results not worth whispering about, let alone shouting.
Whether up or down, tiny changes hardly matter and they cost lots of time and money. So, put the DonorVoice testing discipline, or one as rigorous, to work for your future.
What’s your experience with testing?
Roger
Testing Question: a sample size — even on acquisition — of 626K is impossible for many of your readers. What’s the fallback for smaller orgs?
Thanks Lisa. Was just going to ask the same thing.
Lisa and Pam,
Hi, hope all is well and thanks for asking about smaller groups. The 626k was partly for dramatic effect and to make the point, indirectly perhaps, that too often testing is done and “winners” declared because one number is higher than the other in the Excel document showing results. This difference is often nothing more than “noise” and random fluctuations. This is especially true in acquisition where the control and test response rate is so low and by extension, the sample size (those responding) is tiny.
So, a couple options for smaller groups. Be satisfied with only looking for big changes (e.g. can flag a 30% or more improvement as statistically significant). This means be even more diligent and focused on well supported hypotheses (i.e. reason to believe) to dictate which tests go in-market. Do not let the testing become habitual and rote, make it really important to research and identify worthy tests. This will result in far less testing and that is a good outcome. This outcome is equally good for most large charities whose testing is a net loss (particularly when factoring in time, opportunity cost, etc.. but also just the hard costs usually equal a red number).
It also probably means focusing more on house file appeals than acquisition because the response rates (i.e. sample size) are so much higher. For example, with a 10% response rate on a house appeal (control) and a desire to flag 30% improvements (e.g. 13% or higher) as significant, the sample size needed (per variation) is only 867.
Roger. This is so, so right, but it is a message that is very hard to make to managers who have been on formal direct marketing training where the core message is “Test, test, test”
Really interesting, Kevin — hope all is well with you too and thank you for the detailed explanation! Much appreciated.
We have a list of about 450, have 124 active donors, and a retention rate of about 60%. I keep track of retention #’s, giving history, who’s new, who’s back, who left, average gifts list wide, who is giving more and who is giving less, when they give and if these are trends or one times. We tried a different style of letter last year and brought back over 30 folks who hadn’t given in years, but lost another 15 or so who had given in the last year.
I understand testing as something like this: I was a varsity field hockey player in high school. Once in gym class, feeling my oats, I got the ball and ran down the field toward the goal, leaving the pack behind. At first there were shouts and cheers, and then they stopped. I kept going, watching my feet, stick and the ball, dribbling adeptly until I knew, because I saw the 25 yard line pass under my feet, I was in range of the goal. I looked up just as I hit the ball, only to see the ball cross the end line about 30 feet to the left of the goal. I missed the opportunity to score for my team because I was focused on the ball, my stick, the cheers, my strong, fast legs, and what I thought I knew for certain -not where the goal cage was and how to calibrate my actions to get there.
So yes and thank you to all the suggestions above. And it might even be easier, as so many things are, to figure out what to test and what not to bother with, in a small nonprofit.
Roger:
Do you have any articles, posts or chapters in books you can refer us to on point 6 in this post? This is so contrary to what I have been taught (and believed) all of my life that I am asking for something more on it. The commentary also does not seem to say what, exactly, the correct method of selecting test names would be. Have I missed something?
David Krear
David,
Hi, hope all is well. To your question, here is clarification on pt. 6. If you have a donor file of 100,000 donors and want to do a test with 10,000 donors then you can take a random nth sample of the 100k by first sorting the 100k in a random order (e.g. using random number generator) and then taking every 10th record.
This leaves 90,000 records for the control. The point of point 6 – which may have been unclear – is it makes no logical or statistical sense to take another random nth of 10,000 for the control and using these results as the comparison with the test group. Use the 90k.
If the random sample for the test is pulled properly it is, by definition, representative of the 90k left for the control. The 90k universe and the corresponding results create a much more reliable estimate of the control performance to compare with the test and give you more statistical power to flag the difference as meaningful or not.
You simply don’t need sample sizes to be equal (if this is the logic for a 10k control group) for statistical significance testing to determine if a test idea wins. In fact, doing so creates lots of room for false negatives – i.e. flagging tests as losers that are winners.
The only other argument we’ve heard is the we need an “apples to apples” comparison. Agreed but apples to apples doesn’t require pulling two random samples, only one.
As an important sidenote, depending on the size of the file or the composition of it, a simple random sample may in fact not yield an apples to apples comparison. Sometimes “simple” is too simple”. By way of quick example, national polling is not done using a simple, random sample (i.e. every nth). The reason is that there are “strata” or segments that are not close to proportional and simple random can miss this. A national sample might get pulled by first grouping large counties and small counties separately and then doing a simple random sample from each. Otherwise, small county (and the people who live there) are at high risk of being under-represented with a simple random sample.
The same can hold for a large donor file with segments that vary greatly in proportionate size.
Going into much more detail on a blog post comment is probably overburdening the medium (probably already have). Happy to discuss further so let me know if you want to chat.