TESTING: BASKIN-ROBBINS CURES CANCER!

January 2, 2018      Admin

(aka The Myth of Statistical Significance)

Get the Nobel Prize ready. I know Baskin-Robbins cures cancer in lab tests for a fact despite no medical training or testing.

How can I be so certain?

Because they have 31 flavors.

If I ran a test of people who eat each flavor of Baskin-Robbins, it’s very likely that at least one group would have statistically significant results at the .05 level that it cures cancer.  All “.05 level” means is that there’s a 5% or less chance that an event happened by random chance.  Given 31 shots at a one-out-of-twenty proposition, one or more is to have significant positive results.

(Before you rush out to buy a franchise, know there’s also a flavor or two that cause cancer at a .05 level.)

Why do I tell this story, delicious as it may be?

Because I have lived this story. Not with ice cream but with  direct mail.

I set up a 15-panel test, with only slightly modified results below:

You might say:

  • E2 is a strong winner
  • C had a strong showing as a concept
  • Don’t do what you did in A again – it had three of the worst six showings

Here’s the trick.  All these panels were the same audience receiving the same piece.  There was no test.  It was my first time doing a list pull, so I used the same instructions as the previous year when there was a 15-part test.

The actual test had about the same differences in results as the fake one.

What does this mean for your testing regime?

First, let’s recognize that a .05 level of statistical significance is as artificial as say,  only finding guys attractive who are six-foot-plus.  (Why, yes, I am 5’11”.  Why do you ask?)  You should have barely a modicum more certainty in a .049 significance as in a .051 one.

It also means that most tests don’t have winners or losers.  It’s largely noise from which it is  difficult to extract a meaningful signal.  One reason is that most direct marketing tests don’t have large enough sample sizes to determine large impacts.

I’d recommend playing around with the sample size calculator here to see the power necessary.

This is also why it’s important to start with a strong hypothesis where  testing is concerned.  You want to have a hypothesis like this:  “I believe this test should [increase response rate/increase average gift/increase donor lifetime value/some combination]” because of X.  That way, you’ll know what you are going to measure against instead of looking at a table like mine above and picking ‘winners’ based on wherever a bigger (and false, but alluringly so) number appears.

Your hypothesis should also be something that will have impacts beyond one communication. If you test red envelope versus white envelope, at the end of the test, you might get a statistically significant result.  But your results may be no different than if you mailed the red envelope twice by mistake.

Moreover, you’ve only made steps to maximizing the value of that single communication, rather than the value of a donor.  It’s when you focus on how to treat donors and what causes them to do what they do that you enter the realm of strategy, build donor loyalty, and maybe, just maybe, make your donors happier with their experiences.

So ask yourself – am I testing a cancer drug that has a chance of working because I have a theory about how it will metabolize?

Or am I testing ice cream flavors hoping one comes up statistically significant?

The latter can be a very rocky road.

Nick

6 responses to “TESTING: BASKIN-ROBBINS CURES CANCER!”

  1. Terry Pious Pereira says:

    GREAT AND THANKS!

  2. I could be misunderstanding, but it seems to me you’re saying “strategy eats data for breakfast.” This makes sense, but what does this mean for the lion’s share of nonprofits (75 – 80%) with budgets under $1 million? I would think it would mean they should forget about testing entirely, and focus all their resources on the donor experience — despite the fact we’re always telling people not to rely on other peoples’ numbers and “test things for yourself.” Would that be correct?

  3. Jay Love says:

    Claire, you hit the nail right on the head! For nearly 90% of the registered nonprofits who are less than 1.5 million in annual revenue the list sizes are basically too small to be statistically valid for all but simple A/B tests.

    So yes, the concept of donor experience focus makes a ton of sense.

    Nick, any advice for those hundreds of thousands of NPO’s with list sizes below 10,000 names?

  4. Nick Ellinger says:

    I would make a friendly amendment to “forget about testing entirely” and say that smaller nonprofits should largely forget about testing *small* things. Google famously tested among 41 different shades of blue; we don’t have quantity to test between even two shades because the effect sizes will be so small.

    A small nonprofit can, though, still take on those tests that have the potential to be very significant to their program. Take for example a smaller nonprofit of my acquaintance that ran two matching gift programs: four of their 8-12 mail piece program had the messaging, along with related emails. Given what you’ve seen in The Agitator, you might want to test other types of matching or lead gift strategy as it would transform the way a third of communications are done.

    Here, you would want to A/B the audience, rather than the communications. You could construct this any number of ways, but a simple one would be 50% get only communications about lead gifts and 50% get only matching gift ones. Over the course of the year – four mail pieces and six-ish email missives – you would be able to build a sample, and results, that make it a worthy test.

    This is a simple example and skims the surface, as matching gift is a technique rather than something core to the organization’s messaging. Those with smaller lists will have to be wily: getting hypotheses from the literature or their own donors’ expressed desires, selecting only those tests to run with greatest potential impact, and continuing tests across time, communications, and media. That said, as we’ll explore this week, these are also good steps for anyone to take, no matter the size.

  5. Ben Miller says:

    Nick, I do not want to distract from the larger point here as I think I am agreement with you, non-profits need to have a better understanding of what they are trying to achieve with a test and how to read results. I do have to take exception to a few minor things in your post today however. First some great examples of spurious correlations (http://www.tylervigen.com/spurious-correlations) that sheds light on the extent to which statistics IS THE ANSWER versus just another piece of evidence.
    I wanted to point out that it appears in the table of results you provided from the 15-panel test to show no significant difference in response rate from best to worst. The best 3.65% is only 0.41 percentage points better than the worst at 3.24%. At the 95% confidence level, that would mean each panel would need to have over 200,000 sample size to prove that these two packages were statistically different. My hunch is that these panels were not that large, and therefore we could not reject the null hypothesis that these would yield the same response rate.
    The other point I wanted to make is that .05 is not arbitrary, it was chosen long ago because it is two standard deviations away from the mean. It is however as you point out, not the end all be all, and you can and are supposed to change this value depending on the test and ramifications if wrong. For example, the FDA will routinely use stricter p-values to accept clinical trial data where lives are on the line. As it relates to non-profits, my opinion is that you can loosen up this confidence interval. You just need to ask yourself if you are 80% (for example) that the average response is better are you comfortable rolling out with the results. You can and should then calculate what sample size is appropriate.
    This goes to Claire and Jay’s points on the vast majority of non-profits of smaller size. I think they can get meaningful results even with the smallest of sample sizes. Think about FDA studies, where they are routinely done with just hundreds of participants. You can also run the test across multiple mailings if the results are not repeatable that will also provide more evidence.
    I do want to be clear that I agree with your main point of starting with a strong hypothesis and making sure your testing is helping to advance the overall strategy. I just wanted to make these minor clarifications so that no one thinks that just because there is an 87% correlation between the age of Miss America and the murders by steam, hot vapours, and hot objects, that does not mean there is no use for the correlation calculation.

  6. Nick Ellinger says:

    I own the spurious correlations book – it’s a great one and highlights our point about beginning with a strong hypothesis so you can distinguish signal from noise.

    For statistical significance, this is an excellent point. Most statistical significance calculators (e.g.,https://www.kissmetrics.com/growth-tools/ab-significance-test/, http://www.mrss.com/toolshed/chi-squared-test-2/, and https://neilpatel.com/ab-testing-calculator/) use chi-squared, where this test would be 98%+ significant. This was the one I (and my agency) were using at the time – they actually came to me and said we need to roll out with E2 without knowing the packages were the same. The one calculator I linked to (http://www.evanmiller.org/ab-testing/sample-size.html) uses confidence intervals and is more rigorous.

    And good point that smaller nonprofits probably don’t need to go to this level of significance – wish I’d thought to say that in the early comment.

    As to correlations, they are a very useful tool and, like all statistics, should be interrogated as to why. One of my old posts is over at https://directtodonor.com/2015/11/13/the-dirty-dirty-data-tricks-that-dirty-dirty-people-will-use-to-try-to-get-their-way/ and talks about correlation versus causation by way of Matt Damon, Neil DeGrasse Tyson, and fantasy football. Like most of my thoughts on higher-level math, it’s probably mostly good for a laugh…