I love running A/B and multivariate (‘MVT’) tests. These are experiments designed to evaluate different design or copy variants, based on actual performance data. Instead of comparing or deciding on product options based on gut feel, an A/B or multivariate test allows to you to compare alternatives based on objective data and predefined success criteria.
However, running these kinds of tests can be quite tough. These are some of the reasons why:
- Insufficient traffic – Time and traffic are two important prerequisites if you want to be able to draw some meaningful conclusions from your experiments. However, what do you do if you don’t have a large user base yet or when you traffic starts faltering?
- Not sure which metric(s) to focus on – One of the things that I’ve learned the hard way is the importance of being clear upfront about the exact goal of the test and ensuring that you’ve selected the relevant metric(s) to focus on.
- Determine required sample size – Working out the sample size you need to in order to reach a “point of statistical significance” with your test can be tricky. Luckily, most A/B testing tools have an automated function for calculating this.
The other day I came across a great post by Optimizely titled “Stats with Cats: 21 Terms Experimenters Need to Know”. Reading trough this piece really helped me in understanding more about how to best design an experiment and tackle some of the common issues which I outlined above.
These are the main things that I learned from Stats with Cats: 21 Terms Experimenters Need to Know:
- Statistical significance – Significance is a statistical term that tells how sure you are that a difference or relationship exists. For example, if you want want to be able to confidently tell whether there’s a difference between version A and B, you need a treshold (e.g. 95%) to describe the level of error you’re comfortable with in a given A/B test. Significant differences can be large or small, depending on your sample size.
- Confidence interval – This is a computed range used to describe the certainty of an estimate of some underlying parameter. In the case of A/B testing, these underlying parameters are conversion rates or improvement rates.
- Bayesian – This is a statistical method that takes a bottom-up approach to data analytics when calculating statistical significance. It encodes the past knowledge of similar, previous experiments into a prior, which is a statistical device. You can use this prior in combination with current experiment data to make a conclusion on a currently running experiment.
- Effect size – The effect size (also known as “improvement” or “lift”) is the amount of difference between the original version (‘control’ version) and a variant. This could be an increase in conversion rate (a positive improvement) or a decrease in conversion (a negative improvement). The effect size is a common input into many sample size calculators. For example, Optimizely’s A/B Test Sample Size Calculator lets you enter an expected conversion rate for your control version.
- Error rate – The error rate stands for the chance of finding a conclusive difference between a control version and a variation in an A/B test, or not finding a difference where there is one. This encompasses “type 1” and “type 2” errors. A type 1 error occurs when a conclusive outcome (winner or loser) is declared, and the test is actually inconclusive. This is often referred to as a “false positive”. With type 2 errors, no conclusive result (winner or loser) is declared, failing to discover a conclusive difference between a control and a variation when there was one. This is also referred to as a “false negative” (see Fig. 1 below).
- Hypothesis test – Sometimes called a “t-test”, a hypothesis test is a statistical inference methodology used to determine if an experiment result was likely due to chance alone. Hypothesis tests try to disprove a null hypothesis, i.e. the assumption that two variations are the same. In the context of A/B testing, hypothesis tests will help determine the probability that one variation is better than the other, supposing the variations were actually the same.
- Fixed horizon hypothesis test – The key thing with a fixed horizon test is that it’s designed to come to a decision about version A or B at a set moment in time, ideally after reaching the point of statistical significance.
- Sequential hypothesis test – A sequential hypothesis test is the opposite of a fixed horizon hypothesis test, as the underlying principle of this test is that the experimenter can make a decision on the test at any point in time.
Main learning point: Even though I’m not a statistician or a data analyst, I found it really helpful to learn more about some of the terms that experimenters need to know about. Especially given some of the challenges with respect to running successful experiments, I believe it’s important to think through things such as a null hypothesis or desired effect size before you design and run your experiment.
Fig. 1 – Possible outcomes of A/B experiments – Taken from: http://blog.optimizely.com/2015/02/04/stats-with-cats-21-terms-experimenters-need-to-know/#type-i
Related links for further learning: