Are you sure that your results are statistically valid? A lot of optimizers fail to recognize when their results are inconclusive and are shocked when the conversion rate drops after implementing a change.
Here are 5 common reasons why your results are inconclusive:
Number #1: Confidence
One reason why your results are inconclusive is the confidence level. This level decides the risk you are willing to take with your results. The study from Israel (1992) describes the confidence level in the following way:
“This means that, if a 95% confidence level is selected, 95 out of 100 samples will have the true population value” (1).
This is usually a good thing because you are minimizing the risk of getting false results. But sometimes it might be a better idea of taking a little bit more risk for achieving a significant result. This way you can conclude your results anyway.
An example of calculating this without be a statistical mastermind is using the CXL AB test calculator:
In the calculator, you can see that lowering the confidence level will decrease the required sample size per variant. Be aware though that you are increasing the risk of looking at a false hypothesis. But a big increase that you get from lowering your confidence level might be a calculated risk.
But more factors might make your results inconclusive regarding sample size. Israel (1992) explains that the sample size has three main factors that decide if the sample size is big enough:
“Perhaps the most frequently asked question concerning sampling is, “What size sample do I need?” The answer to this question is influenced by a number of factors, including the purpose of the study, population size, the risk of selecting a “bad” sample, and the allowable sampling error” (1).
The calculator can help you with determining that but understanding the different metrics are important if you want to use it. We covered the first one but what about the other two factors?
Number #2: Level of precision
Another reason why your results are inconclusive is the level of precision of your sample size. The study from Israel (1992) defines the level of precision in the following way:
“The level of precision, sometimes called sampling error, is the range in which the true value of the population is estimated to be” (1).
The study from Israel (1992) describes an example study that can help elaborate on this definition. You might have results that indicate that 15% of tests will not reach statistical significance with a level of precision of 5%. This means that the results conclude that 10% to 20% of the tests will reach statistical significance.
But how does that affect the validity of your testing results? For understanding that you will need to know what a confidence interval is. According to Mcleod (2019, June 10) the confidence interval is:
“A range of values that’s likely to include a population value with a certain degree of confidence. It is often expressed as a % whereby a population mean lies between an upper and lower interval”
The CXL AB test calculator can show the confidence interval of a test at the bottom of the calculator when clicking on “Z Test”.
Researchhubs (2015) describes that the level of precision refers to the width of a confidence interval. This is indicated by the grey bars on the above image. When looking at the grey bar from variant 1 for example. You can see that the lower interval is 5,4% and the higher interval is 6%. Therefore the level of precision is 0,6%. This means that a smaller gap will equal a higher level of precision.
Researchers look at confidence intervals to find out if a change was significant and therefore conclusive. When the lower interval (5,4%) of variant 1 is higher than the high interval of the control (5,3%) we can say that a result is significantly better.
If the level of precision would be less this result would be inconclusive because the lower interval of the variant would not be higher than the high interval of the control.
As you can see in the figure above the grey bars are intersecting each other which would signal an inconclusive result.
Remember to use the calculator to see if this is the case and that the results you are looking at might be inconclusive. You can then choose to take a bigger risk by changing your confidence levels, collect more data, or move on to your next test.
Number #3: Degree of variability
Variability refers to the spread of results in comparison to the average of the result. A great way to explain variability is shown in these two results by LibGuides:
The results of both these quizzes are the same but the variability is different because the answers have more variation in the frequency per data point. This means that the variability of quiz 1 is higher than quiz 2. That means that quiz 1 will need more data than quiz 2 to reach a point where you can draw valid conclusions. Isreal (1992) describes this in the following way:
“The more heterogeneous a population, the larger the sample size required to obtain a given level of precision. The less variable (more homogeneous) a population, the smaller the sample size.” (2).
Looking at the quizzes from before you can understand that you need access to the raw data of your tests to measure the variability of your variants. There are many ways of measuring the degree of variability according to LibGuides:
- interquartile range
- standard deviation
Then you will have to use that data to find out how much of an increase you need in your sample size before your results are conclusive. But this topic goes beyond what I could write in one blog post and therefore I recommend the statistics course for a/b testing by Georgi Georgiev at CXL. This will cover everything you need to know about validating your a/b test results.
Number #4: A wrong or absent hypothesis
A basic mistake that optimizers make when starting their testing is rigorously copying competitors for coming up with changes for their website. They design, develop, and test the change and get a significant result. Then when they are asked what they have learned from their test they remain speechless.
Testing without a hypothesis will not teach you anything about creating a better user experience. You were probably just lucky that the change gave you a good result yet you have no idea why that result is present. Not having a hypothesis will therefore make your results inconclusive.
McCombes (2019, April 23) states the following:
“A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data)”.
That’s why having a wrong hypothesis will also invalidate your results because you have to base it on existing theories and knowledge. Preferably research that has been executed on your target group or users. McCombes (2019, April 23) explains that a hypothesis is an answer to your research question that still has to be tested.
I like to use this hypothesis template inspired by Craig Sullivan:
“We believe that doing [A] for people [B] will make the outcome [C] happen.”
Writing a good hypothesis is one of the first and most important steps of the conversion optimization process and should be treated as such.
Number #5: Length pollution
The length of your tests can give you inconclusive results if you are not careful. Julien Le Nestour (2015, January) describes that it’s important to always run your test in business cycles to prevent sample pollution.
This is because the type of people can vary a lot across different times in a week. Business cycles are usually defined as weeks as seen in the CXL pre-test calculator:
The founder of CXL states the following on test length:
“If you don’t test a full week at a time, you’re skewing your results” (Peep Laja, 2019).
The next time that you find that your results are significant within 3 days you should not stop your test because your results are still inconclusive because of length pollution. To prevent this I usually only share and analyze the current results of my test at the end of a business cycle.
This means you will need to test for at least 7 days and it’s recommended to test longer when you think you can reach a significant result.
The goal of this article was to point out the fact that many reasons invalidate your testing results and that you should be aware of all those reasons to prevent implementing a negative change. Educate yourself in all those reasons and use calculators to effortlessly notify yourself of those inconclusive results.
Israel, G. D. (1992). Determining the sample size.
Julien Le Nestour. (2015, January 13). How long should you run your A/B test for? 3 Principles to follow. https://julienlenestour.com/long-run-ab-test/
Peep Laja. (2019, April 21). 12 A/B Testing Mistakes I See All the Time | CXL. CXL. https://cxl.com/blog/12-ab-split-testing-mistakes-i-see-businesses-make-all-the-time/#weeks
McCombes, S. (2019, April 23). How to Write a Strong Hypothesis | Steps and Examples. Scribbr. https://www.scribbr.com/research-process/hypotheses/
Yartsev. A. (2016). Precision of findings: p-value and the confidence interval https://derangedphysiology.com/main/required-reading/statistics-and-interpretation-evidence/Chapter%201.5.0/precision-findings-p-value-and-confidence-interval#:~:text=The%20precision%20of%20the%20findings,own%2Dtruth%20result%20is%20found.
Accuracy vs. Precision of confidence intervals. (2015). Researchhubs.com. http://researchhubs.com/post/ai/data-analysis-and-statistical-inference/accuracy-vs-precision.html
Mcleod, S. (2019, June 10). Z-Score: What are Confidence Intervals in Statistics? Simplypsychology.org; Simply Psychology. https://www.simplypsychology.org/confidence-interval.html
LibGuides: Maths: Measures of variability. (2020). Libguides.com. https://latrobe.libguides.com/maths/measures-variability