A few weeks ago I wrote about how I was working on an A/B test simulator and asked if anyone was interested in working on it. A few of you reached out (thank you!) but the discussions quickly stalled because I realized that I didn’t have a good plan where to take it from there.
Rather than let it linger on my Macbook forever more, I made a push ship the v1 and am happy to say you can check it out on GitHub here:
How it works
Here’s the idea:
Let’s say you’re running a big test on your homepage which has a conversion rate of 10% and you think your test will either do really well (+20%) or fail terribly (-20%). You configure this in the script:
|ORIGINAL_CONVERSION_RATE = 0.1|
|VARIATION_OUTCOMES = [–0.2, 0.2]|
Also, you want to run your A/B test until you’ve either had more than 10,000 participants or until the test has reached 99% significance. You configure this in the
|return "pass" if total_participants >= 10000|
|if gaussian?(participants_a, conversions_a) && gaussian?(participants_b, conversions_b)|
|if p_value(participants_a, conversions_a, participants_b, conversions_b) >= 0.99|
When you run the script (
ruby abtest-simulator.rb) it then simulates 1,000 A/B tests, where for each A/B test we assign visitors one of the variations and continue until we declare a winner or pass if a winner is never decided on:
Summary: Passes: 74 Correct: 908 Incorrect: 18 Correct Decisions: 908/926: 98.06%
908 times out of 1,000 our criteria made the “correct” decision: we choose the winning +20% variation or didn’t chose the -20% variation. In 18 tests we incorrectly chose the -20% variation or didn’t choose the +20% variation. And in 74 tests out of 1,000 we never reached significance.
The idea with this project is that you can play around with the numbers to see what impact they have. For example:
- What is the impact of 99% signifiance vs 95% signifiance?
- What if you just wait until there are 50 conversions and pick the best performer?
- What if you don’t expect the test to result in a big change, but only smaller ones? (Hint: A/B testing small changes is a mess.)
If anyone is interested in helping on this project, now’s a good time to get involved.
Specifically, I’d love it for folks to go through the script and verify that I haven’t made any logical mistakes. I don’t think I have, but also wouldn’t bet my house on it. That’s also why I’m not including any general “lessons learned” from this simulator just yet – I don’t want to report on results until others have verified that all is well with the script. I also wouldn’t rule out someone saying “Matt, the way you’ve done this doesn’t make any sense”. If I figure out any mistakes on my own or from others, I’ll write posts about them so others can learn as well.
If you can’t find any bugs, just play around with it. How does the original conversion rate impact the results? How does the distribution of results impact it? How does the criteria for ending the test impact it? Eventually we can publish our findings – the more people that contribute, the better the writeup will be.