A/B Test Simulator v1

A few weeks ago I wrote about how I was working on an A/B test simulator and asked if anyone was interested in working on it. A few of you reached out (thank you!) but the discussions quickly stalled because I realized that I didn’t have a good plan where to take it from there.

Rather than let it linger on my Macbook forever more, I made a push ship the v1 and am happy to say you can check it out on GitHub here:

https://github.com/mattm/abtest-simulator

How it works

Here’s the idea:

Let’s say you’re running a big test on your homepage which has a conversion rate of 10% and you think your test will either do really well (+20%) or fail terribly (-20%). You configure this in the script:

Also, you want to run your A/B test until you’ve either had more than 10,000 participants or until the test has reached 99% significance. You configure this in the evaluate method:

When you run the script (ruby abtest-simulator.rb) it then simulates 1,000 A/B tests, where for each A/B test we assign visitors one of the variations and continue until we declare a winner or pass if a winner is never decided on:

Summary:
Passes: 74
Correct: 908
Incorrect: 18
Correct Decisions: 908/926: 98.06%

908 times out of 1,000 our criteria made the “correct” decision: we choose the winning +20% variation or didn’t chose the -20% variation. In 18 tests we incorrectly chose the -20% variation or didn’t choose the +20% variation. And in 74 tests out of 1,000 we never reached significance.

The idea with this project is that you can play around with the numbers to see what impact they have. For example:

  • What is the impact of 99% signifiance vs 95% signifiance?
  • What if you just wait until there are 50 conversions and pick the best performer?
  • What if you don’t expect the test to result in a big change, but only smaller ones? (Hint: A/B testing small changes is a mess.)

Next steps

If anyone is interested in helping on this project, now’s a good time to get involved.

Specifically, I’d love it for folks to go through the script and verify that I haven’t made any logical mistakes. I don’t think I have, but also wouldn’t bet my house on it. That’s also why I’m not including any general “lessons learned” from this simulator just yet – I don’t want to report on results until others have verified that all is well with the script. I also wouldn’t rule out someone saying “Matt, the way you’ve done this doesn’t make any sense”. If I figure out any mistakes on my own or from others, I’ll write posts about them so others can learn as well.

If you can’t find any bugs, just play around with it. How does the original conversion rate impact the results? How does the distribution of results impact it? How does the criteria for ending the test impact it? Eventually we can publish our findings – the more people that contribute, the better the writeup will be.

Analyzing an A/B Test’s Impact Using Funnel Segmentation

If you decide to roll your own in-house A/B testing solution, you’re going to need a way to measure how each variation in each test influences user behavior.

In my experience the best way to do this is to take advantage of a third party analytics tool and piggyback on its funnel segmentation features. This post is about how to do that.

Funnel Segmentation 101

Consider this funnel from Lean Domain Search:

  1. A user performs a search
  2. Then clicks on a search result
  3. Then clicks on a registration link

In Mixpanel, the funnel looks like this:

Screen Shot 2016-08-04 at 9.29.19 AM.png

Of the 35K people who performed a search, 9K (26%) of them clicked on a search result, then 900 (10% who clicked, 2.5% overall) clicked a registration link.

We can then use Mixpanel’s segmentation feature to segment on various properties to see how they impact the funnel. For example, here’s what segmenting on Browser looks like:

Screen Shot 2016-08-04 at 9.32.16 AM.png

We can see that 27% of Chrome searchers click on a search result compared to only 18% of iOS Mobile visitors. We could also segment on other properties that Mixpanel’s tracking client automatically collects such as the visitor’s country, which search engine he or she came from, and most importantly for our purposes here, custom event properties.

Passing Variations as Custom Event Properties

Segmenting on a property like the visitor’s country is very similar conceptually to segmenting on which A/B test variation a user sees. In both cases we’re breaking down the funnel to see what impact the property value (each country or each variation) has on the rest of the funnel.

Consider a toy A/B test where we’re running an A/B test to measure the impact of the homepage’s background color on sign ups.

When the visitor lands on the homepage, we fire a Visited Homepage event with a abtest_variation property set to the name of the variation the user sees:

With this in place, you can then set up a funnel such as:

  1. Visited  Homepage
  2. Signed Up

Then segment on abtest_variation to see what impact each variation has on the rest of the funnel.

In the real world, you’re not going to have white hardcoded like it is in the code snippet above. You’ll want to make sure that whatever A/B test variation the user is assigned to gets passed as the variation property’s value on that tracking event.

Further improvements

The setup above should work fine for your v1, but there are several ways you can improve the setup for long term testing.

Pass the test name as an event property

I recommend also passing an abtest_name property on the event:

The advantage of this is that if you’re running back to back tests, you’ll be able to set up your funnel to ensure you’re only looking at the results of a specific test without worrying that identically-named variations from earlier tests are impacting the results (which would happen if you started a test the same day a previous test ended). The funnel would look like this:

  1. Visited Homepage where abtest_name = homepage test 3
  2. Signed Up

Then segment on abtest_variation like before to see just the results of this A/B test.

Generalize the event name

In the examples above, we’re passing the A/B test details as properties on the Visited Homepage event. If we’re running multiple tests on the site, we’d have to pass the A/B test properties on all of the relevant events.

A better way to do it is to fire a generic A/B test event name with those properties instead:

Now the funnel would look like this:

  1. Assigned Variation where abtest_name = homepage test 3
  2. Signed Up

Then segment on abtest_variation again.

To see this in action, check out Calypso’s A/B test module (more on that module in this post). When a user is assigned an A/B test variation, we fire a calypso_abtest_start event with the name and variation:

We can then analyze the test’s impact on other events using Tracks, our internal analytics platform.

Benefits

The nice thing about using an analytics tool to analyze an A/B test is that you can measure the test’s impact on any event even after the test has finished. For example, at first you might decide you want to measure the test’s impact on sign ups, but later decide you also want to measure the test’s impact on users visiting your support page. Doing that is as easy as setting up a new funnel. You can event measure your test’s impact on multiple steps of your funnel because that’s just another funnel.

Also, you don’t have to litter your code with lots of conversion events specific to your A/B test (like how A/Bingo does it) because you’ll probably already have analytics events set up for the core parts of your funnel.

Lastly, if your analytics provider provides an API like Mixpanel you can pull in the results of your A/B tests into an internal report where you can also add significance results and other details about the test.

If you have any questions about any of this, don’t hesitate to drop me a note.

Anyone interested in collaborating on an A/B test simulator?

I recently started working on a small side project to build an A/B test simulator. My goal is to measure the long term conversion rate impact of different approaches such as:

  • What impact do various significance levels (90%, 95%, 99%) have?
  • Is it better to run more shorter tests with less statistical significance or fewer longer tests with more statistical significance?
  • What’s the best strategy if your site does not receive a lot of traffic?

I have a preliminary script, but there’s still a lot of work to be done and I’d love to collaborate with somebody on it going forward. I think the results will run contrary to a lot of the current A/B test best practices and they could have a big impact on how people run tests in the future.

If you have experience running A/B tests, a stats background, or familiarity with Ruby, those skills will come in handy but they’re not required.

Drop me a note if you’re interested in working together on it.

How would an evolving website work?

As a web developer who is also interested in A/B testing and evolution, it occurred to me that it would be fascinating to try to build a website that optimizes itself. I’ve been kicking around this idea for a while and wanted to share a few thoughts on it here to get feedback from anyone else that finds the idea promising.

Here’s the idea:

In traditional A/B testing, you specify multiple variations of some aspect of your site, measure which variation performs the best, then make the winner the default, and repeat. The types of things you test can range from simple headline and button color tests to complex tests that affect the user’s entire experience with your site.

In all of these cases though you have to figure out what variations you want to test. If you’re testing the headline, you need to specify which headlines you want to test. If you’re testing button color, you need to specify which colors to test, etc.

In the natural world, we see stunningly complex and optimized life forms that evolved little by little over billions of years. Evolution is similar to A/B testing in a sense, except that in evolution the variations happen by accident through genetic copying errors. Most of those mutations decrease the organism’s odds of reproducing, but occasionally they confer a benefit that causes that organism to have a better chance at surviving and reproducing. When that happens, the mutation gets passed on and may eventually spread to the rest of species over time. That “simple” process is what has led to all of the variety of life on earth.

Optimizing a web page through evolution poses many issues though.

How do you algorithmically mutate something on the page? Could you write a function that generates variations of a headline? Maybe. Would those variations be any good? Would you trust it enough to deploy into production?

I bet by analyzing tens of thousands of webpages, you could algorithmically identify common headline wording structures. Then maybe you could write code that intelligently customizes those headlines to your service.

You might be able to do the same with design. If you tried to categorize hundreds of homepage layouts, I expect you’d probably wind up with 20-30 common layouts that cover 90%+ of websites. Could you then write an algorithm that automatically tweaks your site’s layout to test these different layouts on your visitors? Could you do the same with color schemes? Maybe.

There’s also the problem of statistical significance. Running a simple two variation A/B test can take a long time if you don’t get a lot of traffic. Trying to get significant results for lots of algorithmically generated headlines might take forever. But maybe there are ways around that like the multi-armed bandit algorithm.

To sum it up, I think you’d need the following: a way to intelligently generate mutations on headlines, buttons, layout, etc + a ton of traffic or some novel method for picking the best variations + an organization that is willing to try such an experiment. It would not be easy, but I think it’s possible.

Imagine if it did work. Maybe your homepage converts users into sign ups at 10% now. You flip a switch, and in 6 months it increases your conversion rate 30% without any intervention on your part.

It would be revolutionary… if it worked.

How Calypso’s A/B Test Module Works

Automattic recently open sourced Calypso, a JavaScript and REST-API powered interface that runs WordPress.com. I was fortunate to get to work on a few pieces of it, mainly its Analytics and A/B Test modules. In this post I’ll walk through how the A/B test module works because it might give you a few things to consider if you find yourself rolling your own A/B testing solution like we’ve done at Automattic.

Bucketing and Reporting

When it comes to A/B testing, there are two tools you need: one that buckets users and one that reports on the results of your tests. The bucketing tool lets you say “Show 50% of users a green button, show the other 50% the red button”. The analysis tool then lets you measure the impact of the green vs red button on other actions like signing up for an account, publishing a post, upgrading, etc.

The Calypso A/B Test module that I’ll be discussing in this post is our bucketing tool. We also have a separate internal tool for analyzing the results of the A/B tests, but that’s a topic for another post.

A/B Testing in Calypso

The A/B Test module’s README provides detailed instructions for how it works. You can also check out the module itself if you’re interested. I’ll give an overview here and elaborate on some of the decisions that went into it.

We have a file called active-tests.js that contains configuration information for all of the active tests we’re running in Calypso. For example, here’s one of the tests:

What this says is that we have a test called businessPluginsNudge that started on November 19th that has two variations: drake and nudge each of which are shown 50% of the time. Also, users that are inelligble to participate in the test should be shown the drake variation (more on what inelligible means below).

To assign a user to a test, there’s a function the A/B test module exports called abtest. It’s used like so:

The abtest function assigns the user to a variation and returns that variation. For this particular test, if the user is eligible then 50% of the time the function will return drake and the other 50% of the time it will return nudge. We can then use the variation to determine what the user sees.

The abtest function also sends the test name and the user’s variation back to us via an API endpoint so that we can record it and later use it to measure the impact on other events using our internal reporting tool.

Eligibility

Consider an A/B test that tests the wording of a button. If the new wording isn’t properly translated but a large percentage of the users don’t speak English, it can throw off the results of the test. For example, if the new wording underperforms, was it because the new wording was truly inferior or was it because a lot of non-English users saw the English wording and simply couldn’t read it?

To account for that and similar issues, we have this idea of eligibility. In certain situations we don’t want users to count towards the test. We need to show them something of course, but we don’t want to track it. That’s what the defaultVariation property is for in the test configuration. Inelligible users are shown that variation, but we don’t send the information about the test and the user’s variation back to our servers. By default in the A/B test module, only English language users are eligible for the tests so they will always be shown the variation specified by defaultVariation.

We also only want to include users that have local storage enabled because that’s where we save the user’s variation. We save the variation locally because we always want to show the user the same variation that they originally saw. Saving it to local storage keeps things fast. We could fetch it from the server, but we don’t want to slow down the UI while we wait for the response so we read it from local storage instead. A side effect of this is that we don’t handle situations where users change browsers, switch devices, or clear their local storage. That’s only a fraction of users though so it doesn’t impact the results very much.

One last point on eligibility: imagine you’re testing the wording on a particular button. You run one test where you show “Upgrade Now” and another “Upgrade Today” (this is a silly test, but just to give you an idea). Lets say “Upgrade Today” wins and you make that the default. Then you run another test comparing “Upgrade Today” to “Turbocharge Your Site”. If a user participated in the original test and saw the “Upgrade Now” variation, it could impact their behavior on this new test. To account for that, if a user has participated in a previous test with the same name as a new test, then he or she won’t be eligible for the new test. We only want to include users who are participating in the test for the first time because it will result numbers that better represent the impact of each variation.

Multiple active tests

In A/B testing parlance, there’s a concept known as multivariate tests. The idea is that if you have a test running on one page (green button, red button) and another test running on another page (“Upgrade Now”, “Upgrade Today”), the combination of the variations from those tests might be important. For example, what if green button + “Upgrade Today” leads to a higher conversion rate than the other combinations? There is that possibility, but we generally don’t worry about that to keep the analysis simpler.

Dealing with A/B tests that span multiple pages

There’s one final situation I want to note:

Imagine you have two pricing pages, one for your Silver Plan and one for your Gold Plan. On the Silver Plan‘s pricing page, you assign the user a variation and use that to adjust the page:

So far so good. Now imagine that you want to adjust the payment form on a different page if the user saw a particular variation for the Silver Plan.

If you call abtest( 'silverPlan' ) to grab the variation on the payment page, it will also assign the user to a variation for that test. Many of the users viewing the payment page though will be purchasing the Gold Plan and never have even seen the plan page for your Silver Plan. Assigning those users to a variation will distort the results of the test. To account for that, the A/B test module also exports a getABTestVariation function that just returns a user’s variation without assigning him to one:

This doesn’t come up in simple tests, but for complex tests that affect multiple parts of the user’s experience, it’s essential to be able to determine if a user is part of a variation without assigning him or her to one.

Wrapping Up

As you can see, there are a lot of subtle issues that can impact the results of your tests. Hopefully this gives you an idea of a few of the things to watch out for if you do roll your own A/B testing tools.

If you have any questions, suggestions on how to improve it, or just want to chat about A/B testing tools, don’t hesitate to drop me a note.

Visualizing the Sampling Distribution of a Proportion with R

In yesterday’s post, we showed that a binomial distribution can be approximated by a normal distribution and some of the math behind it.

Today we’ll take it a step further, showing how those results can help us understand the distribution of a sample proportion.

Consider the following example:

Out of the last 250 visitors to your website, 40 signed up for an account.

The conversion rate for that group is 16%, but it’s possible (and likely!) that the true conversion rate differs from this. Statistics can help us determine a range (a confidence interval) for what the true conversion rate actually is.

Recall that in the last post we said that the mean of the binomial distribution can be approximated with a normal distribution with a mean and standard deviation calculated by:

\mu = np

\sigma = \sqrt{npq}

For a proportion, we want to figure out the mean and standard deviation on a per-trial basis so we divide each formula by n, the number of trials:

\mu = \frac{np}{n} = p

\sigma = \frac{\sqrt{npq}}{n} = \sqrt{\frac{npq}{n^2}} = \sqrt{\frac{pq}{n}}

With the mean and standard deviation of the sample proportion in hand, we can plot the distribution for this example:

sample-proportion-distribution

As you can see, the most likely conversion rate is 16% (which is no surprise), but the true conversion rate can fall anywhere under that curve with it being less and less likely as you move farther away.

Where it gets really interesting is when you want to compare multiple proportions.

Let’s say we’re running an A/B test and the original had 40 conversions out of 250 like the example above and the experimental version had 60 conversions out of 270 participants. We can plot the distribution of both sample proportions with this code:

Here’s the result:

multiple-sample-proportion-distribution

What can we determine from this? Is the experimental version better than the original? What if the true proportion for the original is 20% (towards the upper end of its distribution) and the true proportion for the experimental version is 16% (towards the lower end of its distribution)?

We’ll save the answer for a future post :)

Visualizing How a Normal Distribution Approximates a Binomial Distribution with R

My handy Elementary Statistics textbook, which I’m using to get smart on the math behind A/B testing, states the following:

Normal Distribution as Approximation to Binomial Distribution

If np \geq 5 and nq \geq 5, then the binomial random variable has a probability distribution that can be approximated by a normal distribution with the mean and standard deviation given as:

\mu = np

\sigma = \sqrt{npq}

In easier to understand terms, take the following example:

Each visitor to your website has a 30% chance of signing up for an account. Over the next 250 visitors, how many can you expect to sign up for an account?

The first formula lets us figure out the mean by simply multiplying the number of visitors by the probability of a successful conversion:

\mu = np = 250 * 0.30 = 75

Simple enough and fairly easy to understand.

The second formula, the one to figure out the standard deviation, is less intuitive:

\sigma = \sqrt{npq} = \sqrt{250 * 0.30 * (1 - 0.30)} = 7.25

Why are we taking the square root of the product of these three values? The textbook doesn’t explain, noting that “the formal justification that allows us to use the normal distribution as an approximation to the binomial distribution results from more advanced mathematics”.

Because this standard deviation formula plays a big role in calculating the confidence intervals for sample proportions, I decided to simulate the scenario above to prove to myself that the standard deviation formula is accurate.

The R script below simulates 250 visitors coming to a website, each with a 30% chance of signing up. After each group of 250 visitors we track how many of them wound up converting. After all of the runs (the default is 1,000, though the higher the number the more accurate the distribution will be) we plot the probability distribution of the results in blue and a curve representing what we’d expect the distribution to look like if the standard deviation formula above is correct in red.

The distribution of results from this experiment paints a telling picture:

normal-distribution

Not only is the mean what we expect (around 75), but the standard deviation formula (which said it would be 7.25) does predict the standard deviation from this experiment (7.25). Go figure :)

As we’ll see, we can use the fact that the normal distribution approximates a binomial distribution approximates to figure out the distribution of a sample proportion, which we can then compare to other sample proportion distributions to make conclusions about whether they differ and by how much (ie, how to analyze the results of an A/B test).