How Calypso’s A/B Test Module Works

Automattic recently open sourced Calypso, a JavaScript and REST-API powered interface that runs I was fortunate to get to work on a few pieces of it, mainly its Analytics and A/B Test modules. In this post I’ll walk through how the A/B test module works because it might give you a few things to consider if you find yourself rolling your own A/B testing solution like we’ve done at Automattic.

Bucketing and Reporting

When it comes to A/B testing, there are two tools you need: one that buckets users and one that reports on the results of your tests. The bucketing tool lets you say “Show 50% of users a green button, show the other 50% the red button”. The analysis tool then lets you measure the impact of the green vs red button on other actions like signing up for an account, publishing a post, upgrading, etc.

The Calypso A/B Test module that I’ll be discussing in this post is our bucketing tool. We also have a separate internal tool for analyzing the results of the A/B tests, but that’s a topic for another post.

A/B Testing in Calypso

The A/B Test module’s README provides detailed instructions for how it works. You can also check out the module itself if you’re interested. I’ll give an overview here and elaborate on some of the decisions that went into it.

We have a file called active-tests.js that contains configuration information for all of the active tests we’re running in Calypso. For example, here’s one of the tests:

What this says is that we have a test called businessPluginsNudge that started on November 19th that has two variations: drake and nudge each of which are shown 50% of the time. Also, users that are inelligble to participate in the test should be shown the drake variation (more on what inelligible means below).

To assign a user to a test, there’s a function the A/B test module exports called abtest. It’s used like so:

The abtest function assigns the user to a variation and returns that variation. For this particular test, if the user is eligible then 50% of the time the function will return drake and the other 50% of the time it will return nudge. We can then use the variation to determine what the user sees.

The abtest function also sends the test name and the user’s variation back to us via an API endpoint so that we can record it and later use it to measure the impact on other events using our internal reporting tool.


Consider an A/B test that tests the wording of a button. If the new wording isn’t properly translated but a large percentage of the users don’t speak English, it can throw off the results of the test. For example, if the new wording underperforms, was it because the new wording was truly inferior or was it because a lot of non-English users saw the English wording and simply couldn’t read it?

To account for that and similar issues, we have this idea of eligibility. In certain situations we don’t want users to count towards the test. We need to show them something of course, but we don’t want to track it. That’s what the defaultVariation property is for in the test configuration. Inelligible users are shown that variation, but we don’t send the information about the test and the user’s variation back to our servers. By default in the A/B test module, only English language users are eligible for the tests so they will always be shown the variation specified by defaultVariation.

We also only want to include users that have local storage enabled because that’s where we save the user’s variation. We save the variation locally because we always want to show the user the same variation that they originally saw. Saving it to local storage keeps things fast. We could fetch it from the server, but we don’t want to slow down the UI while we wait for the response so we read it from local storage instead. A side effect of this is that we don’t handle situations where users change browsers, switch devices, or clear their local storage. That’s only a fraction of users though so it doesn’t impact the results very much.

One last point on eligibility: imagine you’re testing the wording on a particular button. You run one test where you show “Upgrade Now” and another “Upgrade Today” (this is a silly test, but just to give you an idea). Lets say “Upgrade Today” wins and you make that the default. Then you run another test comparing “Upgrade Today” to “Turbocharge Your Site”. If a user participated in the original test and saw the “Upgrade Now” variation, it could impact their behavior on this new test. To account for that, if a user has participated in a previous test with the same name as a new test, then he or she won’t be eligible for the new test. We only want to include users who are participating in the test for the first time because it will result numbers that better represent the impact of each variation.

Multiple active tests

In A/B testing parlance, there’s a concept known as multivariate tests. The idea is that if you have a test running on one page (green button, red button) and another test running on another page (“Upgrade Now”, “Upgrade Today”), the combination of the variations from those tests might be important. For example, what if green button + “Upgrade Today” leads to a higher conversion rate than the other combinations? There is that possibility, but we generally don’t worry about that to keep the analysis simpler.

Dealing with A/B tests that span multiple pages

There’s one final situation I want to note:

Imagine you have two pricing pages, one for your Silver Plan and one for your Gold Plan. On the Silver Plan‘s pricing page, you assign the user a variation and use that to adjust the page:

So far so good. Now imagine that you want to adjust the payment form on a different page if the user saw a particular variation for the Silver Plan.

If you call abtest( 'silverPlan' ) to grab the variation on the payment page, it will also assign the user to a variation for that test. Many of the users viewing the payment page though will be purchasing the Gold Plan and never have even seen the plan page for your Silver Plan. Assigning those users to a variation will distort the results of the test. To account for that, the A/B test module also exports a getABTestVariation function that just returns a user’s variation without assigning him to one:

This doesn’t come up in simple tests, but for complex tests that affect multiple parts of the user’s experience, it’s essential to be able to determine if a user is part of a variation without assigning him or her to one.

Wrapping Up

As you can see, there are a lot of subtle issues that can impact the results of your tests. Hopefully this gives you an idea of a few of the things to watch out for if you do roll your own A/B testing tools.

If you have any questions, suggestions on how to improve it, or just want to chat about A/B testing tools, don’t hesitate to drop me a note.

One thought on “How Calypso’s A/B Test Module Works

  1. Analyzing an A/B Test’s Impact Using Funnel Segmentation – Matt Mazur

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s