Removing Gaps from Stacked Area Charts in R

April 19, 2017Mazur 1 Comment

Creating a stacked area chart in R is fairly painless, unless your data has gaps. For example, consider the following CSV data showing the number of plan signups per week:

	+————+———-+———+
	\| week \| plan \| signups \|
	+————+———-+———+
	\| 2017-01-26 \| Bronze \| 10 \|
	\| 2017-01-26 \| Gold \| 55 \|
	\| 2017-01-26 \| Standard \| 108 \|
	\| 2017-02-05 \| Bronze \| 6 \|
	\| 2017-02-05 \| Iron \| 1 \|
	\| 2017-02-05 \| Gold \| 37 \|
	\| 2017-02-05 \| Standard \| 142 \|
	\| 2017-02-12 \| Bronze \| 17 \|
	\| 2017-02-12 \| Iron \| 2 \|
	\| 2017-02-12 \| Gold \| 42 \|
	\| 2017-02-12 \| Standard \| 119 \|
	\| 2017-02-19 \| Bronze \| 11 \|
	\| 2017-02-19 \| Gold \| 26 \|
	\| 2017-02-19 \| Silver \| 4 \|
	\| 2017-02-19 \| Platinum \| 1 \|
	\| 2017-02-19 \| Standard \| 70 \|
	\| 2017-02-26 \| Bronze \| 13 \|
	\| 2017-02-26 \| Silver \| 5 \|
	\| 2017-02-26 \| Standard \| 52 \|
	+————+———-+———+

view raw

signups-by-week.txt

hosted with ❤ by GitHub

Plotting this highlights the problem:

	library(ggplot2)

	data <- read.csv("dummy-data.csv", sep = "\t")

	g <- ggplot(data, aes(x = week, y = signups, group = plan, fill = plan)) +
	geom_area()

	print(g)

view raw

stacked-area-chart-gaps.R

hosted with ❤ by GitHub

The reason the gaps exist is that not all plans have data points every week. Consider Gold, for example: during the first four weeks there are 55, 37, 42, and 26 signups, but during the last week there isn’t a data point at all. That’s why the chart shows the gap: it’s not that the data indicates Gold went to zero signups the final week; it indicates no data at all.

To remedy this, we need to ensure that every week contains a data point for every plan. That means for weeks where there isn’t a data point for a plan, we need to fill it in with 0 so that R knows that the signups are in fact 0 for that week.

I asked Charles Bordet, an R expert who I hired through Upwork to help me level up my R skills, how he would go about filling in the data.

He provided two solutions:

1. Using expand.grid and full_join

	data <- read.csv("data.csv", sep = "\t")

	weeks <- unique(data$week)
	plans <- unique(data$plan)
	combinations <- expand.grid(week = weeks, plan = plans)

	data <- full_join(data, combinations, by = c("week" = "week", "plan" = "plan")) %>%
	mutate(signups = ifelse(is.na(signups), 0, signups)) %>%
	arrange(week, plan)

	g <- ggplot(data, aes(x = week, y = signups, group = plan, fill = plan)) +
	geom_area(position = "stack")

	print(g)

view raw

stacke-area-chart-without-gaps-v1.R

hosted with ❤ by GitHub

Here’s how it works:

expand.grid creates “a data frame from all combinations of the supplied vectors or factors”. By passing it in the weeks and plans, it generates the following data frame called combinations:

	week plan
	1 2017-01-26 Bronze
	2 2017-02-05 Bronze
	3 2017-02-12 Bronze
	4 2017-02-19 Bronze
	5 2017-02-26 Bronze
	6 2017-01-26 Gold
	7 2017-02-05 Gold
	8 2017-02-12 Gold
	9 2017-02-19 Gold
	10 2017-02-26 Gold
	11 2017-01-26 Standard
	12 2017-02-05 Standard
	13 2017-02-12 Standard
	14 2017-02-19 Standard
	15 2017-02-26 Standard
	16 2017-01-26 Iron
	17 2017-02-05 Iron
	18 2017-02-12 Iron
	19 2017-02-19 Iron
	20 2017-02-26 Iron
	21 2017-01-26 Silver
	22 2017-02-05 Silver
	23 2017-02-12 Silver
	24 2017-02-19 Silver
	25 2017-02-26 Silver
	26 2017-01-26 Platinum
	27 2017-02-05 Platinum
	28 2017-02-12 Platinum
	29 2017-02-19 Platinum
	30 2017-02-26 Platinum

view raw

expand-grid-weeks-plans.R

hosted with ❤ by GitHub

The full_join then takes all of the rows from data and combines them with combinations based on week and plan. When there aren’t any matches (which will happen when a week doesn’t have a value for a plan), signups gets set to NA:

	week plan signups
	1 2017-01-26 Bronze 10
	2 2017-01-26 Gold 55
	3 2017-01-26 Standard 108
	4 2017-02-05 Bronze 6
	5 2017-02-05 Iron 1
	6 2017-02-05 Gold 37
	7 2017-02-05 Standard 142
	8 2017-02-12 Bronze 17
	9 2017-02-12 Iron 2
	10 2017-02-12 Gold 42
	11 2017-02-12 Standard 119
	12 2017-02-19 Bronze 11
	13 2017-02-19 Gold 26
	14 2017-02-19 Silver 4
	15 2017-02-19 Platinum 1
	16 2017-02-19 Standard 70
	17 2017-02-26 Bronze 13
	18 2017-02-26 Silver 5
	19 2017-02-26 Standard 52
	20 2017-02-26 Gold NA
	21 2017-01-26 Iron NA
	22 2017-02-19 Iron NA
	23 2017-02-26 Iron NA
	24 2017-01-26 Silver NA
	25 2017-02-05 Silver NA
	26 2017-02-12 Silver NA
	27 2017-01-26 Platinum NA
	28 2017-02-05 Platinum NA
	29 2017-02-12 Platinum NA
	30 2017-02-26 Platinum NA

view raw

full-join-example.txt

hosted with ❤ by GitHub

Then we just use dplyr’s mutate to replace all of the NA values with zero, and voila:

	week plan signups
	1 2017-01-26 Bronze 10
	2 2017-01-26 Gold 55
	3 2017-01-26 Iron 0
	4 2017-01-26 Platinum 0
	5 2017-01-26 Silver 0
	6 2017-01-26 Standard 108
	7 2017-02-05 Bronze 6
	8 2017-02-05 Gold 37
	9 2017-02-05 Iron 1
	10 2017-02-05 Platinum 0
	11 2017-02-05 Silver 0
	12 2017-02-05 Standard 142
	13 2017-02-12 Bronze 17
	14 2017-02-12 Gold 42
	15 2017-02-12 Iron 2
	16 2017-02-12 Platinum 0
	17 2017-02-12 Silver 0
	18 2017-02-12 Standard 119
	19 2017-02-19 Bronze 11
	20 2017-02-19 Gold 26
	21 2017-02-19 Iron 0
	22 2017-02-19 Platinum 1
	23 2017-02-19 Silver 4
	24 2017-02-19 Standard 70
	25 2017-02-26 Bronze 13
	26 2017-02-26 Gold 0
	27 2017-02-26 Iron 0
	28 2017-02-26 Platinum 0
	29 2017-02-26 Silver 5
	30 2017-02-26 Standard 52

view raw

stacked-area-chart-data-without-gaps.txt

hosted with ❤ by GitHub

2. Using spread and gather

The second method Charles provided uses the tidyr package’s spread and gather functions:

	data <- read.csv("data.csv", sep = "\t")

	data <- data %>%
	tidyr::spread(key = plan, value = signups, fill = 0) %>%
	tidyr::gather(key = plan, value = signups, – week) %>%
	arrange(week, plan)

	g <- ggplot(data, aes(x = week, y = signups, group = plan, fill = plan)) +
	geom_area(position = "stack")

	print(g)

view raw

stacked-area-chart-tidyr-solution.R

hosted with ❤ by GitHub

The spread function takes the key-value pairs (week and plan in this case) and spreads it across multiple columns, making the “long” data “wider”, and filling in the missing values with 0:

	week Bronze Gold Iron Platinum Silver Standard
	1 2017-01-26 10 55 0 0 0 108
	2 2017-02-05 6 37 1 0 0 142
	3 2017-02-12 17 42 2 0 0 119
	4 2017-02-19 11 26 0 1 4 70
	5 2017-02-26 13 0 0 0 5 52

view raw

spread-data.txt

hosted with ❤ by GitHub

Then we take the wide data and convert it back to long data using gather The - week means to exclude the week column when gathering the data that spread produced:

	week plan signups
	1 2017-01-26 Bronze 10
	2 2017-01-26 Gold 55
	3 2017-01-26 Iron 0
	4 2017-01-26 Platinum 0
	5 2017-01-26 Silver 0
	6 2017-01-26 Standard 108
	7 2017-02-05 Bronze 6
	8 2017-02-05 Gold 37
	9 2017-02-05 Iron 1
	10 2017-02-05 Platinum 0
	11 2017-02-05 Silver 0
	12 2017-02-05 Standard 142
	13 2017-02-12 Bronze 17
	14 2017-02-12 Gold 42
	15 2017-02-12 Iron 2
	16 2017-02-12 Platinum 0
	17 2017-02-12 Silver 0
	18 2017-02-12 Standard 119
	19 2017-02-19 Bronze 11
	20 2017-02-19 Gold 26
	21 2017-02-19 Iron 0
	22 2017-02-19 Platinum 1
	23 2017-02-19 Silver 4
	24 2017-02-19 Standard 70
	25 2017-02-26 Bronze 13
	26 2017-02-26 Gold 0
	27 2017-02-26 Iron 0
	28 2017-02-26 Platinum 0
	29 2017-02-26 Silver 5
	30 2017-02-26 Standard 52

view raw

spread-and-gather.txt

hosted with ❤ by GitHub

Using either methods, we get a stacked area chart without the gaps ⚡️:

An impractical guide to doubling your conversion rates

July 19, 2016Mazur 4 Comments

Let’s imagine a standard web app with three main steps:

Viewed Homepage
Signed Up
Purchased

Furthermore, lets say 20% of the visitor to the hompage sign up, and 5% of the users that sign up complete a purchase giving you a 1% overall conversion rate.

Your boss comes to you and says “You need to double the overall conversion rate from 1% to 2%. Do whatever it takes.”

As a thought experiment, consider at a high level what the numbers have to be for this to work out. Your first thought might be to double the homepage conversion rate from 20% to 40% (giving you a 40% * 5% = 2% overall conversion rate) or double the purchase conversion rate (20% * 10% = 2% overall conversion rate). You could also increase both by some smaller amount to get a similar result: 30% * 6.67% = 2%.

In the real world, this winds up being really hard. If you increase the percentage of people signing up, it’s probably going to decrease the percentage of people who then purchase. Why? Consider what would happen to a site that has a pricing section on its homepage and then removes it completely. More people will sign up, but once they do and see the pricing, many of those extra people you got to sign up (because they thought your service was free) will leave (because they realized after signing up that your service actually isn’t free). So if you increase the sign up rate to 40% somehow, your sign up to purchase conversion rate might drop in half from 5% to say 2.5% giving you that same 40% * 2.5% = 1% overall conversion rate. If you’re lucky some of those extra users will convert and maybe you’ll get 40% * like 2.8% = 1.12% overall conversion rate. Getting closer to that 2%, but still a long way away.

The trick to getting to 2% is to improve the quality of the traffic at each step.

Consider that homepage conversion rate of 20%. How could you increase that without making any changes to your website?

That 20% conversion rate is actually a composite of different segments. For example, imagine your website has three traffic sources:

Direct traffic (50% of your traffic) converts at 30%
Search traffic (40% of your traffic) converts at 10%
Social traffic (10% of your traffic) converts at 10%

50% * 30% + 40% * 10% + 10% * 10% = 20% homepage conversion rate.

What would happen to your conversion rate if you delisted your site from search engines completely? The numbers now become:

Direct traffic (83.3% of your traffic) converts at 30%
Social traffic (16.7% of your traffic) converts at 10%

83.3% * 30% + 16.7% * 10% = 26.7% homepage conversion rate.

Without making any changes to your site itself, you increased the homepage conversion rate by 26.7%/20% – 1 = +34%. Because the quality of your traffic has improved, the sign up to purchase conversion rate will likely increase as well. Instead of 5% upgrading, it might wind up being 8% (+60%). What’s your overall conversion rate now? 26.7% * 8% = 2.15%!

So in order to double your conversion rate in this made up example all you had to do was delist your site from search engines. Mission accomplished 🍻. As an added bonus many of your other metrics will increase as well. Because your site’s traffic is of a higher quality, your retention rates will go up, your churn will go down, your average revenue per user will go up, your refunds will go down, and more.

You could also do similar hacks where you make your website non-mobile-friendly so Google decreases its mobile rankings, which (assuming mobile converts more poorly than desktop users) would increase your conversion rates. You could block certain poorly converting browsers from visiting your site (I’m looking at you, IE). You could block users from poorly converting countries, or even non-English language users if your site isn’t translated.

Of course you should never do any of these things.

In the process of improving your metrics by hacking off a large portion of your traffic, you’ll also wind up decreasing the number of people who make it all the way through the funnel (purchasing in this case).

It’s tempting to want to focus on a single metric like conversion rate, but it’s also important to remember that individual metrics can almost always be artificially boosted:

You can increase your conversion rates by preventing low converting traffic from reaching your site (at the expense of revenue)
You can increase your revenue by increasing paid advertising (at the expense of profit)
You can increase your profit by laying off a bunch of employees (possibly at the expense of your long term growth and profitability).

Instead, try to identify the set of metrics that are most important to your company and pay attention to the group as a whole. More often than not when one metric goes up, another will go down but with solid execution and a little luck, the overall impact of your changes will be a net win for your website or company.

Promoting best sellers in Shopify

June 16, 2016Mazur Leave a comment

My friend Tom Davies just launched a new Shopify app called Best Seller Insights that enables shop owners to effortlessly promote and track trends for their best-selling products

He’s running a promotion as part of the launch that offers new customers 20% off all the plans through June 25th. If you run a Shopify store, be sure to check it out.

My @shopify app, Best Seller Insights, is launched! All plans 20% off through June 25 https://t.co/Xd1MjdaxHK pic.twitter.com/JSoTDeaQle

— Burst Commerce (@burstcommerce) June 14, 2016

Hacker Bits Deployment Story

March 31, 2016Mazur Leave a comment

Hacker Bits, an awesome service that curates the best HackerNews stories each month into a stylish easy-to-read magazine, included my deployment story in their April 2016 edition (!). You can check it out here:

	+————+———-+———+
	\| week \| plan \| signups \|
	+————+———-+———+
	\| 2017-01-26 \| Bronze \| 10 \|
	\| 2017-01-26 \| Gold \| 55 \|
	\| 2017-01-26 \| Standard \| 108 \|
	\| 2017-02-05 \| Bronze \| 6 \|
	\| 2017-02-05 \| Iron \| 1 \|
	\| 2017-02-05 \| Gold \| 37 \|
	\| 2017-02-05 \| Standard \| 142 \|
	\| 2017-02-12 \| Bronze \| 17 \|
	\| 2017-02-12 \| Iron \| 2 \|
	\| 2017-02-12 \| Gold \| 42 \|
	\| 2017-02-12 \| Standard \| 119 \|
	\| 2017-02-19 \| Bronze \| 11 \|
	\| 2017-02-19 \| Gold \| 26 \|
	\| 2017-02-19 \| Silver \| 4 \|
	\| 2017-02-19 \| Platinum \| 1 \|
	\| 2017-02-19 \| Standard \| 70 \|
	\| 2017-02-26 \| Bronze \| 13 \|
	\| 2017-02-26 \| Silver \| 5 \|
	\| 2017-02-26 \| Standard \| 52 \|
	+————+———-+———+

Matt Mazur

Category Uncategorized