April | 2017 | Matt Mazur

99% of the queries I write to join tables wind up using JOIN (aka INNER JOIN) or LEFT JOIN so whenever there’s an opportunity to use one the other types, I get pretty excited 🙂. Today, that wound up being a CROSS JOIN.

Consider the following table containing charges:

	DROP TABLE IF EXISTS charges;

	CREATE TABLE charges (id INT PRIMARY KEY AUTO_INCREMENT, amount DECIMAL(10,2));
	INSERT INTO charges (amount) VALUES (18);
	INSERT INTO charges (amount) VALUES (15);
	INSERT INTO charges (amount) VALUES (27);

	+—-+——–+
	\| id \| amount \|
	+—-+——–+
	\| 1 \| 18.00 \|
	\| 2 \| 15.00 \|
	\| 3 \| 27.00 \|
	+—-+——–+

view raw

Charges.txt

hosted with ❤ by GitHub

How would you add add a column showing how much each charge represents as a percentage of the total charges?

Option 1: Using a subquery

One way to solve this is to use a subquery:

	SELECT , ROUND(amount / (SELECT SUM(amount) FROM charges) 100) AS percent
	FROM charges

	+—-+——–+———+
	\| id \| amount \| percent \|
	+—-+——–+———+
	\| 1 \| 18.00 \| 30 \|
	\| 2 \| 15.00 \| 25 \|
	\| 3 \| 27.00 \| 45 \|
	+—-+——–+———+

view raw

gistfile1.txt

hosted with ❤ by GitHub

For each record, we divide the amount by the sum of all the amounts to get the percentage.

Option 2: Using a variable

Similar to the solution above, except here we save the sum of the amounts in a variable and then use that variable in the query:

	SET @total := (SELECT SUM(amount) FROM charges);

	SELECT , ROUND(amount / @total 100) AS percent
	FROM charges;

	+—-+——–+———+
	\| id \| amount \| percent \|
	+—-+——–+———+
	\| 1 \| 18.00 \| 30 \|
	\| 2 \| 15.00 \| 25 \|
	\| 3 \| 27.00 \| 45 \|
	+—-+——–+———+

view raw

gistfile1.txt

hosted with ❤ by GitHub

Option 3: Using CROSS JOIN

A cross join takes every row from the first table and joins it on every row in the second table. From w3resource.com:

In this solution, we create a result set with one value (the sum of the amounts) and then cross join the charges table on it. That will add the total to each record, which we can then divide the amount by to get the percentage:

	SELECT , ROUND(amount / total 100) AS percent
	FROM charges
	CROSS JOIN (SELECT SUM(amount) AS total FROM charges) t;

	+—-+——–+——-+———+
	\| id \| amount \| total \| percent \|
	+—-+——–+——-+———+
	\| 1 \| 18.00 \| 60.00 \| 30 \|
	\| 2 \| 15.00 \| 60.00 \| 25 \|
	\| 3 \| 27.00 \| 60.00 \| 45 \|
	+—-+——–+——-+———+

view raw

gistfile1.txt

hosted with ❤ by GitHub

If we didn’t want the total column in the result, we could simply exclude it:

	SELECT id, amount, ROUND(amount / total * 100) AS percent
	FROM charges
	CROSS JOIN (SELECT SUM(amount) AS total FROM charges) t;

	+—-+——–+———+
	\| id \| amount \| percent \|
	+—-+——–+———+
	\| 1 \| 18.00 \| 30 \|
	\| 2 \| 15.00 \| 25 \|
	\| 3 \| 27.00 \| 45 \|
	+—-+——–+———+

view raw

gistfile1.txt

hosted with ❤ by GitHub

In this case there shouldn’t be any performance gains using the CROSS JOIN vs one of the other methods, but I find it more elegant than the subquery or variable solutions.

CROSS JOIN vs INNER JOIN

Note that CROSS JOIN and INNER JOIN do the same thing, it’s just that because we’re not joining on a specific column, the convention is to use CROSS JOIN. For example, this produces the same result as the last CROSS JOIN example:

	SELECT id, amount, ROUND(amount / total * 100) AS percent
	FROM charges
	INNER JOIN (SELECT SUM(amount) AS total FROM charges) t;

	+—-+——–+———+
	\| id \| amount \| percent \|
	+—-+——–+———+
	\| 1 \| 18.00 \| 30 \|
	\| 2 \| 15.00 \| 25 \|
	\| 3 \| 27.00 \| 45 \|
	+—-+——–+———+

view raw

cross-vs-inner.txt

hosted with ❤ by GitHub

And so does this:

	SELECT id, amount, ROUND(amount / total * 100) AS percent
	FROM charges, (SELECT SUM(amount) AS total FROM charges) t;

	+—-+——–+———+
	\| id \| amount \| percent \|
	+—-+——–+———+
	\| 1 \| 18.00 \| 30 \|
	\| 2 \| 15.00 \| 25 \|
	\| 3 \| 27.00 \| 45 \|
	+—-+——–+———+

view raw

gistfile1.txt

hosted with ❤ by GitHub

So why use CROSS JOIN at all? Per a Stack Overflow thread:

Using CROSS JOIN vs (INNER) JOIN vs comma

The common convention is:

* Use CROSS JOIN when and only when you don’t compare columns between tables. That suggests that the lack of comparisons was intentional.
* Use (INNER) JOIN with ON when and only when you have comparisons between tables (plus possibly other comparisons).
* Don’t use comma.

Props this Stack Overflow question for the tip about using CROSS JOIN to solve this type of problem.

Creating a stacked area chart in R is fairly painless, unless your data has gaps. For example, consider the following CSV data showing the number of plan signups per week:

	+————+———-+———+
	\| week \| plan \| signups \|
	+————+———-+———+
	\| 2017-01-26 \| Bronze \| 10 \|
	\| 2017-01-26 \| Gold \| 55 \|
	\| 2017-01-26 \| Standard \| 108 \|
	\| 2017-02-05 \| Bronze \| 6 \|
	\| 2017-02-05 \| Iron \| 1 \|
	\| 2017-02-05 \| Gold \| 37 \|
	\| 2017-02-05 \| Standard \| 142 \|
	\| 2017-02-12 \| Bronze \| 17 \|
	\| 2017-02-12 \| Iron \| 2 \|
	\| 2017-02-12 \| Gold \| 42 \|
	\| 2017-02-12 \| Standard \| 119 \|
	\| 2017-02-19 \| Bronze \| 11 \|
	\| 2017-02-19 \| Gold \| 26 \|
	\| 2017-02-19 \| Silver \| 4 \|
	\| 2017-02-19 \| Platinum \| 1 \|
	\| 2017-02-19 \| Standard \| 70 \|
	\| 2017-02-26 \| Bronze \| 13 \|
	\| 2017-02-26 \| Silver \| 5 \|
	\| 2017-02-26 \| Standard \| 52 \|
	+————+———-+———+

view raw

signups-by-week.txt

hosted with ❤ by GitHub

Plotting this highlights the problem:

	library(ggplot2)

	data <- read.csv("dummy-data.csv", sep = "\t")

	g <- ggplot(data, aes(x = week, y = signups, group = plan, fill = plan)) +
	geom_area()

	print(g)

view raw

stacked-area-chart-gaps.R

hosted with ❤ by GitHub

The reason the gaps exist is that not all plans have data points every week. Consider Gold, for example: during the first four weeks there are 55, 37, 42, and 26 signups, but during the last week there isn’t a data point at all. That’s why the chart shows the gap: it’s not that the data indicates Gold went to zero signups the final week; it indicates no data at all.

To remedy this, we need to ensure that every week contains a data point for every plan. That means for weeks where there isn’t a data point for a plan, we need to fill it in with 0 so that R knows that the signups are in fact 0 for that week.

I asked Charles Bordet, an R expert who I hired through Upwork to help me level up my R skills, how he would go about filling in the data.

He provided two solutions:

1. Using expand.grid and full_join

	data <- read.csv("data.csv", sep = "\t")

	weeks <- unique(data$week)
	plans <- unique(data$plan)
	combinations <- expand.grid(week = weeks, plan = plans)

	data <- full_join(data, combinations, by = c("week" = "week", "plan" = "plan")) %>%
	mutate(signups = ifelse(is.na(signups), 0, signups)) %>%
	arrange(week, plan)

	g <- ggplot(data, aes(x = week, y = signups, group = plan, fill = plan)) +
	geom_area(position = "stack")

	print(g)

view raw

stacke-area-chart-without-gaps-v1.R

hosted with ❤ by GitHub

Here’s how it works:

expand.grid creates “a data frame from all combinations of the supplied vectors or factors”. By passing it in the weeks and plans, it generates the following data frame called combinations:

	week plan
	1 2017-01-26 Bronze
	2 2017-02-05 Bronze
	3 2017-02-12 Bronze
	4 2017-02-19 Bronze
	5 2017-02-26 Bronze
	6 2017-01-26 Gold
	7 2017-02-05 Gold
	8 2017-02-12 Gold
	9 2017-02-19 Gold
	10 2017-02-26 Gold
	11 2017-01-26 Standard
	12 2017-02-05 Standard
	13 2017-02-12 Standard
	14 2017-02-19 Standard
	15 2017-02-26 Standard
	16 2017-01-26 Iron
	17 2017-02-05 Iron
	18 2017-02-12 Iron
	19 2017-02-19 Iron
	20 2017-02-26 Iron
	21 2017-01-26 Silver
	22 2017-02-05 Silver
	23 2017-02-12 Silver
	24 2017-02-19 Silver
	25 2017-02-26 Silver
	26 2017-01-26 Platinum
	27 2017-02-05 Platinum
	28 2017-02-12 Platinum
	29 2017-02-19 Platinum
	30 2017-02-26 Platinum

view raw

expand-grid-weeks-plans.R

hosted with ❤ by GitHub

The full_join then takes all of the rows from data and combines them with combinations based on week and plan. When there aren’t any matches (which will happen when a week doesn’t have a value for a plan), signups gets set to NA:

	week plan signups
	1 2017-01-26 Bronze 10
	2 2017-01-26 Gold 55
	3 2017-01-26 Standard 108
	4 2017-02-05 Bronze 6
	5 2017-02-05 Iron 1
	6 2017-02-05 Gold 37
	7 2017-02-05 Standard 142
	8 2017-02-12 Bronze 17
	9 2017-02-12 Iron 2
	10 2017-02-12 Gold 42
	11 2017-02-12 Standard 119
	12 2017-02-19 Bronze 11
	13 2017-02-19 Gold 26
	14 2017-02-19 Silver 4
	15 2017-02-19 Platinum 1
	16 2017-02-19 Standard 70
	17 2017-02-26 Bronze 13
	18 2017-02-26 Silver 5
	19 2017-02-26 Standard 52
	20 2017-02-26 Gold NA
	21 2017-01-26 Iron NA
	22 2017-02-19 Iron NA
	23 2017-02-26 Iron NA
	24 2017-01-26 Silver NA
	25 2017-02-05 Silver NA
	26 2017-02-12 Silver NA
	27 2017-01-26 Platinum NA
	28 2017-02-05 Platinum NA
	29 2017-02-12 Platinum NA
	30 2017-02-26 Platinum NA

view raw

full-join-example.txt

hosted with ❤ by GitHub

Then we just use dplyr’s mutate to replace all of the NA values with zero, and voila:

	week plan signups
	1 2017-01-26 Bronze 10
	2 2017-01-26 Gold 55
	3 2017-01-26 Iron 0
	4 2017-01-26 Platinum 0
	5 2017-01-26 Silver 0
	6 2017-01-26 Standard 108
	7 2017-02-05 Bronze 6
	8 2017-02-05 Gold 37
	9 2017-02-05 Iron 1
	10 2017-02-05 Platinum 0
	11 2017-02-05 Silver 0
	12 2017-02-05 Standard 142
	13 2017-02-12 Bronze 17
	14 2017-02-12 Gold 42
	15 2017-02-12 Iron 2
	16 2017-02-12 Platinum 0
	17 2017-02-12 Silver 0
	18 2017-02-12 Standard 119
	19 2017-02-19 Bronze 11
	20 2017-02-19 Gold 26
	21 2017-02-19 Iron 0
	22 2017-02-19 Platinum 1
	23 2017-02-19 Silver 4
	24 2017-02-19 Standard 70
	25 2017-02-26 Bronze 13
	26 2017-02-26 Gold 0
	27 2017-02-26 Iron 0
	28 2017-02-26 Platinum 0
	29 2017-02-26 Silver 5
	30 2017-02-26 Standard 52

view raw

stacked-area-chart-data-without-gaps.txt

hosted with ❤ by GitHub

2. Using spread and gather

The second method Charles provided uses the tidyr package’s spread and gather functions:

	data <- read.csv("data.csv", sep = "\t")

	data <- data %>%
	tidyr::spread(key = plan, value = signups, fill = 0) %>%
	tidyr::gather(key = plan, value = signups, – week) %>%
	arrange(week, plan)

	g <- ggplot(data, aes(x = week, y = signups, group = plan, fill = plan)) +
	geom_area(position = "stack")

	print(g)

view raw

stacked-area-chart-tidyr-solution.R

hosted with ❤ by GitHub

The spread function takes the key-value pairs (week and plan in this case) and spreads it across multiple columns, making the “long” data “wider”, and filling in the missing values with 0:

	week Bronze Gold Iron Platinum Silver Standard
	1 2017-01-26 10 55 0 0 0 108
	2 2017-02-05 6 37 1 0 0 142
	3 2017-02-12 17 42 2 0 0 119
	4 2017-02-19 11 26 0 1 4 70
	5 2017-02-26 13 0 0 0 5 52

view raw

spread-data.txt

hosted with ❤ by GitHub

Then we take the wide data and convert it back to long data using gather The - week means to exclude the week column when gathering the data that spread produced:

	week plan signups
	1 2017-01-26 Bronze 10
	2 2017-01-26 Gold 55
	3 2017-01-26 Iron 0
	4 2017-01-26 Platinum 0
	5 2017-01-26 Silver 0
	6 2017-01-26 Standard 108
	7 2017-02-05 Bronze 6
	8 2017-02-05 Gold 37
	9 2017-02-05 Iron 1
	10 2017-02-05 Platinum 0
	11 2017-02-05 Silver 0
	12 2017-02-05 Standard 142
	13 2017-02-12 Bronze 17
	14 2017-02-12 Gold 42
	15 2017-02-12 Iron 2
	16 2017-02-12 Platinum 0
	17 2017-02-12 Silver 0
	18 2017-02-12 Standard 119
	19 2017-02-19 Bronze 11
	20 2017-02-19 Gold 26
	21 2017-02-19 Iron 0
	22 2017-02-19 Platinum 1
	23 2017-02-19 Silver 4
	24 2017-02-19 Standard 70
	25 2017-02-26 Bronze 13
	26 2017-02-26 Gold 0
	27 2017-02-26 Iron 0
	28 2017-02-26 Platinum 0
	29 2017-02-26 Silver 5
	30 2017-02-26 Standard 52

view raw

spread-and-gather.txt

hosted with ❤ by GitHub

Using either methods, we get a stacked area chart without the gaps ⚡️:

	SELECT , ROUND(amount / (SELECT SUM(amount) FROM charges) 100) AS percent
	FROM charges

	+—-+——–+———+
	\| id \| amount \| percent \|
	+—-+——–+———+
	\| 1 \| 18.00 \| 30 \|
	\| 2 \| 15.00 \| 25 \|
	\| 3 \| 27.00 \| 45 \|
	+—-+——–+———+

	+————+———-+———+
	\| week \| plan \| signups \|
	+————+———-+———+
	\| 2017-01-26 \| Bronze \| 10 \|
	\| 2017-01-26 \| Gold \| 55 \|
	\| 2017-01-26 \| Standard \| 108 \|
	\| 2017-02-05 \| Bronze \| 6 \|
	\| 2017-02-05 \| Iron \| 1 \|
	\| 2017-02-05 \| Gold \| 37 \|
	\| 2017-02-05 \| Standard \| 142 \|
	\| 2017-02-12 \| Bronze \| 17 \|
	\| 2017-02-12 \| Iron \| 2 \|
	\| 2017-02-12 \| Gold \| 42 \|
	\| 2017-02-12 \| Standard \| 119 \|
	\| 2017-02-19 \| Bronze \| 11 \|
	\| 2017-02-19 \| Gold \| 26 \|
	\| 2017-02-19 \| Silver \| 4 \|
	\| 2017-02-19 \| Platinum \| 1 \|
	\| 2017-02-19 \| Standard \| 70 \|
	\| 2017-02-26 \| Bronze \| 13 \|
	\| 2017-02-26 \| Silver \| 5 \|
	\| 2017-02-26 \| Standard \| 52 \|
	+————+———-+———+

Matt Mazur

Month April 2017

A Simple CROSS JOIN Example

Option 1: Using a subquery

Option 2: Using a variable

Option 3: Using CROSS JOIN

CROSS JOIN vs INNER JOIN

Removing Gaps from Stacked Area Charts in R

1. Using expand.grid and full_join

2. Using spread and gather