# A Simple CROSS JOIN Example

99% of the queries I write to join tables wind up using JOIN (aka INNER JOIN) or LEFT JOIN so whenever there’s an opportunity to use one the other types, I get pretty excited 🙂. Today, that wound up being a CROSS JOIN.

Consider the following table containing charges:

How would you add add a column showing how much each charge represents as a percentage of the total charges?

## Option 1: Using a subquery

One way to solve this is to use a subquery:

For each record, we divide the amount by the sum of all the amounts to get the percentage.

## Option 2: Using a variable

Similar to the solution above, except here we save the sum of the amounts in a variable and then use that variable in the query:

## Option 3: Using CROSS JOIN

A cross join takes every row from the first table and joins it on every row in the second table. From w3resource.com: In this solution, we create a result set with one value (the sum of the amounts) and then cross join the charges table on it. That will add the `total` to each record, which we can then divide the `amount` by to get the percentage:

If we didn’t want the `total` column in the result, we could simply exclude it:

In this case there shouldn’t be any performance gains using the CROSS JOIN vs one of the other methods, but I find it more elegant than the subquery or variable solutions.

## CROSS JOIN vs INNER JOIN

Note that CROSS JOIN and INNER JOIN do the same thing, it’s just that because we’re not joining on a specific column, the convention is to use CROSS JOIN. For example, this produces the same result as the last CROSS JOIN example:

And so does this:

So why use CROSS JOIN at all? Per a Stack Overflow thread:

Using CROSS JOIN vs (INNER) JOIN vs comma

The common convention is:

* Use CROSS JOIN when and only when you don’t compare columns between tables. That suggests that the lack of comparisons was intentional.
* Use (INNER) JOIN with ON when and only when you have comparisons between tables (plus possibly other comparisons).
* Don’t use comma.

Props this Stack Overflow question for the tip about using CROSS JOIN to solve this type of problem.

# Removing Gaps from Stacked Area Charts in R

Creating a stacked area chart in R is fairly painless, unless your data has gaps. For example, consider the following CSV data showing the number of plan signups per week:

Plotting this highlights the problem: The reason the gaps exist is that not all plans have data points every week. Consider Gold, for example: during the first four weeks there are 55, 37, 42, and 26 signups, but during the last week there isn’t a data point at all. That’s why the chart shows the gap: it’s not that the data indicates Gold went to zero signups the final week; it indicates no data at all.

To remedy this, we need to ensure that every week contains a data point for every plan. That means for weeks where there isn’t a data point for a plan, we need to fill it in with 0 so that R knows that the signups are in fact 0 for that week.

I asked Charles Bordet, an R expert who I hired through Upwork to help me level up my R skills, how he would go about filling in the data.

He provided two solutions:

## 1. Using expand.grid and full_join

Here’s how it works:

`expand.grid` creates “a data frame from all combinations of the supplied vectors or factors”. By passing it in the weeks and plans, it generates the following data frame called `combinations`:

The `full_join` then takes all of the rows from `data` and combines them with `combinations` based on `week` and `plan`. When there aren’t any matches (which will happen when a week doesn’t have a value for a plan), `signups` gets set to `NA`:

Then we just use dplyr’s `mutate` to replace all of the `NA` values with zero, and voila:

## 2. Using spread and gather

The second method Charles provided uses the tidyr package’s `spread` and `gather` functions:

The `spread` function takes the key-value pairs (week and plan in this case) and spreads it across multiple columns, making the “long” data “wider”, and filling in the missing values with 0:

Then we take the wide data and convert it back to long data using `gather` The `- week` means to exclude the `week` column when gathering the data that `spread` produced:

Using either methods, we get a stacked area chart without the gaps ⚡️: 