Removing Gaps from Stacked Area Charts in R

Creating a stacked area chart in R is fairly painless, unless your data has gaps. For example, consider the following CSV data showing the number of plan signups per week:


+————+———-+———+
| week | plan | signups |
+————+———-+———+
| 2017-01-26 | Bronze | 10 |
| 2017-01-26 | Gold | 55 |
| 2017-01-26 | Standard | 108 |
| 2017-02-05 | Bronze | 6 |
| 2017-02-05 | Iron | 1 |
| 2017-02-05 | Gold | 37 |
| 2017-02-05 | Standard | 142 |
| 2017-02-12 | Bronze | 17 |
| 2017-02-12 | Iron | 2 |
| 2017-02-12 | Gold | 42 |
| 2017-02-12 | Standard | 119 |
| 2017-02-19 | Bronze | 11 |
| 2017-02-19 | Gold | 26 |
| 2017-02-19 | Silver | 4 |
| 2017-02-19 | Platinum | 1 |
| 2017-02-19 | Standard | 70 |
| 2017-02-26 | Bronze | 13 |
| 2017-02-26 | Silver | 5 |
| 2017-02-26 | Standard | 52 |
+————+———-+———+

Plotting this highlights the problem:


library(ggplot2)
data <- read.csv("dummy-data.csv", sep = "\t")
g <- ggplot(data, aes(x = week, y = signups, group = plan, fill = plan)) +
geom_area()
print(g)

chart.png

The reason the gaps exist is that not all plans have data points every week. Consider Gold, for example: during the first four weeks there are 55, 37, 42, and 26 signups, but during the last week there isn’t a data point at all. That’s why the chart shows the gap: it’s not that the data indicates Gold went to zero signups the final week; it indicates no data at all.

To remedy this, we need to ensure that every week contains a data point for every plan. That means for weeks where there isn’t a data point for a plan, we need to fill it in with 0 so that R knows that the signups are in fact 0 for that week.

I asked Charles Bordet, an R expert who I hired through Upwork to help me level up my R skills, how he would go about filling in the data.

He provided two solutions:

1. Using expand.grid and full_join


data <- read.csv("data.csv", sep = "\t")
weeks <- unique(data$week)
plans <- unique(data$plan)
combinations <- expand.grid(week = weeks, plan = plans)
data <- full_join(data, combinations, by = c("week" = "week", "plan" = "plan")) %>%
mutate(signups = ifelse(is.na(signups), 0, signups)) %>%
arrange(week, plan)
g <- ggplot(data, aes(x = week, y = signups, group = plan, fill = plan)) +
geom_area(position = "stack")
print(g)

Here’s how it works:

expand.grid creates “a data frame from all combinations of the supplied vectors or factors”. By passing it in the weeks and plans, it generates the following data frame called combinations:


week plan
1 20170126 Bronze
2 20170205 Bronze
3 20170212 Bronze
4 20170219 Bronze
5 20170226 Bronze
6 20170126 Gold
7 20170205 Gold
8 20170212 Gold
9 20170219 Gold
10 20170226 Gold
11 20170126 Standard
12 20170205 Standard
13 20170212 Standard
14 20170219 Standard
15 20170226 Standard
16 20170126 Iron
17 20170205 Iron
18 20170212 Iron
19 20170219 Iron
20 20170226 Iron
21 20170126 Silver
22 20170205 Silver
23 20170212 Silver
24 20170219 Silver
25 20170226 Silver
26 20170126 Platinum
27 20170205 Platinum
28 20170212 Platinum
29 20170219 Platinum
30 20170226 Platinum

The full_join then takes all of the rows from data and combines them with combinations based on week and plan. When there aren’t any matches (which will happen when a week doesn’t have a value for a plan), signups gets set to NA:


week plan signups
1 2017-01-26 Bronze 10
2 2017-01-26 Gold 55
3 2017-01-26 Standard 108
4 2017-02-05 Bronze 6
5 2017-02-05 Iron 1
6 2017-02-05 Gold 37
7 2017-02-05 Standard 142
8 2017-02-12 Bronze 17
9 2017-02-12 Iron 2
10 2017-02-12 Gold 42
11 2017-02-12 Standard 119
12 2017-02-19 Bronze 11
13 2017-02-19 Gold 26
14 2017-02-19 Silver 4
15 2017-02-19 Platinum 1
16 2017-02-19 Standard 70
17 2017-02-26 Bronze 13
18 2017-02-26 Silver 5
19 2017-02-26 Standard 52
20 2017-02-26 Gold NA
21 2017-01-26 Iron NA
22 2017-02-19 Iron NA
23 2017-02-26 Iron NA
24 2017-01-26 Silver NA
25 2017-02-05 Silver NA
26 2017-02-12 Silver NA
27 2017-01-26 Platinum NA
28 2017-02-05 Platinum NA
29 2017-02-12 Platinum NA
30 2017-02-26 Platinum NA

Then we just use dplyr’s mutate to replace all of the NA values with zero, and voila:


week plan signups
1 2017-01-26 Bronze 10
2 2017-01-26 Gold 55
3 2017-01-26 Iron 0
4 2017-01-26 Platinum 0
5 2017-01-26 Silver 0
6 2017-01-26 Standard 108
7 2017-02-05 Bronze 6
8 2017-02-05 Gold 37
9 2017-02-05 Iron 1
10 2017-02-05 Platinum 0
11 2017-02-05 Silver 0
12 2017-02-05 Standard 142
13 2017-02-12 Bronze 17
14 2017-02-12 Gold 42
15 2017-02-12 Iron 2
16 2017-02-12 Platinum 0
17 2017-02-12 Silver 0
18 2017-02-12 Standard 119
19 2017-02-19 Bronze 11
20 2017-02-19 Gold 26
21 2017-02-19 Iron 0
22 2017-02-19 Platinum 1
23 2017-02-19 Silver 4
24 2017-02-19 Standard 70
25 2017-02-26 Bronze 13
26 2017-02-26 Gold 0
27 2017-02-26 Iron 0
28 2017-02-26 Platinum 0
29 2017-02-26 Silver 5
30 2017-02-26 Standard 52

2. Using spread and gather

The second method Charles provided uses the tidyr package’s spread and gather functions:


data <- read.csv("data.csv", sep = "\t")
data <- data %>%
tidyr::spread(key = plan, value = signups, fill = 0) %>%
tidyr::gather(key = plan, value = signups, week) %>%
arrange(week, plan)
g <- ggplot(data, aes(x = week, y = signups, group = plan, fill = plan)) +
geom_area(position = "stack")
print(g)

The spread function takes the key-value pairs (week and plan in this case) and spreads it across multiple columns, making the “long” data “wider”, and filling in the missing values with 0:


week Bronze Gold Iron Platinum Silver Standard
1 2017-01-26 10 55 0 0 0 108
2 2017-02-05 6 37 1 0 0 142
3 2017-02-12 17 42 2 0 0 119
4 2017-02-19 11 26 0 1 4 70
5 2017-02-26 13 0 0 0 5 52

view raw

spread-data.txt

hosted with ❤ by GitHub

Then we take the wide data and convert it back to long data using gather The - week means to exclude the week column when gathering the data that spread produced:


week plan signups
1 2017-01-26 Bronze 10
2 2017-01-26 Gold 55
3 2017-01-26 Iron 0
4 2017-01-26 Platinum 0
5 2017-01-26 Silver 0
6 2017-01-26 Standard 108
7 2017-02-05 Bronze 6
8 2017-02-05 Gold 37
9 2017-02-05 Iron 1
10 2017-02-05 Platinum 0
11 2017-02-05 Silver 0
12 2017-02-05 Standard 142
13 2017-02-12 Bronze 17
14 2017-02-12 Gold 42
15 2017-02-12 Iron 2
16 2017-02-12 Platinum 0
17 2017-02-12 Silver 0
18 2017-02-12 Standard 119
19 2017-02-19 Bronze 11
20 2017-02-19 Gold 26
21 2017-02-19 Iron 0
22 2017-02-19 Platinum 1
23 2017-02-19 Silver 4
24 2017-02-19 Standard 70
25 2017-02-26 Bronze 13
26 2017-02-26 Gold 0
27 2017-02-26 Iron 0
28 2017-02-26 Platinum 0
29 2017-02-26 Silver 5
30 2017-02-26 Standard 52

Using either methods, we get a stacked area chart without the gaps ⚡️:

chart.png

One thought on “Removing Gaps from Stacked Area Charts in R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s