Counting in MySQL When Joins are Involved

There’s a MySQL pattern that I use fairly frequently that I want to share, both to help anyone else who might find it useful and also to find out if there’s a beter way.

Here’s an example: you have one table for users, another for posts, another for pages:


DROP TABLE IF EXISTS users;
CREATE TABLE users (user_id INT PRIMARY KEY AUTO_INCREMENT, name VARCHAR(20));
INSERT INTO users (name) VALUES ('Matt');
INSERT INTO users (name) VALUES ('Simon');
INSERT INTO users (name) VALUES ('Jen');
DROP TABLE IF EXISTS posts;
CREATE TABLE posts (post_id INT PRIMARY KEY AUTO_INCREMENT, user_id INT);
INSERT INTO posts (user_id) VALUES (1);
INSERT INTO posts (user_id) VALUES (1);
INSERT INTO posts (user_id) VALUES (1);
INSERT INTO posts (user_id) VALUES (2);
INSERT INTO posts (user_id) VALUES (2);
DROP TABLE IF EXISTS pages;
CREATE TABLE pages (page_id INT PRIMARY KEY AUTO_INCREMENT, user_id INT);
INSERT INTO pages (user_id) VALUES (2);
INSERT INTO pages (user_id) VALUES (2);
INSERT INTO pages (user_id) VALUES (3);
INSERT INTO pages (user_id) VALUES (3);
INSERT INTO pages (user_id) VALUES (3);
INSERT INTO pages (user_id) VALUES (3);
INSERT INTO pages (user_id) VALUES (3);
+———+——-+
| user_id | name |
+———+——-+
| 1 | Matt |
| 2 | Simon |
| 3 | Jen |
+———+——-+
+———+———+
| post_id | user_id |
+———+———+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
| 5 | 2 |
+———+———+
+———+———+
| page_id | user_id |
+———+———+
| 1 | 2 |
| 2 | 2 |
| 3 | 3 |
| 4 | 3 |
| 5 | 3 |
| 6 | 3 |
| 7 | 3 |
+———+———+

view raw

tables.sql

hosted with ❤ by GitHub

And we want to know how many posts and pages each user has:


+——-+————+————+
| user_id | post_count | page_count |
+——-+————+————+
| 1 | 3 | 0 |
| 2 | 2 | 2 |
| 3 | 0 | 5 |
+——-+————+————+

view raw

goal.sql

hosted with ❤ by GitHub

Attempt 1: COUNT with JOIN

For the moment, lets focus on just getting the post count by user.

We might be tempted to try JOINing the two tables and using COUNT:


SELECT users.user_id, COUNT(*) AS post_count
FROM users
JOIN posts ON posts.user_id = users.user_id
GROUP BY 1
+——-+————+
| user_id | post_count |
+——-+————+
| 1 | 3 |
| 2 | 2 |
+——-+————+

view raw

join.sql

hosted with ❤ by GitHub

The problem is that when we JOIN the two tables, the results will only include users who have posts. In this case, there’s no result for Jen (user id 3) because she doesn’t have any records in the posts table.

Attempt 2: COUNT with LEFT JOIN

Ok, so if JOIN doesn’t work, lets try LEFT JOIN:


SELECT users.user_id, COUNT(*) AS post_count
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
GROUP BY 1
+——-+————+
| user_id | post_count |
+——-+————+
| 1 | 3 |
| 2 | 2 |
| 3 | 1 |
+——-+————+

view raw

left-join.sql

hosted with ❤ by GitHub

Now Jen has a count, but it’s reflecting 1 post, not 0 like we expect.

We can see why it’s broken by looking at the LEFT JOIN results without aggregating them:


SELECT *
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
+———+——-+———+———+
| user_id | name | post_id | user_id |
+———+——-+———+———+
| 1 | Matt | 1 | 1 |
| 1 | Matt | 2 | 1 |
| 1 | Matt | 3 | 1 |
| 2 | Simon | 4 | 2 |
| 2 | Simon | 5 | 2 |
| 3 | Jen | NULL | NULL |
+———+——-+———+———+

Jen doesn’t have any posts, but because we’re LEFT JOINing, her users record is still included. When we then aggregate the results with GROUP BY and COUNT, MySQL sees that the results have one record so returns a count of 1.

A better way to do this (as suggested by Tom Davies) is instead of counting all records, only count post ids:


SELECT users.user_id, COUNT(post_id) AS post_count
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
GROUP BY 1
+———+————+
| user_id | post_count |
+———+————+
| 1 | 3 |
| 2 | 2 |
| 3 | 0 |
+———+————+

Attempt 3: SUM/IF, and LEFT JOIN

Another way we can achieve what we want (for a single table join) is to use SUM/IF:


SELECT users.user_id, SUM(IF(post_id IS NULL, 0, 1)) AS post_count
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
GROUP BY 1
+———+————+
| user_id | post_count |
+———+————+
| 1 | 3 |
| 2 | 2 |
| 3 | 0 |
+———+————+

Here we’re saying “When the post id is null, give it a 0, otherwise a 1, then sum the results” which gives us the correct count. This is actually I trick I learned from the real Jen while at Automattic :).

The problem arises when we need to join multiple tables:


SELECT
users.user_id,
SUM(IF(post_id IS NULL, 0, 1)) AS post_count,
SUM(IF(page_id IS NULL, 0, 1)) AS page_count
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
LEFT JOIN pages ON pages.user_id = users.user_id
GROUP BY 1
# or
SELECT
users.user_id,
COUNT(post_id) AS post_count,
COUNT(page_id) AS page_count
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
LEFT JOIN pages ON pages.user_id = users.user_id
GROUP BY 1
+———+————+————+
| user_id | post_count | page_count |
+———+————+————+
| 1 | 3 | 0 |
| 2 | 4 | 4 |
| 3 | 0 | 5 |
+———+————+————+

Now instead of Simon (user id 2) having 2 posts, he has 4 – what happened? Again, we can look at the un-aggregated results:


SELECT *
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
LEFT JOIN pages ON pages.user_id = users.user_id
+———+——-+———+———+———+———+
| user_id | name | post_id | user_id | page_id | user_id |
+———+——-+———+———+———+———+
| 2 | Simon | 4 | 2 | 1 | 2 |
| 2 | Simon | 5 | 2 | 1 | 2 |
| 2 | Simon | 4 | 2 | 2 | 2 |
| 2 | Simon | 5 | 2 | 2 | 2 |
| 1 | Matt | 1 | 1 | NULL | NULL |
| 1 | Matt | 2 | 1 | NULL | NULL |
| 1 | Matt | 3 | 1 | NULL | NULL |
| 3 | Jen | NULL | NULL | 3 | 3 |
| 3 | Jen | NULL | NULL | 4 | 3 |
| 3 | Jen | NULL | NULL | 5 | 3 |
| 3 | Jen | NULL | NULL | 6 | 3 |
| 3 | Jen | NULL | NULL | 7 | 3 |
+———+——-+———+———+———+———+

The problem is that each post gets joined with each page result. Simon has 2 posts, each of which gets joined with the 2 pages, so when we use COUNT it sees 4 results and returns that amount.

The solution: Subqueries and COALESCE

Here’s how I would solve it:


SELECT
users.user_id,
COALESCE(post_count, 0) AS post_count,
COALESCE(page_count, 0) AS page_count
FROM users
LEFT JOIN (
SELECT user_id, COUNT(*) AS post_count
FROM posts
GROUP BY user_id
) post_counts ON post_counts.user_id = users.user_id
LEFT JOIN (
SELECT user_id, COUNT(*) AS page_count
FROM pages
GROUP BY user_id
) page_counts ON page_counts.user_id = users.user_id
+———+————+————+
| user_id | post_count | page_count |
+———+————+————+
| 1 | 3 | 0 |
| 2 | 2 | 2 |
| 3 | 0 | 5 |
+———+————+————+

view raw

solution.sql

hosted with ❤ by GitHub

To understand how it works, lets focus on counting posts. The first subquery counts how many posts each user has if they have any posts:


SELECT user_id, COUNT(*) AS post_count
FROM posts
GROUP BY user_id
+———+————+
| user_id | post_count |
+———+————+
| 1 | 3 |
| 2 | 2 |
+———+————+

view raw

posts-count.sql

hosted with ❤ by GitHub

We can then LEFT JOIN users on this derived table:


SELECT *
FROM users
LEFT JOIN (
SELECT user_id, COUNT(*) AS post_count
FROM posts
GROUP BY user_id
) post_counts ON post_counts.user_id = users.user_id
+———+——-+———+————+
| user_id | name | user_id | post_count |
+———+——-+———+————+
| 1 | Matt | 1 | 3 |
| 2 | Simon | 2 | 2 |
| 3 | Jen | NULL | NULL |
+———+——-+———+————+

For the users with posts, the result has the post count. For the users without posts, the result is NULL. This is where COALESCE comes into play. COALESCE takes any number of arguments and returns the first non-NULL result:


SELECT COALESCE(NULL, 0)
+——————-+
| COALESCE(NULL, 0) |
+——————-+
| 0 |
+——————-+

So we’re saying “If the user has a posts count, use that, otherwise use 0”.

We can then LEFT JOIN again on the pages table and do the same thing with the pages count to get the posts and pages count per user. Because each subquery only returns a max of one result per user, we don’t run into the issue we did earlier where posts get joined with pages to return the incorrect count.

Somewhat complicated, but the only way I know how to do it. If you know of a better way, please drop a comment below. Thanks!

Edit:

This simpler method also works:


SELECT
users.user_id,
COUNT(DISTINCT post_id) AS post_count,
COUNT(DISTINCT page_id) AS page_count
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
LEFT JOIN pages ON pages.user_id = users.user_id
GROUP BY 1
+———+————+————+
| user_id | post_count | page_count |
+———+————+————+
| 1 | 3 | 0 |
| 2 | 2 | 2 |
| 3 | 0 | 5 |
+———+————+————+

By counting distinct post ids and page ids, we avoid counting NULLs and also avoid counting duplicates due to joining on posts and pages.

Using Data to Forecast the Impact of a Pricing Change

Back in April of this year Help Scout announced we would be raising prices for customers on some of our old legacy plans in six months time (in October). I recently helped with an analysis to estimate what the impact would be on our monthly recurring revenue (MRR). We performed a similar analysis prior to the announcement, but severals months had passed so it was time for fresh forecast.

At a high level, we performed the analysis as follows:

1. Identify the variables that impact the forecast

For us, this meant variables such as:

  • Of of the customers who we announced the price increase for in April, how many have churned between then and now? And how many can we expect to churn between now and October? And how many can we expect to churn after the price increase?
  • How many can we expect to upgrade or downgrade after the price increase?
  • How many can we expect to change from monthly payments to discounted annual payments?
  • Because customers pay per user, what % of users can we expect to lose after the price increase?
  • And so on.

2. Create a spreadsheet that lets you adjust the variables to see what impact they have on the forecast

For example (and simplifying a lot), if we had W customers originally and X have churned between the announcement and now, and we expect another Y to churn between now and the price increase, and we expect Z% to churn after the price increase, paying $N/month on average, we’ll wind up with a future MRR of (W – X – Y) * (1 – Z) * N.

Our actual spreadsheet wound up become quite complex to take into account all of the variables, but in the end we had something that took into account all of the key factors that impact will the outcome.

3. Come up with a range of estimates for each of variable

Using historic trends and educated guesses, we estimated the range for each of the key variables we identified. With those in hand, we create several forecasts (pessmistic, realistic, optimistic) showing what the outcome looks like for each of those situations:

Screen Shot 2017-08-04 at 3.40.52 PM.png

My original instinct was to come back with a single number: “The forecast is $X” but my lead wisely suggested calculating several outcomes to account for the range of possibilities.

This was a fascinating exercise because it forced us understand on a deep level what the inputs are (churn rate, etc) and what impact they have on our bottom line (MRR).

If you’re interested in trying this for your business, try to create a simple spreadsheet model that takes into account various metrics (number of unique visitors, trial sign up rate, trial to paid rate, etc) that comes close to predicting your historic results, then see how it well it does going forward. You’ll likely learn a lot in the process about which metrics you need to change and by how much to achieve your growth goals.

How to Schedule Cloning your Heroku Postgres Database Locally

In my last post, I wrote about how to set up a Bash alias to pull your Heroku data into a local Postgres database. This post takes it a step further by showing how to automate it so that the database is automatically updated on a regular basis.

Background

We left off the last post with this Bash alias:

alias prdb="dropdb preceden_production_copy; PGUSER=postgres PGPASSWORD=password heroku pg:pullHEROKU_POSTGRESQL_MAGENTA preceden_production_copy --app sushi"

This lets me run prdb which first deletes my local database copy, then pulls the production data into a new one.

When attempting to schedule this as a cronjob, I ran into several issues.

First, dropdb won’t work if the database is being used by any applications. You’ll get an error message like:

dropdb: database removal failed: ERROR:  database "preceden_production_copy" is being accessed by other users

This is a problem because I usually leave Metabase and SQLPro for Postgres open which causes dropdb to throw and error which causes pg:pull not to work because the database already exists. I then have to shut down the apps and try again, and even then there’s often some hidden connection that requires a complete reboot.

Second, it usually takes about half an hour for pg:pull to load all of the data into a database. If I’m eager to dive into the latest data, it’s frustrating to have to wait that long to do it.

Lastly, you can’t use aliases in a cronjob, meaning you’d have to duplicate the content of the alias in the cronjob.

As we’ll see, we can work through each of these issues.

Solution

There’s a way with Postgres to drop existing connections. By excuting this query through terminal before we attempt to drop the database, we ensure the command will work. Here’s what it looks like:

echo "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname='preceden_production_copy';" | psql -U postgres

If all goes well, you’ll see something like:

pg_terminate_backend 
----------------------
 t
 t
(2 rows)

Next, in order to avoid duplicating the alias command and the cronjob command, we’ll combine all of the commands into a Bash script which we’ll execute from both places:

#!/bin/bash

current_date_time="`date +%Y\-%m\-%d\ %H\:%M\:%S`";
echo "Running at:" $current_date_time;

echo "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname='preceden_production_copy';" | psql -U postgres
dropdb preceden_production_copy
PGUSER=postgres PGPASSWORD=password heroku pg:pullHEROKU_POSTGRESQL_MAGENTA preceden_production_copy --app sushi

This will output the date and time when the script is being run, then drop any open database connections, remove the database, then pull the production Postgres data into a new database.

Next, we can update the alias to point to this script:

alias prdb="/Users/matt/Projects/Preceden/prdb.sh"

And we can point to it from a cronjob:

0 12 * * * /Users/matt/Projects/Preceden/prdb.sh >> /Users/matt/Projects/Preceden/prdb.log 2>&1

This says “At noon every day, run prdb.sh and append the output to prdb.log”.

And voila! The data should get updated automatically once a day thanks to the cronjob and you still have the ability to run it manually using the alias.

Exploring your Heroku Rails app’s database using SQLPro for Postgres

In the past when I’ve wanted to explore production data for a Heroku-hosted Ruby on Rails app, I’ve primarily used heroku console and rake tasks. Each method has limitations though: heroku console makes easy to answer simple questions about your data, but makes it difficult to perform complicated analyses that take more than a few lines of code. Rake tasks let you perform complex analyses, but make it difficult to explore data because each time you tweak your task to do something new, you need to commit, push to production, run the task, and wait for it to execute. Neither option makes it easy to quickly explore the data.

Wouldn’t it be nice if you could quickly query your database and explore the results?

Fortunately there is a way using a combination of Heroku’s pg:pull feature and a Mac app called SQLPro for Postgres. Here’s how it works:

Step 1: Pull your production data into a local Postgres database

Heroku makes this fairly easy using the pg:pull command:

$ heroku pg:pull HEROKU_POSTGRESQL_MAGENTA mylocaldb --app sushi

Where mylocaldb is the name of a local Postgres database, sushi is the name of your Heroku app, and HEROKU_POSTGRESQL_MAGENT is the name of your database which you can obtain by running:

$ heroku pg:info -a sushi

If your local Postgres instance requires a user name and password, you can provide them via the command line as well:

$ PGUSER=postgres PGPASSWORD=password heroku pg:pull HEROKU_POSTGRESQL_MAGENTA mylocaldb --app sushi

In order for this command to work, mylocaldb can’t exist when you run this command. To delete it beforehand, you can run:

$ dropdb mylocaldb

For my own workflow combine them and use a Bash alias to make it easier to run:

alias prdb="dropdb preceden_production_copy; PGUSER=postgres PGPASSWORD=password heroku pg:pullHEROKU_POSTGRESQL_MAGENTA preceden_production_copy --app sushi"

Then I can just run prdb (my short hand for “Preceden Database”) from the command line to drop the old copy and grab the latest production data:

$ prdb
heroku-cli: Pulling postgresql-infinite-32999 ---> preceden_production_copy
pg_dump: last built-in OID is 16383
pg_dump: reading extensions
...

Step 2: Explore the data using SQLPro for Postgres

SQLPro for Postgres is a fantastic Mac app for exploring Postgres databases. You can also query the data other ways but for quickly exploring, querying, and exporting the data, SQLPro for Postgres is hard to beat.

Here’s what the UI looks like along with an example query to display the first 10 people to sign up:

sqlpro-for-postgres.jpg

In future posts we’ll see how to query Postgres with R to analyze the data and gain insights about how people use our products.

If you’re interested in learning more, sign up for my new Data Science for Product Analytics newsletter to get notified when there are new posts.

Update: check out the follow up post, How to Schedule Cloning your Heroku Postgres Database Locally.