Counting in MySQL When Joins are Involved

There’s a MySQL pattern that I use fairly frequently that I want to share, both to help anyone else who might find it useful and also to find out if there’s a beter way.

Here’s an example: you have one table for users, another for posts, another for pages:


DROP TABLE IF EXISTS users;
CREATE TABLE users (user_id INT PRIMARY KEY AUTO_INCREMENT, name VARCHAR(20));
INSERT INTO users (name) VALUES ('Matt');
INSERT INTO users (name) VALUES ('Simon');
INSERT INTO users (name) VALUES ('Jen');
DROP TABLE IF EXISTS posts;
CREATE TABLE posts (post_id INT PRIMARY KEY AUTO_INCREMENT, user_id INT);
INSERT INTO posts (user_id) VALUES (1);
INSERT INTO posts (user_id) VALUES (1);
INSERT INTO posts (user_id) VALUES (1);
INSERT INTO posts (user_id) VALUES (2);
INSERT INTO posts (user_id) VALUES (2);
DROP TABLE IF EXISTS pages;
CREATE TABLE pages (page_id INT PRIMARY KEY AUTO_INCREMENT, user_id INT);
INSERT INTO pages (user_id) VALUES (2);
INSERT INTO pages (user_id) VALUES (2);
INSERT INTO pages (user_id) VALUES (3);
INSERT INTO pages (user_id) VALUES (3);
INSERT INTO pages (user_id) VALUES (3);
INSERT INTO pages (user_id) VALUES (3);
INSERT INTO pages (user_id) VALUES (3);
+——-+——-+
| user_id | name |
+——-+——-+
| 1 | Matt |
| 2 | Simon |
| 3 | Jen |
+——-+——-+
+——-+———+
| post_id | user_id |
+——-+———+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
| 5 | 2 |
+——-+———+
+——-+———+
| page_id | user_id |
+——-+———+
| 1 | 2 |
| 2 | 2 |
| 3 | 3 |
| 4 | 3 |
| 5 | 3 |
| 6 | 3 |
| 7 | 3 |
+——-+———+

view raw

tables.sql

hosted with ❤ by GitHub

And we want to know how many posts and pages each user has:


+——-+————+————+
| user_id | post_count | page_count |
+——-+————+————+
| 1 | 3 | 0 |
| 2 | 2 | 2 |
| 3 | 0 | 5 |
+——-+————+————+

view raw

goal.sql

hosted with ❤ by GitHub

Attempt 1: COUNT with JOIN

For the moment, lets focus on just getting the post count by user.

We might be tempted to try JOINing the two tables and using COUNT:


SELECT users.user_id, COUNT(*) AS post_count
FROM users
JOIN posts ON posts.user_id = users.user_id
GROUP BY 1
+——-+————+
| user_id | post_count |
+——-+————+
| 1 | 3 |
| 2 | 2 |
+——-+————+

view raw

join.sql

hosted with ❤ by GitHub

The problem is that when we JOIN the two tables, the results will only include users who have posts. In this case, there’s no result for Jen (user id 3) because she doesn’t have any records in the posts table.

Attempt 2: COUNT with LEFT JOIN

Ok, so if JOIN doesn’t work, lets try LEFT JOIN:


SELECT users.user_id, COUNT(*) AS post_count
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
GROUP BY 1
+——-+————+
| user_id | post_count |
+——-+————+
| 1 | 3 |
| 2 | 2 |
| 3 | 1 |
+——-+————+

view raw

left-join.sql

hosted with ❤ by GitHub

Now Jen has a count, but it’s reflecting 1 post, not 0 like we expect.

We can see why it’s broken by looking at the LEFT JOIN results without aggregating them:


SELECT *
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
+——-+——-+———+———+
| user_id | name | post_id | user_id |
+——-+——-+———+———+
| 1 | Matt | 1 | 1 |
| 1 | Matt | 2 | 1 |
| 1 | Matt | 3 | 1 |
| 2 | Simon | 4 | 2 |
| 2 | Simon | 5 | 2 |
| 3 | Jen | NULL | NULL |
+——-+——-+———+———+

Jen doesn’t have any posts, but because we’re LEFT JOINing, her users record is still included. When we then aggregate the results with GROUP BY and COUNT, MySQL sees that the results have one record so returns a count of 1.

A better way to do this (as suggested by Tom Davies) is instead of counting all records, only count post ids:


SELECT users.user_id, COUNT(post_id) AS post_count
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
GROUP BY 1
+——-+————+
| user_id | post_count |
+——-+————+
| 1 | 3 |
| 2 | 2 |
| 3 | 0 |
+——-+————+

Attempt 3: SUM/IF, and LEFT JOIN

Another way we can achieve what we want (for a single table join) is to use SUM/IF:


SELECT users.user_id, SUM(IF(post_id IS NULL, 0, 1)) AS post_count
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
GROUP BY 1
+——-+————+
| user_id | post_count |
+——-+————+
| 1 | 3 |
| 2 | 2 |
| 3 | 0 |
+——-+————+

Here we’re saying “When the post id is null, give it a 0, otherwise a 1, then sum the results” which gives us the correct count. This is actually I trick I learned from the real Jen while at Automattic :).

The problem arises when we need to join multiple tables:


SELECT
users.user_id,
SUM(IF(post_id IS NULL, 0, 1)) AS post_count,
SUM(IF(page_id IS NULL, 0, 1)) AS page_count
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
LEFT JOIN pages ON pages.user_id = users.user_id
GROUP BY 1
# or
SELECT
users.user_id,
COUNT(post_id) AS post_count,
COUNT(page_id) AS page_count
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
LEFT JOIN pages ON pages.user_id = users.user_id
GROUP BY 1
+——-+————+————+
| user_id | post_count | page_count |
+——-+————+————+
| 1 | 3 | 0 |
| 2 | 4 | 4 |
| 3 | 0 | 5 |
+——-+————+————+

Now instead of Simon (user id 2) having 2 posts, he has 4 – what happened? Again, we can look at the un-aggregated results:


SELECT *
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
LEFT JOIN pages ON pages.user_id = users.user_id
+——-+——-+———+———+———+———+
| user_id | name | post_id | user_id | page_id | user_id |
+——-+——-+———+———+———+———+
| 2 | Simon | 4 | 2 | 1 | 2 |
| 2 | Simon | 5 | 2 | 1 | 2 |
| 2 | Simon | 4 | 2 | 2 | 2 |
| 2 | Simon | 5 | 2 | 2 | 2 |
| 1 | Matt | 1 | 1 | NULL | NULL |
| 1 | Matt | 2 | 1 | NULL | NULL |
| 1 | Matt | 3 | 1 | NULL | NULL |
| 3 | Jen | NULL | NULL | 3 | 3 |
| 3 | Jen | NULL | NULL | 4 | 3 |
| 3 | Jen | NULL | NULL | 5 | 3 |
| 3 | Jen | NULL | NULL | 6 | 3 |
| 3 | Jen | NULL | NULL | 7 | 3 |
+——-+——-+———+———+———+———+

The problem is that each post gets joined with each page result. Simon has 2 posts, each of which gets joined with the 2 pages, so when we use COUNT it sees 4 results and returns that amount.

The solution: Subqueries and COALESCE

Here’s how I would solve it:


SELECT
users.user_id,
COALESCE(post_count, 0) AS post_count,
COALESCE(page_count, 0) AS page_count
FROM users
LEFT JOIN (
SELECT user_id, COUNT(*) AS post_count
FROM posts
GROUP BY user_id
) post_counts ON post_counts.user_id = users.user_id
LEFT JOIN (
SELECT user_id, COUNT(*) AS page_count
FROM pages
GROUP BY user_id
) page_counts ON page_counts.user_id = users.user_id
+——-+————+————+
| user_id | post_count | page_count |
+——-+————+————+
| 1 | 3 | 0 |
| 2 | 2 | 2 |
| 3 | 0 | 5 |
+——-+————+————+

view raw

solution.sql

hosted with ❤ by GitHub

To understand how it works, lets focus on counting posts. The first subquery counts how many posts each user has if they have any posts:


SELECT user_id, COUNT(*) AS post_count
FROM posts
GROUP BY user_id
+——-+————+
| user_id | post_count |
+——-+————+
| 1 | 3 |
| 2 | 2 |
+——-+————+

view raw

posts-count.sql

hosted with ❤ by GitHub

We can then LEFT JOIN users on this derived table:


SELECT *
FROM users
LEFT JOIN (
SELECT user_id, COUNT(*) AS post_count
FROM posts
GROUP BY user_id
) post_counts ON post_counts.user_id = users.user_id
+——-+——-+———+————+
| user_id | name | user_id | post_count |
+——-+——-+———+————+
| 1 | Matt | 1 | 3 |
| 2 | Simon | 2 | 2 |
| 3 | Jen | NULL | NULL |
+——-+——-+———+————+

For the users with posts, the result has the post count. For the users without posts, the result is NULL. This is where COALESCE comes into play. COALESCE takes any number of arguments and returns the first non-NULL result:


SELECT COALESCE(NULL, 0)
+—————–+
| COALESCE(NULL, 0) |
+—————–+
| 0 |
+—————–+

So we’re saying “If the user has a posts count, use that, otherwise use 0”.

We can then LEFT JOIN again on the pages table and do the same thing with the pages count to get the posts and pages count per user. Because each subquery only returns a max of one result per user, we don’t run into the issue we did earlier where posts get joined with pages to return the incorrect count.

Somewhat complicated, but the only way I know how to do it. If you know of a better way, please drop a comment below. Thanks!

Edit:

This simpler method also works:


SELECT
users.user_id,
COUNT(DISTINCT post_id) AS post_count,
COUNT(DISTINCT page_id) AS page_count
FROM users
LEFT JOIN posts ON posts.user_id = users.user_id
LEFT JOIN pages ON pages.user_id = users.user_id
GROUP BY 1
+——-+————+————+
| user_id | post_count | page_count |
+——-+————+————+
| 1 | 3 | 0 |
| 2 | 2 | 2 |
| 3 | 0 | 5 |
+——-+————+————+

By counting distinct post ids and page ids, we avoid counting NULLs and also avoid counting duplicates due to joining on posts and pages.

18 thoughts on “Counting in MySQL When Joins are Involved

  1. My solution involves the use of dependent subqueries.

    select user_id,
    (select count(*) from posts where posts.user_id=users.user_id) as post_count,
    (select count(*) from pages where pages.user_id=users.user_id) as page_count
    from users;

    To test performance differences, I loaded the tables with 16,000 posts and nearly 25,000 pages. Limited testing showed nearly identical performance with this query to your query using left join to select subqueries. Your updated simpler method took over 2000 times as long (nearly 3 minutes compared to .02 seconds) to process the same data.

    Using EXPLAIN with each of the queries shows that both of your approaches involves a filesort which is avoided with my query. Adding a key to the user_id on the posts and pages tables avoids the file sort and sped up the slow query to only take 18 seconds. That is still significantly slower then the other two queries.

    I do believe my approach is a bit easier to follow. I ran across this while trying to perform a similar task with a query containing about a dozen columns. More columns also required adding to the GROUP BY portion of the query.

    • WOW! This looks so simple but it solved such a huge thing for me.
      If it helps anyone else, I needed to find a SUM of columns from unique records from JOIN, and if this returned NULL, I needed the default value to be the result of another custom function from another table’s JOIN! It worked out for me because this third joined table (blogs) has one-to-one relationship with users.

      I finally did something like:
      select user_id,
      coalesce((select sum(reads) from posts where posts.user_id=users.user_id), defaultReads(blog.settings)) as post_reads,
      (select count(*) from pages where pages.user_id=users.user_id) as page_count
      from users join blogs on blogs.user_id=users.user_id;

      So now, the post_reads don’t undergo unwanted multiplication and I can still use a default value which depended on a join table.

  2. Hey, I stumbled upon this problem today. Here is what I found.

    I came across 3 ways to solve this “puzzle”, all of which have been mentioned:
    1. left join (select count group by)
    2. select (select count where)
    3. select count(distinct) group by

    Then I benchmarked the solutions against each other (over a 50k dataset), and there is a clear winner: left join (select count group by)
    (0.1s for 1, 0.5s for 2 and 5min for 3)

    It also has a small benefit: you may add other computations, such as sum and avg, and it’s cleaner than having multiple subqueries (select count), (select sum), etc.

    TLDR: choose subqueries and COALESCE

  3. Hi, how about, for example the post have dates corresponding to every post, and for example i wanted to filter date range how can i do that? thanks.

  4. Thank you very much for this. This is exactly what i was looking for. In my case the simple version of the query left out NULL results from the other table so i wasn’t getting the full data I needed. The option with COALESCE with the sub queries worked perfectly for me.

  5. Excellent article thanks. Turns out I was grouping by wrong column (the related one instead of the main table’s). Also didn’t know about GROUP BY 1, cool feature.

  6. Thank you Matt,
    This gives me the ideas to consider given the situation, esp when I am starting. Can I ask a silly question, as a beginner ?

    Why do we use JOIN, when we can use conditional statements like WHERE, WHEN / THEN and IF and LIKE features to play with

    • I’d recommend picking up a SQL book or searching online for a SQL course. DataCamp should have some good ones. You’ll get exposure to many SQL keywords this way.

      • Thank you Mazur, I will take this advice. I enrolled and looks good, not hours of classes which distracts me.. it seems to be result oriented

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s