Counting in MySQL When Joins are Involved

There’s a MySQL pattern that I use fairly frequently that I want to share, both to help anyone else who might find it useful and also to find out if there’s a beter way.

Here’s an example: you have one table for users, another for posts, another for pages:

And we want to know how many posts and pages each user has:

Attempt 1: COUNT with JOIN

For the moment, lets focus on just getting the post count by user.

We might be tempted to try JOINing the two tables and using COUNT:

The problem is that when we JOIN the two tables, the results will only include users who have posts. In this case, there’s no result for Jen (user id 3) because she doesn’t have any records in the posts table.

Attempt 2: COUNT with LEFT JOIN

Ok, so if JOIN doesn’t work, lets try LEFT JOIN:

Now Jen has a count, but it’s reflecting 1 post, not 0 like we expect.

We can see why it’s broken by looking at the LEFT JOIN results without aggregating them:

Jen doesn’t have any posts, but because we’re LEFT JOINing, her users record is still included. When we then aggregate the results with GROUP BY and COUNT, MySQL sees that the results have one record so returns a count of 1.

A better way to do this (as suggested by Tom Davies) is instead of counting all records, only count post ids:

Attempt 3: SUM/IF, and LEFT JOIN

Another way we can achieve what we want (for a single table join) is to use SUM/IF:

Here we’re saying “When the post id is null, give it a 0, otherwise a 1, then sum the results” which gives us the correct count. This is actually I trick I learned from the real Jen while at Automattic :).

The problem arises when we need to join multiple tables:

Now instead of Simon (user id 2) having 2 posts, he has 4 – what happened? Again, we can look at the un-aggregated results:

The problem is that each post gets joined with each page result. Simon has 2 posts, each of which gets joined with the 2 pages, so when we use COUNT it sees 4 results and returns that amount.

The solution: Subqueries and COALESCE

Here’s how I would solve it:

To understand how it works, lets focus on counting posts. The first subquery counts how many posts each user has if they have any posts:

We can then LEFT JOIN users on this derived table:

For the users with posts, the result has the post count. For the users without posts, the result is NULL. This is where COALESCE comes into play. COALESCE takes any number of arguments and returns the first non-NULL result:

So we’re saying “If the user has a posts count, use that, otherwise use 0”.

We can then LEFT JOIN again on the pages table and do the same thing with the pages count to get the posts and pages count per user. Because each subquery only returns a max of one result per user, we don’t run into the issue we did earlier where posts get joined with pages to return the incorrect count.

Somewhat complicated, but the only way I know how to do it. If you know of a better way, please drop a comment below. Thanks!

Edit:

This simpler method also works:

By counting distinct post ids and page ids, we avoid counting NULLs and also avoid counting duplicates due to joining on posts and pages.

Experimenting with waking up early to work on side projects

In the past when I’ve worked on side projects it’s been primarily on nights and weekends. Between school and work, I never had time for them during the day and because I’m not much of a morning person, nights and weekends always worked better.

When my son was born about three years ago, a lot of the time that I would have put into side projects went towards trying to be a good dad to him and now also to my daughter. This led to less time during the week because after I would get done with work, I would spend time with them and after they would go to sleep, I wanted to spend time with my wife and relax. I also can’t stay up too late because I needed to get up with the kids in the morning. Similar story on weekends.

This has led to my side projects time dropping from maybe 10-15 hours/week to only 3-4 hours/week for the last several years.

Inspired by my friend and coworker Dave Martin, I’ve started experimenting with waking up early to work on side projects. Not 4am-5am like him (!), but just at 6am. That gives me 1 to 1.5 hours to focus on side projects most days. I take Saturday off and also wind up skipping some other days, but it’s probably doubled the amount of time I put into side projects to around 7-8 hours per week.

To make this work, I’ve also had to adjust my evening routine. Instead of going to bed around 10:30pm, I try to be in bed by around 9:30pm to make sure I get in 8-9 hours. I wind up watching less Netflix, but don’t feel like I’m missing anything important.

And as a side effect, I’ve found that I’m sleeping a lot better now that I’m no longer jumping straight from coding into bed a lot of nights. All too often I’d just lie there for hours waiting for problem solving mode to turn off. Now I read and it makes it much easier to fall asleep.

If you typically work on side projects at night like I did in the past, I encourage you to experiment with waking up and going to bed early for a few weeks to see how it works for you.

HelpU, a new customer service education platform by Help Scout, is now live! 🚀

Screen Shot 2017-10-13 at 2.00.12 PM.png

My teammates at Help Scout have been hard at work on a new resource called HelpU to help you and your team learn how to deliver worldclass customer service. Whether you’re a member of a large support team (hi Automattic friends!) or handle support by yourself for a small side project (hi Microconf friends!), you’ll find a ton of practical guidance in HelpU to make you even better at wowing your customers.

You’ll be able to learn how to write effective knowledge base articles, how architecture your knowledge base, how to leverage data to make you and your customers more successful, how to save time with saved replieshow to foster a customer-focused company culture, and a whole lot more.

And even if you’re not involved in customer service at all, you should visit anyway just to marvel at its beautiful design 😍.

Check it out: HelpU: Customer Service and Education by Help Scout

Building vs Buying Software

At Automattic and now at Help Scout I’ve been involved in several discussions about the merits of building something internally vs paying for a third party SaaS tool or plugin. Examples include the internal analytics platform Automattic (which we wound up building ourselves) and a business intelligence tool at Help Scout (which we wound up buying).

If you’re lucky it will be an easy decision, but more often than not there will be pros and cons to both approaches that you’ll have to weigh. Some of the factors include:

Do you have the engineering resources and know-how to build it? If not, you’ll obviously have to buy.

How much time will it take to build? As anyone who has built software knows, estimating how much time it will take to “complete” is often incredibly difficult. A one month project could easily turn into four after all is said and done. The more complex the project and the longer it will take to build, the more you should lean towards buying.

How much does it cost to buy? The cheaper it is the more it makes sense to buy.

How much will it cost to build? At first this might seem like $0, but engineers are expensive. A developer making $120K/year might be $180K/year fully loaded (with benefits, taxes, etc) or $15K/month. Several developers spending a month on a project can easily cost tens of thousands of dollars. And that’s not even counting PMs, designers, and anyone else involved in the project. It’s tempting to think “we have to pay them anyway so it doesn’t really cost us that”, but it’s not so simple once you consider what other revenue-generating projects they could be working on instead (and you don’t have to employ them anyway). There’s a real opportunity cost to building something internally.

Does the thing you’ll buy meet all of your requirements? If it’s critical that the software do something and the SaaS you’re considering doesn’t do it, you’ll probably need to build. However, you might be better off paying for a tool that is 90% of what you think you need vs spending weeks or months building something that might be 100%.

What other capabilities does the tool you’ll buy provide? While you might be able to build something that meets your minumum requirements, a tool you buy is likely going to do a lot more than the minimum. You may not think you need those extra features now, but there’s a good chance that you will later. At that point you might have little choice but to keep iterating on what you’ve built which can easily extend the duration of a project.

Is it important to own the data? Third party tools will likely collect user data which may be a concern both from privacy and analysis perspectives.

How likely is it that the third party solution will exist in 5 years? And are there alternatives you could switch to if that hot-new-startup you pay for shuts their doors in six months?

Are you building this as a long term investment? When we were considering whether to buy or build Automattic’s internal analytics platform one of the factors was that we planned on Automattic existing in 10, 20+ years so it made sense to invest time building something that we could use for the long haul. Be skeptical with this one though: while we might hope that whatever we’re working on now will exist in 20 years, the reality is that it probably won’t.

I usually default to buy unless there is a compelling reason to build. In my experience software usually takes much longer to build than initially expected so it winds up making more sense to pay for a 90% solution that you can have immediately and devote those engineering resources to the core business. But again, it’s often not so simple in practice and there can be reasons to build instead (full control over the features and design, ownership of the data, cost, etc).

I’d love to hear about your experience buying vs building. Was there ever a time you built something that you later regretted or vice versa?

Using Data to Forecast the Impact of a Pricing Change

Back in April of this year Help Scout announced we would be raising prices for customers on some of our old legacy plans in six months time (in October). I recently helped with an analysis to estimate what the impact would be on our monthly recurring revenue (MRR). We performed a similar analysis prior to the announcement, but severals months had passed so it was time for fresh forecast.

At a high level, we performed the analysis as follows:

1. Identify the variables that impact the forecast

For us, this meant variables such as:

  • Of of the customers who we announced the price increase for in April, how many have churned between then and now? And how many can we expect to churn between now and October? And how many can we expect to churn after the price increase?
  • How many can we expect to upgrade or downgrade after the price increase?
  • How many can we expect to change from monthly payments to discounted annual payments?
  • Because customers pay per user, what % of users can we expect to lose after the price increase?
  • And so on.

2. Create a spreadsheet that lets you adjust the variables to see what impact they have on the forecast

For example (and simplifying a lot), if we had W customers originally and X have churned between the announcement and now, and we expect another Y to churn between now and the price increase, and we expect Z% to churn after the price increase, paying $N/month on average, we’ll wind up with a future MRR of (W – X – Y) * (1 – Z) * N.

Our actual spreadsheet wound up become quite complex to take into account all of the variables, but in the end we had something that took into account all of the key factors that impact will the outcome.

3. Come up with a range of estimates for each of variable

Using historic trends and educated guesses, we estimated the range for each of the key variables we identified. With those in hand, we create several forecasts (pessmistic, realistic, optimistic) showing what the outcome looks like for each of those situations:

Screen Shot 2017-08-04 at 3.40.52 PM.png

My original instinct was to come back with a single number: “The forecast is $X” but my lead wisely suggested calculating several outcomes to account for the range of possibilities.

This was a fascinating exercise because it forced us understand on a deep level what the inputs are (churn rate, etc) and what impact they have on our bottom line (MRR).

If you’re interested in trying this for your business, try to create a simple spreadsheet model that takes into account various metrics (number of unique visitors, trial sign up rate, trial to paid rate, etc) that comes close to predicting your historic results, then see how it well it does going forward. You’ll likely learn a lot in the process about which metrics you need to change and by how much to achieve your growth goals.

What I’ve Been Up To

This morning I was catching up with my friend and coworker Dave Martin and we got to talking about blogging and how we both miss casual blogging: things like writing about what we’ve working on, what issues we’re running into, what we’re learning etc.

The problem for me boils down to this really high bar I have set in my head for what’s worthy of a blog post. I like writing long technical posts and it’s difficult getting used to the idea that short nontechnical posts are just fine too.

So, in an effort to get back into blogging, here’s a quick update on what I’ve been up to:

I’ve been at Help Scout now for about 4½ months and am really enjoying it. My coworkers are awesome, I love the product, and I’m really getting to level up my data science and analytics skills which I plan to write more about in the future.

My only side project these days is my 7-year old timeline maker service, Preceden. Besides about an hour of customer support each week (via Help Scout, of course), it’s almost entirely passive, though I try to put a few hours into product development and marketing each month to keep improving it.

On the home front, my two year old son and one year old daughter are doing great. I feel incredibly lucky to work remotely which lets me spend more time with them and my wife each day.

I’ve been trying to focus more on my health lately, not because of any major issues, just in an effort to feel more energetic and less stressed each day. Things like sleeping and exercising more, avoiding coffee, meditating, not checking the news so often, etc. I’ve had mixed success maintaining these efforts long term though… it’s a work in progress :).

If we haven’t chatted in a while, I’d love to catch up. Drop me a note anytime by email at matthew.h.mazur@gmail.com or on Twitter/Telegram @mhmazur. Cheers!