Exploring ranking techniques in MySQL

While trying to wrangle a MySQL query to rank a set of results, I got sucked into a few Stack Overflow threads outlining various approaches.

In order to understand the pros and cons of each technique, I created some test data and reimplemented the solutions, all of which I’ve shared below.

If there are any other methods worth including here, please drop a note in the comments.

We’ll be working with this table:

DROP TABLE IF EXISTS users;

CREATE TABLE users (user_id INT PRIMARY KEY AUTO_INCREMENT, name VARCHAR(20), start_date DATE, team_id INT);
INSERT INTO users (name, start_date, team_id) VALUES ('Matt', '2017-01-01', 1);
INSERT INTO users (name, start_date, team_id) VALUES ('John', '2017-01-02', 2);
INSERT INTO users (name, start_date, team_id) VALUES ('Sara', '2017-01-02', 2);
INSERT INTO users (name, start_date, team_id) VALUES ('Tim', '2017-01-02', 3);
INSERT INTO users (name, start_date, team_id) VALUES ('Bob', '2017-01-03', 3);
INSERT INTO users (name, start_date, team_id) VALUES ('Bill', '2017-01-04', 3);
INSERT INTO users (name, start_date, team_id) VALUES ('Kathy', '2017-01-04', 3);
INSERT INTO users (name, start_date, team_id) VALUES ('Anne', '2017-01-05', 3);
+---------+-------+------------+---------+
| user_id | name  | start_date | team_id |
+---------+-------+------------+---------+
|       1 | Matt  | 2017-01-01 |       1 |
|       2 | John  | 2017-01-02 |       2 |
|       3 | Sara  | 2017-01-02 |       2 |
|       4 | Tim   | 2017-01-02 |       3 |
|       5 | Bob   | 2017-01-03 |       3 |
|       6 | Bill  | 2017-01-04 |       3 |
|       7 | Kathy | 2017-01-04 |       3 |
|       8 | Anne  | 2017-01-05 |       3 |
+---------+-------+------------+---------+

Ranked by start date

SELECT *
FROM users
ORDER by start_date ASC

Ranked by start date with ties broken by user id

SELECT *
FROM users
ORDER by start_date ASC, user_id ASC

First employee by start date with ties broken by user id

SELECT *
FROM users
ORDER by start_date ASC, user_id ASC
LIMIT 1

First employee by start date with ties

SELECT *
FROM users
WHERE start_date = (SELECT MIN(start_date) FROM users);

Second employee by start date with ties broken by user id

SELECT *
FROM users
ORDER by start_date ASC, user_id ASC
LIMIT 1
OFFSET 1

Second employee by start date with ties

SELECT *
FROM users
WHERE start_date = (
    SELECT DISTINCT start_date
    FROM users
    ORDER BY start_date ASC
    LIMIT 1
    OFFSET 1
)

Ranked by start date using variable

SET @rank := 0;
SELECT
    *,
    @rank := @rank + 1 AS rank
FROM users
ORDER BY start_date ASC
+---------+-------+------------+---------+------+
| user_id | name  | start_date | team_id | rank |
+---------+-------+------------+---------+------+
|       1 | Matt  | 2017-01-01 |       1 |    1 |
|       2 | John  | 2017-01-02 |       2 |    2 |
|       3 | Sara  | 2017-01-02 |       2 |    3 |
|       4 | Tim   | 2017-01-02 |       3 |    4 |
|       5 | Bob   | 2017-01-03 |       3 |    5 |
|       6 | Bill  | 2017-01-04 |       3 |    6 |
|       7 | Kathy | 2017-01-04 |       3 |    7 |
|       8 | Anne  | 2017-01-05 |       3 |    8 |
+---------+-------+------------+---------+------+

Ranked by start date using a variable

Based on this Stack Overflow comment:

SELECT
    *,
    @rank := @rank + 1 AS rank
FROM users, (SELECT @rank := 0) r
ORDER BY start_date ASC
+---------+-------+------------+---------+------------+------+
| user_id | name  | start_date | team_id | @rank := 0 | rank |
+---------+-------+------------+---------+------------+------+
|       1 | Matt  | 2017-01-01 |       1 |          0 |    1 |
|       2 | John  | 2017-01-02 |       2 |          0 |    2 |
|       3 | Sara  | 2017-01-02 |       2 |          0 |    3 |
|       4 | Tim   | 2017-01-02 |       3 |          0 |    4 |
|       5 | Bob   | 2017-01-03 |       3 |          0 |    5 |
|       6 | Bill  | 2017-01-04 |       3 |          0 |    6 |
|       7 | Kathy | 2017-01-04 |       3 |          0 |    7 |
|       8 | Anne  | 2017-01-05 |       3 |          0 |    8 |
+---------+-------+------------+---------+------------+------+

First employee by start date using by setting a variable

SET @rank := 0;
SELECT *
FROM (
    SELECT
        *,
        @rank := @rank + 1 AS rank
    FROM users
    ORDER BY start_date ASC
) ranked
WHERE rank = 1

Ranked by start date with ties

Based on this Stack Overflow comment:

SET @prev_start_date = NULL;
SET @rank := 0;
SELECT
    *,
    CASE
        WHEN @prev_start_date = start_date THEN @rank
        -- Note that the assignment here will always be true
        WHEN @prev_start_date := start_date THEN @rank := @rank + 1
    END AS rank
FROM users
ORDER BY start_date ASC
+---------+-------+------------+---------+------+
| user_id | name  | start_date | team_id | rank |
+---------+-------+------------+---------+------+
|       1 | Matt  | 2017-01-01 |       1 |    1 |
|       2 | John  | 2017-01-02 |       2 |    2 |
|       3 | Sara  | 2017-01-02 |       2 |    2 |
|       4 | Tim   | 2017-01-02 |       3 |    2 |
|       5 | Bob   | 2017-01-03 |       3 |    3 |
|       6 | Bill  | 2017-01-04 |       3 |    4 |
|       7 | Kathy | 2017-01-04 |       3 |    4 |
|       8 | Anne  | 2017-01-05 |       3 |    5 |
+---------+-------+------------+---------+------+

Ranked by user id within each team

SELECT
    a.*,
    COUNT(*) AS rank
FROM users a
INNER JOIN users b
ON a.team_id = b.team_id AND a.user_id >= b.user_id
GROUP BY a.team_id, a.user_id

or, based on this Stack Overflow comment:

SELECT
    a.*,
    (
        SELECT COUNT(*)
        FROM users b
        WHERE a.team_id = b.team_id AND a.user_id >= b.user_id
    ) AS ranked
FROM users a
+---------+-------+------------+---------+--------+
| user_id | name  | start_date | team_id | ranked |
+---------+-------+------------+---------+--------+
|       1 | Matt  | 2017-01-01 |       1 |      1 |
|       2 | John  | 2017-01-02 |       2 |      1 |
|       3 | Sara  | 2017-01-02 |       2 |      2 |
|       4 | Tim   | 2017-01-02 |       3 |      1 |
|       5 | Bob   | 2017-01-03 |       3 |      2 |
|       6 | Bill  | 2017-01-04 |       3 |      3 |
|       7 | Kathy | 2017-01-04 |       3 |      4 |
|       8 | Anne  | 2017-01-05 |       3 |      5 |
+---------+-------+------------+---------+--------+

Note that both of these techniques require that there be a column without duplicates that we can rank on within the partition. For example, we can’t use start_date due to the duplicates within team 2 (2017-01-02) and team 3 (2017-01-14):

SELECT
    a.*,
    COUNT(*) AS rank
FROM users a
INNER JOIN users b
ON a.team_id = b.team_id AND a.start_date >= b.start_date
GROUP BY a.team_id, a.user_id
+---------+-------+------------+---------+------+
| user_id | name  | start_date | team_id | rank |
+---------+-------+------------+---------+------+
|       1 | Matt  | 2017-01-01 |       1 |    1 |
|       2 | John  | 2017-01-02 |       2 |    2 |
|       3 | Sara  | 2017-01-02 |       2 |    2 |
|       4 | Tim   | 2017-01-02 |       3 |    1 |
|       5 | Bob   | 2017-01-03 |       3 |    2 |
|       6 | Bill  | 2017-01-04 |       3 |    4 |
|       7 | Kathy | 2017-01-04 |       3 |    4 |
|       8 | Anne  | 2017-01-05 |       3 |    5 |
+---------+-------+------------+---------+------+

Return the last person to join within each team based on user id

Based on this Stack Overflow comment:

SELECT a.*
FROM users a
LEFT JOIN users b
ON a.team_id = b.team_id AND a.user_id < b.user_id
WHERE b.team_id IS NULL

or

SELECT a.*
FROM users a
WHERE user_id IN (
    SELECT MAX(user_id)
    FROM users
    GROUP BY team_id
)
+---------+------+------------+---------+
| user_id | name | start_date | team_id |
+---------+------+------------+---------+
|       1 | Matt | 2017-01-01 |       1 |
|       3 | Sara | 2017-01-02 |       2 |
|       8 | Anne | 2017-01-05 |       3 |
+---------+------+------------+---------+

Return the last people to join within each team based on start date

SELECT a.*
FROM users a
LEFT JOIN users b
ON a.team_id = b.team_id AND a.start_date < b.start_date
WHERE b.team_id IS NULL

or, based on this groupwise max post:

SELECT a.*
FROM users a
INNER JOIN (
    SELECT team_id, MAX(start_date) AS max_start_date
    FROM users b
    GROUP BY team_id
) max_start_dates
ON a.team_id = max_start_dates.team_id AND a.start_date = max_start_dates.max_start_date
+---------+------+------------+---------+
| user_id | name | start_date | team_id |
+---------+------+------------+---------+
|       1 | Matt | 2017-01-01 |       1 |
|       2 | John | 2017-01-02 |       2 |
|       3 | Sara | 2017-01-02 |       2 |
|       8 | Anne | 2017-01-05 |       3 |
+---------+------+------------+---------+

Ranked with gaps

Based on this Stack Overflow comment:

SELECT
    user_id,
    name,
    start_date,
    team_id,
    rank
FROM (
    SELECT
        *,
        IF(start_date = @_last_start_date, @cur_rank := @cur_rank, @cur_rank := @_sequence) AS rank,
        @_sequence := @_sequence + 1,
        @_last_start_date := start_date
    FROM users, (SELECT @cur_rank := 1, @_sequence := 1, @_last_start_date := NULL) r
    ORDER BY start_date
) ranked

Notice that after the three tied for second earliest start date, the next one jumps to 5 (not 3):

+---------+-------+------------+---------+------+
| user_id | name  | start_date | team_id | rank |
+---------+-------+------------+---------+------+
|       1 | Matt  | 2017-01-01 |       1 |    1 |
|       2 | John  | 2017-01-02 |       2 |    2 |
|       3 | Sara  | 2017-01-02 |       2 |    2 |
|       4 | Tim   | 2017-01-02 |       3 |    2 |
|       5 | Bob   | 2017-01-03 |       3 |    5 |
|       6 | Bill  | 2017-01-04 |       3 |    6 |
|       7 | Kathy | 2017-01-04 |       3 |    6 |
|       8 | Anne  | 2017-01-05 |       3 |    8 |
+---------+-------+------------+---------+------+

Technically this is known as the rank whereas the other examples, which didn’t include gaps, is the dense rank.

Happy ranking!

Implementing the Global Revenue chart from Good Charts in R with ggplot2 ūüďä

In order to level up my data visualization skills, I’m going to be attempting to¬†implement¬†well-designed charts that others have created on my own using R and eventually other tools. I’ll be¬†open sourcing¬†the code¬†and sharing any lessons learned along the way.

For the first chart, I grabbed a seemingly simple chart from¬†Scott Berinato’s excellent book, Good Charts: The HBR Guide to Making Smarter, More Persuasive Data Visualizations.

Here’s the chart that I set out to reproduce in R (he created it¬†in Illustrator):

goal.png

Here’s my attempt¬†using R and ggplot2¬†whose code you can find in my new data viz repo on GitHub:

chart.png

There were several tricky aspects of the original chart that presented challenges learning opportunities while implementing it in R:

  • In order to get the chart title, y-axis title, and source aligned with the left side of the chart, I wound up having to use¬†R’s¬†Graphical User Objects aka “grobs”; hat-tip to this SO thread for the idea. It was possible to get pretty close by adjusting the hjust value for the¬†chart title and subtitle, but the grob approach seemed less hacky.
  • Speaking of hacky, I wasn’t able to figure out an elegant¬†way to center the x-axis values between the ticks. I resorted to tinkering with the¬†hjust¬†value, but that just shifts them all to the right by a fixed amount. The problem is that¬†some¬†text like “2010” is wider than other text like “2011” which makes it impossible to get them all centered by shifting them by the same fixed amount. If anyone has a solution to this, I’d love to learn a better way to do it.
  • Removing the tick for the bottom ($0) y-axis value was “fun”: I had¬†to make it transparent because I couldn’t figure out how to remove it¬†completely. Again, Stack Overflow is awesome.
  • From a coding best practices perspective, there were several appearance-related values that had to be used¬†in multiple locations in the code. To avoid duplication, I stored them in variables at the top of the script. It seems strange though having some appearance values stored in variables¬†but others directly in the ggplot2¬†code. Storing all of the appearance values¬†in variables – even the ones only used once¬†–¬†seems like bad practice too though.
  • I couldn’t figure out how to get the ends of the x-axis line to stop at¬†the ticks so they wind up extending beyond them slightly in my version.
  • I’m not sure what fonts Scott¬†used in Illustrator to create this chart, but even if I was 100% sure I don’t want to shell out money to buy custom fonts¬†for this little project, so I used similar ones that were already¬†available on my Macbook (PT Sans Narrow for the title, Proxima Nova for everything else).
  • Towards the end there were diminishing returns trying to get everything pixel perfect so I settled on a 90% good enough solution.

As I recreate more of charts there will undoubedly be “a-ha!” moments that cause me to go back and clean up the code for my¬†old charts. I’ll try to share as many of those lessons learned here¬†as well.

A New Adventure

Just a note for those of you who follow me online that after almost four years at Automattic, I’m moving on to try something new.

Automattic is an amazing company and I’m incredibly grateful for having had the chance¬†to work there with such a talented group of people.

As for what’s next: later this month I’ll be joining¬†Help Scout¬†to work as a data scientist on their growth team. I’m really¬†excited by the opportunity and hope to share a lot of what I learn on this blog. More to follow!

Learning to Read

book.jpg

Over the last few years I’ve read maybe¬†2-3 books per year and a good chunk of those were audiobooks so I’m not sure they even qualify as reading. Part of that is that reading hasn’t been a priority for me: given a choice between working on side projects, spending time with my family, watching Game of Thrones and other shows, or¬†reading,¬†I normally don’t¬†choose to read. Another part of it is that I’ve found it harder and harder to sit and read for more than a few minutes; years of Facebook and Twitter have¬†taken their¬†toll on my attention span.

That all changed recently when I made a few modifications to my reading habits. Many of these changes were inspired by discussions on Seeking Wisdom, a fantastic new podcast by David Cancel and Dave Gerhardt about a wide variety of topics including building great products, marketing, personal growth, work-life balance, and more.

Here are the key changes I made:

  • Reading physical books, not ebooks.¬†Many people love the convenience of ebooks, but I’ve always preferred physical¬†books. Rather than try to force myself to read on a Kindle or a¬†tablet, I’ve switched back to reading physical¬†books. There’s just something about¬†the texture and smell of physical books that I really enjoy. Also, by not reading on the Kindle app on my¬†iPad I can avoid notifications and other distractions like checking my email, Facebook, etc.
  • Reading multiple books at once. In the past, I’ve always focused on reading one book at a time but by having multiple books to choose from I can select one depending on the mood I’m in which makes it more likely that I’ll actually do it. If I’m not in the mood to read a business book, for example, I have plenty of others to choose from. I no longer have to tell myself “I’ll read that book after I finish this one.” It does take longer to finish a book, but that’s fine;¬†the goal of reading isn’t to finish books.
  • Allowing myself to stop reading books that don’t interest me. In the past I would try to force myself to get through any book I started. Problem is that I’d inevitably wind up starting a book that didn’t grab me, then I’d just quit reading all together because I hadn’t finished that book.
  • Skipping parts that don’t hold my attention.¬†Years ago I started reading Poor Charlie’s Almanac, a¬†treasure trove of wisdom from Berkshire Hathaway’s Charlie Munger. The first chapter is about Charlie’s life and for whatever reason I struggled to get through it so stopped reading the book completely. This time around, I skipped that chapter and dove straight into the later¬†chapters which I’ve found a lot more engaging.
  • Making it a habit.¬†Every night before I go to sleep, I try to read for 20 to 30 minutes. I do wind up watching less Netflix/HBO to make time for it, but that’s a tradeoff that I’m happy to make. As an added benefit, I’ve been finding it much easier to fall asleep after reading vs working on a side project or watching TV.

A lot of these¬†seem obvious in retrospect, but it took a while for their importance to really click for me. I’d encourage anyone who wants to get back into reading to experiment and see what works for you.

If you have any other suggestions on how to read more effectively, don’t hesitate to drop me an email or leave a comment below.

My R Cheat Sheet, now available on GitHub

Despite working on and off with R for about two years now, I can never seem to remember how to do basic things when I return to it after a few weeks away.

I recently started keeping detailed notes for myself to minimize how much time I spend figuring things out that I already learned about in the past.

You can check out my cheat sheet on GitHub here:

https://github.com/mattm/r-cheat-sheet

It covers everything from data frames to working with dates and times to using ggplot and a lot more. I’ll update it periodically as I add new notes.

If you spot any mistakes or have any suggestions for how to improve it, don’t hesitate to shoot me an email.

Automattic is hiring a Marketing Data Analyst

We’re now accepting applications for a new Marketing Data Analyst position at Automattic that might interest some of you:

https://automattic.com/work-with-us/marketing-data-analyst/

In this role you’d be helping us¬†use¬†data and analytics to guide the direction of our marketing efforts on WordPress.com.

Here’s the official description:

Automattic is looking for a data analyst to join the marketing team. You will distill data into actionable insights to guide our customer marketing and retention strategy as well as inform product development efforts.

Primary responsibilities include:

  • Build and maintain standardized reporting on key metrics across the customer lifecycle.
  • Develop customer segmentation models to inform tailored, multi-channel marketing strategies.
  • Conduct ad hoc analyses to better understand customer behavior, needs, and individual test results.
  • Partner with other analysts and developers to increase data accessibility across the organization.
  • Design a process for prioritizing and communicating data requests and priorities.

You:

  • Are proficient in SQL and Excel.
  • Have experience with web analytics platforms such as: Google Analytics, KISSmetrics, or Mixpanel.
  • Have experience working with marketing teams to deliver analyses and answer business questions.
  • Are able to communicate data in a way that is easy to understand and presents clear recommendations.
  • Are highly collaborative and experienced in working with business owners, executives, developers and creatives to discuss data, strategy and tests.
  • Have excellent prioritization and communication skills.
  • Ideally, have web development experience (though not required).

Like all positions at Automattic, you’ll work remotely, and can be based wherever you live. It’s not a requirement that you live in or relocate to the Bay Area to do this job.

If this sounds interesting to you (and how could it not?!?) there are instructions at the bottom of the job description about how to apply.

And if you have any questions about Automattic or this data analyst position, feel free to drop me an email: mhmazur@automattic.com.

 

Visualizing Your SaaS Plan Cancellation Curves

If you work on a SaaS product, you probably have a good idea of what its¬†cancellation rates are, but¬†chances are you don’t know how that changes over time. For example, what % of users cancel after 1 day? How about after 7 days, 30 days, etc?

I worked on a project at Automattic this week to help us understand the cancellation curves for WordPress.com’s plans and am open sourcing¬†the R script so anyone can do the same for their service.

Here’s an example of how the cancellation curves might look for a service with¬†a Gold, Silver, and Bronze plan:

example.png

We can see that most users who cancel do so pretty quickly and that long term about 30% of Gold plan, 20% of Silver plan, and 10% of Bronze plan subscriptions wind up cancelled.

To generate this data for your own product, you’ll just need three data points for each subscription: when it was purchased, when it was cancelled (if it was), and the name of the subscription. The script¬†will take care of analyzing the data¬†and generating the visualization.

You can check out the script and additional details on GitHub here: Cancellation Curve Visualizer.

If you have any questions or run into any issues, don’t hesitate to drop me a note.