Exploring your Heroku Rails app’s database using SQLPro for Postgres

In the past when I’ve wanted to explore production data for a Heroku-hosted Ruby on Rails app, I’ve primarily used heroku console and rake tasks. Each method has limitations though: heroku console makes easy to answer simple questions about your data, but makes it difficult to perform complicated analyses that take more than a few lines of code. Rake tasks let you perform complex analyses, but make it difficult to explore data because each time you tweak your task to do something new, you need to commit, push to production, run the task, and wait for it to execute. Neither option makes it easy to quickly explore the data.

Wouldn’t it be nice if you could quickly query your database and explore the results?

Fortunately there is a way using a combination of Heroku’s pg:pull feature and a Mac app called SQLPro for Postgres. Here’s how it works:

Step 1: Pull your production data into a local Postgres database

Heroku makes this fairly easy using the pg:pull command:

$ heroku pg:pull HEROKU_POSTGRESQL_MAGENTA mylocaldb --app sushi

Where mylocaldb is the name of a local Postgres database, sushi is the name of your Heroku app, and HEROKU_POSTGRESQL_MAGENT is the name of your database which you can obtain by running:

$ heroku pg:info -a sushi

If your local Postgres instance requires a user name and password, you can provide them via the command line as well:

$ PGUSER=postgres PGPASSWORD=password heroku pg:pull HEROKU_POSTGRESQL_MAGENTA mylocaldb --app sushi

In order for this command to work, mylocaldb can’t exist when you run this command. To delete it beforehand, you can run:

$ dropdb mylocaldb

For my own workflow I terminate any existing Postgres connections (to ensure dropdb works), then execute dropdb, and then finally pg:pull using a Bash alias for it:

alias prdb="echo \"SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname='preceden_production_copy';\" | psql -U postgres; dropdb preceden_production_copy; PGUSER=postgres PGPASSWORD=password heroku pg:pull HEROKU_POSTGRESQL_AQUA_URL preceden_production_copy --app precedenv2"

Then I can just run prdb (my short hand for “Preceden Database”) from the command line to drop the old copy and grab the latest production data:

$ prdb
heroku-cli: Pulling postgresql-infinite-32999 ---> preceden_production_copy
pg_dump: last built-in OID is 16383
pg_dump: reading extensions
...

Step 2: Explore the data using SQLPro for Postgres

SQLPro for Postgres is a fantastic Mac app for exploring Postgres databases. You can also query the data other ways but for quickly exploring, querying, and exporting the data, SQLPro for Postgres is hard to beat.

Here’s what the UI looks like along with an example query to display the first 10 people to sign up:

sqlpro-for-postgres.jpg

In future posts we’ll see how to query Postgres with R to analyze the data and gain insights about how people use our products.

If you’re interested in learning more, sign up for my new Data Science for Product Analytics newsletter to get notified when there are new posts.

A Simple CROSS JOIN Example

99% of the queries I write to join tables wind up using JOIN (aka INNER JOIN) or LEFT JOIN so whenever there’s an opportunity to use one the other types, I get pretty excited 🙂. Today, that wound up being a CROSS JOIN.

Consider the following table containing charges:

How would you add add a column showing how much each charge represents as a percentage of the total charges?

Option 1: Using a subquery

One way to solve this is to use a subquery:

For each record, we divide the amount by the sum of all the amounts to get the percentage.

Option 2: Using a variable

Similar to the solution above, except here we save the sum of the amounts in a variable and then use that variable in the query:

Option 3: Using CROSS JOIN

A cross join takes every row from the first table and joins it on every row in the second table. From w3resource.com:

cross-join-round.png

In this solution, we create a result set with one value (the sum of the amounts) and then cross join the charges table on it. That will add the total to each record, which we can then divide the amount by to get the percentage:

If we didn’t want the total column in the result, we could simply exclude it:

In this case there shouldn’t be any performance gains using the CROSS JOIN vs one of the other methods, but I find it more elegant than the subquery or variable solutions.

CROSS JOIN vs INNER JOIN

Note that CROSS JOIN and INNER JOIN do the same thing, it’s just that because we’re not joining on a specific column, the convention is to use CROSS JOIN. For example, this produces the same result as the last CROSS JOIN example:

And so does this:

So why use CROSS JOIN at all? Per a Stack Overflow thread:

Using CROSS JOIN vs (INNER) JOIN vs comma

The common convention is:

* Use CROSS JOIN when and only when you don’t compare columns between tables. That suggests that the lack of comparisons was intentional.
* Use (INNER) JOIN with ON when and only when you have comparisons between tables (plus possibly other comparisons).
* Don’t use comma.

Props this Stack Overflow question for the tip about using CROSS JOIN to solve this type of problem.

Removing Gaps from Stacked Area Charts in R

Creating a stacked area chart in R is fairly painless, unless your data has gaps. For example, consider the following CSV data showing the number of plan signups per week:

Plotting this highlights the problem:

chart.png

The reason the gaps exist is that not all plans have data points every week. Consider Gold, for example: during the first four weeks there are 55, 37, 42, and 26 signups, but during the last week there isn’t a data point at all. That’s why the chart shows the gap: it’s not that the data indicates Gold went to zero signups the final week; it indicates no data at all.

To remedy this, we need to ensure that every week contains a data point for every plan. That means for weeks where there isn’t a data point for a plan, we need to fill it in with 0 so that R knows that the signups are in fact 0 for that week.

I asked Charles Bordet, an R expert who I hired through Upwork to help me level up my R skills, how he would go about filling in the data.

He provided two solutions:

1. Using expand.grid and full_join

Here’s how it works:

expand.grid creates “a data frame from all combinations of the supplied vectors or factors”. By passing it in the weeks and plans, it generates the following data frame called combinations:

The full_join then takes all of the rows from data and combines them with combinations based on week and plan. When there aren’t any matches (which will happen when a week doesn’t have a value for a plan), signups gets set to NA:

Then we just use dplyr’s mutate to replace all of the NA values with zero, and voila:

2. Using spread and gather

The second method Charles provided uses the tidyr package’s spread and gather functions:

The spread function takes the key-value pairs (week and plan in this case) and spreads it across multiple columns, making the “long” data “wider”, and filling in the missing values with 0:

Then we take the wide data and convert it back to long data using gather The - week means to exclude the week column when gathering the data that spread produced:

Using either methods, we get a stacked area chart without the gaps ⚡️:

chart.png

Exploring ranking techniques in MySQL

While trying to wrangle a MySQL query to rank a set of results, I got sucked into a few Stack Overflow threads outlining various approaches.

In order to understand the pros and cons of each technique, I created some test data and reimplemented the solutions, all of which I’ve shared below.

If there are any other methods worth including here, please drop a note in the comments.

We’ll be working with this table:

DROP TABLE IF EXISTS users;

CREATE TABLE users (user_id INT PRIMARY KEY AUTO_INCREMENT, name VARCHAR(20), start_date DATE, team_id INT);
INSERT INTO users (name, start_date, team_id) VALUES ('Matt', '2017-01-01', 1);
INSERT INTO users (name, start_date, team_id) VALUES ('John', '2017-01-02', 2);
INSERT INTO users (name, start_date, team_id) VALUES ('Sara', '2017-01-02', 2);
INSERT INTO users (name, start_date, team_id) VALUES ('Tim', '2017-01-02', 3);
INSERT INTO users (name, start_date, team_id) VALUES ('Bob', '2017-01-03', 3);
INSERT INTO users (name, start_date, team_id) VALUES ('Bill', '2017-01-04', 3);
INSERT INTO users (name, start_date, team_id) VALUES ('Kathy', '2017-01-04', 3);
INSERT INTO users (name, start_date, team_id) VALUES ('Anne', '2017-01-05', 3);
+---------+-------+------------+---------+
| user_id | name  | start_date | team_id |
+---------+-------+------------+---------+
|       1 | Matt  | 2017-01-01 |       1 |
|       2 | John  | 2017-01-02 |       2 |
|       3 | Sara  | 2017-01-02 |       2 |
|       4 | Tim   | 2017-01-02 |       3 |
|       5 | Bob   | 2017-01-03 |       3 |
|       6 | Bill  | 2017-01-04 |       3 |
|       7 | Kathy | 2017-01-04 |       3 |
|       8 | Anne  | 2017-01-05 |       3 |
+---------+-------+------------+---------+

Ranked by start date

SELECT *
FROM users
ORDER by start_date ASC

Ranked by start date with ties broken by user id

SELECT *
FROM users
ORDER by start_date ASC, user_id ASC

First employee by start date with ties broken by user id

SELECT *
FROM users
ORDER by start_date ASC, user_id ASC
LIMIT 1

First employee by start date with ties

SELECT *
FROM users
WHERE start_date = (SELECT MIN(start_date) FROM users);

Second employee by start date with ties broken by user id

SELECT *
FROM users
ORDER by start_date ASC, user_id ASC
LIMIT 1
OFFSET 1

Second employee by start date with ties

SELECT *
FROM users
WHERE start_date = (
    SELECT DISTINCT start_date
    FROM users
    ORDER BY start_date ASC
    LIMIT 1
    OFFSET 1
)

Ranked by start date using variable

SET @rank := 0;
SELECT
    *,
    @rank := @rank + 1 AS rank
FROM users
ORDER BY start_date ASC
+---------+-------+------------+---------+------+
| user_id | name  | start_date | team_id | rank |
+---------+-------+------------+---------+------+
|       1 | Matt  | 2017-01-01 |       1 |    1 |
|       2 | John  | 2017-01-02 |       2 |    2 |
|       3 | Sara  | 2017-01-02 |       2 |    3 |
|       4 | Tim   | 2017-01-02 |       3 |    4 |
|       5 | Bob   | 2017-01-03 |       3 |    5 |
|       6 | Bill  | 2017-01-04 |       3 |    6 |
|       7 | Kathy | 2017-01-04 |       3 |    7 |
|       8 | Anne  | 2017-01-05 |       3 |    8 |
+---------+-------+------------+---------+------+

Ranked by start date using a variable

Based on this Stack Overflow comment:

SELECT
    *,
    @rank := @rank + 1 AS rank
FROM users, (SELECT @rank := 0) r
ORDER BY start_date ASC
+---------+-------+------------+---------+------------+------+
| user_id | name  | start_date | team_id | @rank := 0 | rank |
+---------+-------+------------+---------+------------+------+
|       1 | Matt  | 2017-01-01 |       1 |          0 |    1 |
|       2 | John  | 2017-01-02 |       2 |          0 |    2 |
|       3 | Sara  | 2017-01-02 |       2 |          0 |    3 |
|       4 | Tim   | 2017-01-02 |       3 |          0 |    4 |
|       5 | Bob   | 2017-01-03 |       3 |          0 |    5 |
|       6 | Bill  | 2017-01-04 |       3 |          0 |    6 |
|       7 | Kathy | 2017-01-04 |       3 |          0 |    7 |
|       8 | Anne  | 2017-01-05 |       3 |          0 |    8 |
+---------+-------+------------+---------+------------+------+

First employee by start date using by setting a variable

SET @rank := 0;
SELECT *
FROM (
    SELECT
        *,
        @rank := @rank + 1 AS rank
    FROM users
    ORDER BY start_date ASC
) ranked
WHERE rank = 1

Ranked by start date with ties

Based on this Stack Overflow comment:

SET @prev_start_date = NULL;
SET @rank := 0;
SELECT
    *,
    CASE
        WHEN @prev_start_date = start_date THEN @rank
        -- Note that the assignment here will always be true
        WHEN @prev_start_date := start_date THEN @rank := @rank + 1
    END AS rank
FROM users
ORDER BY start_date ASC
+---------+-------+------------+---------+------+
| user_id | name  | start_date | team_id | rank |
+---------+-------+------------+---------+------+
|       1 | Matt  | 2017-01-01 |       1 |    1 |
|       2 | John  | 2017-01-02 |       2 |    2 |
|       3 | Sara  | 2017-01-02 |       2 |    2 |
|       4 | Tim   | 2017-01-02 |       3 |    2 |
|       5 | Bob   | 2017-01-03 |       3 |    3 |
|       6 | Bill  | 2017-01-04 |       3 |    4 |
|       7 | Kathy | 2017-01-04 |       3 |    4 |
|       8 | Anne  | 2017-01-05 |       3 |    5 |
+---------+-------+------------+---------+------+

Ranked by user id within each team

SELECT
    a.*,
    COUNT(*) AS rank
FROM users a
INNER JOIN users b
ON a.team_id = b.team_id AND a.user_id >= b.user_id
GROUP BY a.team_id, a.user_id

or, based on this Stack Overflow comment:

SELECT
    a.*,
    (
        SELECT COUNT(*)
        FROM users b
        WHERE a.team_id = b.team_id AND a.user_id >= b.user_id
    ) AS ranked
FROM users a
+---------+-------+------------+---------+--------+
| user_id | name  | start_date | team_id | ranked |
+---------+-------+------------+---------+--------+
|       1 | Matt  | 2017-01-01 |       1 |      1 |
|       2 | John  | 2017-01-02 |       2 |      1 |
|       3 | Sara  | 2017-01-02 |       2 |      2 |
|       4 | Tim   | 2017-01-02 |       3 |      1 |
|       5 | Bob   | 2017-01-03 |       3 |      2 |
|       6 | Bill  | 2017-01-04 |       3 |      3 |
|       7 | Kathy | 2017-01-04 |       3 |      4 |
|       8 | Anne  | 2017-01-05 |       3 |      5 |
+---------+-------+------------+---------+--------+

Note that both of these techniques require that there be a column without duplicates that we can rank on within the partition. For example, we can’t use start_date due to the duplicates within team 2 (2017-01-02) and team 3 (2017-01-14):

SELECT
    a.*,
    COUNT(*) AS rank
FROM users a
INNER JOIN users b
ON a.team_id = b.team_id AND a.start_date >= b.start_date
GROUP BY a.team_id, a.user_id
+---------+-------+------------+---------+------+
| user_id | name  | start_date | team_id | rank |
+---------+-------+------------+---------+------+
|       1 | Matt  | 2017-01-01 |       1 |    1 |
|       2 | John  | 2017-01-02 |       2 |    2 |
|       3 | Sara  | 2017-01-02 |       2 |    2 |
|       4 | Tim   | 2017-01-02 |       3 |    1 |
|       5 | Bob   | 2017-01-03 |       3 |    2 |
|       6 | Bill  | 2017-01-04 |       3 |    4 |
|       7 | Kathy | 2017-01-04 |       3 |    4 |
|       8 | Anne  | 2017-01-05 |       3 |    5 |
+---------+-------+------------+---------+------+

Return the last person to join within each team based on user id

Based on this Stack Overflow comment:

SELECT a.*
FROM users a
LEFT JOIN users b
ON a.team_id = b.team_id AND a.user_id < b.user_id
WHERE b.team_id IS NULL

or

SELECT a.*
FROM users a
WHERE user_id IN (
    SELECT MAX(user_id)
    FROM users
    GROUP BY team_id
)
+---------+------+------------+---------+
| user_id | name | start_date | team_id |
+---------+------+------------+---------+
|       1 | Matt | 2017-01-01 |       1 |
|       3 | Sara | 2017-01-02 |       2 |
|       8 | Anne | 2017-01-05 |       3 |
+---------+------+------------+---------+

Return the last people to join within each team based on start date

SELECT a.*
FROM users a
LEFT JOIN users b
ON a.team_id = b.team_id AND a.start_date < b.start_date
WHERE b.team_id IS NULL

or, based on this groupwise max post:

SELECT a.*
FROM users a
INNER JOIN (
    SELECT team_id, MAX(start_date) AS max_start_date
    FROM users b
    GROUP BY team_id
) max_start_dates
ON a.team_id = max_start_dates.team_id AND a.start_date = max_start_dates.max_start_date
+---------+------+------------+---------+
| user_id | name | start_date | team_id |
+---------+------+------------+---------+
|       1 | Matt | 2017-01-01 |       1 |
|       2 | John | 2017-01-02 |       2 |
|       3 | Sara | 2017-01-02 |       2 |
|       8 | Anne | 2017-01-05 |       3 |
+---------+------+------------+---------+

Ranked with gaps

Based on this Stack Overflow comment:

SELECT
    user_id,
    name,
    start_date,
    team_id,
    rank
FROM (
    SELECT
        *,
        IF(start_date = @_last_start_date, @cur_rank := @cur_rank, @cur_rank := @_sequence) AS rank,
        @_sequence := @_sequence + 1,
        @_last_start_date := start_date
    FROM users, (SELECT @cur_rank := 1, @_sequence := 1, @_last_start_date := NULL) r
    ORDER BY start_date
) ranked

Notice that after the three tied for second earliest start date, the next one jumps to 5 (not 3):

+---------+-------+------------+---------+------+
| user_id | name  | start_date | team_id | rank |
+---------+-------+------------+---------+------+
|       1 | Matt  | 2017-01-01 |       1 |    1 |
|       2 | John  | 2017-01-02 |       2 |    2 |
|       3 | Sara  | 2017-01-02 |       2 |    2 |
|       4 | Tim   | 2017-01-02 |       3 |    2 |
|       5 | Bob   | 2017-01-03 |       3 |    5 |
|       6 | Bill  | 2017-01-04 |       3 |    6 |
|       7 | Kathy | 2017-01-04 |       3 |    6 |
|       8 | Anne  | 2017-01-05 |       3 |    8 |
+---------+-------+------------+---------+------+

Technically this is known as the rank whereas the other examples, which didn’t include gaps, is the dense rank.

Happy ranking!