Changing my Mind on When to Include Table Names in SQL Queries

I haven’t made too many changes to how I write SQL recently, but I did adopt a new convention recently that I really like so wanted to share.

In the past, I would have written the following query like so:

select
  email,
  sum(amount) as total_revenue
from users
inner join charges on users.id = charges.user_id

Note that this does not prefix email or amount with the table name where they came from.

Claire Caroll of dbt fame recently pinged me to suggest a change: whenever there’s a join involved, you should include the table name to make it clear where the column originated. The query above would look like this:

select
  users.email,
  sum(charges.amount) as total_revenue
from users
inner join charges on users.id = charges.user_id

When there’s no join involved, it’s fine to leave it out because there’s no room for confusion:

select
  id,
  name
from companies

I’ve been following this convention for a few weeks and really like it because there’s zero ambiguity when seeing this around where each column originated. It’s more verbose obviously, but I think the extra clarity outweighs that downside.

I’ve updated the style guide to reflect this guide.

I’d love to hear your thoughts – do you always include the name name, only when necessary, or follow a convention like this and only include it when joins are involved?

An Interview About My Work as a Data Analyst

I recently had the pleasure of being interviewed by Simon Ouderkirk about my journey to becoming a data analyst and lessons learned along the way. You can check out a recording of the interview here:

This came about because Simon volunteered to start this interview series for the Locally Optimistic data community. And because Simon and I have worked closely together for a number of years (first when I was full time at Automattic and now as a consultant), he pinged me to see if I wanted to be a guinea pig for this first interview :).

If data and analytics interest you, I highly recommend checking out the Locally Optimistic blog and Slack community linked to above; you’ll walk away thinking a lot more deeply about data, metrics, and running an effective analytics organization. You can also follow Simon on Twitter to learn about future interviews in this series.

Timeglider Acquisition

preceden-timeglider.png

I’ve got some big news to share today: Preceden has acquired Timeglider, one of the other big players in the online timeline maker space.

Here’s the announcement: Preceden Acquires Timeglider.

This was my first meaningful acquisition: back in December 2017 I did buy and redirect the domain for Timerime, another timeline maker tool, but the site had shut down several months prior so it was merely a matter of buying the domain. Timeglider on the other hand is an active business with paying customers, recurring revenue, IP, etc – which made this quite a bit more complicated and required lawyers and paperwork and whatnot. I learned a lot, but am also very happy it’s behind me.

Now comes excuting on it and recouping the cost in a hopefully reasonable amount of time. I’ll probably write more about the data and analytics piece of this in the future, but for now, just wanted to share the news!

Mazur’s SQL Style Guide

sql-style-guide.png

Even though I work primarily in Looker these days, I do spend a fair amount of time writing and reviewing SQL queries. Over time, I’ve experimented with and adopted various ways of styling my queries. Things like whether to use CTEs or subqueries (CTEs), single vs double quotes (single), how to format joins, and a lot more.

I’ve attempted to distill my preferences into a guide which you can check out on GitHub: Mazur’s SQL Style Guide.

Like with any style guide, you’ll have to consider which conventions you want to adopt. For example, you might write queries that use both single and double quotes and not give it a second thought, but feel strongly about columns always using snake_case and never camelCase. In the end, the key is readability, but opinions will differ on what’s readable and what’s not :).

For those of you writing a lot of SQL, hopefully you walk away from this guide with some things think about.

And like I mention in the guide, if you find yourself disagreeing strongly with any of my preferences, I’d love to chat. Please shoot me an email if do.

Last but not least, I have a Matt on Analytics newsletter where I’ll occasionally (very occasionally at the moment) share new content like this. If you’re interested in this kind of thing, you can sign up here.

JOIN Talk: Fostering a Data Culture with a Daily Metrics Email Report

A few months back I wrote about how Help Scout generates a Daily Metrics Report using Looker. I wound up giving a talk about this topic at JOIN, Looker’s annual user conference, and wanted to share the recording for anyone who wants to learn more.

Here’s the description:

How do you keep everyone in your organization informed about the performance of its key metrics when not everyone uses Looker regularly? In this session, we’ll talk about how to leverage Fivetran and Looker to build a daily metrics email report that will help educate end engage your entire team.

Huge thanks to Fivetran for inviting me to speak during their sponsor session and Looker for putting on the conference.

A Few Thoughts on Image Upload Usage at Preceden

One of Preceden’s most popular feature requests over the years has been the ability to upload images to Preceden and have those images appear on timelines.

A lot of competitors offer that functionality, but I procrastinated for almost 9 years for two reasons:

  1. It’s complex to implement, both in terms of actually handling the uploads and having them appear on the timelines.
  2. Most of the people that requested it were using Preceden for school timelines and that segment of users tend not to upgrade at a high rate. People using Preceden for work-related project planning timelines didn’t request it much. Therefore, it never was much of a priority because it likely wouldn’t move the needle on the business.

That said, since I’ve had more time to work on Preceden recently, I decided to finally do it. For handling uploads, I wound up using Filestack.com which simplified the implementation a lot. And updating Preceden’s rendering logic took time too, but in the end it all worked out.

I recently checked on the usage stats and – not surprisingly – it’s used most heavily by people using Preceden for education:

Screen Shot 2018-11-15 at 2.27.44 PM.png

For users that have signed up since this launched:

  • Teaching: 29% uploaded an image
  • School: 26%
  • Personal Use: 16%
  • Work: 12%

In other words, it’s used very heavily (which is great!) but not with the segment of users with the highest propensity to pay.

This dilemma comes up fairly often: do you build Feature A that will be used heavily by mostly-free users, or Feature B that will be used heavily by mostly-paying customers?

For better or worse, I never wound up focusing on one market or use case with Preceden: it’s a general purpose timeline maker that can be used for any type of timeline. As a result though, I often get into these situations. If I was just building Preceden for project planners, I’d never implement image uploads. If I was just building it for students creating timelines for school, I’d probably have implemented it years ago.

It also comes down to goals: if my main goal is growing revenue, I probably shouldn’t work on features like this. But if I want Preceden to be the best general purpose timeline maker then it does, but there’s an opportunity cost because I’m not building features for the folks who will actually pay.

I operate in the middle for product development: work mostly on features that will make money, but also spend some percentage of my time on features like this that will make it a better general purpose tool.

If I were to start something from scratch today, I’d probably pick a narrow niche and try to nail it. No general-purpose tools. I’d recommend that to others too.

Going broad is fun in a way too though, it just has it’s challenges :).

 

Analyzing 89 Responses to a SQL Screener Question for a Senior Data Analyst Position

At Help Scout, we recently went through the process of hiring a new senior data analyst. In order to apply for the position, we asked anyone interested to answer a few short screener questions including one to help evaluate their SQL skills.

Here’s the SQL screener question we asked. If you’re an analyst, you’ll get the most value out of this post if you work through it before reading on:

This question – designed to be answered in 10-15 minutes – proved incredibly valuable because it served as an easy way to evaluate whether applicants had the minimum technical skills to be effective in this role. This position requires working a lot with SQL, so if an applicant struggled with an intermediate SQL challenge, they would likely struggle in the role as well.

What surprised me is that almost none of the answers were identical, except for a few towards the end because someone commented on the Gist with a (slightly buggy) solution :).

For reference, here’s how I would answer the question:

There are a lot of variations on this though that will still result in the correct answer – and many which won’t. For example, no points lost for using uppercase SQL instead of lowercase. But if the query doesn’t union the tables together at some point, it probably wouldn’t result in the correct answer.

If you’re interested in data analysis and analytics, you can subscribe to my Matt on Analytics newsletter to get notified about future posts like this one.

Analyzing the Differences

It would be impossible to list every difference – as you’ll see at the end of this post in the anonymized responses, there are and endless number of ways to format the query.

That said, there are a lot of common differences, some substantial, some not.

SQL Casing

Does the query use uppercase or lowercase SQL or some combination of the two?

Note that in these examples and all that follow, the answers aren’t necessarily correct. They’re just chosen to highlight different ways of approaching the query.

Common Table Expressions (CTEs) vs Subqueries

Common Table Expressions go a long way towards making your query easy to read and debug. Not all SQL dialects support CTEs (MySQL doesn’t, for example), but using them in the query was almost always an indicator of an experienced analyst.

Meaningful CTE Names

CTEs benefit a lot from meaningful names that make it easy for you and other analysts to interpret.

INNER JOIN vs LEFT JOIN

Either works, but INNER JOIN performs better and is more intuitive here.

Implicit vs Explicit INNER

“INNER” is implied if you just write “JOIN”, so it’s not required, but can make the query easier to read. Either way is fine.

Filtering in the WHERE clause vs JOIN condition

The standard way to filter the conversations is to use a WHERE clause to filter the results to only include those that have a beacon-interest tag. However, because we’re using an INNER JOIN, it’s also possible to add it as a join condition and get the same result. In terms of performance, it doesn’t make a difference which approach you take.

I lean towards a WHERE clauses because I think it’s clearer, but including it in the JOIN condition is completely viable as well.

Converting Milliseconds

HubSpot stores the form submission timestamp in milliseconds. Queries that didn’t account that would not result in the correct result.

UNION DISTINCT vs UNION ALL

There are two types of UNIONs: UNION DISTINCT and UNION ALL. The former – which is the default when you just write UNION – only returns the unique records in the combined results.

Both result in the correct answer here, but UNION ALL performs better because with UNION DISTINCT the database has to sort the results and remove the duplicate rows.

Accounting for Multiple Interest

Many applicants didn’t take into account that people could have expressed interest by filling out the HubSpot form and contacting support. Neglecting to account for this could result in multiple rows for individual email addresses.

GROUP BY Column Name vs Number

I lean towards using column numbers, but either is fine, and using the column name can have benefits. When there are 5+ columns (which is not an issue with this question), lean towards using column numbers which will be a lot more sane than typing out all the names (hat-tip Ray Buhr for this tip).

Single vs Multiple Lines When Listing Multiple Columns

Another common style difference is whether people put multiple columns on the same line or not. Either is fine, but I lean towards one column per line because I think it’s easier to read.

Comma First vs Comma Last

While not that common in the responses, it’s perfectly valid when listing columns on multiple lines to put the commas before the column name. The benefit is that you don’t have to add a comma to the previous line when adding a new column which also simplifies query diffs that you might see in your version control tool.

Comma-first folks tended to have software development backgrounds.

PS: For anyone interested in SQL coding conventions, I highly recommend checking out dbt’s coding conventions which have influenced my preferences here.

All Responses

We were fortunate to receive over 100 applications, most of which included an answer to the SQL question. I suspect if the application didn’t include this question, we would have had twice the number of applcants, but the presence of this question led some underqualified folks not to apply.

You can check out all 89 responses on GitHub.

If you have any suggestions on how to improve my query or feedback on any of this analysis, please drop a comment below. Happy querying!