At Help Scout, we recently went through the process of hiring a new senior data analyst. In order to apply for the position, we asked anyone interested to answer a few short screener questions including one to help evaluate their SQL skills.
Here’s the SQL screener question we asked. If you’re an analyst, you’ll get the most value out of this post if you work through it before reading on:
This question – designed to be answered in 10-15 minutes – proved incredibly valuable because it served as an easy way to evaluate whether applicants had the minimum technical skills to be effective in this role. This position requires working a lot with SQL, so if an applicant struggled with an intermediate SQL challenge, they would likely struggle in the role as well.
What surprised me is that almost none of the answers were identical, except for a few towards the end because someone commented on the Gist with a (slightly buggy) solution :).
For reference, here’s how I would answer the question:
There are a lot of variations on this though that will still result in the correct answer – and many which won’t. For example, no points lost for using uppercase SQL instead of lowercase. But if the query doesn’t union the tables together at some point, it probably wouldn’t result in the correct answer.
If you’re interested in data analysis and analytics, you can subscribe to my Matt on Analytics
newsletter to get notified about future posts like this one.
Analyzing the Differences
It would be impossible to list every difference – as you’ll see at the end of this post in the anonymized responses, there are and endless number of ways to format the query.
That said, there are a lot of common differences, some substantial, some not.
Does the query use uppercase or lowercase SQL or some combination of the two?
Note that in these examples and all that follow, the answers aren’t necessarily correct. They’re just chosen to highlight different ways of approaching the query.
Common Table Expressions (CTEs) vs Subqueries
Common Table Expressions go a long way towards making your query easy to read and debug. Not all SQL dialects support CTEs (MySQL doesn’t, for example), but using them in the query was almost always an indicator of an experienced analyst.
Meaningful CTE Names
CTEs benefit a lot from meaningful names that make it easy for you and other analysts to interpret.
INNER JOIN vs LEFT JOIN
Either works, but INNER JOIN performs better and is more intuitive here.
Implicit vs Explicit INNER
“INNER” is implied if you just write “JOIN”, so it’s not required, but can make the query easier to read. Either way is fine.
Filtering in the WHERE clause vs JOIN condition
The standard way to filter the conversations is to use a WHERE clause to filter the results to only include those that have a beacon-interest tag. However, because we’re using an INNER JOIN, it’s also possible to add it as a join condition and get the same result. In terms of performance, it doesn’t make a difference which approach you take.
I lean towards a WHERE clauses because I think it’s clearer, but including it in the JOIN condition is completely viable as well.
HubSpot stores the form submission timestamp in milliseconds. Queries that didn’t account that would not result in the correct result.
UNION DISTINCT vs UNION ALL
There are two types of UNIONs: UNION DISTINCT and UNION ALL. The former – which is the default when you just write UNION – only returns the unique records in the combined results.
Both result in the correct answer here, but UNION ALL performs better because with UNION DISTINCT the database has to sort the results and remove the duplicate rows.
Accounting for Multiple Interest
Many applicants didn’t take into account that people could have expressed interest by filling out the HubSpot form and contacting support. Neglecting to account for this could result in multiple rows for individual email addresses.
GROUP BY Column Name vs Number
I lean towards using column numbers, but either is fine, and using the column name can have benefits. When there are 5+ columns (which is not an issue with this question), lean towards using column numbers which will be a lot more sane than typing out all the names (hat-tip Ray Buhr for this tip).
Single vs Multiple Lines When Listing Multiple Columns
Another common style difference is whether people put multiple columns on the same line or not. Either is fine, but I lean towards one column per line because I think it’s easier to read.
Comma First vs Comma Last
While not that common in the responses, it’s perfectly valid when listing columns on multiple lines to put the commas before the column name. The benefit is that you don’t have to add a comma to the previous line when adding a new column which also simplifies query diffs that you might see in your version control tool.
Comma-first folks tended to have software development backgrounds.
PS: For anyone interested in SQL coding conventions, I highly recommend checking out dbt’s coding conventions which have influenced my preferences here.
We were fortunate to receive over 100 applications, most of which included an answer to the SQL question. I suspect if the application didn’t include this question, we would have had twice the number of applcants, but the presence of this question led some underqualified folks not to apply.
You can check out all 89 responses on GitHub.
If you have any suggestions on how to improve my query or feedback on any of this analysis, please drop a comment below. Happy querying!