Analyzing 89 Responses to a SQL Screener Question for a Senior Data Analyst Position

At Help Scout, we recently went through the process of hiring a new senior data analyst. In order to apply for the position, we asked anyone interested to answer a few short screener questions including one to help evaluate their SQL skills.

Here’s the SQL screener question we asked. If you’re an analyst, you’ll get the most value out of this post if you work through it before reading on:

We’re currently in the process of rolling out Beacon, our live chat tool, to existing customers who have expressed interest trying it.

Customers could have expressed interest in two ways: either by filling out an interest form or mentioning to our support team that they want to try it.

For the interest form, there is one table, hubspot.contact, with two relevant fields:

  • email – The user’s email address
  • property_beacon_interest – A Unix timestamp in milliseconds representing when they filled out the form or null if they have not expressed interest
+------------------+--------------------------+
|      email       | property_beacon_interest |
+------------------+--------------------------+
| matt@example.com |            1534101377000 |
| eli@example.com  |                          |
+------------------+--------------------------+

When a customer expresses interest in a support conversation, our support team tags the conversation with a beacon-interest tag. There are two relevant tables:

  1. helpscout.conversation with three relevant fields:
  • id – The id of the conversation
  • email – The email of the person who reached out to support
  • created_at – A timestamp with the date/time the conversation was created
+----+-------------------+--------------------------+
| id |       email       |        created_at        |
+----+-------------------+--------------------------+
|  1 | matt@example.com  | 2018-08-14 14:02:10 UTC  |
|  2 | eli@example.com   | 2018-08-14 14:06:30 UTC  |
|  3 | matt@example.com  | 2018-08-14 14:07:33 UTC  |
|  4 | katia@example.com | 2018-08-14 14:11:30 UTC  |
|  5 | jen@example.com   | 2018-08-13 14:11:30 UTC  |
+----+-------------------+--------------------------+
  1. There’s also a helpscout.conversation_tag table with two relevant fields:
  • conversation_id – The id of the conversation that was tagged. A conversation can have zero or more tags.
  • tag – The name of the tag
+-----------------+-----------------+
| conversation_id |       tag       |
+-----------------+-----------------+
|               1 | new-trial       |
|               1 | bug-report      |
|               2 | beacon-interest |
|               4 | beacon-interest |
+-----------------+-----------------+

Your challenge:

Write a SQL query (any dialect is fine) that combines data from these two sources that lists everyone who has expressed interest in trying Beacon and when they first expressed that interest.

The end result using the example tables above should be a functioning SQL query that returns the following:

+-------------------+-------------------------+
|      email        |  expressed_interest_at  |
+-------------------+-------------------------+
| matt@example.com  | 2018-08-12 19:16:17 UTC |
| eli@example.com   | 2018-08-14 14:06:30 UTC |
| katia@example.com | 2018-08-14 14:11:30 UTC |
+-------------------+-------------------------+

You should include your query in the response field to this question in the online application.

view raw

screen.md

hosted with ❤ by GitHub

This question – designed to be answered in 10-15 minutes – proved incredibly valuable because it served as an easy way to evaluate whether applicants had the minimum technical skills to be effective in this role. This position requires working a lot with SQL, so if an applicant struggled with an intermediate SQL challenge, they would likely struggle in the role as well.

What surprised me is that almost none of the answers were identical, except for a few towards the end because someone commented on the Gist with a (slightly buggy) solution :).

For reference, here’s how I would answer the question:


with hubspot_interest as (
select
email,
timestamp_millis(property_beacon_interest) as expressed_interest_at
from hubspot.contact
where property_beacon_interest is not null
),
support_interest as (
select
email,
created_at as expressed_interest_at
from helpscout.conversation
join helpscout.conversation_tag on conversation.id = conversation_tag.conversation_id
where tag = "beacon-interest"
),
combined_interest as (
select * from hubspot_interest
union all
select * from support_interest
),
final as (
select
email,
min(expressed_interest_at) as expressed_interest_at
from combined_interest
group by 1
)
select * from final

view raw

sql-answer.sql

hosted with ❤ by GitHub

There are a lot of variations on this though that will still result in the correct answer – and many which won’t. For example, no points lost for using uppercase SQL instead of lowercase. But if the query doesn’t union the tables together at some point, it probably wouldn’t result in the correct answer.

If you’re interested in data analysis and analytics, you can subscribe to my Matt on Analytics newsletter to get notified about future posts like this one.

Analyzing the Differences

It would be impossible to list every difference – as you’ll see at the end of this post in the anonymized responses, there are and endless number of ways to format the query.

That said, there are a lot of common differences, some substantial, some not.

SQL Casing

Does the query use uppercase or lowercase SQL or some combination of the two?

Note that in these examples and all that follow, the answers aren’t necessarily correct. They’re just chosen to highlight different ways of approaching the query.


# Good: All uppercase
SELECT
email,
datetime(property_beacon_interest/1000, 'unixepoch') AS expressed_interest_at
FROM
hubspot.contact
WHERE
property_beacon_interest IS NOT NULL
UNION
SELECT
c.email,
c.created_at AS expressed_interest_at
FROM
helpscout.conversation c
INNER JOIN helpscout.conversation_tag ct
ON c.id = ct.conversation_id AND ct.tag = 'beacon-interest'
# Good: All lowercase
select
email,
first_interest = min(first_interest)
from (
interest forms
select
email,
first_interest = dateadd(S, property_beacon_interest/1000, '1970-01-01')
from
hubspot.contact
where
property_beacon_interest is not null
support team tags
union all
select
email,
first_interest = created_at
from
helpscout.conversation c join
helpscout.conversation_tag ct on c.id = ct.conversation_id and ct.tag = 'beacon-interest'
) combined
group by
email
# Okay: Mixed uppercase and lowercase
SELECT helpscout.conversation.email, helpscout.conversation.created_at as expressed_interest_at
FROM helpscout.conversation
INNER JOIN helpscout.conversation_tag ON helpscout.conversation.id=helpscout.conversation_tag.conversation_id
WHERE tag="beacon-interest"
UNION
select hubspot.contact.email, DATETIME(hubspot.contact.property_beacon_interest/1000, 'unixepoch') || ' UTC' as expressed_interest_at
FROM hubspot.contact
WHERE property_beacon_interest != ''
ORDER BY expressed_interest_at;
# Bad: Other variations
Select
email
From
helpscout.conversation
Where
created_at
And
conversation_id
Order By
email

view raw

casing.sql

hosted with ❤ by GitHub

Common Table Expressions (CTEs) vs Subqueries

Common Table Expressions go a long way towards making your query easy to read and debug. Not all SQL dialects support CTEs (MySQL doesn’t, for example), but using them in the query was almost always an indicator of an experienced analyst.


# GOOD: CTEs
# I'd put each CTE on a new line though to make it easier to comment and copy/paste
with form as (
select
email,
to_timestamp(property_beacon_interest) as expressed_interest_at
from hubspot.contact
), support as (
select
c.email,
c.created_at as expressed_interest_at
from hubspot.conversation c
left join hubspot.conversation_tag t on c.id = t.conversation_id
where t.tag = 'beacon-interest'
), unioned as (
select * from form
union all
select * from support
), final as (
select
email,
min(expressed_interest_at) as expressed_interest_at
from unioned
group by 1
order by 2
)
select * from final
# OKAY: Subqueries
SELECT email, MIN(created_at) AS expressed_interest_at
FROM
(
SELECT a.email, MIN(created_at) AS created_at
FROM conversation a
INNER JOIN conversation_tag b
ON a.id=b.conversation_id
WHERE b.tag="beacon-interest"
GROUP BY id
UNION
SELECT email,DATETIME(time_created/1000,'unixepoch')|| ' UTC' AS created_at
FROM hubspot.contact
WHERE time_created IS NOT NULL
)
GROUP BY email ORDER BY expressed_interest_at

view raw

ctes.sql

hosted with ❤ by GitHub

Meaningful CTE Names

CTEs benefit a lot from meaningful names that make it easy for you and other analysts to interpret.


# Good: Descriptive names
WITH beacon_interest_contact as (
SELECT
email,
FROM_UNIXTIME(property_beacon_interest/1000) as expressed_interest_at
FROM
hubspot.contact
WHERE
property_beacon_interest IS NOT NULL),
beacon_interest_conversation as (
SELECT
email,
created_at as expressed_interest_at
FROM
helpscout.conversation
INNER JOIN
helpscout.conversation_tag
ON
conversation.id = conversation_tag.conversation_id
WHERE
conversation_tag.tag = 'beacon-interest'),
beacon_interest_union as (
SELECT
*
FROM
beacon_interest_contact
UNION
SELECT
*
FROM
beacon_interest_conversation)
SELECT
email,
MIN(expressed_interest_at) as expressed_interest_at
FROM
beacon_interest_union
GROUP BY
email;
# Bad: Non-descriptive names
;with x0 as
(
Select hcv.email, hcv.created_at expressed_interest_at
From helpscout.conversation hcv
Inner join helpscout.conversation_tag hct on hcv.id = hct.conversation_id
Where hct.tag = ‘beaconinterest’
Union
Select email, dateadd(S,property_beacon_interest, '1970-01-01') expressed_interest_at
From hubspot.contact
Where property_beacon_interest is not null
),
X1 as (
Select email , expressed_interest_at, row_number() over (partition by email order by expressed_interest_at) as RowNo
From x0
Order by expressed_interest_at
)
Select email, expressed_interest_at
From x1
Where RowNo =1

view raw

cte-names.sql

hosted with ❤ by GitHub

INNER JOIN vs LEFT JOIN

Either works, but INNER JOIN performs better and is more intuitive here.


# Good: INNER JOIN
SELECT a.email AS email, a.created_at AS expressed_interest_at
FROM helpscout.conversation AS a
INNER JOIN helpscout.conversation AS b ON a.id = b.conversation_id
WHERE b.conversation_id = 'beacon-interest'
# Okay: LEFT JOIN
select
c.email,
c.created_at as expressed_interest_at
from hubspot.conversation c
left join hubspot.conversation_tag t on c.id = t.conversation_id
where t.tag = 'beacon-interest'

view raw

join.sql

hosted with ❤ by GitHub

Implicit vs Explicit INNER

“INNER” is implied if you just write “JOIN”, so it’s not required, but can make the query easier to read. Either way is fine.


# Good: Specifying INNER
select a.email
, a.created_at as expressed_interest_at
from helpscout.conversation a
INNER JOIN
helpscout.conversation_tag b
ON a.id = b.conversation_id
where b.tag = ‘beaconinterest’
# Good: Omitting INNER
SELECT HC.`email`, DATE_FORMAT(HC.`created_at`, '%Y-%m-%e %T UTC') AS expressed_interest_at
FROM `helpscout.conversation` AS HC
JOIN `helpscout.conversation_tag` AS HCT ON HC.`id`=HCT.`conversation_id` WHERE HCT.`tag`='beacon-interest'
ORDER BY expressed_interest_at

view raw

inner.sql

hosted with ❤ by GitHub

Filtering in the WHERE clause vs JOIN condition

The standard way to filter the conversations is to use a WHERE clause to filter the results to only include those that have a beacon-interest tag. However, because we’re using an INNER JOIN, it’s also possible to add it as a join condition and get the same result. In terms of performance, it doesn’t make a difference which approach you take.

I lean towards a WHERE clauses because I think it’s clearer, but including it in the JOIN condition is completely viable as well.


# Good: Filtering in the WHERE clause
SELECT cnv.email,
cnv.created_at AS expressed_interest_at
FROM helpscout.conversation AS cnv
JOIN helpscout.conversation_tag AS ctg
ON cnv.id = ctg.conversation_id
WHERE ctg.tag = 'beacon-interest'
# Good: Including it in the join condition
select conv.email, conv.created_at as expressed_interest_at
from helpscout.conversation conv
join helpscout.conversation_tag tag
on conv.id = tag.conversation_id and tag.tag = 'beacon-interest'

Converting Milliseconds

HubSpot stores the form submission timestamp in milliseconds. Queries that didn’t account that would not result in the correct result.


# Good: Accounting for the timestamp being in millseconds
select email, (timestamp '1970-01-01 00:00:00 UTC' +
numtodsinterval(property_beacon_interest/1000, 'SECOND'))
at time zone 'utc') as created_at
from hubspot.contact
# Bad: Not realizing millseconds need to be handled differently
SELECT
EMAIL
,FROM_UNIXTIME(PROPERTY_BEACON_INTEREST) AS EXPRESSED_INTEREST_AT
FROM HUBSPOT.CONTACT
WHERE
PROPERTY_BEACON_INTEREST IS NOT NULL

UNION DISTINCT vs UNION ALL

There are two types of UNIONs: UNION DISTINCT and UNION ALL. The former – which is the default when you just write UNION – only returns the unique records in the combined results.

Both result in the correct answer here, but UNION ALL performs better because with UNION DISTINCT the database has to sort the results and remove the duplicate rows.


# Good: UNION ALL
SELECT
C.email,
C.created_at AS expressed_interest_at
FROM helpscout.conversation C
INNER JOIN helpscout.conversation_tag CT ON CT.conversation_id = C.id
WHERE CT.tag = 'beacon-interest'
UNION ALL
SELECT
email,
DATEADD(S, CONVERT(INT,LEFT(property_beacon_interest, 10)), '1970-01-01') AS expressed_interest_at
FROM hubspot.contact
# Okay: UNION DISTINCT
SELECT email, MIN(created_at) AS expressed_interest_at
FROM
(SELECT a.email, MIN(created_at) AS created_at
FROM conversation a
INNER JOIN conversation_tag b
ON a.id=b.conversation_id
WHERE b.tag="beacon-interest"
GROUP BY id
UNION
SELECT email,DATETIME(time_created/1000,'unixepoch')|| ' UTC' AS created_at
FROM hubspot.contact
WHERE time_created IS NOT NULL)
GROUP BY email ORDER BY expressed_interest_at

view raw

union.sql

hosted with ❤ by GitHub

Accounting for Multiple Interest

Many applicants didn’t take into account that people could have expressed interest by filling out the HubSpot form and contacting support. Neglecting to account for this could result in multiple rows for individual email addresses.


# Good: Taking the earliest timestamp for each email
SELECT email, MIN(created_at) AS expressed_interest_at
FROM
(SELECT c.email, MIN(created_at) AS created_at
FROM helpscout.conversation c, helpscout.conversation_tag ct
WHERE c.id =ct.conversation_id
AND ct.tag="beacon-interest"
GROUP BY id
UNION
SELECT email,DATETIME(time_created/1000,'unixepoch')|| ' UTC' AS created_at
FROM hubspot.contact
WHERE time_created IS NOT NULL)
GROUP BY email ORDER BY expressed_interest_at;
# Bad: Missing it
select cont.email, cont.property_beacon_interest as expressed_interest_at
from hubspot.contact cont
where cont.property_beacon_interest is not null
union
select conv.email, conv.created_at as expressed_interest_at
from helpscout.conversation conv
join helpscout.conversation_tag tag
on conv.id = tag.conversation_id and tag.tag = 'beacon-interest'
order by expressed_interest_at

GROUP BY Column Name vs Number

I lean towards using column numbers, but either is fine, and using the column name can have benefits. When there are 5+ columns (which is not an issue with this question), lean towards using column numbers which will be a lot more sane than typing out all the names (hat-tip Ray Buhr for this tip).


# Good: Grouping by column number
select
email,
min(expressed_interest_at) as expressed_interest_at
from unioned
group by 1
order by 2
# Good: Grouping by column name
SELECT a.email, min(b.created_at) AS expressed_interest_at
FROM hubspot.contact AS a
JOIN helpscout.conversation AS b ON a.email = b.email
GROUP BY a.email

view raw

grouping.sql

hosted with ❤ by GitHub

Single vs Multiple Lines When Listing Multiple Columns

Another common style difference is whether people put multiple columns on the same line or not. Either is fine, but I lean towards one column per line because I think it’s easier to read.


# Good: Multiple columns per line
SELECT email, created_at AS expressed_interest_at
FROM helpscout.conversation c
# Good: One column per line
SELECT
email,
FROM_UNIXTIME(property_beacon_interest/1000) as expressed_interest_at
FROM
hubspot.contact

view raw

lines.sql

hosted with ❤ by GitHub

Comma First vs Comma Last

While not that common in the responses, it’s perfectly valid when listing columns on multiple lines to put the commas before the column name. The benefit is that you don’t have to add a comma to the previous line when adding a new column which also simplifies query diffs that you might see in your version control tool.


# Good: Comma last
SELECT
email,
FROM_UNIXTIME(property_beacon_interest/1000) as expressed_interest_at
FROM
hubspot.contact
# Good: Comma first
select
email
,dateadd(s, convert(int,left(property_beacon_interest,10)), '1970-01-01')
from hubspot.contact

view raw

comma-first.sql

hosted with ❤ by GitHub

Comma-first folks tended to have software development backgrounds.

PS: For anyone interested in SQL coding conventions, I highly recommend checking out dbt’s coding conventions which have influenced my preferences here.

All Responses

We were fortunate to receive over 100 applications, most of which included an answer to the SQL question. I suspect if the application didn’t include this question, we would have had twice the number of applcants, but the presence of this question led some underqualified folks not to apply.

You can check out all 89 responses on GitHub.

If you have any suggestions on how to improve my query or feedback on any of this analysis, please drop a comment below. Happy querying!

3 thoughts on “Analyzing 89 Responses to a SQL Screener Question for a Senior Data Analyst Position

  1. Hi Matt. I’m transitioning from Sr. Analyst to Analytics Engineer and am currently working to develop expertise in SQL to support the move. Your article was VERY helpful to me. My solution was close to yours so that was reassuring. Your commentary throughout the article/analysis helped me clean up a bit of my own understanding and habits/practices so thank you for that 🙌

    Take care.
    -Jason

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s