A Few Thoughts on Image Upload Usage at Preceden

November 15, 2018Mazur Leave a comment

One of Preceden’s most popular feature requests over the years has been the ability to upload images to Preceden and have those images appear on timelines.

A lot of competitors offer that functionality, but I procrastinated for almost 9 years for two reasons:

It’s complex to implement, both in terms of actually handling the uploads and having them appear on the timelines.
Most of the people that requested it were using Preceden for school timelines and that segment of users tend not to upgrade at a high rate. People using Preceden for work-related project planning timelines didn’t request it much. Therefore, it never was much of a priority because it likely wouldn’t move the needle on the business.

That said, since I’ve had more time to work on Preceden recently, I decided to finally do it. For handling uploads, I wound up using Filestack.com which simplified the implementation a lot. And updating Preceden’s rendering logic took time too, but in the end it all worked out.

The most common feature request I get for Preceden is to make it possible to add images to timelines. With today's update, folks can now do just that 🎉. Here's the before and after of a student's timeline about the best movies. pic.twitter.com/QDyNaO89NY

— Matt Mazur (@mhmazur) October 4, 2018

I recently checked on the usage stats and – not surprisingly – it’s used most heavily by people using Preceden for education:

Screen Shot 2018-11-15 at 2.27.44 PM.png

For users that have signed up since this launched:

Teaching: 29% uploaded an image
School: 26%
Personal Use: 16%
Work: 12%

In other words, it’s used very heavily (which is great!) but not with the segment of users with the highest propensity to pay.

This dilemma comes up fairly often: do you build Feature A that will be used heavily by mostly-free users, or Feature B that will be used heavily by mostly-paying customers?

For better or worse, I never wound up focusing on one market or use case with Preceden: it’s a general purpose timeline maker that can be used for any type of timeline. As a result though, I often get into these situations. If I was just building Preceden for project planners, I’d never implement image uploads. If I was just building it for students creating timelines for school, I’d probably have implemented it years ago.

It also comes down to goals: if my main goal is growing revenue, I probably shouldn’t work on features like this. But if I want Preceden to be the best general purpose timeline maker then it does, but there’s an opportunity cost because I’m not building features for the folks who will actually pay.

I operate in the middle for product development: work mostly on features that will make money, but also spend some percentage of my time on features like this that will make it a better general purpose tool.

If I were to start something from scratch today, I’d probably pick a narrow niche and try to nail it. No general-purpose tools. I’d recommend that to others too.

Going broad is fun in a way too though, it just has it’s challenges :).

Analyzing 89 Responses to a SQL Screener Question for a Senior Data Analyst Position

November 12, 2018November 12, 2018Mazur 3 Comments

At Help Scout, we recently went through the process of hiring a new senior data analyst. In order to apply for the position, we asked anyone interested to answer a few short screener questions including one to help evaluate their SQL skills.

Here’s the SQL screener question we asked. If you’re an analyst, you’ll get the most value out of this post if you work through it before reading on:

We’re currently in the process of rolling out Beacon, our live chat tool, to existing customers who have expressed interest trying it.

Customers could have expressed interest in two ways: either by filling out an interest form or mentioning to our support team that they want to try it.

For the interest form, there is one table, hubspot.contact, with two relevant fields:

email – The user’s email address
property_beacon_interest – A Unix timestamp in milliseconds representing when they filled out the form or null if they have not expressed interest

+------------------+--------------------------+
|      email       | property_beacon_interest |
+------------------+--------------------------+
| matt@example.com |            1534101377000 |
| eli@example.com  |                          |
+------------------+--------------------------+

When a customer expresses interest in a support conversation, our support team tags the conversation with a beacon-interest tag. There are two relevant tables:

helpscout.conversation with three relevant fields:

id – The id of the conversation
email – The email of the person who reached out to support
created_at – A timestamp with the date/time the conversation was created

+----+-------------------+--------------------------+
| id |       email       |        created_at        |
+----+-------------------+--------------------------+
|  1 | matt@example.com  | 2018-08-14 14:02:10 UTC  |
|  2 | eli@example.com   | 2018-08-14 14:06:30 UTC  |
|  3 | matt@example.com  | 2018-08-14 14:07:33 UTC  |
|  4 | katia@example.com | 2018-08-14 14:11:30 UTC  |
|  5 | jen@example.com   | 2018-08-13 14:11:30 UTC  |
+----+-------------------+--------------------------+

There’s also a helpscout.conversation_tag table with two relevant fields:

conversation_id – The id of the conversation that was tagged. A conversation can have zero or more tags.
tag – The name of the tag

+-----------------+-----------------+
| conversation_id |       tag       |
+-----------------+-----------------+
|               1 | new-trial       |
|               1 | bug-report      |
|               2 | beacon-interest |
|               4 | beacon-interest |
+-----------------+-----------------+

Your challenge:

Write a SQL query (any dialect is fine) that combines data from these two sources that lists everyone who has expressed interest in trying Beacon and when they first expressed that interest.

The end result using the example tables above should be a functioning SQL query that returns the following:

+-------------------+-------------------------+
|      email        |  expressed_interest_at  |
+-------------------+-------------------------+
| matt@example.com  | 2018-08-12 19:16:17 UTC |
| eli@example.com   | 2018-08-14 14:06:30 UTC |
| katia@example.com | 2018-08-14 14:11:30 UTC |
+-------------------+-------------------------+

You should include your query in the response field to this question in the online application.

view raw

screen.md

hosted with ❤ by GitHub

This question – designed to be answered in 10-15 minutes – proved incredibly valuable because it served as an easy way to evaluate whether applicants had the minimum technical skills to be effective in this role. This position requires working a lot with SQL, so if an applicant struggled with an intermediate SQL challenge, they would likely struggle in the role as well.

What surprised me is that almost none of the answers were identical, except for a few towards the end because someone commented on the Gist with a (slightly buggy) solution :).

For reference, here’s how I would answer the question:

	with hubspot_interest as (

	select
	email,
	timestamp_millis(property_beacon_interest) as expressed_interest_at
	from hubspot.contact
	where property_beacon_interest is not null

	),

	support_interest as (

	select
	email,
	created_at as expressed_interest_at
	from helpscout.conversation
	join helpscout.conversation_tag on conversation.id = conversation_tag.conversation_id
	where tag = "beacon-interest"

	),

	combined_interest as (

	select * from hubspot_interest
	union all
	select * from support_interest

	),

	final as (

	select
	email,
	min(expressed_interest_at) as expressed_interest_at
	from combined_interest
	group by 1

	)

	select * from final

view raw

sql-answer.sql

hosted with ❤ by GitHub

There are a lot of variations on this though that will still result in the correct answer – and many which won’t. For example, no points lost for using uppercase SQL instead of lowercase. But if the query doesn’t union the tables together at some point, it probably wouldn’t result in the correct answer.

If you’re interested in data analysis and analytics, you can subscribe to my Matt on Analytics newsletter to get notified about future posts like this one.

Analyzing the Differences

It would be impossible to list every difference – as you’ll see at the end of this post in the anonymized responses, there are and endless number of ways to format the query.

That said, there are a lot of common differences, some substantial, some not.

SQL Casing

Does the query use uppercase or lowercase SQL or some combination of the two?

Note that in these examples and all that follow, the answers aren’t necessarily correct. They’re just chosen to highlight different ways of approaching the query.

	# Good: All uppercase

	SELECT
	email,
	datetime(property_beacon_interest/1000, 'unixepoch') AS expressed_interest_at
	FROM
	hubspot.contact
	WHERE
	property_beacon_interest IS NOT NULL
	UNION
	SELECT
	c.email,
	c.created_at AS expressed_interest_at
	FROM
	helpscout.conversation c
	INNER JOIN helpscout.conversation_tag ct
	ON c.id = ct.conversation_id AND ct.tag = 'beacon-interest'


	# Good: All lowercase

	select
	email,
	first_interest = min(first_interest)
	from (
	— interest forms
	select
	email,
	first_interest = dateadd(S, property_beacon_interest/1000, '1970-01-01')
	from
	hubspot.contact
	where
	property_beacon_interest is not null

	— support team tags
	union all
	select
	email,
	first_interest = created_at
	from
	helpscout.conversation c join
	helpscout.conversation_tag ct on c.id = ct.conversation_id and ct.tag = 'beacon-interest'
	) combined
	group by
	email

	# Okay: Mixed uppercase and lowercase

	SELECT helpscout.conversation.email, helpscout.conversation.created_at as expressed_interest_at
	FROM helpscout.conversation
	INNER JOIN helpscout.conversation_tag ON helpscout.conversation.id=helpscout.conversation_tag.conversation_id
	WHERE tag="beacon-interest"
	UNION
	select hubspot.contact.email, DATETIME(hubspot.contact.property_beacon_interest/1000, 'unixepoch') \|\| ' UTC' as expressed_interest_at
	FROM hubspot.contact
	WHERE property_beacon_interest != ''
	ORDER BY expressed_interest_at;

	# Bad: Other variations

	Select
	email
	From
	helpscout.conversation
	Where
	created_at
	And
	conversation_id
	Order By
	email

view raw

casing.sql

hosted with ❤ by GitHub

Common Table Expressions (CTEs) vs Subqueries

Common Table Expressions go a long way towards making your query easy to read and debug. Not all SQL dialects support CTEs (MySQL doesn’t, for example), but using them in the query was almost always an indicator of an experienced analyst.

	# GOOD: CTEs
	# I'd put each CTE on a new line though to make it easier to comment and copy/paste

	with form as (

	select
	email,
	to_timestamp(property_beacon_interest) as expressed_interest_at
	from hubspot.contact

	), support as (

	select
	c.email,
	c.created_at as expressed_interest_at
	from hubspot.conversation c
	left join hubspot.conversation_tag t on c.id = t.conversation_id
	where t.tag = 'beacon-interest'

	), unioned as (

	select * from form
	union all
	select * from support

	), final as (

	select
	email,
	min(expressed_interest_at) as expressed_interest_at
	from unioned
	group by 1
	order by 2

	)

	select * from final

	# OKAY: Subqueries

	SELECT email, MIN(created_at) AS expressed_interest_at
	FROM
	(
	SELECT a.email, MIN(created_at) AS created_at
	FROM conversation a
	INNER JOIN conversation_tag b
	ON a.id=b.conversation_id
	WHERE b.tag="beacon-interest"
	GROUP BY id
	UNION
	SELECT email,DATETIME(time_created/1000,'unixepoch')\|\| ' UTC' AS created_at
	FROM hubspot.contact
	WHERE time_created IS NOT NULL
	)
	GROUP BY email ORDER BY expressed_interest_at

view raw

ctes.sql

hosted with ❤ by GitHub

Meaningful CTE Names

CTEs benefit a lot from meaningful names that make it easy for you and other analysts to interpret.

	# Good: Descriptive names

	WITH beacon_interest_contact as (
	SELECT
	email,
	FROM_UNIXTIME(property_beacon_interest/1000) as expressed_interest_at
	FROM
	hubspot.contact
	WHERE
	property_beacon_interest IS NOT NULL),
	beacon_interest_conversation as (
	SELECT
	email,
	created_at as expressed_interest_at
	FROM
	helpscout.conversation
	INNER JOIN
	helpscout.conversation_tag
	ON
	conversation.id = conversation_tag.conversation_id
	WHERE
	conversation_tag.tag = 'beacon-interest'),
	beacon_interest_union as (
	SELECT
	*
	FROM
	beacon_interest_contact
	UNION
	SELECT
	*
	FROM
	beacon_interest_conversation)
	SELECT
	email,
	MIN(expressed_interest_at) as expressed_interest_at
	FROM
	beacon_interest_union
	GROUP BY
	email;

	# Bad: Non-descriptive names

	;with x0 as
	(
	Select hcv.email, hcv.created_at expressed_interest_at
	From helpscout.conversation hcv
	Inner join helpscout.conversation_tag hct on hcv.id = hct.conversation_id
	Where hct.tag = ‘beacon-interest’
	Union
	Select email, dateadd(S,property_beacon_interest, '1970-01-01') expressed_interest_at
	From hubspot.contact
	Where property_beacon_interest is not null

	),

	X1 as (
	Select email , expressed_interest_at, row_number() over (partition by email order by expressed_interest_at) as RowNo
	From x0
	Order by expressed_interest_at
	)

	Select email, expressed_interest_at
	From x1
	Where RowNo =1

view raw

cte-names.sql

hosted with ❤ by GitHub

INNER JOIN vs LEFT JOIN

Either works, but INNER JOIN performs better and is more intuitive here.

	# Good: INNER JOIN

	…
	SELECT a.email AS email, a.created_at AS expressed_interest_at
	FROM helpscout.conversation AS a
	INNER JOIN helpscout.conversation AS b ON a.id = b.conversation_id
	WHERE b.conversation_id = 'beacon-interest'
	…

	# Okay: LEFT JOIN

	…
	select
	c.email,
	c.created_at as expressed_interest_at
	from hubspot.conversation c
	left join hubspot.conversation_tag t on c.id = t.conversation_id
	where t.tag = 'beacon-interest'
	…

view raw

join.sql

hosted with ❤ by GitHub

Implicit vs Explicit INNER

“INNER” is implied if you just write “JOIN”, so it’s not required, but can make the query easier to read. Either way is fine.

	# Good: Specifying INNER

	…
	select a.email
	, a.created_at as expressed_interest_at
	from helpscout.conversation a
	INNER JOIN
	helpscout.conversation_tag b
	ON a.id = b.conversation_id
	where b.tag = ‘beacon-interest’
	…

	# Good: Omitting INNER

	…
	SELECT HC.`email`, DATE_FORMAT(HC.`created_at`, '%Y-%m-%e %T UTC') AS expressed_interest_at
	FROM `helpscout.conversation` AS HC
	JOIN `helpscout.conversation_tag` AS HCT ON HC.`id`=HCT.`conversation_id` WHERE HCT.`tag`='beacon-interest'
	ORDER BY expressed_interest_at
	…

view raw

inner.sql

hosted with ❤ by GitHub

Filtering in the WHERE clause vs JOIN condition

The standard way to filter the conversations is to use a WHERE clause to filter the results to only include those that have a beacon-interest tag. However, because we’re using an INNER JOIN, it’s also possible to add it as a join condition and get the same result. In terms of performance, it doesn’t make a difference which approach you take.

I lean towards a WHERE clauses because I think it’s clearer, but including it in the JOIN condition is completely viable as well.

	# Good: Filtering in the WHERE clause

	…
	SELECT cnv.email,
	cnv.created_at AS expressed_interest_at
	FROM helpscout.conversation AS cnv
	JOIN helpscout.conversation_tag AS ctg
	ON cnv.id = ctg.conversation_id
	WHERE ctg.tag = 'beacon-interest'
	…

	# Good: Including it in the join condition

	…
	select conv.email, conv.created_at as expressed_interest_at
	from helpscout.conversation conv
	join helpscout.conversation_tag tag
	on conv.id = tag.conversation_id and tag.tag = 'beacon-interest'
	…

view raw

join-vs-where.sql

hosted with ❤ by GitHub

Converting Milliseconds

HubSpot stores the form submission timestamp in milliseconds. Queries that didn’t account that would not result in the correct result.

	# Good: Accounting for the timestamp being in millseconds

	…
	select email, (timestamp '1970-01-01 00:00:00 UTC' +
	numtodsinterval(property_beacon_interest/1000, 'SECOND'))
	at time zone 'utc') as created_at
	from hubspot.contact
	…

	# Bad: Not realizing millseconds need to be handled differently

	…
	SELECT
	EMAIL
	,FROM_UNIXTIME(PROPERTY_BEACON_INTEREST) AS EXPRESSED_INTEREST_AT
	FROM HUBSPOT.CONTACT
	WHERE
	PROPERTY_BEACON_INTEREST IS NOT NULL
	…

view raw

milliseconds.sql

hosted with ❤ by GitHub

UNION DISTINCT vs UNION ALL

There are two types of UNIONs: UNION DISTINCT and UNION ALL. The former – which is the default when you just write UNION – only returns the unique records in the combined results.

Both result in the correct answer here, but UNION ALL performs better because with UNION DISTINCT the database has to sort the results and remove the duplicate rows.

	# Good: UNION ALL

	SELECT
	C.email,
	C.created_at AS expressed_interest_at
	FROM helpscout.conversation C
	INNER JOIN helpscout.conversation_tag CT ON CT.conversation_id = C.id
	WHERE CT.tag = 'beacon-interest'
	UNION ALL
	SELECT
	email,
	DATEADD(S, CONVERT(INT,LEFT(property_beacon_interest, 10)), '1970-01-01') AS expressed_interest_at
	FROM hubspot.contact

	# Okay: UNION DISTINCT

	SELECT email, MIN(created_at) AS expressed_interest_at
	FROM
	(SELECT a.email, MIN(created_at) AS created_at
	FROM conversation a
	INNER JOIN conversation_tag b
	ON a.id=b.conversation_id
	WHERE b.tag="beacon-interest"
	GROUP BY id
	UNION
	SELECT email,DATETIME(time_created/1000,'unixepoch')\|\| ' UTC' AS created_at
	FROM hubspot.contact
	WHERE time_created IS NOT NULL)
	GROUP BY email ORDER BY expressed_interest_at

view raw

union.sql

hosted with ❤ by GitHub

Accounting for Multiple Interest

Many applicants didn’t take into account that people could have expressed interest by filling out the HubSpot form and contacting support. Neglecting to account for this could result in multiple rows for individual email addresses.

	# Good: Taking the earliest timestamp for each email

	SELECT email, MIN(created_at) AS expressed_interest_at
	FROM
	(SELECT c.email, MIN(created_at) AS created_at
	FROM helpscout.conversation c, helpscout.conversation_tag ct
	WHERE c.id =ct.conversation_id
	AND ct.tag="beacon-interest"
	GROUP BY id
	UNION
	SELECT email,DATETIME(time_created/1000,'unixepoch')\|\| ' UTC' AS created_at
	FROM hubspot.contact
	WHERE time_created IS NOT NULL)
	GROUP BY email ORDER BY expressed_interest_at;

	# Bad: Missing it

	select cont.email, cont.property_beacon_interest as expressed_interest_at
	from hubspot.contact cont
	where cont.property_beacon_interest is not null
	union
	select conv.email, conv.created_at as expressed_interest_at
	from helpscout.conversation conv
	join helpscout.conversation_tag tag
	on conv.id = tag.conversation_id and tag.tag = 'beacon-interest'
	order by expressed_interest_at

view raw

multiple-interest.sql

hosted with ❤ by GitHub

GROUP BY Column Name vs Number

I lean towards using column numbers, but either is fine, and using the column name can have benefits. When there are 5+ columns (which is not an issue with this question), lean towards using column numbers which will be a lot more sane than typing out all the names (hat-tip Ray Buhr for this tip).

	# Good: Grouping by column number

	…
	select
	email,
	min(expressed_interest_at) as expressed_interest_at
	from unioned
	group by 1
	order by 2
	…

	# Good: Grouping by column name

	…
	SELECT a.email, min(b.created_at) AS expressed_interest_at
	FROM hubspot.contact AS a
	JOIN helpscout.conversation AS b ON a.email = b.email
	GROUP BY a.email
	…

view raw

grouping.sql

hosted with ❤ by GitHub

Single vs Multiple Lines When Listing Multiple Columns

Another common style difference is whether people put multiple columns on the same line or not. Either is fine, but I lean towards one column per line because I think it’s easier to read.

	# Good: Multiple columns per line

	…
	SELECT email, created_at AS expressed_interest_at
	FROM helpscout.conversation c
	…

	# Good: One column per line

	…
	SELECT
	email,
	FROM_UNIXTIME(property_beacon_interest/1000) as expressed_interest_at
	FROM
	hubspot.contact
	…

view raw

lines.sql

hosted with ❤ by GitHub

Comma First vs Comma Last

While not that common in the responses, it’s perfectly valid when listing columns on multiple lines to put the commas before the column name. The benefit is that you don’t have to add a comma to the previous line when adding a new column which also simplifies query diffs that you might see in your version control tool.

	# Good: Comma last

	…
	SELECT
	email,
	FROM_UNIXTIME(property_beacon_interest/1000) as expressed_interest_at
	FROM
	hubspot.contact
	…

	# Good: Comma first
	…
	select
	email
	,dateadd(s, convert(int,left(property_beacon_interest,10)), '1970-01-01')
	from hubspot.contact
	…

view raw

comma-first.sql

hosted with ❤ by GitHub

Comma-first folks tended to have software development backgrounds.

PS: For anyone interested in SQL coding conventions, I highly recommend checking out dbt’s coding conventions which have influenced my preferences here.

All Responses

We were fortunate to receive over 100 applications, most of which included an answer to the SQL question. I suspect if the application didn’t include this question, we would have had twice the number of applcants, but the presence of this question led some underqualified folks not to apply.

You can check out all 89 responses on GitHub.

If you have any suggestions on how to improve my query or feedback on any of this analysis, please drop a comment below. Happy querying!

I’m speaking at JOIN next week!

October 3, 2018Mazur Leave a comment

Screen Shot 2018-10-03 at 8.52.19 AM.png

Next week I’m excited to be speaking at JOIN, Looker’s annual user conference in San Francisco.

My talk (at 11:15am on Wednesday the 10th) is about how we use Fivetran, Looker, and email to get more people at Help Scout interested in and engaged with our metrics, a topic which I’ve written about previously on this blog. Huge thanks to Fivetran for sponsoring this session.

I’ll also be at the dbt meetup on Tuesday the 9th.

If you happen to be attending either the meetup or the conference, drop me a note – I’d love to say hey 👋.

Automating Facebook Ad Insights Reporting Using Fivetran and Looker

October 1, 2018October 1, 2018Mazur Leave a comment

For one of my recent consulting projects, I worked with a client to automate their Facebook Ad Insights reporting in Looker. There are some nuances with the data that made it a little tricky to figure out initially, but in the end we wound up with a pretty elegant model that is going to let them report on all of their key metrics directly from Looker. This post is about how you can do the same.

If you’d like to get notified when I release future tutorials like this, make sure to sign up for my Matt on Analytics newsletter.

The Objective

By the end of this tutorial, you’ll be able to analyze your Facebook Ad Insights data in Looker using the following 15 fields: Account ID, Account Name, Ad ID, Adset ID, Campaign ID, Campaign Name, Country, Report Date, CPM, CTR, ROAS, Total Conversion Value, Total Impressions, Total Link Clicks, and Total Spend.

Setting up the Connection in Fivetran

Fivetran makes it incredibly easy to centralize all of your data in a data warehouse for analysis in a BI tool like Looker. This tutorial assumes you’re using Fivetran, but if you’re using another ETL tool like Stitch to grab your Ad Insights data, the modeling should be fairly similar.

There are a lot of ways to slice and dice your Facebook data, but for the key metrics mentioned above, here’s what your setup should look like:

Breakdown should be country – this means all of the reporting data will be segmented by country. You could segment it in additional ways like by age, gender, etc depending on your needs – just make sure to adjust the model accordingly if you do.
Action Breakdowns should be action_type.
Fields should be account_name, action_values, actions, campaign_id, campaign_name, impressions, inline_link_clicks, and spend.
Click Attribution Window for us is 28 days and View Attribution Window is 1 day.

Once connected, Fivetran will pull all of the relevant data from Facebook using the Facebook Ad Insights API and throw into your data warehouse:

There are two key tables:

ad_insights – This table has data related to the spend: campaign_id, country, date, account_id, account_name, ad_id, adset_id, campaign_name, impressions, inline_link_clicks, and spend.
ad_insights_action_values – This table has data related to how much revenue was earned as a result of that spend: campaign_id, country, date, _1_d_view, _28_d_view, action_type, and value.

For example, to see spend metrics by campaign for a given day, we can run a query like this:

	select
	campaign_id,
	sum(spend) as total_spend,
	sum(impressions) as total_impressions,
	sum(spend) / sum(impressions) * 1000 as cpm,
	sum(inline_link_clicks) as link_clicks,
	sum(inline_link_clicks) / sum(impressions) as ctr
	from facebook_ad_insights.ad_insights
	where date = '2018-10-01'
	group by 1
	order by 1

view raw

facebook-ad-insights-spend.sql

hosted with ❤ by GitHub

And to see conversions by campaign on a given date:

	select
	campaign_id,
	sum(value) as total_conversion_value
	from facebook_ad_insights.ad_insights_action_values
	where
	action_type = 'offsite_conversion.fb_pixel_purchase' and
	date = '2018-10-01'
	group by 1
	order by 1

view raw

facebook-conversion-value.sql

hosted with ❤ by GitHub

One key note about the conversion data that will come into play later: there may be several different values for action_type, but the only one that matters for measuring total conversion value is offsite_conversion.fb_pixel_purchase; everything else can be ignored.

Another important point: conversion data is cohorted by the day of the spend, not the day the conversion happened. That matters because it means there will never be conversions on days without spend. Put another way: every row in the conversion data has a corresponding row in the spend data. As we’ll see, that means we can join the spend data to the conversion data and we’ll capture everything we need.

Modeling the Data in Looker

Identifying the primary keys

Spend data in the ad_insights table can be uniquely identified by the combination of the date, campaign id, and country. We can set up a primary key dimension like so:

	dimension: primary_key {
	type: string
	sql: concat(cast(${date_date} as string), ${campaign_id}, ${country}) ;;
	hidden: yes
	primary_key: yes
	}

view raw

ad-insights-primary-key.lookml

hosted with ❤ by GitHub

For the conversion data, this comes close, but there can also be many action_type records for each date/campaign/country combination so we can’t just use that as the primary key.

That said, because we only care about action_type of offsite_conversion.fb_pixel_purchase, it simplifies the modeling to create a derived table that consists of only actions of this type, that way we can use date/campaign/country as the primary key.

You can model this in dbt or simply create a derived table in Looker by filtering fb_ad_insights_action_values accordingly (we’ll wind up calling this fb_conversions below).

select * 
from fivetran.fb_ad_insights_action_values
where action_type = "offsite_conversion.fb_pixel_purchase"

By only working with this derived table, there will be a one-to-one relationship between the spend data and the conversion data.

Creating the Model

Here’s what the model winds up looking like:

	explore: fb_ad_insights {
	label: "Facebook"
	view_label: "Facebook"

	join: fb_conversions {
	view_label: "Facebook"
	type: left_outer
	relationship: one_to_one
	sql_on: ${fb_ad_insights.primary_key} = ${fb_conversions.primary_key} ;;
	}
	}

view raw

facebook-ad-insights-joining.lookml

hosted with ❤ by GitHub

We’re left joining the spend data to the derived conversion table and because the conversion data is already filtered to only include the fb_pixel_purchase action_type, there’s a one-to-one relationship.

Creating the Spend View

Here’s what it looks like:

	view: fb_ad_insights {
	sql_table_name: fivetran.fb_ad_insights ;;

	dimension: primary_key {
	type: string
	sql: concat(cast(${date_date} as string), ${campaign_id}, ${country}) ;;
	hidden: yes
	primary_key: yes
	}

	dimension: account_id {
	type: string
	sql: ${TABLE}.account_id ;;
	}

	dimension: account_name {
	type: string
	sql: ${TABLE}.account_name ;;
	}

	dimension: country {
	type: string
	sql: ${TABLE}.country ;;
	}

	dimension: ad_id {
	type: string
	sql: ${TABLE}.ad_id ;;
	}

	dimension: adset_id {
	type: string
	sql: ${TABLE}.adset_id ;;
	}

	dimension: campaign_id {
	type: string
	sql: ${TABLE}.campaign_id ;;
	}

	dimension: campaign_name {
	type: string
	sql: ${TABLE}.campaign_name ;;
	}

	dimension_group: date {
	label: "Report"
	type: time
	timeframes: [raw, date, week, month, quarter, year]
	sql: ${TABLE}.`date` ;;
	}

	dimension: impressions {
	type: number
	sql: ${TABLE}.impressions ;;
	hidden: yes
	}

	dimension: inline_link_clicks {
	type: number
	sql: ${TABLE}.inline_link_clicks ;;
	hidden: yes
	}

	dimension: spend {
	type: number
	sql: ${TABLE}.spend ;;
	hidden: yes
	}

	measure: total_spend {
	label: "Total Spend"
	description: "The estimated amount of money we've spent on these ads"
	type: sum
	sql: ${spend} ;;
	value_format_name: usd
	}

	measure: total_impressions {
	label: "Total Impressions"
	description: "The number of times our ads were on screen."
	type: sum
	sql: ${impressions} ;;
	value_format_name: decimal_0
	}

	measure: total_link_clicks {
	label: "Total Link Clicks"
	description: "The number of clicks on links within the ad that led to destinations or experiences, on or off Facebook"
	type: sum
	sql: ${inline_link_clicks} ;;
	value_format_name: decimal_0
	}

	measure: cpm_ {
	label: "CPM"
	description: "The average cost for 1,000 impressions"
	type: number
	sql: ${total_spend} / nullif(${total_impressions}, 0) * 1000 ;;
	value_format_name: usd
	}

	measure: ctr_amount {
	label: "CTR"
	description: "The percentage of times people saw your ad and performed a link click"
	type: number
	sql: ${total_link_clicks} / nullif(${total_impressions}, 0) ;;
	value_format_name: percent_2
	}
	}

view raw

facebook-ad-insights-view.lookml

hosted with ❤ by GitHub

All pretty straightforward.

Creating the Conversions View

	view: fb_conversions {
	derived_table: {
	sql:
	select *
	from fivetran.fb_ad_insights_action_values
	where action_type = "offsite_conversion.fb_pixel_purchase" ;;
	}

	dimension: primary_key {
	type: string
	sql: concat(cast(${date_date} as string), ${campaign_id}, ${country}) ;;
	hidden: yes
	primary_key: yes
	}

	dimension: _1_d_view {
	type: number
	sql: ${TABLE}._1_d_view ;;
	hidden: yes
	}

	dimension: _28_d_click {
	type: number
	sql: ${TABLE}._28_d_click ;;
	hidden: yes
	}

	dimension: _fivetran_synced {
	type: number
	sql: ${TABLE}._fivetran_synced ;;
	hidden: yes
	}

	dimension: country {
	type: string
	sql: ${TABLE}.country ;;
	hidden: yes
	}

	dimension: campaign_id {
	type: string
	sql: ${TABLE}.campaign_id ;;
	hidden: yes
	}

	dimension_group: date {
	label: "Conversion"
	type: time
	sql: ${TABLE}.`date` ;;
	timeframes: [raw, date, week, month, quarter, year]
	hidden: yes
	}

	dimension: index {
	type: number
	sql: ${TABLE}.index ;;
	hidden: yes
	}

	dimension: value {
	description: "The total value of all conversions attributed to your ads"
	type: number
	sql: ${TABLE}.value ;;
	hidden: yes
	}

	measure: total_conversion_value {
	description: "The total value of all conversions attributed to your ads"
	type: sum
	sql: ${value} ;;
	value_format_name: usd
	}

	measure: roas {
	label: "ROAS"
	description: "Conversion Value / Spend"
	type: number
	sql: ${total_conversion_value} / nullif(${fb_ad_insights.total_spend}, 0) ;;
	value_format_name: percent_0
	}
	}

view raw

fb_conversions.lookml

hosted with ❤ by GitHub

At the top you’ll see that this is a derived table from the original fb_ad_insights_action_values provided by Fivetran.

The only noteworthy metric here is the ROAS measure which takes the total conversion value measure and divides it by the total spend measure from the spend view.

And… drum roll… that’s it. You can now explore your Facebook Ad Insights data in Looker to your heart’s content and create dashboards that your leadership and teammates can consume without logging into Facebook or relying on its limited reporting capabilities.

Feel free to reach out if you run into any issues with any of the above – I’m happy to help.

Lastly, if you found this helpful, I encourage you to join the newsletter as there will be several more posts like this in the coming weeks including how to model AdWords and DCM data as well as how to combine all of these into a single report.