A Frequent Communication Mistake I’ve Made as a Data Analyst

June 22, 2018Mazur 1 Comment

Looking back at my time so far as a data analyst, some of the biggest mistakes I’ve made were not technical in nature but around how I communicated within the organization.

Two real-world examples to illustrate:

A few years ago at Automattic our ad revenue was way down from what our marketing team expected it to be. For example (and I’m making up numbers here), for every $100 we were spending on ads, we had been making $150 historically, but in recent months we were making $25. Either the performance had gone way down or there was some issue with the tracking and reporting.

I was on the data team at the time and volunteered to work with the marketing team to investigate. As it turned out, there was indeed an issue: there was a problem with the way AdWords was appending UTM parameters to our URLs which was breaking our tracking. For example, a visitor would click an ad and land on wordpress.com/business&utm_source=adwords – note that there’s an amperstand after the URL path instead of a question mark, so the correct UTM source wouldn’t get tracked and the customer wouldn’t get attributed to AdWords.

Fortunately, we had some event tracking set up on these pages (Tracks for the win) that recorded the full URL, so I was able to go back and determine which customers came from ads and calculate what our actual return on ad spend was. After figuring out the issue and determining how much unattributed revenue we had, I wrote up a lengthy post about what happened and published it on our internal marketing blog without informing the marketing team about it first.

Second example: a few months ago at Help Scout, we had an ambitious revenue target for Q1. With a few days left in the quarter, we were still projecting to come in short of the target and no one realistically expected us to reach it. Something about the projection seemed off to me so I dove in and realized there was a mistake in one of the calculations (it was my fault – in the projection we weren’t counting revenue that we earned that month from customers that were delinquent who then became paying again). As a result, our projection was too low and we likely were going to hit our target (and eventually did!). I wrote up a lengthy message about what happened and published it in one of our company Slack channels without informing any of the leadership about it first.

To understand the problem, it’s important to note that as a data analyst, I haven’t typically been responsible for the performance of our metrics. I help set up tracking and reporting and help ensure accuracy, but someone else in the organization is responsible for how well those metrics were doing.

In both of the cases above, I wasn’t intentionally bypassing people. At the time, it was more like “oh, hey, there’s a bug, now it’s fixed, better let everyone know about it” – and probably an element of wanting credit for figuring out the issue too.

However, not consulting with those responsible for the metrics before reporting it was a mistake for several reasons:

They didn’t have an opportunity to help me improve how the issue and impact were communicated with the rest of the company and its leadership.
I missed an opportunity to have them doublecheck the revised calculations, which could have been wrong.
Even though we were doing better than we had been reporting in both cases, it may have indirectly made people look bad because they had been reporting performance based on inaccurate data. They should not be finding out about the issue at the same time as the rest of the company.

In neither case was there any big drama about how I went about it, but it was a mistake on my part nonetheless.

Here’s what I’d recommend for anyone in a similar role: if someone else in your organization is responsible for the performance of a metric and you as a data analyst discover some issue with the accuracy of that metric, always discuss it with them first and collaborate with them on how it is communicated to the rest of the company.

It sounds obvious in retrospect, but it’s bitten me a few times so I wanted to share it with the hope that it helps other analysts out there avoid similar issues. Soft skills like this are incredibly important and worth developing in parallel with your technical skills.

If you’ve made any similar mistakes or have any related lessons learned, I’d love to hear about them in the comments or by email. Cheers!

Tracking Daily Unique Visitors to Recently Published Blog Posts with Looker, Fivetran, Mixpanel, and BigQuery

May 11, 2018May 11, 2018Mazur 4 Comments

If you work at a company that publishes a lot of content, it’s important to understand how well that content is performing. Not just in terms of page views and unique visitors, but whether it converts visitors into trials, etc.

At Help Scout we have a Looker dashboard to help us track all of these things and more. In this post, I’ll walk you through how we track the daily unique visitors to our recently published blog posts. For example, on May 2nd we published April’s Release Notes – how many people viewed that post on the day it was published? How about the day after? And how does that compare to our other recently published content?

Overview

Big picture, we fire a Viewed Page Mixpanel event on every marketing page. We then use Fivetran to get that event data into BigQuery, where we analyze it in Looker. You can read more about the setup here: Tracking What Pages Your Visitors View Prior to Signing Up Using Mixpanel, Fivetran, BigQuery, and Looker.

Querying for Recently Published Posts

With this data in hand, we need to figure out a way to determine what the recent blog posts were so that we can limit our analysis to them.

Here’s the query we use:

	SELECT
	current_url,
	MIN(TIMESTAMP_TRUNC(time, DAY)) AS published_at
	FROM mp.event
	WHERE
	current_url like "https://www.helpscout.net/blog/%"
	GROUP BY 1
	HAVING
	COUNT(*) > 200 AND
	MIN(time) >= "2018-04-04"
	ORDER BY 2 DESC

view raw

recent-blog-posts.sql

hosted with ❤ by GitHub

For Help Scout blog URLs (ie, URLs that begin with https://www.helpscout.net/blog/), we need to determine when the post was published. That’s the same as the first day it was viewed. However, because we launched Mixpanel page view tracking on April 3rd, this would make it look like every one of our posts was published on April 3rd or sometime after that. That’s why we limit the results to April 4th or later. Also, we want to limit it to posts that received at least a certain number of visitors that first day, otherwise the results will wind up including low traffic posts that were first viewed after April 4th.

This query gets us a list of those recently published posts:

	+———————————————————————-+——————————–+
	\| current_url \| published_at \|
	+———————————————————————-+——————————–+
	\| https://www.helpscout.net/blog/customer-intimacy/ \| 2018-05-03 00:00:00.000000 UTC \|
	\| https://www.helpscout.net/blog/april-2018-release-notes/ \| 2018-05-02 00:00:00.000000 UTC \|
	\| https://www.helpscout.net/blog/data-security-breach/ \| 2018-04-26 00:00:00.000000 UTC \|
	\| https://www.helpscout.net/blog/beacon-preview-ui/ \| 2018-04-19 00:00:00.000000 UTC \|
	\| https://www.helpscout.net/blog/customer-service-employee-engagement/ \| 2018-04-05 00:00:00.000000 UTC \|
	\| https://www.helpscout.net/blog/march-2018-release-notes/ \| 2018-04-04 00:00:00.000000 UTC \|
	+———————————————————————-+——————————–+

view raw

recent-posts.txt

hosted with ❤ by GitHub

Modeling the Data in Looker

Over in Looker, we’re going to create a derived table with these results so that we can determine whether a given Mixpanel event URL is a recently published blog post:

	view: mp_new_blog_content {
	derived_table: {
	sql:
	SELECT
	current_url,
	MIN(TIMESTAMP_TRUNC(time, DAY)) AS published_at
	FROM mp.event
	WHERE
	current_url like "https://www.helpscout.net/blog/%"
	GROUP BY 1
	HAVING
	COUNT(*) > 200 AND
	MIN(time) >= "2018-04-04"
	ORDER BY 2 DESC ;;
	sql_trigger_value: FORMAT_DATETIME("%Y-%m-%d-%H", CURRENT_DATETIME()) ;;
	}

	dimension: current_url {
	type: string
	sql: ${TABLE}.current_url ;;
	hidden: yes
	}

	dimension: is_new_blog_content {
	label: "Is New Blog Content?"
	type: yesno
	sql: ${blog_post_publish_date} IS NOT NULL ;;
	}

	dimension_group: blog_post_publish {
	type: time
	timeframes: [
	date,
	week,
	month,
	quarter,
	year
	]
	sql: ${TABLE}.published_at ;;
	}
	}

view raw

new-blog-content.lookml

hosted with ❤ by GitHub

The reason we have the is_new_blog_content dimension here is because we’re going to LEFT JOIN all Mixpanel events on this derived table by the URL. Not all URLs will have a match in this table, so this dimension will let us limit the analysis to just events that were a recently published blog post.

Here’s how we model the relationship between our main Mixpanel events model and this derived table:

	explore: mp_events {
	view_label: "Mixpanel"
	label: "Mixpanel"

	join: mp_new_blog_content {
	view_label: "Mixpanel"
	type: left_outer
	relationship: many_to_one
	sql_on: ${mp_new_blog_content.current_url} = ${mp_events.current_url} ;;
	}
	}

view raw

recent-posts-explore.lookml

hosted with ❤ by GitHub

One other key piece of this is that we model how to calculate unique visitors in the main Mixpanel events view:

	view: mp_events {
	sql_table_name: mp.event ;;

	# …

	measure: unique_visitors {
	type: count_distinct
	sql: ${distinct_id} ;;
	}
	}

view raw

unique-visitors.lookml

hosted with ❤ by GitHub

Creating the Chart in Looker

With these foundations in place, we can then create the chart we set out to.

We want to use that Is New Blog Content dimension to limit the results to recently published posts, then pivot the daily unique visitor count on the URL:

Then it’s just a matter of setting up the chart in Looker and voila, there we have it:

Going forward, without anyone having to log into Google Analyics, we’ll be able to track the popularity of our new blog posts and track the trends over time.

By the way, that spike on April 19th is from our CEO’s Beacon 2.0 Preview: The User Interface post, all about Help Scout’s soon-to-be-released live chat & improved self service tool. If you’re interested in getting notified when it launches, you can sign up here.

Happy querying!

Analyzing a Conversion Funnel in BigQuery Using Fivetran Powered Mixpanel Data

April 12, 2018Mazur Leave a comment

In a recent post I outlined how to use Fivetran to sync Mixpanel data to BigQuery for analysis in Looker. Today we’ll walk through how to write a SQL query to analyze a funnel using the Mixpanel data in BigQuery.

For this analysis, we’re going to create a three-step funnel showing how many visitors who start on Help Scout’s pricing page click through to the sign up page and then sign up.

One Step

To begin, lets just look at visitors who viewed the pricing page:

	SELECT
	distinct_id,
	time AS viewed_pricing_at
	FROM mp.event
	WHERE
	name = "Viewed Page" AND
	current_url LIKE "https://www.helpscout.net/pricing/%"

	+—————————————————-+——————————–+
	\| distinct_id \| viewed_pricing_at \|
	+—————————————————-+——————————–+
	\| 1628dfc9f7e0-0184dcc4e-71615c5a-49a10-1628dfc9f800 \| 2018-04-03 17:09:27.000000 UTC \|
	\| 16291bd96456-01abc2a0f-87a6c72-2c880-16291bd9648ae \| 2018-04-04 10:41:32.000000 UTC \|
	\| 162936a4e6f37-01526b61-6b4b327d-25800-162936a4e738 \| 2018-04-04 18:27:46.000000 UTC \|
	\| 162a0381d481-0b3137c58-60031d3c-3d10d-162a0381d496 \| 2018-04-07 06:10:35.000000 UTC \|
	+—————————————————-+——————————–+

view raw

pricing-page.sql

hosted with ❤ by GitHub

You might wonder why we need the % at the end of the URL; that’s simply to make sure the results include pages with URL parameters such as https://www.helpscout.net/pricing/?utm_source=adwords.

Two Steps

Next, we’ll join these results on sign up page data, making sure that the sign up page views occurred after the pricing page views:

	SELECT
	pricing.distinct_id,
	viewed_pricing_at,
	viewed_sign_up_at
	FROM (
	SELECT
	distinct_id,
	time AS viewed_pricing_at
	FROM mp.event
	WHERE
	name = "Viewed Page" AND
	current_url LIKE "https://www.helpscout.net/pricing/%"
	) pricing
	LEFT JOIN (
	SELECT
	distinct_id,
	time AS viewed_sign_up_at
	FROM mp.event
	WHERE
	name = "Viewed Page" AND
	current_url LIKE "https://secure.helpscout.net/members/register/%"
	) sign_up_page ON sign_up_page.distinct_id = pricing.distinct_id AND sign_up_page.viewed_sign_up_at > viewed_pricing_at

	+———————————————————–+——————————–+——————————–+
	\| distinct_id \| viewed_pricing_at \| viewed_sign_up_at \|
	+———————————————————–+——————————–+——————————–+
	\| 1628be3f8ab208-0678714f9f0ec-b34356b-1fa400-1628be3f8b825 \| 2018-04-03 07:31:26.000000 UTC \| 2018-04-09 06:08:08.000000 UTC \|
	\| 1628be3f8ab208-0678714f9f0ec-b34356b-1fa400-1628be3f8b825 \| 2018-04-05 12:48:37.000000 UTC \| 2018-04-09 06:08:08.000000 UTC \|
	\| 1628be3f8ab208-0678714f9f0ec-b34356b-1fa400-1628be3f8b825 \| 2018-04-03 07:28:42.000000 UTC \| 2018-04-09 06:08:08.000000 UTC \|
	\| 1628be3f8ab208-0678714f9f0ec-b34356b-1fa400-1628be3f8b825 \| 2018-04-03 07:27:49.000000 UTC \| 2018-04-09 06:08:08.000000 UTC \|
	\| 1628be3f8ab208-0678714f9f0ec-b34356b-1fa400-1628be3f8b825 \| 2018-04-03 07:35:51.000000 UTC \| 2018-04-09 06:08:08.000000 UTC \|
	\| 162914788c5fb-09337b4c437dc4-b34356b-144000-162914788c628 \| 2018-04-04 17:30:18.000000 UTC \| \|
	+———————————————————–+——————————–+——————————–+

view raw

two-step-funnel.sql

hosted with ❤ by GitHub

The reason we LEFT JOIN is because not all visitors will make it to the next step of the funnel and we want the data to reflect that.

Note that we join on both the distinct_id (to ensure each result is for a single visitor) and on the time the events occurred.

Three Steps

Extending this to the third and final step of the funnel, the Signed Up event, we get:

	SELECT
	pricing.distinct_id,
	viewed_pricing_at,
	viewed_sign_up_at,
	signed_up_at
	FROM (
	SELECT
	distinct_id,
	time AS viewed_pricing_at
	FROM mp.event
	WHERE
	name = "Viewed Page" AND
	current_url LIKE "https://www.helpscout.net/pricing/%"
	) pricing
	LEFT JOIN (
	SELECT
	distinct_id,
	time AS viewed_sign_up_at
	FROM mp.event
	WHERE
	name = "Viewed Page" AND
	current_url LIKE "https://secure.helpscout.net/members/register/%"
	) sign_up_page ON sign_up_page.distinct_id = pricing.distinct_id AND sign_up_page.viewed_sign_up_at > viewed_pricing_at
	LEFT JOIN (
	SELECT
	distinct_id,
	time AS signed_up_at
	FROM mp.event
	WHERE
	name = "Signed Up"
	) signed_up ON signed_up.distinct_id = pricing.distinct_id AND signed_up.signed_up_at > viewed_sign_up_at

	+———————————————————–+——————————–+——————————–+——————————–+
	\| distinct_id \| viewed_pricing_at \| viewed_sign_up_at \| signed_up_at \|
	+———————————————————–+——————————–+——————————–+——————————–+
	\| 1628be3f8ab208-0678714f9f0ec-b34356b-1fa400-1628be3f8b825 \| 2018-04-03 07:31:26.000000 UTC \| 2018-04-09 06:08:08.000000 UTC \| \|
	\| 1628be3f8ab208-0678714f9f0ec-b34356b-1fa400-1628be3f8b825 \| 2018-04-05 12:48:37.000000 UTC \| 2018-04-09 06:08:08.000000 UTC \| \|
	\| 1628be3f8ab208-0678714f9f0ec-b34356b-1fa400-1628be3f8b825 \| 2018-04-03 07:28:42.000000 UTC \| 2018-04-09 06:08:08.000000 UTC \| \|
	\| 1628be3f8ab208-0678714f9f0ec-b34356b-1fa400-1628be3f8b825 \| 2018-04-03 07:27:49.000000 UTC \| 2018-04-09 06:08:08.000000 UTC \| \|
	\| 1628be3f8ab208-0678714f9f0ec-b34356b-1fa400-1628be3f8b825 \| 2018-04-03 07:35:51.000000 UTC \| 2018-04-09 06:08:08.000000 UTC \| \|
	\| 162914788c5fb-09337b4c437dc4-b34356b-144000-162914788c628 \| 2018-04-04 17:30:18.000000 UTC \| 2018-04-04 17:31:13.000000 UTC \| 2018-04-04 17:34:51.000000 UTC \|
	\| 162914788c5fb-09337b4c437dc4-b34356b-144000-162914788c628 \| 2018-04-04 08:31:16.000000 UTC \| 2018-04-04 17:24:52.000000 UTC \| 2018-04-04 17:34:51.000000 UTC \|
	+———————————————————–+——————————–+——————————–+——————————–+

view raw

three-steps.sql

hosted with ❤ by GitHub

Determining the First Event Occurrence for Each Step

The query above will return every combination of pricing page views, sign up page views, and sign up events for each visitor. For our funnel though, we don’t care whether they loaded the pricing page or sign up page multiple times, we only care that they did it at all. So we modify the query to only return the first instance of each event for each visitor:

	SELECT
	pricing.distinct_id,
	MIN(viewed_pricing_at) AS viewed_pricing_at,
	MIN(viewed_sign_up_at) AS viewed_sign_up_at,
	MIN(signed_up_at) AS signed_up_at
	FROM (
	SELECT
	distinct_id,
	time AS viewed_pricing_at
	FROM mp.event
	WHERE
	name = "Viewed Page" AND
	current_url LIKE "https://www.helpscout.net/pricing/%"
	) pricing
	LEFT JOIN (
	SELECT
	distinct_id,
	time AS viewed_sign_up_at
	FROM mp.event
	WHERE
	name = "Viewed Page" AND
	current_url LIKE "https://secure.helpscout.net/members/register/%"
	) sign_up_page ON sign_up_page.distinct_id = pricing.distinct_id AND sign_up_page.viewed_sign_up_at > viewed_pricing_at
	LEFT JOIN (
	SELECT
	distinct_id,
	time AS signed_up_at
	FROM mp.event
	WHERE
	name = "Signed Up"
	) signed_up ON signed_up.distinct_id = pricing.distinct_id AND signed_up.signed_up_at > viewed_sign_up_at
	GROUP BY 1

	+————————————————————–+——————————–+——————————–+——————————–+
	\| distinct_id \| viewed_pricing_at \| viewed_sign_up_at \| signed_up_at \|
	+————————————————————–+——————————–+——————————–+——————————–+
	\| 162931933729b2-0b3ba365e6d865-336c7b05-384000-1629319337333b \| 2018-04-04 16:58:45.000000 UTC \| 2018-04-11 20:50:13.000000 UTC \| 2018-04-11 20:51:46.000000 UTC \|
	\| 162b770945247a-0145ce1cbf076d8-495860-1fa400-162b77094541286 \| 2018-04-11 19:56:18.000000 UTC \| 2018-04-11 20:02:54.000000 UTC \| \|
	\| 162899a5ec5895-0fc657f6c6ebde-33697b07-384000-162899a5ec68c6 \| 2018-04-09 19:20:01.000000 UTC \| 2018-04-11 19:38:59.000000 UTC \| \|
	\| 162b73858df1f8-0625fc7a476d418-76313118-3d10d-162b73858e2237 \| 2018-04-11 17:19:53.000000 UTC \| 2018-04-11 17:23:22.000000 UTC \| \|
	\| 162b6d32dd11a5-0bbf34346ce939-3553629-231015-162b6d32dd2646 \| 2018-04-11 15:30:09.000000 UTC \| 2018-04-11 16:40:08.000000 UTC \| \|
	\| 162875b83f63e2-0870a441baadb6-33697b07-384000-162875b83f714f \| 2018-04-02 10:15:44.000000 UTC \| 2018-04-11 14:56:37.000000 UTC \| 2018-04-11 14:57:27.000000 UTC \|
	+————————————————————–+——————————–+——————————–+——————————–+

view raw

funnel-first-events.sql

hosted with ❤ by GitHub

Measuring the Funnel

Finally, we count how many visitors made it to each step to get our funnel:

	SELECT
	COUNT(viewed_pricing_at) AS viewed_pricing,
	COUNT(viewed_sign_up_at) AS viewed_sign_up,
	COUNT(signed_up_at) AS signed_up
	FROM (
	SELECT
	pricing.distinct_id,
	MIN(viewed_pricing_at) AS viewed_pricing_at,
	MIN(viewed_sign_up_at) AS viewed_sign_up_at,
	MIN(signed_up_at) AS signed_up_at
	FROM (
	SELECT
	distinct_id,
	time AS viewed_pricing_at
	FROM mp.event
	WHERE
	name = "Viewed Page" AND
	current_url LIKE "https://www.helpscout.net/pricing/%"
	) pricing
	LEFT JOIN (
	SELECT
	distinct_id,
	time AS viewed_sign_up_at
	FROM mp.event
	WHERE
	name = "Viewed Page" AND
	current_url LIKE "https://secure.helpscout.net/members/register/%"
	) sign_up_page ON sign_up_page.distinct_id = pricing.distinct_id AND sign_up_page.viewed_sign_up_at > viewed_pricing_at
	LEFT JOIN (
	SELECT
	distinct_id,
	time AS signed_up_at
	FROM mp.event
	WHERE
	name = "Signed Up"
	) signed_up ON signed_up.distinct_id = pricing.distinct_id AND signed_up.signed_up_at > viewed_sign_up_at
	GROUP BY 1
	) visitor_data

	+—————-+—————-+———–+
	\| viewed_pricing \| viewed_sign_up \| signed_up \|
	+—————-+—————-+———–+
	\| 1234 \| 567 \| 89 \|
	+—————-+—————-+———–+

view raw

three-step-funnel-counts.sql

hosted with ❤ by GitHub

Voila!

From here, you can take it a step further and segment the results on a property in the first step of the funnel or add a funnel duration to limit how long visitors have to complete the funnel, etc.

If you have any questions about this, don’t hesitate to reach out.

Tracking What Pages Your Visitors View Prior to Signing Up Using Mixpanel, Fivetran, BigQuery, and Looker

April 6, 2018Mazur 2 Comments

One of the things we’ve been analyzing at Help Scout recently is what paths individual companies take before signing up for a trial. For example, looking at a company that signed up last month, did they start their journey with Help Scout on our homepage? Or did they find us via our blog? Or one of our marketing landing pages? And once we know that, what other pages did they visit before signing up? Did that company wind up becoming a customer? How much did we make from them?

This post is about how we’ve wrangled the data using Mixpanel, Fivetran, BigQuery, and Looker to help us answer these questions.

Big picture, we’re tracking page view and sign up events in Mixpanel, syncing that data to BigQuery using Fivetran, then tying it to our internal company data in Looker for easy analysis.

Step 1: Tracking Page Views and Sign Ups in Mixpanel

If you’ve used an event-based analytics service, this step will be pretty straightforward. We load the Mixpanel script on every page, then fire a Viewed Page event:

	<script>
	mixpanel.track("Viewed Page");
	</script>

view raw

viewed-page.js

hosted with ❤ by GitHub

And then if the company signs up for a trial, we fire a Signed Up event with that company’s id as a property:

	mixpanel.track("Signed Up", {
	com_id: comId
	});

view raw

signed-up.js

hosted with ❤ by GitHub

These are the only two Mixpanel events we track. If we wanted to track actions in-app, we could also fire custom events for those, but we don’t at Help Scout because we tie this Mixpanel data together with our internal data about what companies have done in-app, eliminating the need for additional Mixpanel events.

Here’s what the Viewed Page event looks like in Mixpanel’s Live View:

Screen Shot 2018-04-06 at 1.42.07 PM.png

Also as we’ll see later, Mixpanel automatically tracks a lot of details about the visitor: things like the browser, country, URL, referrer, OS, etc, all of which we can use use in our analyses.

Step 2: Syncing Mixpanel data to BigQuery with Fivetran

Fivetran is this amazing service that specializes in helping you centralize all of your data in a data warehouse. For example, we have Fivetran connectors set up for MySQL (which we use internally at Help Scout), Salesforce, HubSpot, Google Sheets, and now Mixpanel:

Screen Shot 2018-04-06 at 1.46.42 PM.png

Screen Shot 2018-04-06 at 1.51.58 PM.png

Taking Mixpanel as an example, we provide Fivetran out Mixpanel API credentials, then Fivetran queries Mixpanel’s API periodically, cleans up the results, and throws it all in BigQuery:

Screen Shot 2018-04-06 at 1.50.21 PM.png

This lets us analyze our Mixpanel data just like we would other SQL data:

SELECT COUNT(*) FROM mp.event WHERE city = "Tokyo" AND DATE(time) = "2018-04-03"

view raw

tokyo.sql

hosted with ❤ by GitHub

We can also access the custom event properties that we’re tracking for the Signed Up event using Standard SQL’s json_extract_scalar function:

	SELECT
	properties,
	JSON_EXTRACT_SCALAR(properties, "$.com_id") AS com_id
	FROM mp.event
	WHERE name = "Signed Up"
	LIMIT 1

	+—————–+——–+
	\| properties \| com_id \|
	+—————–+——–+
	\| {"com_id":1234} \| 1234 \|
	+—————–+——–+

view raw

com-id.sql

hosted with ❤ by GitHub

Once we have the company id, it’s just a matter of querying for a specific company id to view their event history:

	SELECT
	time,
	name,
	current_url
	FROM (
	SELECT
	distinct_id,
	CAST(JSON_EXTRACT_SCALAR(properties, "$.com_id") AS INT64) AS com_id
	FROM mp.event
	WHERE name = "Signed Up"
	) signed_up
	JOIN mp.event ON event.distinct_id = signed_up.distinct_id
	WHERE signed_up.com_id = 1234
	ORDER BY time ASC

	+——————————–+————-+————————————————————–+
	\| time \| name \| current_url \|
	+——————————–+————-+————————————————————–+
	\| 2018-04-05 15:38:38.000000 UTC \| Viewed Page \| https://www.helpscout.net/ \|
	\| 2018-04-05 15:38:40.000000 UTC \| Viewed Page \| https://secure.helpscout.net/members/login/ \|
	\| 2018-04-05 15:38:46.000000 UTC \| Viewed Page \| https://www.helpscout.net/pricing/ \|
	\| 2018-04-05 15:38:52.000000 UTC \| Viewed Page \| https://secure.helpscout.net/members/register/19/ \|
	\| 2018-04-05 15:39:43.000000 UTC \| Signed Up \| https://secure.helpscout.net/welcome/segmentation-questions/ \|
	+——————————–+————-+————————————————————–+

view raw

visitor-flow.sql

hosted with ❤ by GitHub

Step 3: Modeling the data in Looker

If you have access to a Business Intelligence tool like Looker, you can model this Mixpanel data and how to join it with your other data.

First, create a view to model the Mixpanel event data:

	view: mp_events {
	sql_table_name: mp.event ;;

	dimension: event_id {
	primary_key: yes
	type: number
	sql: ${TABLE}.event_id ;;
	}

	dimension: browser {
	type: string
	sql: ${TABLE}.browser ;;
	}

	dimension: browser_version {
	type: number
	sql: ${TABLE}.browser_version ;;
	}

	dimension: city {
	type: string
	sql: ${TABLE}.city ;;
	}

	dimension: full_url {
	type: string
	sql: ${TABLE}.current_url ;;
	}

	dimension: has_url_parameters {
	type: yesno
	hidden: yes
	sql: STRPOS(${full_url}, "?") > 0 ;;
	}

	dimension: current_url_without_parameters {
	hidden: yes
	label: "Current URL without Parameters"
	type: string
	sql:
	CASE
	WHEN ${has_url_parameters} THEN SUBSTR(${full_url}, 0, STRPOS(${full_url}, "?") – 1)
	ELSE ${full_url}
	END ;;
	}

	dimension: has_url_hash {
	type: yesno
	hidden: yes
	sql: STRPOS(${current_url_without_parameters}, "#") > 0 ;;
	}

	dimension: current_url {
	type: string
	sql:
	CASE
	WHEN ${has_url_hash} THEN SUBSTR(${current_url_without_parameters}, 0, STRPOS(${current_url_without_parameters}, "#") – 1)
	ELSE ${current_url_without_parameters}
	END ;;
	}

	dimension: current_url_path {
	type: string
	sql: SUBSTR(${current_url}, STRPOS(${current_url}, "helpscout.net") + LENGTH("helpscout.net")) ;;
	}

	dimension: device {
	type: string
	sql: ${TABLE}.device ;;
	}

	dimension: distinct_id {
	type: string
	sql: ${TABLE}.distinct_id ;;
	}

	dimension: initial_referrer {
	type: string
	sql: ${TABLE}.initial_referrer ;;
	}

	dimension: initial_referring_domain {
	type: string
	sql: ${TABLE}.initial_referring_domain ;;
	}

	dimension: lib_version {
	type: string
	sql: ${TABLE}.lib_version ;;
	}

	dimension: mp_country_code {
	type: string
	sql: ${TABLE}.mp_country_code ;;
	}

	dimension: mp_lib {
	type: string
	sql: ${TABLE}.mp_lib ;;
	}

	dimension: name {
	type: string
	sql: ${TABLE}.name ;;
	}

	dimension: os {
	type: string
	sql: ${TABLE}.os ;;
	}

	dimension: properties {
	type: string
	sql: ${TABLE}.properties ;;
	}

	dimension: referrer {
	type: string
	sql: ${TABLE}.referrer ;;
	}

	dimension: referring_domain {
	type: string
	sql: ${TABLE}.referring_domain ;;
	}

	dimension: region {
	type: string
	sql: ${TABLE}.region ;;
	}

	dimension: screen_height {
	type: number
	sql: ${TABLE}.screen_height ;;
	}

	dimension: screen_width {
	type: number
	sql: ${TABLE}.screen_width ;;
	}

	dimension: search_engine {
	type: string
	sql: ${TABLE}.search_engine ;;
	}

	dimension_group: time {
	label: "Event"
	type: time
	timeframes: [
	raw,
	time,
	date,
	week,
	month,
	quarter,
	year
	]
	sql: ${TABLE}.time ;;
	}

	dimension: utm_source {
	label: "UTM Source"
	type: string
	sql: JSON_EXTRACT_SCALAR(${properties}, "$.utm_source") ;;
	}

	dimension: utm_campaign {
	label: "UTM Campaign"
	type: string
	sql: JSON_EXTRACT_SCALAR(${properties}, "$.utm_campaign") ;;
	}

	dimension: utm_term {
	label: "UTM Term"
	type: string
	sql: JSON_EXTRACT_SCALAR(${properties}, "$.utm_term") ;;
	}

	dimension: utm_medium {
	label: "UTM Medium"
	type: string
	sql: JSON_EXTRACT_SCALAR(${properties}, "$.utm_medium") ;;
	}

	dimension: utm_content {
	label: "UTM Content"
	type: string
	sql: JSON_EXTRACT_SCALAR(${properties}, "$.utm_content") ;;
	}

	measure: count {
	label: "Total Events"
	type: count
	drill_fields: [event_id, name]
	}

	measure: unique_visitors {
	type: count_distinct
	sql: ${distinct_id} ;;
	}
	}

view raw

mp_events.lookml

hosted with ❤ by GitHub

Then create a view with a derived table to model the relationship between a distinct_id in Mixpanel with a company id:

	view: mp_companies {
	derived_table: {
	sql:
	SELECT
	distinct_id,
	CAST(JSON_EXTRACT_SCALAR(properties, "$.com_id") AS INT64) AS com_id
	FROM mp.event
	WHERE name = "Signed Up" ;;
	}

	dimension: distinct_id {
	type: number
	sql: ${TABLE}.distinct_id ;;
	hidden: yes
	}

	dimension: com_id {
	label: "Company ID"
	type: number
	sql: ${TABLE}.com_id ;;
	}
	}

view raw

mp-companies.lookml

hosted with ❤ by GitHub

Finally, connect those two views and any others you want to be able to analyze together:

	connection: "bigquery"

	include: "*.view.lkml"

	explore: mp_events {
	view_label: "Mixpanel"
	label: "Mixpanel"

	sql_always_where: STRPOS(${full_url}, "https://www.helpscout.net") = 1 OR STRPOS(${full_url}, "https://secure.helpscout.net") = 1 ;;

	join: mp_companies {
	view_label: "Mixpanel"
	type: left_outer
	relationship: many_to_one
	sql_on: ${mp_companies.distinct_id} = ${mp_events.distinct_id} ;;
	}

	join: helpscout_companies {
	view_label: "Company"
	relationship: one_to_one
	sql_on: ${helpscout_companies.com_id} = ${mp_companies.com_id} ;;
	}
	}

view raw

mp.lookml

hosted with ❤ by GitHub

Step 4: Analyzing the data in Looker

Once you’ve modeled the data, you can analyze the event history for specific companies, track trends like page views and unique visitors, and more:

Screen Shot 2018-04-06 at 2.28.57 PM.png

In future posts I’ll walk through some of the other interesting analyses you can perform with this data.

If you have any questions about this setup, don’t hesitate to reach out.

	+———————————————————————-+——————————–+
	\| current_url \| published_at \|
	+———————————————————————-+——————————–+
	\| https://www.helpscout.net/blog/customer-intimacy/ \| 2018-05-03 00:00:00.000000 UTC \|
	\| https://www.helpscout.net/blog/april-2018-release-notes/ \| 2018-05-02 00:00:00.000000 UTC \|
	\| https://www.helpscout.net/blog/data-security-breach/ \| 2018-04-26 00:00:00.000000 UTC \|
	\| https://www.helpscout.net/blog/beacon-preview-ui/ \| 2018-04-19 00:00:00.000000 UTC \|
	\| https://www.helpscout.net/blog/customer-service-employee-engagement/ \| 2018-04-05 00:00:00.000000 UTC \|
	\| https://www.helpscout.net/blog/march-2018-release-notes/ \| 2018-04-04 00:00:00.000000 UTC \|
	+———————————————————————-+——————————–+

Matt Mazur

Category Help Scout