A Frequent Communication Mistake I’ve Made as a Data Analyst

Looking back at my time so far as a data analyst, some of the biggest mistakes I’ve made were not technical in nature but around how I communicated within the organization.

Two real-world examples to illustrate:

A few years ago at Automattic our ad revenue was way down from what our marketing team expected it to be. For example (and I’m making up numbers here), for every $100 we were spending on ads, we had been making $150 historically, but in recent months we were making $25. Either the performance had gone way down or there was some issue with the tracking and reporting.

I was on the data team at the time and volunteered to work with the marketing team to investigate. As it turned out, there was indeed an issue: there was a problem with the way AdWords was appending UTM parameters to our URLs which was breaking our tracking. For example, a visitor would click an ad and land on wordpress.com/business&utm_source=adwords – note that there’s an amperstand after the URL path instead of a question mark, so the correct UTM source wouldn’t get tracked and the customer wouldn’t get attributed to AdWords.

Fortunately, we had some event tracking set up on these pages (Tracks for the win) that recorded the full URL, so I was able to go back and determine which customers came from ads and calculate what our actual return on ad spend was. After figuring out the issue and determining how much unattributed revenue we had, I wrote up a lengthy post about what happened and published it on our internal marketing blog without informing the marketing team about it first.

Second example: a few months ago at Help Scout, we had an ambitious revenue target for Q1. With a few days left in the quarter, we were still projecting to come in short of the target and no one realistically expected us to reach it. Something about the projection seemed off to me so I dove in and realized there was a mistake in one of the calculations (it was my fault – in the projection we weren’t counting revenue that we earned that month from customers that were delinquent who then became paying again). As a result, our projection was too low and we likely were going to hit our target (and eventually did!). I wrote up a lengthy message about what happened and published it in one of our company Slack channels without informing any of the leadership about it first.

To understand the problem, it’s important to note that as a data analyst, I haven’t typically been responsible for the performance of our metrics. I help set up tracking and reporting and help ensure accuracy, but someone else in the organization is responsible for how well those metrics were doing.

In both of the cases above, I wasn’t intentionally bypassing people. At the time, it was more like “oh, hey, there’s a bug, now it’s fixed, better let everyone know about it” – and probably an element of wanting credit for figuring out the issue too.

However, not consulting with those responsible for the metrics before reporting it was a mistake for several reasons:

  • They didn’t have an opportunity to help me improve how the issue and impact were communicated with the rest of the company and its leadership.
  • I missed an opportunity to have them doublecheck the revised calculations, which could have been wrong.
  • Even though we were doing better than we had been reporting in both cases, it may have indirectly made people look bad because they had been reporting performance based on inaccurate data. They should not be finding out about the issue at the same time as the rest of the company.

In neither case was there any big drama about how I went about it, but it was a mistake on my part nonetheless.

Here’s what I’d recommend for anyone in a similar role: if someone else in your organization is responsible for the performance of a metric and you as a data analyst discover some issue with the accuracy of that metric, always discuss it with them first and collaborate with them on how it is communicated to the rest of the company.

It sounds obvious in retrospect, but it’s bitten me a few times so I wanted to share it with the hope that it helps other analysts out there avoid similar issues. Soft skills like this are incredibly important and worth developing in parallel with your technical skills.

If you’ve made any similar mistakes or have any related lessons learned, I’d love to hear about them in the comments or by email. Cheers!

A Simple MRR Spreadsheet Model

Screen Shot 2018-06-15 at 2.21.51 PM.png

A buddy of mine asked for help creating a simple spreadsheet to model his startup’s Monthly Recurring Revenue (MRR) over time given various assumptions. He suggested I share it so others can check it out as well.

You can find the Google Sheet here: Simple MRR Spreadsheet Model. It lets you tinker with a few variables to estimate what your MRR will look like over the next two years. You’ll want to make a copy (File > Make a copy…) to edit it.

The variables include:

  • Monthly visitors
  • Monthly visitor growth %
  • Visitor to trial %
  • Trial to paid %
  • Average monthly revenue per customer
  • Monthly churn %

A big asterisk is that this is a very, very simple model and how your MRR actually plays out will depend on a lot of things like how long your trial period is, how your churn rate differs for new customers vs long-time ones, conversion rates and churn rates by plan, how your conversion rates changes as the quality of your traffic changes, and a lot more. That said, this should get you in the right ballpark.

One important thing to note for anyone getting into recurring revenue is the impact your growth rate and churn rate have on your bottom line. For example, if you’re not growing, churn will eventually cause your MRR to plateau.

For example, here’s a model with 0% monthly growth:

Screen Shot 2018-06-15 at 2.24.58 PM.png

And here’s the exact same model with 10% monthly growth:

Screen Shot 2018-06-15 at 2.26.56 PM.png

You can play around with it to get an idea of how different numbers impact your long term growth.

If you have any suggestions for improving it just drop a comment below or shoot me an email – thanks!

Wrangling dbt Database Permissions

One of the more time consuming aspects of getting dbt working for me has been figuring out how to set the correct database permissions. I’m comfortable analyzing data once it’s in a data warehouse, but haven’t have a ton of experience actually setting one up. This post is for anyone else in a similar spot looking to set up dbt.

For some context going into this, I’m using Amazon RDS for Postgres with Heroku data made available by Stitch.

Big picture, to summarize Martin Melin who gave me some tips on this on dbt Slack, is for each person to have a user with full access to its own dedicated schema and read-only access to the source schemas.

For a quick key on the different roles in my setup:

  • mhmazur: The superuser
  • mazur: The dbt user
  • stitch: For Stitch to use
  • mode: For Mode to use

1) Create a user for dbt to use to connect to your data warehouse

With the exception of step 4, all of these need to be run from a superuser account:

-- As mhmazur
create role mazur with login password '...';

These are the credentials you’ll set in your profile.yml file for dbt to use to connect to your data warehouse.

2) Create an analytics schema

-- As mhmazur
create schema analytics;

This is where the production views and tables that dbt creates will exist. Your BI tools and analyses will use this data.

You could also set one up for development (and adjust the commands below accordingly) by changing the schema name to something like “dbt_mazur”.

If you tried to run dbt at this point, you’d get an error like “permission denied for schema analytics” because the dbt user doesn’t have access to it yet.

3) Give the dbt user full access to the analytics schema

-- As mhmazur
grant all on schema analytics to mazur;

Almost there! If you tried to run dbt now, you would get a different error: “permission denied for schema preceden_heroku” because the user doesn’t have access to the sources tables yet.

4) Make the dbt user the owner of the schema

-- As mhmazur
alter schema analytics owner to mazur;

This is necessary because when you created the schema, your superuser role became its owner. We need our dbt user to be the owner so that it can grant usage permissions from dbt hooks (see down below).

5) Finally, give the dbt user read-only access to the source tables

In order to transform the data, dbt needs to be able to query the source tables. In my case, the source tables live in a “preceden_heroku” schema owned by a “stitch” user. Therefore, to grant read-only access to the dbt user, I have to log in as the “stitch” user and run the following:

-- As stitch
grant usage on schema preceden_heroku to mazur;
grant select on all tables in schema preceden_heroku to mazur;

The last command is necessary even if you don’t materialize any models as tables because it also grants select permissions on views as well (even though it just says “tables” in the command). Per the docs: “note that ALL TABLES is considered to include views”.

Undoing these changes

You may run into issues and need to reverse these changes. Here’s how:

From the “stitch” user, revoke the read-only rights you gave to the user:

-- As stitch
revoke usage on schema preceden_heroku from mazur;
revoke all privileges on all tables in schema preceden_heroku from mazur;

Then from the dbt user (“mazur” in my case) change the owner back to your superuser:

-- As mazur
alter schema analytics owner to mhmazur;

Then back from your super user account (“mhmazur” for me), revoke permissions to the analytics schema:

-- As mhmazur
revoke all on schema analytics from mazur;

And then remove the analytics schema:

-- As mhmazur
drop schema analytics cascade;

And finally remove the user:

-- As mhmazur 
drop role mazur;

Granting read-only access to the analytics schema

One last thing: you’ll want to grant read-only access to the analytics schema so that your BI tool can execute queries. As the dbt user, run:

-- As mazur
grant usage on schema analytics to mode;
grant select on all tables in schema analytics to mode;

You’ll want to set this up as a hook so that whenever dbt changes the views and tables your BI maintains the correct permissions:

 - 'grant usage on schema "{{ target.schema }}" to mode'
 - 'grant select on all tables in schema "{{ target.schema }}" to mode'

If you don’t do this, you’ll wind up getting a “permission denied for schema ______” or “permission denied for relation ______” error in your BI tool.

Happy querying!

Using Macros in dbt to Select Different Table Names in Development and Production

I’ve been experimenting a lot with dbt, a command line tool for transforming data within a data warehouse. I’ll have more to write about dbt later, but want to share a few lessons learned as I work through it for anyone else Googling for solutions in the future.

Here’s a problem I ran into and how to solve it:

In my development environment, I have an exact copy of my production Postgres database. For example, I can run a query like:

select count(*) from users;

In my data warehouse, the tables are created by Stitch which automatically adds the tables to a schema. To query for the number of users in the data warehouse, I have to prepend the schema to the table name:

select count(*) from preceden_heroku.users;

This presents a slight challenge for configuring dbt. It’s a best practice to use variables to set table names that way if the table name changes in the future, you can just adjust the variable and all of the models that depend on it will continue to work correctly:

However, with my setup, the table names differ between development and production so I can’t use this setup without some adjustment.

Macros to the rescue

I asked in the dbt Slack group how to work around this and Drew Banin, one of dbt’s creators, recommend using a macro. Macros are snippets of code that you can use in your dbt models like functions.

We can create a macro that accepts a table name and adjusts it based on whether dbt is being run in development or production:

In development, if we pass in an argument like “users”, the macro will simply return “users”. If production, it will prepend the table name with the schema name to return “preceden_heroku.users”.

Then we can use the this macro in the model by passing in the table name (via the variable) to have dbt use table names without a schema in development and table names with a schema in production:

Easy peasy :)

Generating High Quality Available .com Domain Names for a Specific Industry

In my last post I detailed how to extract all of the available .com domain names from the .com zone file. In this post I’m going to show you how to do something very useful with the result: finding a great available domain name for a business in a specific industry.

For example, we’re going to find great business names that can fill in the blanks for the industry of your choosing:

  • ____________Marketing.com
  • ____________Consulting.com
  • ____________SEO.com
  • ____________Data.com
  • ____________Media.com
  • ____________Systems.com
  • ____________Law.com

The big idea: Check for keywords that are registered for other industries, but not registered for yours

Consider this: what if we looked at all of the registered domains that end with advertising.com, figure out the keyword, and then check whether the corresponding marketing.com domain is available? For example, imagine we check and see that the domain HightowerAdvertising.com is registered (we’ll refer to Hightower as the keyword here). We can then check to see if HightowerMarketing.com is registered. Because someone already registered the keyword for the advertising industry, there’s a good chance that the keyword is meaningful and worth checking for the marketing industry as well.

We can take this a step further by checking for common keywords in multiple industries. For example, we check all the domains that end in advertising.com, all that end in media.com, see which keywords they have in common, then check which of those are not registered for marketing.com domains.

The fewer industies we check for common keywords, the more results we’ll have, but the lower the quality. The more industries we check, the fewer the results, but the higher the quality.

Getting your command line on

If you went through my last post, you should have wound up with a domains.txt file that has about 108M registered .com domain names:

$ wc -l domains.txt 
 108894538 domains.txt

With a little bit of command line magic, we can extract all of the domains that end in ADVERTISING (like HIGHTOWERADVERTISING), then remove the trailing ADVERTISING word to get just HIGHTOWER, then sort those results and save it to a list:

$ LC_ALL=C grep ADVERTISING$ domains.txt | sed 's/.\{11\}$//' | sort -u > tmp/advertising.txt

Which will generate a list such as:

Then we do the same for MARKETING domains:

$ LC_ALL=C grep MARKETING$ domains.txt | sed 's/.\{9\}$//' | sort -u > tmp/marketing.txt

And finally, we figure out which domains are in the advertising list but not in the marketing list:

$ comm -23 tmp/advertising.txt tmp/marketing.txt > results/marketing.txt

If we want to find common keywords registered in multiple industries, we need to add an extra step to generate that list of common keywords before figuring out which ones are available in ours:

$ comm -12 tmp/advertising.txt tmp/media.txt | comm -12 - tmp/design.txt | sort -u > tmp/common.txt
$ comm -23 tmp/common.txt tmp/marketing.txt > results/marketing.txt

The resulting marketing.txt list will have the common keywords in the other industries that are likely not registered in yours:

The way to interpret this is that for a keyword like Adspace, those domains are registered in the other industries (AdspaceAdvertising.com, AdspaceMedia.com), but not registered for ours (AdspaceMarketing.com). Again, the more similiar industries you check for common keywords, the higher the quality of results you’ll have. We could add three or four more industries to get a short, very high quality list.

By the way, the reason I say likely not registered is because once a domain loses its name servers – for example, if it’s way past its expiration date – it will drop out of the zone file even though the name isn’t available to register yet. Therefore some of the results might actually be registered, but a quick WHOIS check will confirm if it is or not:

$ whois blueheronmarketing.com

No match for domain "BLUEHERONMARKETING.COM".

Or you could just use this Ruby script

Because it’s a pain to run all of these commands while searching for available domains in an industry, I put together this small Ruby script to help:


There are instructions in the README explaining how to set the industry and similar industries in the script. If all goes well, it will run all of the necessary commands to generate the list of results:

$ ruby generator.rb 
Finding available domains for marketing...
Generating industry name lists...
Searching for domains that end with 'advertising'...
  LC_ALL=C grep ADVERTISING$ domains.txt | sed 's/.\{11\}$//' | sort -u > tmp/advertising.txt
Searching for domains that end with 'media'...
  LC_ALL=C grep MEDIA$ domains.txt | sed 's/.\{5\}$//' | sort -u > tmp/media.txt
Searching for domains that end with 'design'...
  LC_ALL=C grep DESIGN$ domains.txt | sed 's/.\{6\}$//' | sort -u > tmp/design.txt
Searching for domains that end with 'marketing'...
  LC_ALL=C grep MARKETING$ domains.txt | sed 's/.\{9\}$//' | sort -u > tmp/marketing.txt
Finding common names in industries...
  comm -12 tmp/advertising.txt tmp/media.txt | comm -12 - tmp/design.txt | sort -u > tmp/common.txt
Finding names not registered for marketing...
  comm -23 tmp/common.txt tmp/marketing.txt > results/marketing.txt
Done, results available in results/marketing.txt

And with a little luck, you’ll find a great domain in the list to use for your new business.

Extracting a List of All Registered .com Domains from the Verisign Zone File

Back in the day when I worked on Lean Domain Search I got a lot of experience working with Verisign’s .com zone file because that’s what Lean Domain Search uses behind the scenes to check whether a given domain is available to register or not.

I still get a lot of emails asking for details about how it worked so over a series of posts, I’m going to walk through how to work with the zone file and eventually explain exactly how Lean Domain Search works.

What’s a zone file?

A zone file lists all registered domains for a given Top Level Domain (like .com, .net, etc) and the name servers associated with the domain. For example, because this blog is hosted on WordPress.com, the zone file lists the WordPress.com name servers for it:


How do I get access to the zone file?

Anyone can fill out a form, apply, and get access. There are details on this page. I detailed in this old post on Lean Domain Search how I filled out the form, though it has changed since then so you’ll need to make some adjustments.

What happens after I apply for access?

Verisign will provide you details to log into the FTP to download the zone file:

Screen Shot 2018-05-18 at 1.07.14 PM.png

The zone file is that 2.91 GB com.zone.gz which unzipped is 11.47 GB currently.

What’s in the zone file?

It begins with some administrative details, then begins listing domains and their associated name server. Note that registered domains without a name server (such as ones that are close to expiring) are not included in this list.

How can I extract a list of just the domains?

Glad you asked! It takes a little bit of command line fu.

If you’d like to follow along, here are the first 1,000 lines of the zone file. You can download this and use the terminal commands below just like you would if you were working with the entire 317,338,073 line zone file.

1) First, we’ll grab a list of just the domains:

$ awk '{print $1}' com.zone > domains-only.txt

For a line like this:


This command will return just KITCHENEROKTOBERFEST.

This will also return some non-domains from the administrative section at the top of the zone file, but we’ll filter those out later.

Here’s what domains-only.txt should look like.

2) Next, we’ll sort the results and remove duplicates:

$ sort -u domains-only.txt --output domains-unique.txt

This is necessary because most domains will have multiple name servers, but we don’t want the domain to appear multiple times in our final list of domains.

Here’s what domains-unique.txt should look like.

3) Last but not least, we’ll ensure the results include only domains:

$ LC_ALL=C grep '^[A-Z0-9\-]*$' domains-unique.txt > domains.txt

There are a few things to note here.

First, make sure to use gnu grep, which is not the default on Macs. GNU grep is fast.

The LC_ALL=C forces grep to use the locale C, which tells grep this is an ASCII file, not a UTF-8 file. More details here. While not important for this 1,000-line file, it significantly reduces how much time grep takes on the full 300M+ line zone file.

The ^[A-Z0-9\-]*$ regular expression here looks for lines that are made up of letters, numbers, and dashes. The reason we use a * (0 or more characters) vs + (1 or more characters) is simply because the grep command doesn’t support +.

Technically this regex will match strings that are longer than domains can actually be (the max is 63 characters) as well as strings that start or end with a dash (which isn’t valid for a domain) but there aren’t any of those in the zone file so it’s not a big deal and grep will run faster this way. If you really wanted to get fancy, you could match proper domains, but it will take longer to run: ^[A-Z0-9]([A-Z0-9\-]{0,61}[A-Z0-9])?$

Here’s what domains.txt should look like.

Note that this does include some domain-like strings from the administrative section like 1526140941 which isn’t actually a domain. Depending on what you’re using the zone file for you could remove these lines, but it’s never been a big deal for my use case. Because Lean Domain Search is limited to letters-only domains, it actually just uses  ^[A-Z]* for the regex.

Here’s some actual code from Lean Domain Search with these steps above:

Screen Shot 2018-05-18 at 1.43.18 PM.png

If you run into any trouble or have suggestions on how to improve any of these commands, don’t hesitate to reach out. Cheers!

Removing Query Parameters and Fragments from URLs with SQL

If you’re working with Mixpanel data, it’s important to keep in mind that current_url property represents the the full URL including query parameters and fragements. Here are a few examples, screenshots courtesy of the Chrome Mixpanel Debugger Extension:


Screen Shot 2018-05-15 at 9.04.24 AM.png


Screen Shot 2018-05-15 at 9.04.59 AM.png


Screen Shot 2018-05-15 at 9.05.38 AM.png

You get the idea: The current_url property is the exact URL as displayed in the browser, query parameters and all.

This is an issue because if we’re analyzing the data to determine the number of unique visitors to a page, the query paramters and fragments shouldn’t have an impact. These should all be treated as the same page:

Without adjusting the query to take into account that these are all the same page, we’d wind up undercounting the number of visitors to it because we’d only wind up counting the version without query paramters or fragements.

Take a look at just a fraction of the URLs our visitors had when viewing a single blog post (this Mixpanel data is in BigQuery thanks to Fivetran’s Mixpanel connector):

Screen Shot 2018-05-15 at 9.23.46 AM.png

Thanks to our marketing efforts, we actually had more visitors to the post with query parameters than there were to the same post without URL parameters.

Fortunately, removing the URL parameters and fragment is relatively straightforward using Standard SQL’s STRPOS AND SUBSTR:

With that adjustment made, you can easily count the number of page views or unique visitors to a given page, regardless of URL parameters or fragements.