Removing Query Parameters and Fragments from URLs with SQL

If you’re working with Mixpanel data, it’s important to keep in mind that current_url property represents the the full URL including query parameters and fragements. Here are a few examples, screenshots courtesy of the Chrome Mixpanel Debugger Extension:

https://www.helpscout.net/blog/

Screen Shot 2018-05-15 at 9.04.24 AM.png

https://www.helpscout.net/blog/?utm_source=example

Screen Shot 2018-05-15 at 9.04.59 AM.png

https://www.helpscout.net/blog/?utm_source=example#fragment

Screen Shot 2018-05-15 at 9.05.38 AM.png

You get the idea: The current_url property is the exact URL as displayed in the browser, query parameters and all.

This is an issue because if we’re analyzing the data to determine the number of unique visitors to a page, the query paramters and fragments shouldn’t have an impact. These should all be treated as the same page:

Without adjusting the query to take into account that these are all the same page, we’d wind up undercounting the number of visitors to it because we’d only wind up counting the version without query paramters or fragements.

Take a look at just a fraction of the URLs our visitors had when viewing a single blog post (this Mixpanel data is in BigQuery thanks to Fivetran’s Mixpanel connector):

Screen Shot 2018-05-15 at 9.23.46 AM.png

Thanks to our marketing efforts, we actually had more visitors to the post with query parameters than there were to the same post without URL parameters.

Fortunately, removing the URL parameters and fragment is relatively straightforward using Standard SQL’s STRPOS AND SUBSTR:


SELECT
current_url,
CASE
WHEN STRPOS(current_url, "?") > 0 THEN SUBSTR(current_url, 0, STRPOS(current_url, "?") 1)
WHEN STRPOS(current_url, "#") > 0 THEN SUBSTR(current_url, 0, STRPOS(current_url, "#") 1)
ELSE current_url
END AS url
FROM (
SELECT 'https://www.helpscout.net/blog/beacon-preview-ui/' AS current_url
UNION ALL SELECT 'https://www.helpscout.net/blog/beacon-preview-ui/?utm_content=1234&utm_medium=social&utm_source=twitter' AS current_url
UNION ALL SELECT 'https://www.helpscout.net/blog/beacon-preview-ui/#example' AS current_url
);
+——————————————————————————————————-+—————————————————+
| current_url | url |
+——————————————————————————————————-+—————————————————+
| https://www.helpscout.net/blog/beaconpreviewui/ | https://www.helpscout.net/blog/beaconpreviewui/ |
| https://www.helpscout.net/blog/beaconpreviewui/?utm_content=1234&utm_medium=social&utm_source=twitter | https://www.helpscout.net/blog/beaconpreviewui/ |
| https://www.helpscout.net/blog/beaconpreviewui/#example | https://www.helpscout.net/blog/beacon-preview-ui/ |
+——————————————————————————————————-+—————————————————+

With that adjustment made, you can easily count the number of page views or unique visitors to a given page, regardless of URL parameters or fragements.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s