Learning Data Science: 3 Months In

At the end of April I decided to take a break from Preceden and start using that time to level up my data science skills. I’m about 3 months into that journey now and wanted to share how I’m going about it in case it’s helpful to anyone.

Data Science

Data science is very broad and depending on who you ask it can mean a lot of different things. Some folks would consider analyzing data in SQL or Excel as data science, but to me that’s never felt quite right. I prefer a definition that leans more on writing code that makes use of statistics, machine learning, natural language processing, and similar fields to analyze data.

Python

Going into this I had a lot of programming and data analysis experience, but hadn’t done much with Python and barely knew what regression meant.

I considered continuing to learn R which I already have some experience in, but I’m not a huge fan of R so decided to start fresh and learn Python instead. Having used Ruby extensively for Preceden and other projects has made learning Python pretty easy though.

DataCamp

DataCamp is an online learning platform to help people learn data science. They have hundreds of interactive courses and tracks for learning R, Python, Excel, SQL, etc. If you have the interest and time, the $300/year they charge for access to all of their courses is nothing compared to the value they provide.

I’ve been making my way through their Machine Learning for Everyone career track which starts off with a basic introduction to Python and quickly dives into statistics, supervised learning, natural language processing, and a lot more.

Screen Shot 2020-07-29 at 9.21.24 AM.png

Each course is a combination of video lectures and interactive coding exercises:

Screen Shot 2020-07-29 at 9.44.11 AM

Screen Shot 2020-07-29 at 9.44.52 AM.png

The courses are really well done and I feel like they’re giving me exposure to a broad range of machine learning topics. I wouldn’t say the courses go deep on any particular topic, but they provide great introductions which you can build on outside of DataCamp.

So far I’ve completed 10 out of 37 courses in this career track + 2 additional Python courses that were not in the track but recommended prerequisites for some of the courses that are in the track.

If you pushed through a course it might take 4 hours to complete, but I’m probably spending 10-15 hours on each course (so about 1 course/week). This is because I spend a lot of extra time during and after the course writing documentation for myself and trying to apply the material to real-world data to learn it better.

Documentation

Every time I stumble across a new function or technique I spend some extra time researching it and documenting it in a public Python Cheat Sheet GitHub repository.

At first I was doing writing notes in markdown files, but have since gotten a little savier and am doing them in iPython Notebook files now. Here’s a recent example of documentation I wrote about analyzing time series.

Screen Shot 2020-07-29 at 9.28.57 AM

I usually try to come up with some super simple example demonstrating how each function works which helps me learn it better and serves as an easy reference guide when I need to brush up on it when applying it down the road.

Real World Projects

For each course, I also try to apply the material to some real world data that I have access to, whether it be for Help Scout or Preceden.

For example, after DataCamp’s supervised learning course I spent some time trying to use Help Scout trial data to predict which would convert into customers.

For any projects involving Help Scout data, I usually share a short writeup afterwards in our metrics Slack channel as a way to help educate people on data science terms and techniques:

Screen Shot 2020-07-29 at 9.36.56 AM.png

Books

I’ve also picked up a few books which I’ve found to be excellent resources for learning matrial in more depth.

YouTube

You can search YouTube for almost any data science topic and find dozens of videos about it. The quality varies, but I’ve found that watching a few on any topic are usually enough to fill in any major gaps in my understanding.

For example, last week I was working through DataCamp’s course on time series analysis and having trouble with a few concepts. A quick search on YouTube for videos on autoregressive models turned up this video which cleared things up for me:

Kaggle

After DataCamp’s course on supervised learning I spent a lot of time trying to apply it to Kaggle’s Titantic Survival data competition.

Screen Shot 2020-07-29 at 10.01.01 AM

Breaking 80% accuracy is super hard 😬

The public notebooks that other people have shared are fantastic learning resources and in the future I want to spend a lot more time trying these competitions and learning from the work others have done.

What’s Next

At the rate I’m going I should be through DataCamp’s machine learning track before the end of the year which will be a nice milestone in this journey. Along the way I’ll continue trying to apply the material to real world problems and hopefully wind up somewhat competent with these techniques when all is said and done. We shall see!

What’s Up

It’s been a while since I’ve written on this blog, so wanted to take a break from my normal routine to say hi to you long-term readers and share a few updates about what’s going on in my world.

Work-wise, I’m very fortunate to not have been heavily impacted by Covid so far. I’m still consulting with Help Scout where I oversee their analytics and business intelligence efforts. I had been consulting with Automattic as well, but left earlier this year to focus more on growing Preceden, my long-running timeline maker tool. After a few months though I had knocked out most of my big todo list items for Preceden, so started looking for something new to work on and decided to I wanted to learn machine learning. My goal at the moment is to find some valuable ways to apply machine learning to help grow Preceden and Help Scout.

These days, my mornings are mostly spent getting better at machine learning through a combination of courses on DataCamp, books, and projects. Preceden is on the backburner, though I do spend some time each week working on support and fixing occasional bugs. My afternoons are spent with Help Scout where I spend a lot of time using dbt and Looker to help the team gain insights though data.

Family-wise, we moved from Florida to North Carolina last summer and we’ve been very happy with the move. My kids are 5, 4, and 2 now and keep my wife and I very busy.

Health-wise, I’ve been experimenting with high-intesnsity interval training (HIIT) workouts on YouTube which I enjoy beause they’re short but also get you sweating a lot. Most benefits I get from those are negated by a suboptimal diet though (Chick Fil A and Dunkin Donuts are so good…).

I recently finished Ozark on Netflix and highly recommend it, especially if you enjoyed shows like Breaking Bad or Narcos.

I also usually play one, sometimes two online poker tournaments with friends each week – if you’re interested in joining shoot me an email.

I’ll try not to let a year go between blog posts in the future, but no promises 😁.

Hope everything is going as well with you all.

Changing my Mind on When to Include Table Names in SQL Queries

I haven’t made too many changes to how I write SQL recently, but I did adopt a new convention recently that I really like so wanted to share.

In the past, I would have written the following query like so:

select
  email,
  sum(amount) as total_revenue
from users
inner join charges on users.id = charges.user_id

Note that this does not prefix email or amount with the table name where they came from.

Claire Caroll of dbt fame recently pinged me to suggest a change: whenever there’s a join involved, you should include the table name to make it clear where the column originated. The query above would look like this:

select
  users.email,
  sum(charges.amount) as total_revenue
from users
inner join charges on users.id = charges.user_id

When there’s no join involved, it’s fine to leave it out because there’s no room for confusion:

select
  id,
  name
from companies

I’ve been following this convention for a few weeks and really like it because there’s zero ambiguity when seeing this around where each column originated. It’s more verbose obviously, but I think the extra clarity outweighs that downside.

I’ve updated the style guide to reflect this guide.

I’d love to hear your thoughts – do you always include the name name, only when necessary, or follow a convention like this and only include it when joins are involved?

An Interview About My Work as a Data Analyst

I recently had the pleasure of being interviewed by Simon Ouderkirk about my journey to becoming a data analyst and lessons learned along the way. You can check out a recording of the interview here:

This came about because Simon volunteered to start this interview series for the Locally Optimistic data community. And because Simon and I have worked closely together for a number of years (first when I was full time at Automattic and now as a consultant), he pinged me to see if I wanted to be a guinea pig for this first interview :).

If data and analytics interest you, I highly recommend checking out the Locally Optimistic blog and Slack community linked to above; you’ll walk away thinking a lot more deeply about data, metrics, and running an effective analytics organization. You can also follow Simon on Twitter to learn about future interviews in this series.