Learning Data Science: 3 Months In

At the end of April I decided to take a break from Preceden and start using that time to level up my data science skills. I’m about 3 months into that journey now and wanted to share how I’m going about it in case it’s helpful to anyone.

Data Science

Data science is very broad and depending on who you ask it can mean a lot of different things. Some folks would consider analyzing data in SQL or Excel as data science, but to me that’s never felt quite right. I prefer a definition that leans more on writing code that makes use of statistics, machine learning, natural language processing, and similar fields to analyze data.

Python

Going into this I had a lot of programming and data analysis experience, but hadn’t done much with Python and barely knew what regression meant.

I considered continuing to learn R which I already have some experience in, but I’m not a huge fan of R so decided to start fresh and learn Python instead. Having used Ruby extensively for Preceden and other projects has made learning Python pretty easy though.

DataCamp

DataCamp is an online learning platform to help people learn data science. They have hundreds of interactive courses and tracks for learning R, Python, Excel, SQL, etc. If you have the interest and time, the $300/year they charge for access to all of their courses is nothing compared to the value they provide.

I’ve been making my way through their Machine Learning for Everyone career track which starts off with a basic introduction to Python and quickly dives into statistics, supervised learning, natural language processing, and a lot more.

Screen Shot 2020-07-29 at 9.21.24 AM.png

Each course is a combination of video lectures and interactive coding exercises:

Screen Shot 2020-07-29 at 9.44.11 AM

Screen Shot 2020-07-29 at 9.44.52 AM.png

The courses are really well done and I feel like they’re giving me exposure to a broad range of machine learning topics. I wouldn’t say the courses go deep on any particular topic, but they provide great introductions which you can build on outside of DataCamp.

So far I’ve completed 10 out of 37 courses in this career track + 2 additional Python courses that were not in the track but recommended prerequisites for some of the courses that are in the track.

If you pushed through a course it might take 4 hours to complete, but I’m probably spending 10-15 hours on each course (so about 1 course/week). This is because I spend a lot of extra time during and after the course writing documentation for myself and trying to apply the material to real-world data to learn it better.

Documentation

Every time I stumble across a new function or technique I spend some extra time researching it and documenting it in a public Python Cheat Sheet GitHub repository.

At first I was doing writing notes in markdown files, but have since gotten a little savier and am doing them in iPython Notebook files now. Here’s a recent example of documentation I wrote about analyzing time series.

Screen Shot 2020-07-29 at 9.28.57 AM

I usually try to come up with some super simple example demonstrating how each function works which helps me learn it better and serves as an easy reference guide when I need to brush up on it when applying it down the road.

Real World Projects

For each course, I also try to apply the material to some real world data that I have access to, whether it be for Help Scout or Preceden.

For example, after DataCamp’s supervised learning course I spent some time trying to use Help Scout trial data to predict which would convert into customers.

For any projects involving Help Scout data, I usually share a short writeup afterwards in our metrics Slack channel as a way to help educate people on data science terms and techniques:

Screen Shot 2020-07-29 at 9.36.56 AM.png

Books

I’ve also picked up a few books which I’ve found to be excellent resources for learning matrial in more depth.

YouTube

You can search YouTube for almost any data science topic and find dozens of videos about it. The quality varies, but I’ve found that watching a few on any topic are usually enough to fill in any major gaps in my understanding.

For example, last week I was working through DataCamp’s course on time series analysis and having trouble with a few concepts. A quick search on YouTube for videos on autoregressive models turned up this video which cleared things up for me:

Kaggle

After DataCamp’s course on supervised learning I spent a lot of time trying to apply it to Kaggle’s Titantic Survival data competition.

Screen Shot 2020-07-29 at 10.01.01 AM

Breaking 80% accuracy is super hard 😬

The public notebooks that other people have shared are fantastic learning resources and in the future I want to spend a lot more time trying these competitions and learning from the work others have done.

What’s Next

At the rate I’m going I should be through DataCamp’s machine learning track before the end of the year which will be a nice milestone in this journey. Along the way I’ll continue trying to apply the material to real world problems and hopefully wind up somewhat competent with these techniques when all is said and done. We shall see!

My R Cheat Sheet, now available on GitHub

Despite working on and off with R for about two years now, I can never seem to remember how to do basic things when I return to it after a few weeks away.

I recently started keeping detailed notes for myself to minimize how much time I spend figuring things out that I already learned about in the past.

You can check out my cheat sheet on GitHub here:

https://github.com/mattm/r-cheat-sheet

It covers everything from data frames to working with dates and times to using ggplot and a lot more. I’ll update it periodically as I add new notes.

If you spot any mistakes or have any suggestions for how to improve it, don’t hesitate to shoot me an email.

Chronos: An R Script to Analyze the Distribution of Time Between Two Events

If someone asked you about your site’s conversion rates, you could probably tell them what the conversion rates are (right?). But what if someone asked you what % convert within an hour, a day, or a week?

We’ve been looking at this at Automattic and I wound up putting together an R script to help with the analysis. Because everything needs a fancy name, I dubbed it Chronos and you can check it out on Github.

All you need to do to use it is generate a CSV containing two columns: one with the unix timestamp of the first event and another with the unix timestamp of the second event:

1350268044,1408676495
1307322538,1350061315
1307676110,1340667657
1307661905,1337311786
1307758702,1428877904
...

The script will then show you the distribution of time between the two events as well as the percent that occur prior to a few fixed points (30 minutes, 1 hour, etc):

Distribution:
5% within 2 minutes 
10% within 5 minutes 
15% within 1 hour 21 minutes 
20% within 1 day 38 minutes 
25% within 3 days 2 hours 58 minutes 
30% within 6 days 9 hours 20 minutes 
33.33333% within 11 days 
35% within 14 days 
40% within 23 days 
45% within 42 days 
50% within 67 days 
55% within 95 days 
60% within 148 days 
65% within 210 days 
66.66667% within 232 days 
70% within 288 days 
75% within 390 days 
80% within 550 days 
85% within 677 days 
90% within 920 days 
95% within 1288 days 
100% within 1715 days 

Percentage by certain durations:
13% within 30 minutes
14% within 1 hour
17% within 5 hours
20% within 1 day
30% within 7 days

In addition to analyzing conversion rates, you can use this to measure things like retention rates. The data above, for example, looks at how long between when users logged their first beer and last beer in Adam Week‘s handy beer tracking app, BrewskiMe (thank you again Adam for providing the data).

If you run into any issues or have any suggestions for how to improve it just let me know.

The impact of a $15 minimum wage on a McDonalds

There was a really interesting thread on Reddit earlier this week in the Explain It Like I’m 5 (ELI5) subreddit titled How would a $15 minimum wage ACTUALLY affect a franchised business like McDonalds?

In an effort to make sure I understand the math, I’m going to try to summarize the top response. Here we go:

The Cost of Labor (COL) is the sum your employees’ wages + benefits + payroll taxes. When viewing an operational report for a business, the COL is usually also expressed as a percentage of net sales. Net sales is gross sales minus returns and discounts which for a franchise like McDonalds means probably just subtracting the value of coupons.

For the franchise the commentor is considering for his analysis (which may or may not be an actual McDonalds), the COL is currently 28% of its net sales. So for every $1 they sell, $0.28 goes towards labor. If you buy a $15 meal, it costs $4.20 in wages to produce it on average.

(Some commentors point out that 28% is high and where they worked the goal was 15% and if they operated at more than 20% for a week the manager would get fired. Those are for higher end restaurants though.)

For restaurants, there’s also Cost of Sales aka Cost of Goods which is basically the cost of the ingredients. For this franchise, it’s also 28% of net sales. So for a $15 meal, 28% COL + 28% COS = 56% or $8.40 towards the wages and ingredients to make it.

Then there’s franchise fees (aka royalty fees which corporate charges each franchise for running a store with their brand), which are ~10% of net sales.

COL + COS + the franchise fee make up the majority of operating costs.

For the franchise he’s looking at for a particular week, those numbers work out to: $27,321 net sales so 28% to COL ($7,702) + 28% COS ($7,908) + 10% franchise fee ($2,732) = $8,979 remaining. Here, COL + COS are ~56% of the net sales. The remaining amount is used to pay the manager, assistant manager, rent/mortgage, garbage, utilities, maintenance, advertising, administrative overhead, etc.

At this restaurant, employees make $9.25/hour on average. Increasing the minimum to $15/hour would be a 62% increase in COL (we assume everyone would make $15/hour to keep it simple). With the same $27,321 net sales, that would bump COL to $12,477, reducing the remaining amount to $4,204. That won’t be enough to cover all of the remaining costs. Now COL + COS are ~74% of net sales.

For fast food restaurants, a general rule is that you want COL + COS to be under 60% and need it to be under 65% to be profitable. Another commentor said a good goal is 50% for COL + COS. It will vary by the type of restaurant; the fast food is extremely competetive so there are thin margins.

Increasing the COL by 62% would cause major issues. By increasing the hourly wage to $15, it increases the COL by $12,477 – $7,702 = $4,775/week. If you wanted the same $8,979 remaining, you’d have to increase the net sales by that $4,775/week to $32,096, an increase of 17%. That would probably come from higher menu prices, assuming customers were willing to pay it.

This other response and the comments on it are worth a read as well.


I’ll end by saying that I do believe the current US minimum wage is too low and think we should raise it, but… it’s complicated. If the national minimum wage was raised to $15/hour, that would would also lead to higher COS for McDonalds because it would cause more for companies to produce the ingredients, correct? But it would also mean that people who were making less than $15 would have more money to spend so a hypothetical 10%-20% increase in menu prices might not be that bad. But if the price of everything increases, doesn’t it decrease the value of those extra wages? While the Reddit discussion is interesting, it made me appreciate that there are professional economists out there who can take into account the full impact of a change like this.