In order to level up my data visualization skills, I’m going to be attempting to implement well-designed charts that others have created on my own using R and eventually other tools. I’ll be open sourcing the code and sharing any lessons learned along the way.
For the first chart, I grabbed a seemingly simple chart from Scott Berinato’s excellent book, Good Charts: The HBR Guide to Making Smarter, More Persuasive Data Visualizations.
Here’s the chart that I set out to reproduce in R (he created it in Illustrator):
Here’s my attempt using R and ggplot2 whose code you can find in my new data viz repo on GitHub:
There were several tricky aspects of the original chart that presented
challenges learning opportunities while implementing it in R:
- In order to get the chart title, y-axis title, and source aligned with the left side of the chart, I wound up having to use R’s Graphical User Objects aka “grobs”; hat-tip to this SO thread for the idea. It was possible to get pretty close by adjusting the hjust value for the chart title and subtitle, but the grob approach seemed less hacky.
- Speaking of hacky, I wasn’t able to figure out an elegant way to center the x-axis values between the ticks. I resorted to tinkering with the hjust value, but that just shifts them all to the right by a fixed amount. The problem is that some text like “2010” is wider than other text like “2011” which makes it impossible to get them all centered by shifting them by the same fixed amount. If anyone has a solution to this, I’d love to learn a better way to do it.
- Removing the tick for the bottom ($0) y-axis value was “fun”: I had to make it transparent because I couldn’t figure out how to remove it completely. Again, Stack Overflow is awesome.
- From a coding best practices perspective, there were several appearance-related values that had to be used in multiple locations in the code. To avoid duplication, I stored them in variables at the top of the script. It seems strange though having some appearance values stored in variables but others directly in the ggplot2 code. Storing all of the appearance values in variables – even the ones only used once – seems like bad practice too though.
- I couldn’t figure out how to get the ends of the x-axis line to stop at the ticks so they wind up extending beyond them slightly in my version.
- I’m not sure what fonts Scott used in Illustrator to create this chart, but even if I was 100% sure I don’t want to shell out money to buy custom fonts for this little project, so I used similar ones that were already available on my Macbook (PT Sans Narrow for the title, Proxima Nova for everything else).
- Towards the end there were diminishing returns trying to get everything pixel perfect so I settled on a 90% good enough solution.
As I recreate more of charts there will undoubedly be “a-ha!” moments that cause me to go back and clean up the code for my old charts. I’ll try to share as many of those lessons learned here as well.
Just a note for those of you who follow me online that after almost four years at Automattic, I’m moving on to try something new.
Automattic is an amazing company and I’m incredibly grateful for having had the chance to work there with such a talented group of people.
As for what’s next: later this month I’ll be joining Help Scout to work as a data scientist on their growth team. I’m really excited by the opportunity and hope to share a lot of what I learn on this blog. More to follow!
Over the last few years I’ve read maybe 2-3 books per year and a good chunk of those were audiobooks so I’m not sure they even qualify as reading. Part of that is that reading hasn’t been a priority for me: given a choice between working on side projects, spending time with my family, watching Game of Thrones and other shows, or reading, I normally don’t choose to read. Another part of it is that I’ve found it harder and harder to sit and read for more than a few minutes; years of Facebook and Twitter have taken their toll on my attention span.
That all changed recently when I made a few modifications to my reading habits. Many of these changes were inspired by discussions on Seeking Wisdom, a fantastic new podcast by David Cancel and Dave Gerhardt about a wide variety of topics including building great products, marketing, personal growth, work-life balance, and more.
Here are the key changes I made:
- Reading physical books, not ebooks. Many people love the convenience of ebooks, but I’ve always preferred physical books. Rather than try to force myself to read on a Kindle or a tablet, I’ve switched back to reading physical books. There’s just something about the texture and smell of physical books that I really enjoy. Also, by not reading on the Kindle app on my iPad I can avoid notifications and other distractions like checking my email, Facebook, etc.
- Reading multiple books at once. In the past, I’ve always focused on reading one book at a time but by having multiple books to choose from I can select one depending on the mood I’m in which makes it more likely that I’ll actually do it. If I’m not in the mood to read a business book, for example, I have plenty of others to choose from. I no longer have to tell myself “I’ll read that book after I finish this one.” It does take longer to finish a book, but that’s fine; the goal of reading isn’t to finish books.
- Allowing myself to stop reading books that don’t interest me. In the past I would try to force myself to get through any book I started. Problem is that I’d inevitably wind up starting a book that didn’t grab me, then I’d just quit reading all together because I hadn’t finished that book.
- Skipping parts that don’t hold my attention. Years ago I started reading Poor Charlie’s Almanac, a treasure trove of wisdom from Berkshire Hathaway’s Charlie Munger. The first chapter is about Charlie’s life and for whatever reason I struggled to get through it so stopped reading the book completely. This time around, I skipped that chapter and dove straight into the later chapters which I’ve found a lot more engaging.
- Making it a habit. Every night before I go to sleep, I try to read for 20 to 30 minutes. I do wind up watching less Netflix/HBO to make time for it, but that’s a tradeoff that I’m happy to make. As an added benefit, I’ve been finding it much easier to fall asleep after reading vs working on a side project or watching TV.
A lot of these seem obvious in retrospect, but it took a while for their importance to really click for me. I’d encourage anyone who wants to get back into reading to experiment and see what works for you.
If you have any other suggestions on how to read more effectively, don’t hesitate to drop me an email or leave a comment below.
Despite working on and off with R for about two years now, I can never seem to remember how to do basic things when I return to it after a few weeks away.
I recently started keeping detailed notes for myself to minimize how much time I spend figuring things out that I already learned about in the past.
You can check out my cheat sheet on GitHub here:
It covers everything from data frames to working with dates and times to using ggplot and a lot more. I’ll update it periodically as I add new notes.
If you spot any mistakes or have any suggestions for how to improve it, don’t hesitate to shoot me an email.
We’re now accepting applications for a new Marketing Data Analyst position at Automattic that might interest some of you:
In this role you’d be helping us use data and analytics to guide the direction of our marketing efforts on WordPress.com.
Here’s the official description:
Automattic is looking for a data analyst to join the marketing team. You will distill data into actionable insights to guide our customer marketing and retention strategy as well as inform product development efforts.
Primary responsibilities include:
- Build and maintain standardized reporting on key metrics across the customer lifecycle.
- Develop customer segmentation models to inform tailored, multi-channel marketing strategies.
- Conduct ad hoc analyses to better understand customer behavior, needs, and individual test results.
- Partner with other analysts and developers to increase data accessibility across the organization.
- Design a process for prioritizing and communicating data requests and priorities.
- Are proficient in SQL and Excel.
- Have experience with web analytics platforms such as: Google Analytics, KISSmetrics, or Mixpanel.
- Have experience working with marketing teams to deliver analyses and answer business questions.
- Are able to communicate data in a way that is easy to understand and presents clear recommendations.
- Are highly collaborative and experienced in working with business owners, executives, developers and creatives to discuss data, strategy and tests.
- Have excellent prioritization and communication skills.
- Ideally, have web development experience (though not required).
Like all positions at Automattic, you’ll work remotely, and can be based wherever you live. It’s not a requirement that you live in or relocate to the Bay Area to do this job.
If this sounds interesting to you (and how could it not?!?) there are instructions at the bottom of the job description about how to apply.
And if you have any questions about Automattic or this data analyst position, feel free to drop me an email: firstname.lastname@example.org.
If you work on a SaaS product, you probably have a good idea of what its cancellation rates are, but chances are you don’t know how that changes over time. For example, what % of users cancel after 1 day? How about after 7 days, 30 days, etc?
I worked on a project at Automattic this week to help us understand the cancellation curves for WordPress.com’s plans and am open sourcing the R script so anyone can do the same for their service.
Here’s an example of how the cancellation curves might look for a service with a Gold, Silver, and Bronze plan:
We can see that most users who cancel do so pretty quickly and that long term about 30% of Gold plan, 20% of Silver plan, and 10% of Bronze plan subscriptions wind up cancelled.
To generate this data for your own product, you’ll just need three data points for each subscription: when it was purchased, when it was cancelled (if it was), and the name of the subscription. The script will take care of analyzing the data and generating the visualization.
You can check out the script and additional details on GitHub here: Cancellation Curve Visualizer.
If you have any questions or run into any issues, don’t hesitate to drop me a note.
Andrew Allemann has a great post on Domain Name Wire where he tries to estimate the impact of GoDaddy’s five-year default domain name registration option.
GoDaddy’s shopping cart defaults to a five-year registration period when you place a domain name in your cart. Most people switch this back to just one year, but some don’t. Whether they merely overlook this or decide it makes sense to register the domain name for five years, about 3.5% of new .com registrations at GoDaddy each month are for five years.
Here’s a summary of the math:
- In June, the .com registry reflects 26,750 five-year registrations which account for 3.48% of all their .com registrations so 26,750/.0348 = 769K total.
- On average across all registries, only 1.66% of of new .com registrations are for 5 years.
- Had GoDaddy met the average, it would have only registered 769K * 1.66% = ~12,750 five year registrations.
- That works out to be a difference of 26,750 – 12,750 = 14K five year registrations or 14K * 5 = 70K years of registrations.
- Assuming those users would have purchased 1 year registrations if 5 wasn’t the default, that works out to be 70K – 14K = 56K extra years of registrations per month thanks to that five-year default.
Regardless of how you feel about GoDaddy, you’ve got to admit that they’re really effective at upselling.
Hattip to my coworker Wendy for sharing the post.