Exploring ChatGPT’s Knowledge Cutoff

A recurring topic of discussion on the OpenAI forums, on Reddit, and on Twitter is about what ChatGPT’s knowledge cutoff date actually is. It seems like it should be straightforward enough to figure out (just ask it), but it can be confusing due to ChatGPT’s inconsistent answers about its cutoff month, differences from official documentation, and varying capabilities between the API and playground.

ChatGPT’s knowledge cutoff is of interest to me because I run Preceden, an AI timeline maker, and when users request to generate a timeline about a recent event, I need to know whether the ChatGPT API will respond with reliable information about that topic.

In addition to the knowledge cutoff month, I was also curious to understand what GPT-4 Turbo’s knowledge cutoff of ‘April 2023’ actually means. This raises questions such as whether it refers to the beginning, end, or a point in between that month. I’m also interested in knowing if the quality of its knowledge declines as we approach that date, and whether it sometimes has knowledge about events after that date.

Seeking unambiguous, easily verifiable knowledge

How can we evaluate the accuracy of ChatGPT’s knowledge (or any LLM for that matter) over time?

After brainstorming a few options like news headlines and election results, I arrived at the same conclusion that some forum commenters did: ask ChatGPT about celebrity deaths and specifically, celebrity death dates.

While a bit morbid (similar to the well-known Titantic survival prediction challenge), celebrity death dates work well for our purpose because they are widely reported (increasing the odds that the LLM was exposed to them as part of its training data) and the LLM’s response is easy to verify (if the person died on Feb 10, 2023 but the LLM says they are alive or they died in March, then it’s clearly incorrect).

How we’ll go about this

USAToday has pages that list celebrity deaths each year, including an ongoing list for 2023.

We’ll scrape that data to give us a list of 168 celebrities who have died between Jan 1, 2022 and August 1, 2023 (several months after GPT-4 Turbo’s official knowledge cutoff, so we can test whether it knows anything about more recent celebrity deaths).

We’ll then repeatedly ask the ChatGPT API about the deaths dates of the celebrities in that list. We’ll do this with multiple temperatures to minimize the impact the temperature might have on the results (temperature is a parameter that controls the “creativity” or randomness of the response).

Here’s what the prompt looks like (note that I’m including the celebrity’s birth date to avoid any ambiguity when there are multiple celebrities with the same name):

Given a list of celebrity names and their birth dates, generate a JSON object containing their death dates.

For example, given ‘Michael Jackson (1958-08-28)’, you should output: { “Michael Jackson”: “2009-06-25” }

If the person is still alive, return a blank string for their death date.

Now, return a JSON object containing every celebrity in this list:

1. Dan Reeves (1944-01-19)

2. Sidney Poitier (1927-02-20)

3. Peter Bogdanovich (1939-07-30)

4. Bob Saget (1956-05-17)

We’ll then compare ChatGPT’s knowledge about their death dates to the actual death dates to determine how accurate ChatGPT’s responses are. Occasionally GPT gives dates that are a day or two off, so I made a judgment call and am calling the date accurate as long as it is close to the actual death, indicating GPT is least aware that the celebrity had died, even if there is slight ambiguity about the exact date, possibly stemming from when the death occurred vs when it was reported.

Finally, we’ll analyze the results to see how often ChatGPT returned an accurate response.

You can find the source code for the USA Today scraper in this celebrity-deaths-scraper repo and the full source code for the Jupyter Notebook used in this analysis in this llm-knowledge-cutoff repo. While this analysis focuses on GPT-4 Turbo, the notebook is easy enough to customize for other ChatGPT models or other LLMs for anyone interested in continuing this work.

Things to keep in mind going into this

  • I don’t have a deep understanding of how LLMs work which may lead me to interpret some of these results incorrectly. If you spot anything I’ve overlooked in this analysis, please share in a comment, drop me an email, or let me know on X and I’ll update this post accordingly.
  • There could be subtle things like the wording I used in the prompt, the order of the celebrities, etc that is causing issues. Similar to the last point, if you notice anything like that, please let me know.
  • It’s likely that OpenAI is tweaking these ChatGPT models frequently which may causes differences between the results below and what you see in the future if you run this notebook.

What does gpt-4-1106-preview know?

Overall GPT-4 does fairly well returning accurate celebrity death dates prior to 2023, but as we approach its knowledge cutoff, the probability that it returns an correct response declines significantly.

For example, there were 13 celebrities who passed away in January 2022 and we asked ChatGPT about each of them 50 times (10 times for each of the 5 temperatures), resulting in 650 responses. Out of those 650 responses, 596 were accurate, so the accuracy % for that month is 91.69%.

However, when we look at January 2023, which also saw 13 deaths, only 501 responses were correct, for an accuracy rate of 77%. By March 2023, only 210/450 or 44% of the responses were correct.

This chart also answers the question about what the April 2023 knowledge cutoff represents: it appears to mean GPT-4 Turbo was trained on data prior to April 2023 and not including that month. I can’t say for certain that’s true of 100% of its knowledge, but at least in terms of celebrity deaths, it’s not aware of anything from April 1st onward.

You may wonder as well what impact temperature had on the results. Not much:

Across each of the 5 temperatures it evaluated (0, 0.25, 5, 0.75, and 1), the accuracy of GPT-4 Turbo’s responses over these 168 celebrities was very similar.

Here’s how it does for each celebrity:

Many of the celebrities it returns the correct date for 100% of the time as we can see from the top half of that chart.

For others, ChatGPT will sometimes say the person is still alive, other times that he or she passed away. For example, Bud Grant, a former Minnesota Vikings coach, died on March 11, 2023 so in theory should be part of GPT-4 Turbo’s knowledge. However, sometimes the API returns a correct death date for him in the script, but often not. Ditto if you ask the playground about him. Here it responds he’s still alive:

But here it responds that he has indeed passed away:

And for two celebrities, ChatGPT never returned a correct date for their death:

  1. Ryuichi Sakamoto, a Japanese composer who died on March 28, 2023. His death doesn’t seem to have been reported until April 2, after a April 1 knowledge cutoff, so would make sense that ChatGPT wouldn’t know about it.
  2. Lola Chantrelle Mitchell, a former member of Three 6 Mafia, who died on January 1, 2023. I asked about her on X and someone received a correct response, but he hasn’t been able to reproduce it and I haven’t seen it in my testing. I’m stumped on why its having such trouble with her, given that her death was widely reported at the beginning of 2023.

Again, the probabilistic nature of LLMs means that there is inherent randomness in its responses. But why does it return 100% correct responses for some celebrities, 5-10% for others, and 0% for yet others? What about its training makes that happen? I asked GPT-4 Turbo and here is its response which I suspect answers it well:

This highlights the need to not only consider the possibility of hallucinations in ChatGPT’s responses, but in the possibility of ignorance as well, especially for information about recent events. This will be less of an issue if you’re using GPT-4 with browsing enabled, but will still impact older models as well as usage of GPT-4 via the API or playground.

Again, if anyone spots anything I’ve overlooked in this analysis, please drop me a note.

Thanks for reading 👋.

Learning Data Science: 3 Months In

At the end of April I decided to take a break from Preceden and start using that time to level up my data science skills. I’m about 3 months into that journey now and wanted to share how I’m going about it in case it’s helpful to anyone.

Data Science

Data science is very broad and depending on who you ask it can mean a lot of different things. Some folks would consider analyzing data in SQL or Excel as data science, but to me that’s never felt quite right. I prefer a definition that leans more on writing code that makes use of statistics, machine learning, natural language processing, and similar fields to analyze data.

Python

Going into this I had a lot of programming and data analysis experience, but hadn’t done much with Python and barely knew what regression meant.

I considered continuing to learn R which I already have some experience in, but I’m not a huge fan of R so decided to start fresh and learn Python instead. Having used Ruby extensively for Preceden and other projects has made learning Python pretty easy though.

DataCamp

DataCamp is an online learning platform to help people learn data science. They have hundreds of interactive courses and tracks for learning R, Python, Excel, SQL, etc. If you have the interest and time, the $300/year they charge for access to all of their courses is nothing compared to the value they provide.

I’ve been making my way through their Machine Learning for Everyone career track which starts off with a basic introduction to Python and quickly dives into statistics, supervised learning, natural language processing, and a lot more.

Screen Shot 2020-07-29 at 9.21.24 AM.png

Each course is a combination of video lectures and interactive coding exercises:

Screen Shot 2020-07-29 at 9.44.11 AM

Screen Shot 2020-07-29 at 9.44.52 AM.png

The courses are really well done and I feel like they’re giving me exposure to a broad range of machine learning topics. I wouldn’t say the courses go deep on any particular topic, but they provide great introductions which you can build on outside of DataCamp.

So far I’ve completed 10 out of 37 courses in this career track + 2 additional Python courses that were not in the track but recommended prerequisites for some of the courses that are in the track.

If you pushed through a course it might take 4 hours to complete, but I’m probably spending 10-15 hours on each course (so about 1 course/week). This is because I spend a lot of extra time during and after the course writing documentation for myself and trying to apply the material to real-world data to learn it better.

Documentation

Every time I stumble across a new function or technique I spend some extra time researching it and documenting it in a public Python Cheat Sheet GitHub repository.

At first I was doing writing notes in markdown files, but have since gotten a little savier and am doing them in iPython Notebook files now. Here’s a recent example of documentation I wrote about analyzing time series.

Screen Shot 2020-07-29 at 9.28.57 AM

I usually try to come up with some super simple example demonstrating how each function works which helps me learn it better and serves as an easy reference guide when I need to brush up on it when applying it down the road.

Real World Projects

For each course, I also try to apply the material to some real world data that I have access to, whether it be for Help Scout or Preceden.

For example, after DataCamp’s supervised learning course I spent some time trying to use Help Scout trial data to predict which would convert into customers.

For any projects involving Help Scout data, I usually share a short writeup afterwards in our metrics Slack channel as a way to help educate people on data science terms and techniques:

Screen Shot 2020-07-29 at 9.36.56 AM.png

Books

I’ve also picked up a few books which I’ve found to be excellent resources for learning matrial in more depth.

YouTube

You can search YouTube for almost any data science topic and find dozens of videos about it. The quality varies, but I’ve found that watching a few on any topic are usually enough to fill in any major gaps in my understanding.

For example, last week I was working through DataCamp’s course on time series analysis and having trouble with a few concepts. A quick search on YouTube for videos on autoregressive models turned up this video which cleared things up for me:

Kaggle

After DataCamp’s course on supervised learning I spent a lot of time trying to apply it to Kaggle’s Titantic Survival data competition.

Screen Shot 2020-07-29 at 10.01.01 AM

Breaking 80% accuracy is super hard 😬

The public notebooks that other people have shared are fantastic learning resources and in the future I want to spend a lot more time trying these competitions and learning from the work others have done.

What’s Next

At the rate I’m going I should be through DataCamp’s machine learning track before the end of the year which will be a nice milestone in this journey. Along the way I’ll continue trying to apply the material to real world problems and hopefully wind up somewhat competent with these techniques when all is said and done. We shall see!

My R Cheat Sheet, now available on GitHub

Despite working on and off with R for about two years now, I can never seem to remember how to do basic things when I return to it after a few weeks away.

I recently started keeping detailed notes for myself to minimize how much time I spend figuring things out that I already learned about in the past.

You can check out my cheat sheet on GitHub here:

https://github.com/mattm/r-cheat-sheet

It covers everything from data frames to working with dates and times to using ggplot and a lot more. I’ll update it periodically as I add new notes.

If you spot any mistakes or have any suggestions for how to improve it, don’t hesitate to shoot me an email.

Chronos: An R Script to Analyze the Distribution of Time Between Two Events

If someone asked you about your site’s conversion rates, you could probably tell them what the conversion rates are (right?). But what if someone asked you what % convert within an hour, a day, or a week?

We’ve been looking at this at Automattic and I wound up putting together an R script to help with the analysis. Because everything needs a fancy name, I dubbed it Chronos and you can check it out on Github.

All you need to do to use it is generate a CSV containing two columns: one with the unix timestamp of the first event and another with the unix timestamp of the second event:

1350268044,1408676495
1307322538,1350061315
1307676110,1340667657
1307661905,1337311786
1307758702,1428877904
...

The script will then show you the distribution of time between the two events as well as the percent that occur prior to a few fixed points (30 minutes, 1 hour, etc):

Distribution:
5% within 2 minutes 
10% within 5 minutes 
15% within 1 hour 21 minutes 
20% within 1 day 38 minutes 
25% within 3 days 2 hours 58 minutes 
30% within 6 days 9 hours 20 minutes 
33.33333% within 11 days 
35% within 14 days 
40% within 23 days 
45% within 42 days 
50% within 67 days 
55% within 95 days 
60% within 148 days 
65% within 210 days 
66.66667% within 232 days 
70% within 288 days 
75% within 390 days 
80% within 550 days 
85% within 677 days 
90% within 920 days 
95% within 1288 days 
100% within 1715 days 

Percentage by certain durations:
13% within 30 minutes
14% within 1 hour
17% within 5 hours
20% within 1 day
30% within 7 days

In addition to analyzing conversion rates, you can use this to measure things like retention rates. The data above, for example, looks at how long between when users logged their first beer and last beer in Adam Week‘s handy beer tracking app, BrewskiMe (thank you again Adam for providing the data).

If you run into any issues or have any suggestions for how to improve it just let me know.