Extracting a List of All Registered .com Domains from the Verisign Zone File

Back in the day when I worked on Lean Domain Search I got a lot of experience working with Verisign’s .com zone file because that’s what Lean Domain Search uses behind the scenes to check whether a given domain is available to register or not.

I still get a lot of emails asking for details about how it worked so over a series of posts, I’m going to walk through how to work with the zone file and eventually explain exactly how Lean Domain Search works.

What’s a zone file?

A zone file lists all registered domains for a given Top Level Domain (like .com, .net, etc) and the name servers associated with the domain. For example, because this blog is hosted on WordPress.com, the zone file lists the WordPress.com name servers for it:

MATTMAZUR NS NS1.WORDPRESS
MATTMAZUR NS NS2.WORDPRESS
MATTMAZUR NS NS3.WORDPRESS

How do I get access to the zone file?

Anyone can fill out a form, apply, and get access. There are details on this page. I detailed in this old post on Lean Domain Search how I filled out the form, though it has changed since then so you’ll need to make some adjustments.

What happens after I apply for access?

Verisign will provide you details to log into the FTP to download the zone file:

Screen Shot 2018-05-18 at 1.07.14 PM.png

The zone file is that 2.91 GB com.zone.gz which unzipped is 11.47 GB currently.

What’s in the zone file?

It begins with some administrative details, then begins listing domains and their associated name server. Note that registered domains without a name server (such as ones that are close to expiring) are not included in this list.

How can I extract a list of just the domains?

Glad you asked! It takes a little bit of command line fu.

If you’d like to follow along, here are the first 1,000 lines of the zone file. You can download this and use the terminal commands below just like you would if you were working with the entire 317,338,073 line zone file.

1) First, we’ll grab a list of just the domains:

$ awk '{print $1}' com.zone > domains-only.txt

For a line like this:

KITCHENEROKTOBERFEST NS NS1.UNIREGISTRYMARKET.LINK.

This command will return just KITCHENEROKTOBERFEST.

This will also return some non-domains from the administrative section at the top of the zone file, but we’ll filter those out later.

Here’s what domains-only.txt should look like.

2) Next, we’ll sort the results and remove duplicates:

$ sort -u domains-only.txt --output domains-unique.txt

This is necessary because most domains will have multiple name servers, but we don’t want the domain to appear multiple times in our final list of domains.

Here’s what domains-unique.txt should look like.

3) Last but not least, we’ll ensure the results include only domains:

$ LC_ALL=C grep '^[A-Z0-9\-]*$' domains-unique.txt > domains.txt

There are a few things to note here.

First, make sure to use gnu grep, which is not the default on Macs. GNU grep is fast.

The LC_ALL=C forces grep to use the locale C, which tells grep this is an ASCII file, not a UTF-8 file. More details here. While not important for this 1,000-line file, it significantly reduces how much time grep takes on the full 300M+ line zone file.

The ^[A-Z0-9\-]*$ regular expression here looks for lines that are made up of letters, numbers, and dashes. The reason we use a * (0 or more characters) vs + (1 or more characters) is simply because the grep command doesn’t support +.

Technically this regex will match strings that are longer than domains can actually be (the max is 63 characters) as well as strings that start or end with a dash (which isn’t valid for a domain) but there aren’t any of those in the zone file so it’s not a big deal and grep will run faster this way. If you really wanted to get fancy, you could match proper domains, but it will take longer to run: ^[A-Z0-9]([A-Z0-9\-]{0,61}[A-Z0-9])?$

Here’s what domains.txt should look like.

Note that this does include some domain-like strings from the administrative section like 1526140941 which isn’t actually a domain. Depending on what you’re using the zone file for you could remove these lines, but it’s never been a big deal for my use case. Because Lean Domain Search is limited to letters-only domains, it actually just uses  ^[A-Z]* for the regex.

Here’s some actual code from Lean Domain Search with these steps above:

Screen Shot 2018-05-18 at 1.43.18 PM.png

If you run into any trouble or have suggestions on how to improve any of these commands, don’t hesitate to reach out. Cheers!

3 thoughts on “Extracting a List of All Registered .com Domains from the Verisign Zone File

  1. Generating High Quality Available .com Domain Names for a Specific Industry – Matt Mazur

    • Hey – it’s not that fast – it can take a few hours due to the size of the zone file.

      It will also depend on how beefy the machine you run it on is.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s