So, I’m sitting at my kitchen table the other night thinking about startup type things when an idea pops into my head: Create an index for Hacker News.
Now, this isn’t the first time this occurred to me. A few weeks ago I emailed Paul Graham asking whether I could create a searchable database of Hacker News. He said he’d rather me not, plus I found out later about searchyc.com, which does exactly that.
But an index… that would have a different purpose. You could do all sorts of interesting analysis on it… top posts, top contributors, posting frequency, etc etc. No, I wouldn’t save the content, just the relevant information for the submissions only (no comments) like title, URL, points, # comments, and date.
The software wasn’t hard to write. The submissions are sequentially numbered from 1 to about 270K and it’s easy to differentiate between submissions and comments by searching the HTML. After about an hour of work and a little testing, I set off my small VB program to crawl the site.
This was Tuesday night. I went to sleep, eager to analyze the results the next day.
Wednesday morning I woke up and checked its status. 30% or something low like that. I couldn’t do any analysis then anyway — so off to work. I got home that evening and it was still chugging along. 55%. Getting there…
That night, around 9, I checked the status. 73.64%. Stupid slow connection. I came back half an hour later. 73.65%. Man, my connection is really terrible, I thought to myself. I loaded up Amazon to see if it would load. No problem. I restarted my computer, thinking it’s some connection problem. When it reboots, I check YC again … it took about 20 seconds and finally loaded. Hmm. Then it hit me. Wait a minute. Oh no. No no no no. What if the indexing caused HackerNews to go down?
This is not good. Not good at all.
So I shut the program down and went to bed. Next morning, Thursday morning, I checked my email before heading out, half expecting to see some sort of email. Nothing. Phew. YC was still somewhat slow at that point, but was improving.
I checked HackerNews throughout the day at work. Seemed to be just about better. Sometime in the afternoon I checked GMail. I had an email from Paul Graham titled “please stop”. It says:
Would you please not do that to the server again?
“Shit” I said. My coworker shot a puzzled look at me. “Nothing” I told him, “Its a long story.”
I wrote an response, apologizing profusely. Unfortunately, I realized later that night that the response didn’t go through… only a blank email. So, I rewrote the email and sent if off.
I’d like to take this opportunity again to say sorry to Paul and any other member of the HackerNews community that was affected by this. I didn’t think through what effect the indexing would cause, and would never have done it if I realized it would unintentionally result in a denial of service attack on my favorite news site. I don’t know how much time it took to fix it and apologize for any lost time YC took to correct it.
If you’re considering doing something like this, you should rethink your plans. It’s not exactly the best way to make an impression.
You idiot. LOL
By the way, it’s “Self Made Minds”, not “Seld …”. :/
well at least you didn’t accidentally create an interenet worm.
Well you could have just had a delay every few pages. Sure it could take a lot longer, but better than nothing! :)
Self* Made Minds – got it, thank you. Its amazing how you can overlook something so many times when you write it yourself.
staticshock on YC suggested using Google cache to do it. I’m finishing off the remaining 70K and then I’ll post some analysis.
cue the script kiddie trouble makers
Everybody gets one.
You got yourself to page 1 of HN. And now you got me reading your post (for the first time). I think you are doing OK. Hopefully you will let us see what your index can do for HN. Peace.
As Ashley Williams said, just insert a call to sleep() with a random value between 5-10 seconds.
Doesn’t it indicate something horribly wrong with the way the site is built if a single computer, on a residental ISP, takes it down with a basic crawler?
Why don’t search bots do the same to it?
Search robots crawl much slower.
But it IS quite a sign of bad administration if a single person/computer can kill a site by simply crawling. :o
Take it easy on the kid; everybody kills HN on their first trip.
YC readers deserve the title “hackers” indeed. Because, you know, even if you’re a crappy programmer, you are still elite if you follow that which spews forth from lord PG’s mouth.
Alex, what are you talking about? That doesn’t even make any sense.
I lol’ed when I first read this…I’m surprised nobody did it earlier actually.
As if he took the server down. He was throttled, plain and simple.
You found a good community to f**k up with. YC loves this sort of stuff.
Keep at it.