Ill-behaved web-crawlers

By joe

May 27, 2007 - 2 minutes read - 411 words

This is not about HPC. I look at our logs every now and then to see if we have problems which aren’t normally covered in monitoring scenarios. Looking over the web logs, I see the usual usage, and bots. Some bots have been poorly behaved, some are quite intelligent. Google’s are pretty good.

So are many of the others. A group of them are very poor web-denizens, who seem to be incapable of understanding the links they see, and blindly follow them. Some of the worst include a number of Java based utilities that some people used to snapshot our site. As with all lame tools they did dumb things. Such as, when in the past, we had a company calendar up, with links so our customers could see our schedule (marketing, shows, etc), the lame, and generally brain dead web crawlers would hit the calendar, see about 30 links. And follow them. Including the next day, next week, next month, next year. This resulted in a combinatoric explosion of hits. Almost to the point of being a denial of service attack. The old group of mis-behaving web crawlers were mostly Java based. Unfortunately there is a new group, and it is using old, historical cached data. Based in large part on the inability of very few web-crawlers to deal with an on-line calendar properly, we took the calendar down. It doesn’t show up. The only way there is if you know the URL. Say from a historical cache. One that is now out of date, and incorrect. This new group doesn’t seem to notice that its historical cache is out of date and incorrect. We see it searching for items that have changed or been removed for years. And it happily follows the calendar links. Even though the calendar is not there. Notice, I haven’t indicated which web-crawler is being so ill-behaved. Will do that in a moment. I could speculate why it is behaving thusly: its masters want to out-do the market leader. Unfortunately they do not seem to have given this crawler anything approaching the smarts of the market leader. We have no trouble with Google’s bot. Somehow, though I haven’t tried it, I am betting that it would ignore robots.rules files. Many bots do these days, or going further, use those to look for the “juicy bits”. Well, I am sorry to keep you in suspense over the troublemaker. It is live.com. It is Microsoft’s search site.