vendredi 5 décembre 2014

Web server requirements for being crawled



I'm trying to index a web site that is hosted on a separate server in the domain but every time the crawler runs it returns a ball of warnings and errors the first time, and then the server becomes unavailable for an hour or so. I've tried scaling back the request time to 10 seconds but this only marginally improves performance. I think it has to do with the number of concurrent connections being eaten up by the crawler and not closed properly when they're done, but I don't have any evidence to back this up and convince the server owner to increase the number available.


So my question is, is there a general guideline or requirement for a web server to be crawled? Does it need a specific amount of available connections or specific performance minimums to keep from being killed by a "normal" crawl? This wouldn't be the first time I've seen a crawl bring a server to its knees so I'd really like to know ahead of time if that's going to happen instead of risking an outage.








0 commentaires:

Enregistrer un commentaire