Writing pseudo-crawler for web statistics -


I have been tasked with writing a web pseudo-crawler to calculate some statistics. I need to measure the percentage of HTML files which & lt; DOCTYPE starts with against the number of HTML files that do not compare this staticity between the sites on different topics. To do this, the idea is to separate words (such as "Automobile", " Stock exchange "," liposuction "...) and requested 300 or more pages first.

I want the process to be very fast, yet I do not want to ban Google. Certainly I want to reduce the time of development when it may be possible that some stupid Pearl scripts

Is there a ready-made solution that I can use again? With Google I am not a part of the HTML contained in HTML files, which is the proper solution, I did not find what I wanted to measure.

You can do just about everything, including limiting your request rate.


Comments

Popular posts from this blog

c# - How to capture HTTP packet with SharpPcap -

jquery - SimpleModal Confirm fails to submit form -

php - Multiple Select with Explode: only returns the word "Array" -