c# - Compare the textual content of websites -


I am using a little experiment with text comparison / original literary theft, and website-to-website But I want to try it base However, I am searching in a proper way to process the text.

How will you process the content of two websites for plagiarism and compare them?

I'm thinking of something like this pseudo code:

  // Websites to remove the text foreach website crawls website's stores, so pages are only all Scanned from the pages on the removal of text blocks - this is in the store list // compare in website1.textlist foreach text website 2. compare with all the text in the textlist  

I know That this solution can collect a lot of data, so it's only It is with very small websites possible.

I have not yet decided on the actual text comparison algorithm, but I am still more interested in working the actual process algorithm first.

I 'I am thinking that it would be a good idea to remove all text in different text pieces (paragraphs, tables, headers and so on), as on text pages Can roam.

I am being mixed in C # (though ASP.NET).

I am very interested in any input or advice, so please shoot! :)

My approach to this problem will be for specific blocks of specific copyright, whose copyright Are trying to save.

If you want to prepare your own solution, here are some comments:

  • Respect robots.txt. If they have marked the site as non-crawl, then it is likely that they are not trying to profit from your content anyway.
  • You will need to refresh the site structure stored from time to time as websites change.
  • You must distinguish the text properly with HTML tags and JavaScript.
  • You will need to do a full text search in the entire text of the page tags / text for the extracted text) that are good, published algorithms for what you want to protect.

Comments

Popular posts from this blog

c# - How to capture HTTP packet with SharpPcap -

php - Multiple Select with Explode: only returns the word "Array" -

jquery - SimpleModal Confirm fails to submit form -