c# - Compare the textual content of websites -

- January 15, 2015

I am using a little experiment with text comparison / original literary theft, and website-to-website But I want to try it base However, I am searching in a proper way to process the text.

How will you process the content of two websites for plagiarism and compare them?

I'm thinking of something like this pseudo code:

  // Websites to remove the text foreach website crawls website's stores, so pages are only all Scanned from the pages on the removal of text blocks - this is in the store list // compare in website1.textlist foreach text website 2. compare with all the text in the textlist

I know That this solution can collect a lot of data, so it's only It is with very small websites possible.

I have not yet decided on the actual text comparison algorithm, but I am still more interested in working the actual process algorithm first.

I 'I am thinking that it would be a good idea to remove all text in different text pieces (paragraphs, tables, headers and so on), as on text pages Can roam.

I am being mixed in C # (though ASP.NET).

I am very interested in any input or advice, so please shoot! :)

My approach to this problem will be for specific blocks of specific copyright, whose copyright Are trying to save.

If you want to prepare your own solution, here are some comments:

Respect robots.txt. If they have marked the site as non-crawl, then it is likely that they are not trying to profit from your content anyway.
You will need to refresh the site structure stored from time to time as websites change.
You must distinguish the text properly with HTML tags and JavaScript.
You will need to do a full text search in the entire text of the page tags / text for the extracted text) that are good, published algorithms for what you want to protect.

Search This Blog

Object

c# - Compare the textual content of websites -

Comments

Post a Comment

Popular posts from this blog

c# - How to capture HTTP packet with SharpPcap -

php - Multiple Select with Explode: only returns the word "Array" -

jquery - SimpleModal Confirm fails to submit form -