c# - Compare the textual content of websites -
I am using a little experiment with text comparison / original literary theft, and website-to-website But I want to try it base However, I am searching in a proper way to process the text.
How will you process the content of two websites for plagiarism and compare them?
I'm thinking of something like this pseudo code:
// Websites to remove the text foreach website crawls website's stores, so pages are only all Scanned from the pages on the removal of text blocks - this is in the store list // compare in website1.textlist foreach text website 2. compare with all the text in the textlist
I know That this solution can collect a lot of data, so it's only It is with very small websites possible.
I have not yet decided on the actual text comparison algorithm, but I am still more interested in working the actual process algorithm first.
I 'I am thinking that it would be a good idea to remove all text in different text pieces (paragraphs, tables, headers and so on), as on text pages Can roam.
I am being mixed in C # (though ASP.NET).
I am very interested in any input or advice, so please shoot! :)
My approach to this problem will be for specific blocks of specific copyright, whose copyright Are trying to save.
If you want to prepare your own solution, here are some comments:
- Respect robots.txt. If they have marked the site as non-crawl, then it is likely that they are not trying to profit from your content anyway.
- You will need to refresh the site structure stored from time to time as websites change.
- You must distinguish the text properly with HTML tags and JavaScript.
- You will need to do a full text search in the entire text of the page tags / text for the extracted text) that are good, published algorithms for what you want to protect.
Comments
Post a Comment