1. Analyze and compare the contents of the article
First, consider from the perspective of search engine:
The methods that users often use when creating fake originals are:
(1) Delete some content
(2) Add some content. Add two sentences to the copied article, or combine multiple articles together.
(3) Change the content order. Change the original 1. A, 2. B, 3. C, 4. D, 5. E into 1. C, 2. B, 3. E, 4. A, 5 D
After using word segmentation technology to compare the relevant contents, the analysis items are:
(1) Word size
(2) Frequency of several keywords
(3) Any few words in the text
(4) Links
Procedure analysis process:
If (the number of words is the same) and (the frequency of several keywords is the same) and (any sentence in the text is the same) and (the link points to the article with more than 90% similarity to this article)
Or more than 5 sentences (varying in length, maybe 5-30 words) in the text are consistent
Then it is judged as plagiarism or pseudo originality.
From the above program analysis process, it can be seen that simple deletion of content, addition and consolidation of part of content, exchange of content order, and paragraph order cannot make search engines as original. Why? Because the above simple method can be used to see the general. The number of words, the frequency of several keywords, and the links are all easy to handle. It is not easy to compare any few sentences in the text.
2. Title
If it is the same, it is likely to be copied, but if the title is changed, for example, "Common Sense of Car Rental in Chengdu" is changed to "Teach You How to Rent a Car in Chengdu", the meaning has not changed, but the text has changed. Therefore, we can't judge whether it is original by the title alone. However, the following analysis can be made:
Because the amount of data in search engines is too large, it is impossible to compare all the contents uniformly. Instead, the "word segmentation" technology is also used:
(1) If the page visited by the search engine spider is a new page, it will first collect the content of this page, put it into a database (or other), and other programs such as comparing whether the content is original or valuable. At this time, this content will not be searched.
(2) Analysis content. It also uses word segmentation technology to analyze the title, content, etc. The main content of this page is obtained. For an article such as Jay Chou's 2010 Album, it will be compared with articles with keywords such as "Jay Chou", "2010" and "Album", rather than with all web pages. If the result is original and valuable, it will be included and given a higher weight. If it is considered as copying or plagiarism, it will not be included, or the weight given is very low. By the way, the weight of this page is not only related to its own content, its own content, whether it can have a good ranking, but also related to the weight of the entire site. For example, if the weight of this page is 3 and the weight of the website is 3, the total is 6. If other websites reprint this article, the weight of the article is 1, but the weight of his website is 7, which adds up to 8. 6<8 So the reprinted articles will still be in front of the original articles.
Information source Shangpin China: Cluster website construction