Explain the relevant principles of search engine spiders
Source: Shangpin China |
Type: website encyclopedia |
Date: June 22, 2012
The web spider is updating Website construction Generally speaking, when it comes to content. It is not necessary to re crawl the website pages. For most pages, it is only necessary to judge the properties of the page (mainly the date) and compare the properties obtained with the properties captured last time. If they are the same, they do not need to be updated. However, it is obvious that search engines have made great contributions to the Internet. The history of search engines is not long. Search engines have changed the world and users' habits, which makes me full of confidence in the future of the Internet. The first search engine didn't even analyze the copy of the page, and the search engine didn't do a good job when it started. In addition, there is no ranking standard. In order to deeply tap the commercial potential, this has promoted the gradual development of search engines and the development of more advanced systems. In 2001, it spent US $6.5 billion to purchase the @ homepage. At the beginning of the promotion, the first relatively large commercial search engine was Stanford University in the United States. The biggest competitor is the website, mainly because at that time many search results were spam, and people were not used to using search engines. It is usually called keyword stacking. Once you search for keywords, meta tags are a tool to help search engines sort. The keywords and meta tags will tell the search engine which page the content is. In a short time, we will do a good job of meta tagging and provide relevant search results. However, with the increase of marketing experience of some enterprises, it is easy to improve the ranking of keywords. At that time, it was popular to stack keywords such as "loan, loan and loan", So at that time, search engines were flooded with spam, which caused many users' distrust. At that time, some important search engines included EINet Galaxy, WebCrawl Lekos, InfoseekInktomiAskAllTheWeb, etc. Each search engine has three main parts: 1 Spider Then analyze the page. Spiders' job is to find new pages and collect snapshots of these pages. For example, when scanning web pages, spiders mainly crawl pages. All search engines can realize deep retrieval and fast retrieval. In deep retrieval, spiders can find and scan all the content in the web page; In fast retrieval, spiders do not follow the rules of deep retrieval, and only search important keyword parts, without checking and scanning all the content in the web page. That is to say, the faster the spiders crawl and collect web pages, everyone knows the most important snapshot time of the website. This shows that this website is more important in the search engine's mind. For example, Xinhuanet and People's Web, spiders climb more than 4 times per hour, and some websites may not be crawled by spiders once a month. The level of snapshot capture depends on the popularity of website content, update speed and the old and new website domain names. If there are many external links pointing to this SEO Website, spider crawling rules. That means this website is important, so the frequency of crawling this website is high. Of course, search engines do this to save money. If they crawl all websites at the same frequency, it will take more time and cost to get more comprehensive search results. 2 Index The content of the web page may be repeatedly checked, and the spider is crawling. Then check whether the website content is copied from other websites to ensure the index of original website content. The index results are generally the ranking search results of basically adhering to copying content. When you search, the search engine will not search from the network, but will select the search results from the index. Therefore, the number of pages obtained by the search does not represent the whole website, but the spider will scan and save the number of pages of the website in the background. Google's 1-10 search results are about 160500, and the number of search results is in the middle. There is also the ranking of search results in each region, which can be controlled by the algorithm index of the search engine, or part of it. When you enter the keywords you need to search, each search engine has set up a data center in the country or around the world. The search results will be synchronized due to different data update times, so different search results will appear in different regions. 3Web interface All algorithms call results from the index. When you use the interface you see in the search engine (for example, google.combaidu.com, the search results depend on complex algorithms. Only through query and analysis can they be displayed in the foreground, so the production time of the algorithm is relatively long, and Google is leading in this technical field. Such features are common in English search, and there are also some one-stop features of search engines. Generally speaking, if the search engine ignores "one-stop", such search results will be more correct. For example, when searching for "cat, dog", the search engine will exclude "cat and dog" and only search for "cat, dog" General search engines see that keywords on a page exceed the density range, and keyword density measures the frequency of a keyword appearing on the page. Then it will analyze whether the page is cheating. Now the search engine can handle the word relevance in any region. So in general, keywords should be scattered throughout the page, but there must be a title or paragraph that is temporarily unchanged. In addition to page ranking and general links, search engines also have a core analysis technology that is link correlation analysis. Google also values anchor text links. Anchor text links mainly depend on the age and location of the link, and whether the linked page belongs to an authoritative website. Search engines are very concerned, and links are the biggest indicator of website quality. Because it is difficult to find friendship links now, and you need friendship links very much, there is very little junk information in the links. For example, university websites have a high weight in Google because universities have many high-quality external links. As we all know the importance of external links, many websites begin to buy and sell links, which is also a headache for search engines now. However, ask now determines that the ranking of websites depends more on the quality of websites. It is expected that all search engines hope to get user feedback information before query, search query, time interval, semantic relationship, etc. It can better understand the user's intention and track the user's click. If the user clicks an item and then returns to the search page immediately, the search engine will think that the purchase is unsuccessful and will delete the tracking list. In fact, this practice is approaching e-commerce Search engines have begun to focus on user experience, which can be seen. In order to make users affirm their labor effect and become a standard in the search engine industry, perhaps the future development is personalized search. The working principles of search engines can be roughly divided into: Just like what is said in daily life, "spread the word" and "collect information": the information collection of search engines is basically automatic. The search engines use the order of spiders to connect hyperlinks on every web page. The order of robots is based on hyperlinks from web pages to other links. Like, start from a few web pages and connect to all the links to other web pages in the database. In theory, if there are appropriate hyperlinks on the web page, the robot can traverse most of the web pages. It should also be arranged according to certain rules. In this way, sorting information: the process of sorting information by search engines is called "indexing". Search engines do not only keep the collected information. Search engines can quickly find the information they need without having to search all the information they keep. Imagine that if the information is randomly stacked in the database of the search engine without any rules, it will have to search the entire database every time it looks for data, so no matter how fast the computer system is, it will be useless. Search engines accept queries and return information to users. The search engine receives queries from a large number of users almost at the same time every moment, and accepts queries: users send queries to the search engine. Check the index according to the requirements of each user, find the data required by the user in a very short time, and return it to the user. At present, search engine returns mainly in the form of web page links. Through these links, users can reach the web page containing their required information. Generally, search engines will provide a short section of summary information from these pages under these links to help users determine whether this page contains the content they need. Principle of web spider Spider is a spider crawling on the web. Web spiders search for web pages through their link addresses. Web spiders are a very vivid name of WebSpider. Think of the Internet as a spider web. Start from a certain page of the website (usually the home page), read the content of the page, find other link addresses in the page, and then use these link addresses to find the next page. This cycle continues until all pages of the website are captured. If the whole Internet is regarded as a website, web spiders can use this principle to grab all the pages on the Internet. It is almost impossible to crawl all the pages on the Internet. Judging from the data published at present, for search engines. The search engine with the largest capacity only captures about 40% of the total number of pages. On the one hand, this is due to the bottleneck of crawling technology. It is impossible to traverse all pages. Many pages cannot be found from the links of other pages. Another reason is the problem of storage technology and processing technology. If the average size of each page is 20K, the capacity of 10 billion pages (including pictures) is 1002000G bytes, even if it can be stored, There is also a problem with the download (according to the 20K download per second of a machine, it takes 340 machines a year to download all the pages without stopping). At the same time, due to the large amount of data, the efficiency of search will also be affected. Therefore, the web spiders of many search engines only crawl those important pages, and the main basis for evaluating the importance of crawling is the link depth of a page. Web spiders generally have two strategies: breadth first and depth first (as shown in the figure below). Breadth first refers to the time when web spiders first crawl all the pages linked in the starting page. Then select one of the linked pages and continue to crawl all the pages linked in this page. This is the most commonly used method, because it can allow web spiders to process in parallel and improve their crawl speed. Depth first means that web spiders will start from the start page and follow each link one by one. After handling this line, they will go to the next start page and continue to follow the link. One advantage of this method is that web spiders are easy to design. The difference between the two strategies will be more clear in the following figure. Some web spiders are not very important to some websites, because it is impossible to crawl all the pages. It sets the number of access layers. For example, in the above figure, A is the starting page, belonging to Layer 0, BCDEF belongs to Layer 1, GH belongs to Layer 2, and I belongs to Layer 3. If the number of access levels set by the web spider is 2, Web Page I will not be visited. This also enables some websites to search one part of the page on the search engine and the other part cannot be searched. For website designers, the flat website structure design helps search engines to capture more pages. Encrypted data and web page permissions are often encountered when web spiders visit web pages. Some pages require member permissions to access. Of course, the owner of the website can prevent web spiders from crawling through the agreement (as will be introduced in the next section). But for some * * speech websites, it is hoped that the search engine can search for the speech, but it cannot completely * * let the searcher view it, so it is necessary to provide the web spiders with corresponding user names and passwords. Web spiders can crawl these pages through the given permissions, thus providing search. When the searcher clicks to check the page, the searcher also needs to provide the corresponding permission verification. Website and web spider Different from general access, web spiders need to crawl web pages. If the control is not good, the website server will be overloaded. In April this year, Taobao's servers were unstable because the web spiders of Yahoo's search engine grabbed their data. Can't websites communicate with web spiders? In fact, there are many ways for websites to communicate with web spiders. On the one hand, let the website administrator know where the web spiders come from and what they have done. On the other hand, tell the web spiders which pages should not be crawled and which pages should be updated. Each web spider has its own name when crawling a web page. Will identify themselves to the website. When a web spider crawls a web page, it will send a request, in which a field is used by Useragent to identify the identity of the web spider. For example, if there are access logs on the website, the website administrator can know which search engine spiders have come, when they have come, and how much data they have read. If the webmaster finds a problem with a spider, he or she will contact its owner through its logo. Generally, a special text file Robots.txt is accessed. This file is usually placed in the root directory of the website server, and the web spider enters a website. The website administrator can use robots.txt to define which directories cannot be accessed by web spiders, or which directories cannot be accessed by certain web spiders. For example, if the executable file directory and temporary file directory of some websites do not want to be searched by the search engine, the website administrator can define these directories as denied access directories. The Robots.txt syntax is simple. For example, if there is no restriction on the directory, it can be described in the following two lines: User-agent:* Disallow: Robots.txt is just a protocol, of course. If the designer of the web spider does not follow this protocol, the webmaster cannot prevent the web spider from accessing certain pages, but the general web spider will follow these protocols, and the webmaster can also refuse the web spider to crawl certain pages through other ways. Can recognize the HTML code of the web page. When the web spider downloads the web page. The part of the code will have META identification. Through these signs, you can tell the web spider whether the web page needs to be crawled, and you can also tell the web spider whether the links in the web page need to be tracked. For example, it means that the page does not need to be crawled, but the links in the page need to be tracked. Interested readers check the literature [4 About Robots.txt syntax and META Tag syntax.] Because this allows more visitors to find this website through search engines. In order to make the pages of this website more comprehensive to be captured, the webmaster can establish a website map. Now, most websites want search engines to capture the pages of their websites more comprehensively. That is, many web spiders in SiteMap will use the sitemap.htm file as an entry for crawling website pages. The website administrator can put links to all pages within the website in this file, so that web spiders can easily crawl the entire website, prevent missing some pages, and reduce the burden on the website server. Content extraction The object of disposal is a text file. For web spiders, search engines build web page indexes. Grab web pages including various formats, including html images, docpdf multimedia, dynamic web pages and other formats. After these files are captured, the text information in these files needs to be extracted. Accurately extracting the information of these documents, on the one hand, plays an important role in the search accuracy of the search engine, and on the other hand, has a certain impact on the correct tracking of other links by web spiders. This kind of document generated by software provided by professional manufacturers is applicable to docpdf and other documents. Manufacturers will provide corresponding text extraction interfaces. The web spider only needs to call the interface of these plug-ins to easily extract the text information in the document and other relevant information of the file. HTML has its own syntax. HTML and other documents are different. Different command identifiers are used to represent different fonts, colors, positions and other layouts, such as:, etc. These identifiers need to be filtered out when extracting text information. It is not difficult to filter identifiers, because these identifiers have certain rules, as long as the corresponding information is obtained according to different identifiers. However, when recognizing these information, it is necessary to record a lot of layout information synchronously, such as font size of text, whether it is title, whether it is bold display, whether it is keyword of the page, etc. These information helps to calculate the importance of words in the web page. At the same time, for HTML pages, besides the title and body, there will be many advertising links and public channel links. These links have nothing to do with the body of the text. When extracting the content of the page, you also need to filter these useless links. For example, a website has a "product introduction" channel, because there are navigation bars on every page of the website. If you do not filter the navigation bar links, when you search for "product introduction", every page of the website will undoubtedly bring a lot of garbage information. Filtering these invalid links requires statistics of a large number of web page structure rules, extraction of some commonalities, and unified filtering; Some important websites with special results need to be handled individually. This requires the design of web spiders to have certain scalability. Generally, the content of these files is judged by the linked anchor text (i.e., for multimedia, picture and other files, link text) and related file comments. For example, if the link text is "Maggie Cheung photo" and the link points to a picture in bmp format, the web spider will know that the content of this picture is Maggie Cheung photo ". In this way, when searching for" Maggie Cheung "and" photo ", the search engine can find this picture. In addition, many multimedia files have file attributes, which can also better understand the content of the file. Compared with static web pages, dynamic web pages have always been a problem faced by web spiders. The so-called dynamic web page. The advantage of automatically generated pages in sequence is that they can quickly and uniformly change the style of web pages, and also reduce the space occupied by web pages on the server, but it also brings some trouble to web spiders. Due to the increase of development languages from time to time, there are more and more types of dynamic web pages, such as aspjspphp. These types of web pages may be a little easier for web spiders. For some script languages (such as VBScript and javascript generated pages) that are difficult for web spiders to deal with, web spiders need to have their own script interpretation order if they want to deal with these pages perfectly. For many websites whose data is placed in the database, they need to search the database of their own website to obtain information, which brings great difficulties to web spiders' crawling. For such websites, if the website designer wants these data to be searched by search engines, he needs to provide a method that can traverse the entire database content. It has always been an important technology in web spiders. The whole system generally adopts the form of plug-in to extract the content of web pages. Manage the service order through a plug-in, and use different plug-ins to process pages with different formats. The advantage of this method lies in its good extensibility. Every time a new type is found in the future, its disposal method can be made into a plug-in and added to the plug-in management service sequence. Update cycle Therefore, web spiders also need to constantly update the content of their crawled pages, because the content of websites is often changing. This requires web spiders to scan the website according to a certain period to check which pages need to be updated, which pages are new, and which pages are dead links that have expired. There will always be a part of newly generated web pages whose search cycle is not too short, and the update cycle of search engines has a great impact on the recall of search engines. If the update cycle is too long. Technical implementation will be difficult, and will waste bandwidth and server resources. The web spiders of search engines do not update all websites in the same cycle. For some important websites with large amount of updates, the update cycle is short. For example, some news websites update once every few hours; On the contrary, for some unimportant websites, the update cycle is long, perhaps once every two months. This article was published by Beijing Website Construction Company Shangpin China //ihucc.com/
Source Statement: This article is original or edited by Shangpin China's editors. If it needs to be reproduced, please indicate that it is from Shangpin China. The above contents (including pictures and words) are from the Internet. If there is any infringement, please contact us in time (010-60259772).