Search engine inclusion principle
In fact, the search engine includes pages to collect data on the Internet, which is the most basic work of the search engine. The data collection capability of a search engine directly determines the amount of information that a search engine can provide and its coverage of the Internet, thus determining the quality of a search engine. Therefore, search engines are always trying to improve their data collection capabilities. ( Beijing website production )
1. Page inclusion process
In the Internet, the URL is the entrance address of each page, and the search engine spider program grabs the page through the URL. The search engine spider program starts from the URL list, grabs and stores the page through the URL; At the same time, extract the URL resources from the original page and add them to the URL list. In such a continuous cycle, you can get enough pages from the Internet.
The URL is the entrance to the page, and the domain name is the entrance to the website. The search engine spider program enters the website through the domain name, so as to start crawling the website page. In other words, the first task for search engines to crawl pages on the Internet is to establish a sufficiently large list of domain names, and then enter the corresponding website through the domain name to crawl pages in the website.
For websites, if they want to be included in search engines, the first condition is to join the domain name list of search engines. Here are two common ways to join the search engine domain name list.
First, use the website login portal provided by the search engine to submit the website domain name to the search engine. For example, Google's website login address is //www.google.com/intl/zh-CN/add_url.html , you can submit your own website domain name here. For the submitted domain name list, the search engine will only update it regularly. Therefore, this practice is relatively passive, and it takes a long time from domain name submission to website inclusion.
Second, by establishing a link relationship with external websites, search engines can find our websites through external websites, so as to achieve the inclusion of websites. The initiative of this approach is in our own hands (as long as we have enough high-quality links), and the speed of county collection is much faster than that of active submission to search engines. According to the quantity, quality and relevance of external links, generally, they will be included in search engines in about 2-7 days.
2. Page recording principle
Through the above introduction, I believe that readers have mastered the method of the website being included by the search engine. However, how can we increase the number of pages included in the website? To answer this question, you should first understand the working principle of search engine included pages.
If the collection of website pages is regarded as a directed graph, start from the specified page, follow the links in the page, and follow a specific strategy to traverse the pages in the website. Constantly remove the visited URL from the URL list, store the original page, and extract the URL information from the original page; Then divide URLs into domain names and partial URLs, and judge resources at the same time. Through these efforts, the search engine can build a huge list of domain names and page URLs and store enough original pages.
3. Page recording method
The above has introduced the process and principle of the search engine to include pages. However, among the hundreds of millions of pages on the Internet, how can search engines capture relatively important pages? This is the way to include pages involving search engines.
The way of page collection refers to the strategy used by search engines to capture pages, so as to filter out relatively important information in the Internet. The formulation of the way to include pages depends on the search engine's understanding of the website structure. If you use the photo album capture strategy, search engines can capture more page resources in a website within the same time, then they will stay on the website for a longer time, and the number of pages included will naturally be more. Therefore, deepening the understanding of search engine page inclusion methods is conducive to building a friendly structure for the website and increasing the number of pages included.
>>Breadth first
If the whole website is regarded as a tree, the home page is the root, and each page is the leaf. Breadth first is a horizontal page fetching method, which starts from the shallower layer of the tree to crawl pages until all pages in the same layer are crawled before entering the next layer. Therefore, when optimizing the website, we should display the relatively important information in the website on the shallow page (for example, recommend some popular products or content on the home page). Conversely, through breadth first crawling, search engines can first crawl the relatively important pages in the website. ( High end website construction )
First, the search engine starts from the website home page, grabs the pages pointed to by all links on the home page, forms page collection A, and parses the links of all pages in collection A; Then follow these links to grab the next layer of pages, forming a page set B; In this way, the link is recursively parsed from the line layer page, so as to crawl the deep page, and the crawl process is stopped until a certain set condition is met.
>>Depth first
In contrast to breadth first, depth first first tracks a link in the line layer page to gradually crawl the deep page, and then returns to the shallow page after crawling the deepest page, tracks another link, and continues to crawl to the deep page. This is a kind of page, which can meet the needs of more users.