Search engines support nofollow and noarchive.
Methods for prohibiting search engine inclusion
1. What is the robots.txt file? Search Engines Use the spider program to automatically access web pages on the Internet and obtain web page information. When a spider visits a website, it will first check whether there is a plain text file called robots.txt under the root domain of the website. This file is used to specify the scope of spider fetching on your website. You can create a robots.txt in your website, declare in the file the parts of the website that you do not want to be included by the search engine, or specify that the search engine only includes specific parts.
Please note that you need to use the robots.txt file only if your website contains content that you do not want to be included in the search engine. If you want the search engine to include all the content on the website, do not create a robots.txt file.
2. Where is the robots.txt file located? The robots. txt file should be placed in the root directory of the website. For example, when spider visits a website (such as //www.abc.com )First, check whether the website exists //www.abc.com/robots.txt If the Spider finds this file, it will determine the scope of its access permissions according to the contents of the file.
The corresponding robots.txt URL of the website URL
//www.w3.org/ //www.w3.org/robots.txt
//www.w3.org:80/ //www.w3.org:80/robots.txt
//www.w3.org:1234/ //www.w3.org:1234/robots.txt
//w3.org/ //w3.org/robots.txt
3. I set in robots.txt to prohibit the search engine from including the content of my website. Why does it still appear in the search engine and search results? If other websites link to pages banned from inclusion set in your robots.txt file, these pages may still appear in the search results of the search engine, but the content on your page will not be captured, indexed and displayed. The search engine and search results display only the descriptions of other websites on your relevant pages.
4. It is forbidden for search engines to track the links of web pages, but only to index web pages. If you do not want search engines to track the links on this page, and do not transfer the weight of links, please place this meta tag in the part of the web page:
If you don't want the search engine to follow a specific link, and the search engine also supports more precise control, please write this mark directly on a link: sign in
To allow other search engines to track, but only prevent search engines from following the links of your page, please put this meta tag in the part of the page:
5. It is prohibited for search engines to display snapshots of web pages in search results, but only to index web pages To prevent all search engines from displaying snapshots of your website, please place this meta tag in the part of the web page:
To allow other search engines to display snapshots, but only prevent search engines from following, use the following tags:
Note: This mark only prohibits the search engine from following the snapshot of the page. The search engine will continue to index the page and display the summary of the page in the search results.
6. I want to prohibit Baidu Image Search from including some images. How can I set it? To prevent the Baiduspider from grabbing all pictures on the website, or to prohibit or allow the Baiduspider to grab picture files in a specific format on the website, you can set robots. Please refer to examples 10, 11, and 12 in "Example of the Use of Robots.txt Files".
7. The format of the robots.txt file "robots.txt" file contains one or more records, which are separated by blank lines (CR, CR/NL, or NL as terminators). The format of each record is as follows: ":"
You can use # for annotation in this file. The specific usage method is the same as that in UNIX. The records in this file usually start with one or more user agent lines, followed by several Disallow and Allow lines. The details are as follows:
User-agent:
The value of this item is used to describe the name of the search engine robot. In the "robots. txt" file, if there are multiple user agent records, multiple robots will be restricted by "robots. txt". For this file, there must be at least one user agent record. If the value of this item is set to *, it is valid for any robot. In the "robots. txt" file, there can only be one record such as "User agent: *". If "User agent: SomeBot" and several Disallow and Allow lines are added to the "robots. txt" file, the name "SomeBot" is only limited by the Disallow and Allow lines after "User agent: SomeBot".
Disallow:
The value of this item is used to describe a group of URLs that do not want to be accessed. This value can be a complete path or a non empty prefix of the path. URLs starting with the value of Disallow item will not be accessed by the robot. For example, "Disallow:/help" prohibits the robot from accessing/help.html,/helpabc.html,/help/index.html, while "Disallow:/help/" allows the robot to access/help.html,/helpabc.html, and cannot access/help/index.html. "Disallow:" indicates that the robot is allowed to access all urls of the website. There must be at least one Disallow record in the "/robots. txt" file. If "/robots. txt" does not exist or is an empty file, the website is open to all search engine robots.
Allow:
The value of this item is used to describe a group of URLs that you want to access. Similar to the Disallow item, this value can be a complete path or a prefix of the path. URLs starting with the value of the Allow item are allowed to be accessed by robots. For example, "Allow:/hibaidu" allows the robot to access/hibaidu.htm,/hibaiducom.html,/hibaidu/com.html. All URLs of a website are Allow by default, so Allow is usually used in conjunction with Disallow to allow access to some pages and prohibit access to all other URLs.
Use "*" and "$":
Baiduspider supports the use of wildcards "*" and "$" to fuzzy match urls.
"$" matches the line terminator.
"*" matches 0 or more arbitrary characters.
8. URL matching example Allow or Disallow value URL matching result
/tmp /tmp yes
/tmp /tmp.html yes
/tmp /tmp/a.html yes
/tmp /tmp no
/tmp /tmphoho no
/Hello* /Hello.html yes
/He*lo /Hello,lolo yes
/Heap*lo /Hello,lolo no
html$ /tmpa.html yes
/a.html$ /a.html yes
htm$ /a.html no
9. Example of robots.txt file usage 1 Prohibit all search engines from accessing any part of the website
Download the robots.txt file User agent:*
Disallow: /
Example 2. Allow all robots to access
(Or you can create an empty file "/robots. txt") User agent:*
Allow: /
Example 3. Only the Baidu pider is prohibited from visiting your website User agent: Baidu pider
Disallow: /
Example 4. Only allow Baduspider to visit your website User agent: Baduspider
Allow: /
User-agent: *
Disallow: /
Example 5. Only allow Baidu pider and Google bot to access User agent: Baidu pider
Allow: /
User-agent: Googlebot
Allow: /
User-agent: *
Disallow: /
Example 6. Forbid spiders from accessing specific directories
In this example, the website has three directories that restrict the access of the search engine, that is, the robot will not access these three directories. It should be noted that each directory must be declared separately, not written as "Disallow:/cgi bin//tmp/". User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
Example 7. Allow access to some urls in a specific directory User agent:*
Allow: /cgi-bin/see
Allow: /tmp/hi
Allow: /~joe/look
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
Example 8. Use "*" to restrict access to urls
Access to all URLs (including subdirectories) under the/cgi bin/directory with the suffix ". htm" is prohibited. User-agent: *
Disallow: /cgi-bin/*.htm
Example 9. Use "$" to restrict access to urls
Only URLs with the suffix ". htm" are allowed to be accessed. User-agent: *
Allow: /*.htm$
Disallow: /
Example 10. Forbidden to visit all dynamic pages in the website User agent:*
Disallow: /*?*
Example 11. Baiduspider is prohibited from grabbing all pictures on the website
Only web pages are allowed to be crawled, and no pictures are allowed to be crawled. User-agent: Baiduspider
Disallow: /*.jpg$
Disallow: /*.jpeg$
Disallow: /*.gif$
Disallow: /*.png$
Disallow: /*.bmp$
Example 12. Only Baiduspider is allowed to grab web pages and. gif images
User agent: Baiduspider
Allow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.jpeg$
Disallow: /*.png$
Disallow: /*.bmp$
Example 13. Only the Baduspider is prohibited from grabbing. jpg image User agent: Baduspider
Disallow: /*.jpg$