Every website has robots.txt file. Robots.txt contains very important information about crawling of pages on your website. All search engine like Google, Bing, Yahoo follow instructions of robots.txt file strictly. It allows us to exclude certain pages of the site to crawl which are indexed in google search console for security reasons.
Where to find your robots.txt file?
You can find your robots.txt file by going to google.com.
Type in the search bar exampledomain.com/robots.txt
Your robots.txt file will be open.
All program that are active on the web will have an unique assigned user name i.e. user agent or an assigned name.
User-agent: *
This implies that the instructions are to be followed and valid for every bot.
User-agent: Googlebot
This implies instructions are to be followed and valid only for Google bot.
User-agent: Bingbot
This implies instructions are to be followed and valid only for Bing bot.
User-agent: Slurp
This implies instruction are to be followed and valid only for Yahoo bot.
‘Disallow’ commands tells bots for which the instructions is valid accordingly useragent field value not to access the particular webpage or set of webpages that come after the command.
Disallow:
This commands tells bots that they can browse all the pages of the site i.e.the entire website, because nothing is disallowed.
1 | Disallow: / |
This tells i.e. slash the homepage of your site and all the pages with in the site.
Means, a single slash can remove a whole website from the search from any bots like Google, Yahoo, Bing.
Disallow: /directory/
This commands means that all pages contained within the particular ‘directory’ not to be crawled.
Disallow: /*.pdf/
This command will prevent bots to crawl all files having .pdf extensions. It is mainly used when some websites have documents or data that should not be crawled and visible to anyone. It is mainly used for security purposes.
Allow: Just as one might expect, the “Allow” command tells bots they are allowed to access a certain webpage or directory. Not all search engines recognize this command.
It allows administrators to specify how long the bot should wait between each request, in milliseconds. For example of a Crawl-delay command to wait 8 milliseconds:
Crawl-delay: 8
Google does not recognize this command, although other search engines do.
The Sitemaps protocol helps bots know what to include in their crawling of a website. It includes all links that are machine-readable list of all the pages on a website.
The format is: “Sitemaps:” followed by the web address of the XML file.
I hope you all enjoy reading about this article. Thank you.