Web crawling is a process systematically used to access data on the internet. Commonly this process is used by search engines to retrieve requested information. This makes it possible for people on the internet to find their way to your website.
Generative AI tools rely on the same technology to locate material, including images that may be used in their AI datasets. Using robots.txt you can deter a web crawler from accessing your information. Robots.txt is a text file that can be added on the backend of your website. This text file will include instructions on what pages and/or files a web crawler can look at.
Structure of robots.txt
The structure of a robots.txt file consists of two sections, a user-agent(s) and a set of directives. The user-agent is the identifier for the web crawler and the directives are the instructions to be applied for the specific web crawler.
Creating and Saving your robots.txt:
1) Create your robots.txt in notepad or a text editor.
2) Add the directives corresponding to the type of block you want to create.
3) Save your file to the directory on your site in the root domain with the naming convention “robots.txt”.
4) Test your robots.txt by searching:
https://yourwebsite.com/robots.txt
Types of directives:
- Full allow - all content can be crawled
- Full disallow - no content can be crawled
- Conditional allow - some content can be crawled
Examples of blocking directives:
1) Full allow
Grant no restrictions as to what bots can access on your website using the following lines:
User-agent: *
Disallow:
2) Full disallow
You may restrict all bots from crawling your website by adding the following lines in your robots.txt file:
User-agent: *
Disallow: /
3) Search Engine disallow
You may block search engine bots from indexing your content by adding the following lines in your robots.txt:
User-agent: {Search Engine Name}
Disallow: /
4) Specific URL disallow
You may restrict a specific page by adding the following lines in your robots.txt:
User-agent: *
Disallow: {specific URL}
For multiple URLs use:
User-agent: *
Disallow: {specific URL 1}
Disallow: {specific URL 2}
5) Specific File Type disallow
You may restrict specific file types from being accessed with the following code:
User-agent: *
Disallow: /*.html
For Images Files
User-agent: *
Disallow: / *.jpg
For Image Directory
User-agent: *
Disallow: /{directory (or) folder name}/
6) AI disallow
You may restrict an AI bot from indexing your site with the following code:
User-agent: {AI Crawler Name}
Disallow: /
Open AI disallow
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Google-Extended
Disallow: /
CMS Guides to robots.txt
For additional information on using robots.txt in popular content management systems please refer to their guides below:
It is important to identify what content you deem permissible to be accessed by web crawlers. Crawlers play an integral role in the searchability of your website and/or content; therefore, a full disallow is not recommended. If you are worried about having your content accessed or used by AI you may appeal to robots.txt directives to disallow them from indexing.