1.8 KiB
1.8 KiB
Robots.txt is an increasingly important file found on websites that determine whether you permit a website crawler to index your page for search engine optimization. As web-scraping is entirely legal in the US, this is the wild west of scraping and thus I want to keep mu brain and information safe from scraping.
Fun Fact: Google open-sourced their robots.txt parser in 2019 if you want to see an example of reverse engineering the robots.txt file for search indexing.
Resources:
- Robots.txt file examples
- Robots.txt generator tool
- another robots.txt file sample
Example:
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /
User-agent: AdsBot-Google
Disallow: /
User-agent: bingbot
Disallow: /
User-agent: msnbot
Disallow: /
User-agent: Slurp
Disallow: /
User-agent: Facebot
Disallow: /
User-agent: facebookexternalhit
Disallow: /
User-agent: baiduspider
Disallow: /
User-agent: Applebot
Disallow: /
User-agent: sosobot
Disallow: /
User-agent: exabot
Disallow: /
User-agent: seznambot
Disallow: /
User-agent: Teoma
Disallow: /
User-agent: ScoutJet
Disallow: /
User-agent: DuckDuckBot
Disallow: /
User-agent: Twitterbot
Disallow: /
User-agent: LinkedInBot
Disallow: /
User-agent: Yandex
Disallow: /
User-agent: Relcybot
Disallow: /
User-agent: Feedly
Disallow: /
User-agent: Netvibes
Disallow: /
User-agent: Pingdom
Disallow: /
User-agent: PGBot
Disallow: /
User-agent: Laserlikebot
Disallow: /
User-agent: PetalBot
Disallow: /
User-agent: ia_archiver
Disallow: /
User-agent: JamesBOT
Disallow: /
User-agent: *
Disallow: /