Notepad/enter/Coding Tips (Classical)/Terminal Tips/3. GUIs/Internet/Websites/Robots.txt Files.md

1.8 KiB

Robots.txt is an increasingly important file found on websites that determine whether you permit a website crawler to index your page for search engine optimization. As web-scraping is entirely legal in the US, this is the wild west of scraping and thus I want to keep mu brain and information safe from scraping.

Fun Fact: Google open-sourced their robots.txt parser in 2019 if you want to see an example of reverse engineering the robots.txt file for search indexing.

Resources:

Example:

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow: /

User-agent: AdsBot-Google
Disallow: /

User-agent: bingbot
Disallow: /

User-agent: msnbot
Disallow: /

User-agent: Slurp
Disallow: /

User-agent: Facebot
Disallow: /

User-agent: facebookexternalhit
Disallow: /

User-agent: baiduspider
Disallow: /

User-agent: Applebot
Disallow: /

User-agent: sosobot
Disallow: /

User-agent: exabot
Disallow: /

User-agent: seznambot
Disallow: /

User-agent: Teoma
Disallow: /

User-agent: ScoutJet
Disallow: /

User-agent: DuckDuckBot
Disallow: /

User-agent: Twitterbot
Disallow: /

User-agent: LinkedInBot
Disallow: /

User-agent: Yandex
Disallow: /

User-agent: Relcybot
Disallow: /

User-agent: Feedly
Disallow: /

User-agent: Netvibes
Disallow: /

User-agent: Pingdom
Disallow: /

User-agent: PGBot
Disallow: /

User-agent: Laserlikebot
Disallow: /

User-agent: PetalBot
Disallow: /

User-agent: ia_archiver
Disallow: /

User-agent: JamesBOT
Disallow: /

User-agent: *
Disallow: /