Notepad

1.8 KiB

Raw Permalink Blame History

Robots.txt is an increasingly important file found on websites that determine whether you permit a website crawler to index your page for search engine optimization. As web-scraping is entirely legal in the US, this is the wild west of scraping and thus I want to keep mu brain and information safe from scraping.

Fun Fact: Google open-sourced their robots.txt parser in 2019 if you want to see an example of reverse engineering the robots.txt file for search indexing.

Resources:

Robots.txt file examples
Robots.txt generator tool
another robots.txt file sample

Example:

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow: /

User-agent: AdsBot-Google
Disallow: /

User-agent: bingbot
Disallow: /

User-agent: msnbot
Disallow: /

User-agent: Slurp
Disallow: /

User-agent: Facebot
Disallow: /

User-agent: facebookexternalhit
Disallow: /

User-agent: baiduspider
Disallow: /

User-agent: Applebot
Disallow: /

User-agent: sosobot
Disallow: /

User-agent: exabot
Disallow: /

User-agent: seznambot
Disallow: /

User-agent: Teoma
Disallow: /

User-agent: ScoutJet
Disallow: /

User-agent: DuckDuckBot
Disallow: /

User-agent: Twitterbot
Disallow: /

User-agent: LinkedInBot
Disallow: /

User-agent: Yandex
Disallow: /

User-agent: Relcybot
Disallow: /

User-agent: Feedly
Disallow: /

User-agent: Netvibes
Disallow: /

User-agent: Pingdom
Disallow: /

User-agent: PGBot
Disallow: /

User-agent: Laserlikebot
Disallow: /

User-agent: PetalBot
Disallow: /

User-agent: ia_archiver
Disallow: /

User-agent: JamesBOT
Disallow: /

User-agent: *
Disallow: /

1.8 KiB Raw Permalink Blame History

1.8 KiB

Raw Permalink Blame History