Notepad/enter/Coding Tips (Classical)/Terminal Tips/3. GUIs/Internet/Websites/Robots.txt Files.md

101 lines
1.8 KiB
Markdown

Robots.txt is an increasingly important file found on websites that determine whether you permit a website crawler to index your page for search engine optimization. As web-scraping is entirely legal in the US, this is the wild west of scraping and thus I want to keep mu brain and information safe from scraping.
Fun Fact: Google [open-sourced](https://opensource.googleblog.com/2019/07/googles-robotstxt-parser-is-now-open.html) their [robots.txt parser](https://github.com/google/robotstxt) in 2019 if you want to see an example of reverse engineering the robots.txt file for search indexing.
*Resources*:
- [Robots.txt file examples](https://blog.hubspot.com/marketing/robots-txt-file)
- Robots.txt [generator tool](https://www.internetmarketingninjas.com/tools/robots-txt-generator/)
- another [robots.txt](https://www.cutercounter.com/robots.txt) file sample
Example:
```
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /
User-agent: AdsBot-Google
Disallow: /
User-agent: bingbot
Disallow: /
User-agent: msnbot
Disallow: /
User-agent: Slurp
Disallow: /
User-agent: Facebot
Disallow: /
User-agent: facebookexternalhit
Disallow: /
User-agent: baiduspider
Disallow: /
User-agent: Applebot
Disallow: /
User-agent: sosobot
Disallow: /
User-agent: exabot
Disallow: /
User-agent: seznambot
Disallow: /
User-agent: Teoma
Disallow: /
User-agent: ScoutJet
Disallow: /
User-agent: DuckDuckBot
Disallow: /
User-agent: Twitterbot
Disallow: /
User-agent: LinkedInBot
Disallow: /
User-agent: Yandex
Disallow: /
User-agent: Relcybot
Disallow: /
User-agent: Feedly
Disallow: /
User-agent: Netvibes
Disallow: /
User-agent: Pingdom
Disallow: /
User-agent: PGBot
Disallow: /
User-agent: Laserlikebot
Disallow: /
User-agent: PetalBot
Disallow: /
User-agent: ia_archiver
Disallow: /
User-agent: JamesBOT
Disallow: /
User-agent: *
Disallow: /
```