101 lines
1.8 KiB
Markdown
101 lines
1.8 KiB
Markdown
|
|
Robots.txt is an increasingly important file found on websites that determine whether you permit a website crawler to index your page for search engine optimization. As web-scraping is entirely legal in the US, this is the wild west of scraping and thus I want to keep mu brain and information safe from scraping.
|
|
|
|
Fun Fact: Google [open-sourced](https://opensource.googleblog.com/2019/07/googles-robotstxt-parser-is-now-open.html) their [robots.txt parser](https://github.com/google/robotstxt) in 2019 if you want to see an example of reverse engineering the robots.txt file for search indexing.
|
|
|
|
*Resources*:
|
|
- [Robots.txt file examples](https://blog.hubspot.com/marketing/robots-txt-file)
|
|
- Robots.txt [generator tool](https://www.internetmarketingninjas.com/tools/robots-txt-generator/)
|
|
- another [robots.txt](https://www.cutercounter.com/robots.txt) file sample
|
|
|
|
Example:
|
|
```
|
|
User-agent: *
|
|
Disallow: /
|
|
|
|
User-agent: Googlebot
|
|
Disallow: /
|
|
|
|
User-agent: AdsBot-Google
|
|
Disallow: /
|
|
|
|
User-agent: bingbot
|
|
Disallow: /
|
|
|
|
User-agent: msnbot
|
|
Disallow: /
|
|
|
|
User-agent: Slurp
|
|
Disallow: /
|
|
|
|
User-agent: Facebot
|
|
Disallow: /
|
|
|
|
User-agent: facebookexternalhit
|
|
Disallow: /
|
|
|
|
User-agent: baiduspider
|
|
Disallow: /
|
|
|
|
User-agent: Applebot
|
|
Disallow: /
|
|
|
|
User-agent: sosobot
|
|
Disallow: /
|
|
|
|
User-agent: exabot
|
|
Disallow: /
|
|
|
|
User-agent: seznambot
|
|
Disallow: /
|
|
|
|
User-agent: Teoma
|
|
Disallow: /
|
|
|
|
User-agent: ScoutJet
|
|
Disallow: /
|
|
|
|
User-agent: DuckDuckBot
|
|
Disallow: /
|
|
|
|
User-agent: Twitterbot
|
|
Disallow: /
|
|
|
|
User-agent: LinkedInBot
|
|
Disallow: /
|
|
|
|
User-agent: Yandex
|
|
Disallow: /
|
|
|
|
User-agent: Relcybot
|
|
Disallow: /
|
|
|
|
User-agent: Feedly
|
|
Disallow: /
|
|
|
|
User-agent: Netvibes
|
|
Disallow: /
|
|
|
|
User-agent: Pingdom
|
|
Disallow: /
|
|
|
|
User-agent: PGBot
|
|
Disallow: /
|
|
|
|
User-agent: Laserlikebot
|
|
Disallow: /
|
|
|
|
User-agent: PetalBot
|
|
Disallow: /
|
|
|
|
User-agent: ia_archiver
|
|
Disallow: /
|
|
|
|
User-agent: JamesBOT
|
|
Disallow: /
|
|
|
|
User-agent: *
|
|
Disallow: /
|
|
|
|
```
|