Compare commits
2 Commits
422a0f2f94
...
4e5db77998
Author | SHA1 | Date |
---|---|---|
Shwetha Jayaraj | 4e5db77998 | |
Shwetha Jayaraj | 2d69da9a74 |
|
@ -1,12 +1,12 @@
|
|||
{
|
||||
"recentFiles": [
|
||||
{
|
||||
"basename": "Webscraping",
|
||||
"path": "Coding Tips (Classical)/Terminal Tips/GUIs/Tools/Webscraping.md"
|
||||
"basename": "Robots.txt Files",
|
||||
"path": "Coding Tips (Classical)/Terminal Tips/GUIs/Internet/Websites/Robots.txt Files.md"
|
||||
},
|
||||
{
|
||||
"basename": "Robots.txt Files",
|
||||
"path": "Robots.txt Files.md"
|
||||
"basename": "Webscraping",
|
||||
"path": "Coding Tips (Classical)/Terminal Tips/GUIs/Tools/Webscraping.md"
|
||||
},
|
||||
{
|
||||
"basename": "Potentiometers & Analog SerialReader",
|
||||
|
|
|
@ -25,7 +25,7 @@
|
|||
"state": {
|
||||
"type": "markdown",
|
||||
"state": {
|
||||
"file": "Coding Tips (Classical)/Terminal Tips/GUIs/Tools/Webscraping.md",
|
||||
"file": "Coding Tips (Classical)/Terminal Tips/GUIs/Internet/Websites/Robots.txt Files.md",
|
||||
"mode": "source",
|
||||
"source": false
|
||||
}
|
||||
|
@ -107,7 +107,7 @@
|
|||
"state": {
|
||||
"type": "backlink",
|
||||
"state": {
|
||||
"file": "Coding Tips (Classical)/Terminal Tips/GUIs/Tools/Webscraping.md",
|
||||
"file": "Coding Tips (Classical)/Terminal Tips/GUIs/Internet/Websites/Robots.txt Files.md",
|
||||
"collapseAll": false,
|
||||
"extraContext": false,
|
||||
"sortOrder": "alphabetical",
|
||||
|
@ -124,7 +124,7 @@
|
|||
"state": {
|
||||
"type": "outgoing-link",
|
||||
"state": {
|
||||
"file": "Coding Tips (Classical)/Terminal Tips/GUIs/Tools/Webscraping.md",
|
||||
"file": "Coding Tips (Classical)/Terminal Tips/GUIs/Internet/Websites/Robots.txt Files.md",
|
||||
"linksCollapsed": false,
|
||||
"unlinkedCollapsed": true
|
||||
}
|
||||
|
@ -147,7 +147,7 @@
|
|||
"state": {
|
||||
"type": "outline",
|
||||
"state": {
|
||||
"file": "Coding Tips (Classical)/Terminal Tips/GUIs/Tools/Webscraping.md"
|
||||
"file": "Coding Tips (Classical)/Terminal Tips/GUIs/Internet/Websites/Robots.txt Files.md"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -174,9 +174,10 @@
|
|||
"obsidian-excalidraw-plugin:Create new drawing": false
|
||||
}
|
||||
},
|
||||
"active": "dbad7b010371d947",
|
||||
"active": "0a0de85a51848b9d",
|
||||
"lastOpenFiles": [
|
||||
"Robots.txt Files.md",
|
||||
"Coding Tips (Classical)/Terminal Tips/GUIs/Tools/Webscraping.md",
|
||||
"Coding Tips (Classical)/Terminal Tips/GUIs/Internet/Websites/Robots.txt Files.md",
|
||||
"Excalidraw/Drawing 2023-10-16 12.13.42.excalidraw.md",
|
||||
"Machine Tips (Quantum)/Physics/Hardware/Potentiometers & Analog SerialReader.md",
|
||||
"Excalidraw",
|
||||
|
@ -206,7 +207,6 @@
|
|||
"Untitled.canvas",
|
||||
"Coding Tips (Classical)/Project Vault/Current Occupations/Manhattan Youth",
|
||||
"Coding Tips (Classical)/Project Vault/Current Occupations/Website Projects/My Domain Names.md",
|
||||
"Coding Tips (Classical)/Project Vault/Current Occupations/Potential and Future/Career Tips.md",
|
||||
"Coding Tips (Classical)/Project Vault/About Obsidian/imgFiles/Pasted image 20231011091043.png",
|
||||
"Coding Tips (Classical)/Project Vault/About Obsidian/Slides & Tools/export/Slides/plugin/chalkboard/_style.css",
|
||||
"Coding Tips (Classical)/Project Vault/About Obsidian/Slides & Tools/export/Slides/plugin/chalkboard/img/blackboard.png",
|
||||
|
|
|
@ -0,0 +1,9 @@
|
|||
|
||||
Robots.txt is an increasingly important file found on websites that determine whether you permit a website crawler to index your page for search engine optimization. As web-scraping is entirely legal in the US, this is the wild west of scraping and thus I want to keep mu brain and information safe from scraping.
|
||||
|
||||
Fun Fact: Google [open-sourced](https://opensource.googleblog.com/2019/07/googles-robotstxt-parser-is-now-open.html) their [robots.txt parser](https://github.com/google/robotstxt) in 2019 f you want to see an example of reverse engineering the robots.txt file for search indexing.
|
||||
|
||||
*Resources*:
|
||||
- [Robots.txt file examples](https://blog.hubspot.com/marketing/robots-txt-file)
|
||||
- Robots.txt [generator tool](https://www.internetmarketingninjas.com/tools/robots-txt-generator/)
|
||||
|
|
@ -1,9 +1,10 @@
|
|||
# Webscraping
|
||||
# Web-scraping
|
||||
|
||||
|
||||
Webscraping is a common task in the CS world that makes it easy and efficient to extract large amounts of data. It is part of a larger topic of data mining which allows for the human understandable analysis of all the data that is out there.
|
||||
Web-scraping is a common task in the CS world that makes it easy and efficient to extract large amounts of data. It is part of a larger topic of data mining which allows for the human understandable analysis of all the data that is out there.
|
||||
|
||||
You will often use requests and beautifulsoup libraries. To prevent webscraping on your own sites, refer to the rob
|
||||
You will often use requests and `beautifulsoup` libraries.
|
||||
To prevent web-scraping on your own sites, refer to the [robots.txt](obsidian://open?vault=enter&file=Robots.txt%20Files) information.
|
||||
|
||||
---
|
||||
|
||||
|
|
|
@ -1,6 +0,0 @@
|
|||
|
||||
Robots.txt is an increasingly important file found on websites that determine whether you permit a website crawler to index your page for search engine optimization. As webscraping is entirely legal in the US, this is the wild west of scraping and thus I want to keep mu brain and information safe from scraping.
|
||||
|
||||
|
||||
*Resources*:
|
||||
- [Robots.txt file examples](https://blog.hubspot.com/marketing/robots-txt-file)
|
Loading…
Reference in New Issue