Notepad/enter/Coding Tips (Classical)/Terminal Tips/3. GUIs/Applications/Webscraping.md

55 lines
2.5 KiB
Markdown
Raw Permalink Normal View History

2023-10-16 16:30:06 +00:00
# Web-scraping
2023-07-05 18:29:11 +00:00
2023-10-16 16:30:06 +00:00
Web-scraping is a common task in the CS world that makes it easy and efficient to extract large amounts of data. It is part of a larger topic of data mining which allows for the human understandable analysis of all the data that is out there.
2023-07-05 18:29:11 +00:00
2023-10-16 16:30:06 +00:00
You will often use requests and `beautifulsoup` libraries.
To prevent web-scraping on your own sites, refer to the [robots.txt](obsidian://open?vault=enter&file=Robots.txt%20Files) information.
2023-07-05 18:29:11 +00:00
---
#### Comparing web scraping libraries:
![[Pasted image 20220730121832.png]]
## Sample scraper
```python
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable') driver.get('https://your.url/here?yes=brilliant')
results = []
other_results = []
content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll(attrs={'class': 'class'}):
name = a.find('a')
if name not in results:
results.append(name.text)
for b in soup.findAll(attrs={'class': 'otherclass'}):
name2 = b.find('span')
other_results.append(name.text)
series1 = pd.Series(results, name = 'Names')
series2 = pd.Series(other_results, name = 'Categories')
df = pd.DataFrame({'Names': series1, 'Categories': series2})
df.to_csv('names.csv', index=False, encoding='utf-8')
```
You can also use asyncio or multithreading to make web scraping even [faster](https://oxylabs.io/blog/how-to-make-web-scraping-faster).
Right click > Inspect > Network
---
##### More helpful tutorials
- [How To Scraper Yelp Review For Free - No Coding Required](https://medium.com/prowebscraper/how-to-scraper-yelp-reviews-899b7480eb8d)
- [How to Build a Web Scraper With Python - Step-by-Step Guide](https://hackernoon.com/how-to-build-a-web-scraper-with-python-step-by-step-guide-jxkp3yum)
- [Python Web Scraping Tutorial: Step-By-Step - 2022 Guide](https://oxylabs.io/blog/python-web-scraping)
- [ Intro to Yelp Scraping using Python ](https://towardsdatascience.com/intro-to-yelp-web-scraping-using-python-78252318d832)
- [Webscraping LinkedIn](https://federicohaag.medium.com/linkedin-scraping-with-python-d8d14519602d) with Python
- [github repo](https://github.com/federicohaag/LinkedInScraping) for code
---
## Alternative tools:
- [Octoparse](https://developer.chrome.com/docs/devtools/workspaces/?utm_source=devtools) is a good one which is free for 14 days.
- [ParseHub](https://parsehub.com/docs/ref/api/v2/) is another great scraping tool I've tried out