55 lines
2.5 KiB
Markdown
55 lines
2.5 KiB
Markdown
|
# Web-scraping
|
||
|
|
||
|
|
||
|
Web-scraping is a common task in the CS world that makes it easy and efficient to extract large amounts of data. It is part of a larger topic of data mining which allows for the human understandable analysis of all the data that is out there.
|
||
|
|
||
|
You will often use requests and `beautifulsoup` libraries.
|
||
|
To prevent web-scraping on your own sites, refer to the [robots.txt](obsidian://open?vault=enter&file=Robots.txt%20Files) information.
|
||
|
|
||
|
---
|
||
|
|
||
|
#### Comparing web scraping libraries:
|
||
|
![[Pasted image 20220730121832.png]]
|
||
|
|
||
|
## Sample scraper
|
||
|
```python
|
||
|
import pandas as pd
|
||
|
from bs4 import BeautifulSoup
|
||
|
from selenium import webdriver
|
||
|
driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable') driver.get('https://your.url/here?yes=brilliant')
|
||
|
results = []
|
||
|
other_results = []
|
||
|
content = driver.page_source
|
||
|
soup = BeautifulSoup(content)
|
||
|
for a in soup.findAll(attrs={'class': 'class'}):
|
||
|
name = a.find('a')
|
||
|
if name not in results:
|
||
|
results.append(name.text)
|
||
|
for b in soup.findAll(attrs={'class': 'otherclass'}):
|
||
|
name2 = b.find('span')
|
||
|
other_results.append(name.text)
|
||
|
series1 = pd.Series(results, name = 'Names')
|
||
|
series2 = pd.Series(other_results, name = 'Categories')
|
||
|
df = pd.DataFrame({'Names': series1, 'Categories': series2})
|
||
|
df.to_csv('names.csv', index=False, encoding='utf-8')
|
||
|
```
|
||
|
|
||
|
You can also use asyncio or multithreading to make web scraping even [faster](https://oxylabs.io/blog/how-to-make-web-scraping-faster).
|
||
|
Right click > Inspect > Network
|
||
|
|
||
|
---
|
||
|
|
||
|
##### More helpful tutorials
|
||
|
- [How To Scraper Yelp Review For Free - No Coding Required](https://medium.com/prowebscraper/how-to-scraper-yelp-reviews-899b7480eb8d)
|
||
|
- [How to Build a Web Scraper With Python - Step-by-Step Guide](https://hackernoon.com/how-to-build-a-web-scraper-with-python-step-by-step-guide-jxkp3yum)
|
||
|
- [Python Web Scraping Tutorial: Step-By-Step - 2022 Guide](https://oxylabs.io/blog/python-web-scraping)
|
||
|
- [ Intro to Yelp Scraping using Python ](https://towardsdatascience.com/intro-to-yelp-web-scraping-using-python-78252318d832)
|
||
|
- [Webscraping LinkedIn](https://federicohaag.medium.com/linkedin-scraping-with-python-d8d14519602d) with Python
|
||
|
- [github repo](https://github.com/federicohaag/LinkedInScraping) for code
|
||
|
|
||
|
|
||
|
---
|
||
|
|
||
|
## Alternative tools:
|
||
|
- [Octoparse](https://developer.chrome.com/docs/devtools/workspaces/?utm_source=devtools) is a good one which is free for 14 days.
|
||
|
- [ParseHub](https://parsehub.com/docs/ref/api/v2/) is another great scraping tool I've tried out
|