Notepad/enter/Coding Tips (Classical)/Terminal Tips/GUIs/Tools/Webscraping.md

54 lines
2.4 KiB
Markdown

# Webscraping
Webscraping is a common task in the CS world that makes it easy and efficient to extract large amounts of data. It is part of a larger topic of data mining which allows for the human understandable analysis of all the data that is out there.
You will often use requests and beautifulsoup libraries. To prevent webscraping on your own sites, refer to the rob
---
#### Comparing web scraping libraries:
![[Pasted image 20220730121832.png]]
## Sample scraper
```python
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable') driver.get('https://your.url/here?yes=brilliant')
results = []
other_results = []
content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll(attrs={'class': 'class'}):
name = a.find('a')
if name not in results:
results.append(name.text)
for b in soup.findAll(attrs={'class': 'otherclass'}):
name2 = b.find('span')
other_results.append(name.text)
series1 = pd.Series(results, name = 'Names')
series2 = pd.Series(other_results, name = 'Categories')
df = pd.DataFrame({'Names': series1, 'Categories': series2})
df.to_csv('names.csv', index=False, encoding='utf-8')
```
You can also use asyncio or multithreading to make web scraping even [faster](https://oxylabs.io/blog/how-to-make-web-scraping-faster).
Right click > Inspect > Network
---
##### More helpful tutorials
- [How To Scraper Yelp Review For Free - No Coding Required](https://medium.com/prowebscraper/how-to-scraper-yelp-reviews-899b7480eb8d)
- [How to Build a Web Scraper With Python - Step-by-Step Guide](https://hackernoon.com/how-to-build-a-web-scraper-with-python-step-by-step-guide-jxkp3yum)
- [Python Web Scraping Tutorial: Step-By-Step - 2022 Guide](https://oxylabs.io/blog/python-web-scraping)
- [ Intro to Yelp Scraping using Python ](https://towardsdatascience.com/intro-to-yelp-web-scraping-using-python-78252318d832)
- [Webscraping LinkedIn](https://federicohaag.medium.com/linkedin-scraping-with-python-d8d14519602d) with Python
- [github repo](https://github.com/federicohaag/LinkedInScraping) for code
---
## Alternative tools:
- [Octoparse](https://developer.chrome.com/docs/devtools/workspaces/?utm_source=devtools) is a good one which is free for 14 days.
- [ParseHub](https://parsehub.com/docs/ref/api/v2/) is another great scraping tool I've tried out