Notepad/enter/Coding Tips (Classical)/Terminal Tips/3. GUIs/Applications/Webscraping.md

# Web-scraping 


Web-scraping is a common task in the CS  world  that makes it easy and efficient to  extract large amounts of data. It is part of a larger topic  of data mining which allows for the human understandable  analysis  of all the  data that is out there. 

You will often use requests and `beautifulsoup` libraries. 
To prevent web-scraping on your own sites, refer to the [robots.txt](obsidian://open?vault=enter&file=Robots.txt%20Files)  information. 

---

#### Comparing web scraping libraries: 
![[Pasted image 20220730121832.png]]

## Sample scraper
```python
import pandas as pd 
from bs4 import BeautifulSoup 
from selenium import webdriver 
driver =  webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable') driver.get('https://your.url/here?yes=brilliant') 
results = [] 
other_results = [] 
content = driver.page_source 
soup = BeautifulSoup(content) 
for a in soup.findAll(attrs={'class': 'class'}): 
	name = a.find('a') 
	if name not in results: 
		results.append(name.text) 
for b in soup.findAll(attrs={'class': 'otherclass'}): 
	name2 = b.find('span') 
	other_results.append(name.text) 
series1 = pd.Series(results, name = 'Names') 
series2 = pd.Series(other_results, name = 'Categories') 
df = pd.DataFrame({'Names': series1, 'Categories': series2}) 
df.to_csv('names.csv', index=False, encoding='utf-8')
```

You can also  use  asyncio or multithreading to make  web scraping even [faster](https://oxylabs.io/blog/how-to-make-web-scraping-faster). 
Right click >  Inspect  >  Network 

---

##### More helpful tutorials 
-   [How To Scraper Yelp Review For Free - No Coding Required](https://medium.com/prowebscraper/how-to-scraper-yelp-reviews-899b7480eb8d)
-   [How to Build a Web Scraper With Python - Step-by-Step Guide](https://hackernoon.com/how-to-build-a-web-scraper-with-python-step-by-step-guide-jxkp3yum)
-   [Python Web Scraping Tutorial: Step-By-Step - 2022 Guide](https://oxylabs.io/blog/python-web-scraping)
- [ Intro to Yelp Scraping using Python ](https://towardsdatascience.com/intro-to-yelp-web-scraping-using-python-78252318d832)
- [Webscraping LinkedIn](https://federicohaag.medium.com/linkedin-scraping-with-python-d8d14519602d) with Python 
	- [github repo](https://github.com/federicohaag/LinkedInScraping) for code


---

## Alternative tools: 
-  [Octoparse](https://developer.chrome.com/docs/devtools/workspaces/?utm_source=devtools) is a good one which is free for 14 days. 
- [ParseHub](https://parsehub.com/docs/ref/api/v2/) is another great scraping tool I've tried out
Monday, October 16, 2023, 12:30:01 + 4 2023-10-16 16:30:06 +00:00			`# Web-scraping`
upload Obs files 2023-07-05 18:29:11 +00:00

Monday, October 16, 2023, 12:30:01 + 4 2023-10-16 16:30:06 +00:00			`Web-scraping is a common task in the CS world that makes it easy and efficient to extract large amounts of data. It is part of a larger topic of data mining which allows for the human understandable analysis of all the data that is out there.`
upload Obs files 2023-07-05 18:29:11 +00:00
Monday, October 16, 2023, 12:30:01 + 4 2023-10-16 16:30:06 +00:00			You will often use requests and `beautifulsoup` libraries.
			`To prevent web-scraping on your own sites, refer to the [robots.txt](obsidian://open?vault=enter&file=Robots.txt%20Files) information.`
upload Obs files 2023-07-05 18:29:11 +00:00
			`---`

			`#### Comparing web scraping libraries:`
			`![[Pasted image 20220730121832.png]]`

			`## Sample scraper`
			```python
			`import pandas as pd`
			`from bs4 import BeautifulSoup`
			`from selenium import webdriver`
			`driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable') driver.get('https://your.url/here?yes=brilliant')`
			`results = []`
			`other_results = []`
			`content = driver.page_source`
			`soup = BeautifulSoup(content)`
			`for a in soup.findAll(attrs={'class': 'class'}):`
			`name = a.find('a')`
			`if name not in results:`
			`results.append(name.text)`
			`for b in soup.findAll(attrs={'class': 'otherclass'}):`
			`name2 = b.find('span')`
			`other_results.append(name.text)`
			`series1 = pd.Series(results, name = 'Names')`
			`series2 = pd.Series(other_results, name = 'Categories')`
			`df = pd.DataFrame({'Names': series1, 'Categories': series2})`
			`df.to_csv('names.csv', index=False, encoding='utf-8')`
			```

			`You can also use asyncio or multithreading to make web scraping even [faster](https://oxylabs.io/blog/how-to-make-web-scraping-faster).`
			`Right click > Inspect > Network`

			`---`

			`##### More helpful tutorials`
			`- [How To Scraper Yelp Review For Free - No Coding Required](https://medium.com/prowebscraper/how-to-scraper-yelp-reviews-899b7480eb8d)`
			`- [How to Build a Web Scraper With Python - Step-by-Step Guide](https://hackernoon.com/how-to-build-a-web-scraper-with-python-step-by-step-guide-jxkp3yum)`
			`- [Python Web Scraping Tutorial: Step-By-Step - 2022 Guide](https://oxylabs.io/blog/python-web-scraping)`
			`- [ Intro to Yelp Scraping using Python ](https://towardsdatascience.com/intro-to-yelp-web-scraping-using-python-78252318d832)`
			`- [Webscraping LinkedIn](https://federicohaag.medium.com/linkedin-scraping-with-python-d8d14519602d) with Python`
			`- [github repo](https://github.com/federicohaag/LinkedInScraping) for code`


			`---`

			`## Alternative tools:`
			`- [Octoparse](https://developer.chrome.com/docs/devtools/workspaces/?utm_source=devtools) is a good one which is free for 14 days.`
			`- [ParseHub](https://parsehub.com/docs/ref/api/v2/) is another great scraping tool I've tried out`