Notepad/enter/Coding Tips (Classical)/Terminal Tips/GUIs/Tools/Webscraping.md

2.4 KiB

Webscraping

Webscraping is a common task in the CS world that makes it easy and efficient to extract large amounts of data. It is part of a larger topic of data mining which allows for the human understandable analysis of all the data that is out there.

You will often use requests and beautifulsoup libraries. To prevent webscraping on your own sites, refer to the rob


Comparing web scraping libraries:

!Pasted image 20220730121832.png

Sample scraper

import pandas as pd 
from bs4 import BeautifulSoup 
from selenium import webdriver 
driver =  webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable') driver.get('https://your.url/here?yes=brilliant') 
results = [] 
other_results = [] 
content = driver.page_source 
soup = BeautifulSoup(content) 
for a in soup.findAll(attrs={'class': 'class'}): 
	name = a.find('a') 
	if name not in results: 
		results.append(name.text) 
for b in soup.findAll(attrs={'class': 'otherclass'}): 
	name2 = b.find('span') 
	other_results.append(name.text) 
series1 = pd.Series(results, name = 'Names') 
series2 = pd.Series(other_results, name = 'Categories') 
df = pd.DataFrame({'Names': series1, 'Categories': series2}) 
df.to_csv('names.csv', index=False, encoding='utf-8')

You can also use asyncio or multithreading to make web scraping even faster. Right click > Inspect > Network


More helpful tutorials

Alternative tools:

  • Octoparse is a good one which is free for 14 days.
  • ParseHub is another great scraping tool I've tried out