2.3 KiB

Raw Blame History

Webscraping

Webscraping is a common task in the CS world that makes it easy and efficient to extract large amounts of data. It is part of a larger topic of data mining which allows for the human understandable analysis of all the data that is out there.

You will often use requests and beautifulsoup libraries.

Comparing web scraping libraries:

Sample scraper

import pandas as pd 
from bs4 import BeautifulSoup 
from selenium import webdriver 
driver =  webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable') driver.get('https://your.url/here?yes=brilliant') 
results = [] 
other_results = [] 
content = driver.page_source 
soup = BeautifulSoup(content) 
for a in soup.findAll(attrs={'class': 'class'}): 
	name = a.find('a') 
	if name not in results: 
		results.append(name.text) 
for b in soup.findAll(attrs={'class': 'otherclass'}): 
	name2 = b.find('span') 
	other_results.append(name.text) 
series1 = pd.Series(results, name = 'Names') 
series2 = pd.Series(other_results, name = 'Categories') 
df = pd.DataFrame({'Names': series1, 'Categories': series2}) 
df.to_csv('names.csv', index=False, encoding='utf-8')

You can also use asyncio or multithreading to make web scraping even faster. Right click > Inspect > Network

Alternative tools:

Octoparse is a good one which is free for 14 days.
ParseHub is another great scraping tool I've tried out

2.3 KiB Raw Blame History

Webscraping

Comparing web scraping libraries:

Sample scraper

More helpful tutorials

Alternative tools:

2.3 KiB

Raw Blame History