2.5 KiB

Raw Permalink Blame History

Web-scraping

Web-scraping is a common task in the CS world that makes it easy and efficient to extract large amounts of data. It is part of a larger topic of data mining which allows for the human understandable analysis of all the data that is out there.

You will often use requests and beautifulsoup libraries. To prevent web-scraping on your own sites, refer to the robots.txt information.

Comparing web scraping libraries:

Sample scraper

import pandas as pd 
from bs4 import BeautifulSoup 
from selenium import webdriver 
driver =  webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable') driver.get('https://your.url/here?yes=brilliant') 
results = [] 
other_results = [] 
content = driver.page_source 
soup = BeautifulSoup(content) 
for a in soup.findAll(attrs={'class': 'class'}): 
	name = a.find('a') 
	if name not in results: 
		results.append(name.text) 
for b in soup.findAll(attrs={'class': 'otherclass'}): 
	name2 = b.find('span') 
	other_results.append(name.text) 
series1 = pd.Series(results, name = 'Names') 
series2 = pd.Series(other_results, name = 'Categories') 
df = pd.DataFrame({'Names': series1, 'Categories': series2}) 
df.to_csv('names.csv', index=False, encoding='utf-8')

You can also use asyncio or multithreading to make web scraping even faster. Right click > Inspect > Network

Alternative tools:

Octoparse is a good one which is free for 14 days.
ParseHub is another great scraping tool I've tried out

2.5 KiB Raw Permalink Blame History

Web-scraping

Comparing web scraping libraries:

Sample scraper

More helpful tutorials

Alternative tools:

2.5 KiB

Raw Permalink Blame History