2.4 KiB
2.4 KiB
Webscraping
Webscraping is a common task in the CS world that makes it easy and efficient to extract large amounts of data. It is part of a larger topic of data mining which allows for the human understandable analysis of all the data that is out there.
You will often use requests and beautifulsoup libraries. To prevent webscraping on your own sites, refer to the rob
Comparing web scraping libraries:
Sample scraper
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable') driver.get('https://your.url/here?yes=brilliant')
results = []
other_results = []
content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll(attrs={'class': 'class'}):
name = a.find('a')
if name not in results:
results.append(name.text)
for b in soup.findAll(attrs={'class': 'otherclass'}):
name2 = b.find('span')
other_results.append(name.text)
series1 = pd.Series(results, name = 'Names')
series2 = pd.Series(other_results, name = 'Categories')
df = pd.DataFrame({'Names': series1, 'Categories': series2})
df.to_csv('names.csv', index=False, encoding='utf-8')
You can also use asyncio or multithreading to make web scraping even faster. Right click > Inspect > Network
More helpful tutorials
- How To Scraper Yelp Review For Free - No Coding Required
- How to Build a Web Scraper With Python - Step-by-Step Guide
- Python Web Scraping Tutorial: Step-By-Step - 2022 Guide
- Intro to Yelp Scraping using Python
- Webscraping LinkedIn with Python
- github repo for code