2.5 KiB
2.5 KiB
Web-scraping
Web-scraping is a common task in the CS world that makes it easy and efficient to extract large amounts of data. It is part of a larger topic of data mining which allows for the human understandable analysis of all the data that is out there.
You will often use requests and beautifulsoup
libraries.
To prevent web-scraping on your own sites, refer to the robots.txt information.
Comparing web scraping libraries:
Sample scraper
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable') driver.get('https://your.url/here?yes=brilliant')
results = []
other_results = []
content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll(attrs={'class': 'class'}):
name = a.find('a')
if name not in results:
results.append(name.text)
for b in soup.findAll(attrs={'class': 'otherclass'}):
name2 = b.find('span')
other_results.append(name.text)
series1 = pd.Series(results, name = 'Names')
series2 = pd.Series(other_results, name = 'Categories')
df = pd.DataFrame({'Names': series1, 'Categories': series2})
df.to_csv('names.csv', index=False, encoding='utf-8')
You can also use asyncio or multithreading to make web scraping even faster. Right click > Inspect > Network
More helpful tutorials
- How To Scraper Yelp Review For Free - No Coding Required
- How to Build a Web Scraper With Python - Step-by-Step Guide
- Python Web Scraping Tutorial: Step-By-Step - 2022 Guide
- Intro to Yelp Scraping using Python
- Webscraping LinkedIn with Python
- github repo for code