# Web-scraping Web-scraping is a common task in the CS world that makes it easy and efficient to extract large amounts of data. It is part of a larger topic of data mining which allows for the human understandable analysis of all the data that is out there. You will often use requests and `beautifulsoup` libraries. To prevent web-scraping on your own sites, refer to the [robots.txt](obsidian://open?vault=enter&file=Robots.txt%20Files) information. --- #### Comparing web scraping libraries: ![[Pasted image 20220730121832.png]] ## Sample scraper ```python import pandas as pd from bs4 import BeautifulSoup from selenium import webdriver driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable') driver.get('https://your.url/here?yes=brilliant') results = [] other_results = [] content = driver.page_source soup = BeautifulSoup(content) for a in soup.findAll(attrs={'class': 'class'}): name = a.find('a') if name not in results: results.append(name.text) for b in soup.findAll(attrs={'class': 'otherclass'}): name2 = b.find('span') other_results.append(name.text) series1 = pd.Series(results, name = 'Names') series2 = pd.Series(other_results, name = 'Categories') df = pd.DataFrame({'Names': series1, 'Categories': series2}) df.to_csv('names.csv', index=False, encoding='utf-8') ``` You can also use asyncio or multithreading to make web scraping even [faster](https://oxylabs.io/blog/how-to-make-web-scraping-faster). Right click > Inspect > Network --- ##### More helpful tutorials - [How To Scraper Yelp Review For Free - No Coding Required](https://medium.com/prowebscraper/how-to-scraper-yelp-reviews-899b7480eb8d) - [How to Build a Web Scraper With Python - Step-by-Step Guide](https://hackernoon.com/how-to-build-a-web-scraper-with-python-step-by-step-guide-jxkp3yum) - [Python Web Scraping Tutorial: Step-By-Step - 2022 Guide](https://oxylabs.io/blog/python-web-scraping) - [ Intro to Yelp Scraping using Python ](https://towardsdatascience.com/intro-to-yelp-web-scraping-using-python-78252318d832) - [Webscraping LinkedIn](https://federicohaag.medium.com/linkedin-scraping-with-python-d8d14519602d) with Python - [github repo](https://github.com/federicohaag/LinkedInScraping) for code --- ## Alternative tools: - [Octoparse](https://developer.chrome.com/docs/devtools/workspaces/?utm_source=devtools) is a good one which is free for 14 days. - [ParseHub](https://parsehub.com/docs/ref/api/v2/) is another great scraping tool I've tried out