This repository has been archived on 2023-07-05. You can view files and clone it, but cannot push or open issues/pull-requests.
notes/Terminal Tips/Commands + Settings/Languages/Python/tools/Libraries/beautiful soup.md

1.6 KiB

Beautiful Soup

Beautiful Soup is a popular library commonly used for webscraping, or the automated process of gathering public data extracting large amounts of public data from target websites in seconds.

Used often alongside to requests, it is a parser to extract the data from HTML and can turn even invalid markup into a parse tree. It cannot request data and is only designed for parsing.

Part 1: Get HTML using Requests

import requests url='https://oxylabs.io/blog' response = requests.get(url)

**Part 2: Find Element **

from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, 'html.parser') print(soup.title)

output will be:

<h1 class="blog-header">Oxylabs Blog</h1>

Due to its simple ways of navigating, searching and modifying the parse tree, Beautiful Soup is ideal even for beginners and usually saves developers hours of work. For example, to print all the blog titles from this page, the findAll() method can be used. On this page, all the blog titles are in h2 elements with class attribute set to blog-card__content-title. This information can be supplied to the findAll method as follows

blog_titles = soup.findAll('h2', attrs={"class":"blog-card__content-title"}) 
for title in blog_titles: print(title.text) 
# Output: 
# Prints all blog tiles on the page

Can also easily work with CSS Selectors so don't even need findAll.

blog_titles = soup.select('h2.blog-card__content-title') for title in blog_titles: 
	print(title.text)