1.6 KiB
Beautiful Soup
Beautiful Soup is a popular library commonly used for webscraping, or the automated process of gathering public data extracting large amounts of public data from target websites in seconds.
Used often alongside to requests, it is a parser to extract the data from HTML and can turn even invalid markup into a parse tree. It cannot request data and is only designed for parsing.
Part 1: Get HTML using Requests
import requests url='https://oxylabs.io/blog' response = requests.get(url)
**Part 2: Find Element **
from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, 'html.parser') print(soup.title)
output will be:
<h1 class="blog-header">Oxylabs Blog</h1>
Due to its simple ways of navigating, searching and modifying the parse tree, Beautiful Soup is ideal even for beginners and usually saves developers hours of work. For example, to print all the blog titles from this page, the findAll() method can be used. On this page, all the blog titles are in h2 elements with class attribute set to blog-card__content-title
. This information can be supplied to the findAll method as follows
blog_titles = soup.findAll('h2', attrs={"class":"blog-card__content-title"})
for title in blog_titles: print(title.text)
# Output:
# Prints all blog tiles on the page
Can also easily work with CSS Selectors so don't even need findAll.
blog_titles = soup.select('h2.blog-card__content-title') for title in blog_titles:
print(title.text)