This repository has been archived on 2023-07-05. You can view files and clone it, but cannot push or open issues/pull-requests.
notes/Terminal Tips/Commands + Settings/Languages/Python/tools/Libraries/beautiful soup.md

41 lines
1.6 KiB
Markdown
Raw Normal View History

2023-07-05 03:05:42 +00:00
# Beautiful Soup
Beautiful Soup is a popular library commonly used for webscraping, or the automated process of gathering public data extracting large amounts of public data from target websites in seconds.
Used often alongside to [requests](obsidian://open?vault=Coding%20Tips&file=Requests), it is a parser to extract the data from HTML and can turn even invalid markup into a parse tree. It cannot request data and is only designed for parsing.
**Part 1: Get HTML using Requests**
```
import requests url='https://oxylabs.io/blog' response = requests.get(url)
```
**Part 2: Find Element **
```
from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, 'html.parser') print(soup.title)
```
output will be:
```javascript
<h1 class="blog-header">Oxylabs Blog</h1>
```
Due to its simple ways of navigating, searching and modifying the parse tree, Beautiful Soup is ideal even for beginners and usually saves developers hours of work. For example, to print all the blog titles from this page, the **findAll()** method can be used. On this page, all the blog titles are in h2 elements with class attribute set to ```blog-card__content-title```. This information can be supplied to the findAll method as follows
```python
blog_titles = soup.findAll('h2', attrs={"class":"blog-card__content-title"})
for title in blog_titles: print(title.text)
# Output:
# Prints all blog tiles on the page
```
Can also easily work with CSS Selectors so don't even need findAll.
```python
blog_titles = soup.select('h2.blog-card__content-title') for title in blog_titles:
print(title.text)
```