Menu

Understanding Exploring BeautifulSoup Methods

Exploring BeautifulSoup Methods

In this tutorial we will learn various different ways to access HTML tags using different methods of the BeautifulSoup module. For a basic introduction to the BeautifulSoup module, start from the previous tutorial.

BeautifulSoup: Accessing HTML Tags

The methods that we will cover in this section are used to traverse through different HTML tags considering HTML code as a tree.

Create a file sample_webpage.html and copy the following HTML code in it:

<!DOCTYPE html>
<html>
    
    <head>
        <title> Sample HTML Page</title>
        <style>
            * {
                margin: 0;
                padding: 0;
            }

            div {
                width: 95%;
                height: 75px;
                margin: 10px 2.5%;
                border: 1px dotted grey;
                text-align: center;
            }
              
            p {
                font-family: sans-serif;
                font-size: 18px;
                color: #000;
                line-height: 75px;
            }

            a {
                position: relative;
                top: 25px;
            }
        </style>
    </head>
    
    <body>
        <div id="first-div">
            <p class="first">First Paragraph</p>
        </div>

        <div id="second-div">
            <p class="second">Second Paragraph</p>
        </div>

        <div id="third-div">
            <a href="https://www.studytonight.com">Studytonight</a>
            <p class="third">Third Paragraph</p>        
        </div>

        <div id="fourth-div">
            <p class="fourth">Fourth Paragraph</p>        
        </div>

        <div id="fifth-div">
            <p class="fifth">Fifth Paragraph</p>        
        </div>
    </body>
</html>

Now to read the content of the above HTML file, use the following python code to store the content into a variable:

reading content from the file

with open("sample_webpage.html") as html_file: html = html_file.read()


Now we will use different methods of the BeautifulSoup module and see how they work.

For warmup, let's start with using the `prettify` method.

```python
import bs4

reading content from the file

with open("sample_webpage.html") as html_file: html = html_file.read()

creating a BeautifulSoup object

soup = bs4.BeautifulSoup(html, "html.parser")

print(soup.prettify)


### **BeautifulSoup: Accessing HTML Tag Attributes**

We can retrieve the attributes of any HTML tag using the following syntax:

```html
TagName["AttributeName"]

Let's extract the href attribute from the anchor tag in our HTML code.

import bs4

reading content from the file

with open("sample_webpage.html") as html_file: html = html_file.read()

creating a BeautifulSoup object

soup = bs4.BeautifulSoup(html, "html.parser")

getting anchor tag

link = soup.a

printing the 'href' attribute of anchor tag

print(link["href"])


### **BeautifulSoup:** `contents` **method**

`contents` method is used to list out all the tags that are present in the parent tag. Let's list all the children HTML tags of the **body** tag using the `contents` method.

```python
body = soup.body

getting all the children of 'body' using 'contents'

content_list = body.contents

printing all the children using for loop

for tag in content_list: if tag != "\n": print(tag) print("\n")


### **BeautifulSoup:** `children` **method**

`children` method is similar to the `contents` method, but `children` method returns an **iterator** while the `contents` method returns a **list** of all the children. Let's see an example:

```python
body = soup.body

we can also convert iterator into list using the 'list(iterator)'

for tag in body.children: if tag != "\n": print(tag) print("\n")


### **BeautifulSoup:** `descendants` **method**

`descendants` method helps to retrieve all the child tags of a parent tag. You must be wondering that is what the two methods above also did. Well this method is different from `contents` and `children` method as this method extracts all the child tags and content up until the end. In simple words if we use it to extract the **body** tag then it will print the first **div** tag, then it will print the child of the **div** tag and then their child until it reaches the end, then it will move on to the next **div** tag and so on.

This method returns a **generator**. Let's see an example:

```python
body = soup.body

getting child tags of 'body' tag using 'descendants' method

for tag in body.descendants: if tag != "\n": print(tag) print("\n")


Now you are familiar with most of the methods that are used in web scraping. In the following tutorial, we will learn how to find a specific tag from a bunch of similar tags.