HTML Parsing With Beautiful Soup

BEAUTIFUL SOUP IS an HTML/XML parser written in Python. Beautiful Soup excels as an easy to use parser that requires no knowledge of actual parsing theory and techniques. And thanks to the excellent documentation with many code examples, it is easy to fabricate some working code very quickly.

On Debian, Beautiful Soup can be install via apt-get / aptitude:
aptitude install python-beautifulsoup

The example below extracts the hit counter from this very page. Note that this is perhaps not the best example in the world (the only parse value used is the “footer” section), but it does exemplifies how easily the process of extracting data from a HTML page can be done when utilizing the Beautiful Soup parser.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#!/usr/bin/env python
# coding=utf-8
 
from BeautifulSoup import BeautifulSoup          # For processing HTML
import urllib2                                   # URL tools
import re                                        # Regular expressions
 
def FindHits(proxyUrl):
    # URL to HTML parse
    url = 'http://monzool.net/blog/index.php'
 
    if len(proxyUrl) > 0:
        # Proxy set up
        proxy = urllib2.ProxyHandler( {'http': proxyUrl} )
 
        # Create an URL opener utilizing proxy
        opener = urllib2.build_opener( proxy )
        urllib2.install_opener( opener )
 
        # Aquire data from URL
        request = urllib2.Request( url )
        response = urllib2.urlopen( request )
    else:
        # Aquire data from URL
        response = urllib2.urlopen( url )
 
    # Extract data as HTML data
    html = response.read()
 
    # Parse HTML data
    soup = BeautifulSoup( html )
 
    # Search requested page for <div> section with id="footer"
    # (The result is returned in unicode)
    footer = soup.findAll( 'div', id="footer" )
 
    # Hint: on this site, it is known that only a single "footer" section
    # exists, and that the hit counter resides in that same section
 
    # Search for the frase "Hits=<some number>"
    pattern = re.compile( r'Hits=.*[0-9]' )
    items = re.findall( pattern, str(footer[0]) )
 
    # Print result
    print items[0]        # -> "Hits=<count>"
 
 
if __name__ == "__main__":
    print "Processing..."
    FindHits("")          # Supply proxy if required. 
                          # FindHist("http://<proxyname>:<port>")

Explanation: If connecting to the internet through a proxy, some additional setup must be done to urllib2. Although urllib2 do provide some automatic proxy configuration detection, but here the configuration is made explicitly.

When the URL is opened the HTML is feed to the Beautiful Soup parser. Here after the member call findAll is used for finding the HTML div section identified as “footer” (<div id="footer">). As noted, no further parsing is done, as this page on contains only one footer section, but otherwise Beautiful Soup provides functions like findAllNext and findNextSiblings to iterate through the parse tree (Beautiful Soup is unicode aware, but not using it in this example, so converting the found section to ascii before inputting it to findall).

The resulting output from the search is the hit counter is extracted from this page.

Add a comment »2 comments to this article

  1. Internet Explorer 6.0 Windows XP

    thanks for the great information…

  2. Opera 9.80 Linux

    Thanks for sharing. That info helped me to get another script working: https://bbs.archlinux.org/viewtopic.php?id=88779

    I don’t know about python or any other programming language, but this was easy :)

    By the way, there’s a little typo in your last comment: # FindHist

    Have fun :)

Copyright © All Rights Reserved · Green Hope Theme by Sivan & schiy · Proudly powered by WordPress