BEAUTIFUL SOUP IS an HTML/XML parser written in Python. Beautiful Soup excels as an easy to use parser that requires no knowledge of actual parsing theory and techniques. And thanks to the excellent documentation with many code examples, it is easy to fabricate some working code very quickly.
On Debian, Beautiful Soup can be install via apt-get / aptitude:
aptitude install python-beautifulsoup
The example below extracts the hit counter from this very page. Note that this is perhaps not the best example in the world (the only parse value used is the “footer” section), but it does exemplifies how easily the process of extracting data from a HTML page can be done when utilizing the Beautiful Soup parser.
#!/usr/bin/env python
# coding=utf-8
from BeautifulSoup import BeautifulSoup # For processing HTML
import urllib2 # URL tools
import re # Regular expressions
def FindHits(proxyUrl):
# URL to HTML parse
url = 'https://monzool.net/blog/index.php'
if len(proxyUrl) > 0:
# Proxy set up
proxy = urllib2.ProxyHandler( {'http': proxyUrl} )
# Create an URL opener utilizing proxy
opener = urllib2.build_opener( proxy )
urllib2.install_opener( opener )
# Aquire data from URL
request = urllib2.Request( url )
response = urllib2.urlopen( request )
else:
# Aquire data from URL
response = urllib2.urlopen( url )
# Extract data as HTML data
html = response.read()
# Parse HTML data
soup = BeautifulSoup( html )
# Search requested page for section with id="footer"
# (The result is returned in unicode)
footer = soup.findAll( 'div', id="footer" )
# Hint: on this site, it is known that only a single "footer" section
# exists, and that the hit counter resides in that same section
# Search for the frase "Hits="
pattern = re.compile( r'Hits=.*[0-9]' )
items = re.findall( pattern, str(footer[0]) )
# Print result
print items[0] # -> "Hits="
if __name__ == "__main__":
print "Processing..."
FindHits("") # Supply proxy if required.
# FindHist("http://:")
Explanation: If connecting to the internet through a proxy, some additional setup must be done to urllib2. Although urllib2 do provide some automatic proxy configuration detection, but here the configuration is made explicitly.
When the URL is opened the HTML is feed to the Beautiful Soup parser. Here after the member call findAll
is used for finding the HTML div section identified as “footer” (
Categories: ProgrammingPython
2 Comments
dthcnine · 2008-07-18 at 6:34
thanks for the great information…
Stefan · 2011-05-04 at 19:00
Thanks for sharing. That info helped me to get another script working: https://bbs.archlinux.org/viewtopic.php?id=88779
I don’t know about python or any other programming language, but this was easy :)
By the way, there’s a little typo in your last comment: # FindHist
Have fun :)
Comments are closed.