Why Lambda?

I HAVE BEEN reading up on Python programming lately (more on that in a later post). I’ve now been introduced to anonymous functions. In Python, anonymous functions are available using the lambda keyword. Anonymous functions are great, but I think the Lua syntax for anonymous functions is superior to the syntax adopted in Python.


A normal function, in Python, is defined using the def keyword along with a function name.

>>> def f1(x, y):
...     return x + y
... 
>>> f1(1, 2)
3

In Python anonymous functions are created by a lambda expression.

>>> f2 = lambda x, y: x + y
>>> f2(1, 2)
3

Similar to anonymous function, normal Python functions are first class objects and can be assigned to other variables.

>>> f = f1
>>> f(1, 2)
3

However direct assignment of a function deceleration is not possible.

>>> f = def f3(x, y):
  File "<stdin>", line 1
    f = def f3(x, y):
           ^
SyntaxError: invalid syntax

This last example resembles the concept of the anonymous function syntax chosen in Lua. First a look on how a normal function is defined in Lua. Its not that different from the Python version.

> function f1(x, y)
>>   return x + y
>> end
> print( f1(1, 2) )
3

Like in Python, functions are first class objects in Lua and thus also supports aliasing functions.

> f = f1
> print( f(1, 2) )
3

The syntax for anonymous function in Lua differs not much for how normal functions are defined. The function name is omitted (hence anonymous) and secondly the function definition is wrapped in parentheses.

> f2 = (function(x, y)
>>   return x + y 
>> end)
> print( f2(1, 2) )
3
> -- Or as one-liner if preferred
> f2 = (function(x, y) return x + y end)
> print( f2(1, 2) )
3

In Lua a function is a function and defined as such – being anonymous or not. I think this approach is more elegant that using a dedicated lambda keyword.

Help On Python Regular Expressions

REGULAR EXPRESSIONS ARE a powerful friend, but the friendship doesn’t come easy. Regular expressions can be somewhat baffling getting a grasp on, but when finally understood, the possibilities are almost endless.

When developing the searching expression used in HTML Parsing With Beautiful Soup I realized that my regular expression knowledge had gotten a bit rusty. Fortunately I had double-up on the luck. 1) It was a Python program, hence the Python shell was available. 2) I found David Mertz‘s book Text Processing in Python.

The Python shell makes it easy to experiment and tweak any regular expressions on the fly, but the downside is that its not easy to visually evaluate the outcome of your current expression. David’s book helped two folds. It has extensive theory on Python regular expression syntax, but most superhero-like is the small function provided, that makes it possible to see the outcome of an evaluated expression.

# Credits: David Mertz
def re_show(pat, s):
    print re.compile( pat, re.M ).sub( "{\g<0>}", s.rstrip() ), '\n'

Using regular expressions in Python requires importing of the regular expression libirary.

import re

If using the Python shell just enter the same in the shell prompt. The function by David Mertz can also be typed directly into the shell

>>> import re
>>> def re_show(pat, s):
...    print re.compile( pat, re.M ).sub( "{\g<0>}", s.rstrip() ), '\n'
>>>

The re_show wrapper displays the source and emphasizes the result of the expression, as being the contents between the ‘{‘ and ‘}’ pair.

Next is creation of some example text on which to experiment.

>>> s = 'if (Hulk.color != "green"): print "Grey Hulk"'

Now the experiments can begin. The following searches for everything between the first ‘(‘ to the last ‘)’.

>>> re_show(r'\(.*\)', s)

Result:

if{(Hulk.color != "green")}: print "Grey Hulk"

Another example could be an case-insensitive match on the colors of Hulk.

>>> re_show(r'(?i)green|(?i)grey', s)

Result:

if (Hulk.color != "{green}"): print "{Grey}Hulk"

This is just at minuscule introduction to the powers of regular expressions. If your into regular expressions in Python, I highly recommend to buying the book – or donate and read it online.

HTML Parsing With Beautiful Soup

BEAUTIFUL SOUP IS an HTML/XML parser written in Python. Beautiful Soup excels as an easy to use parser that requires no knowledge of actual parsing theory and techniques. And thanks to the excellent documentation with many code examples, it is easy to fabricate some working code very quickly.

On Debian, Beautiful Soup can be install via apt-get / aptitude:
aptitude install python-beautifulsoup

The example below extracts the hit counter from this very page. Note that this is perhaps not the best example in the world (the only parse value used is the “footer” section), but it does exemplifies how easily the process of extracting data from a HTML page can be done when utilizing the Beautiful Soup parser.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#!/usr/bin/env python
# coding=utf-8
 
from BeautifulSoup import BeautifulSoup          # For processing HTML
import urllib2                                   # URL tools
import re                                        # Regular expressions
 
def FindHits(proxyUrl):
    # URL to HTML parse
    url = 'http://monzool.net/blog/index.php'
 
    if len(proxyUrl) > 0:
        # Proxy set up
        proxy = urllib2.ProxyHandler( {'http': proxyUrl} )
 
        # Create an URL opener utilizing proxy
        opener = urllib2.build_opener( proxy )
        urllib2.install_opener( opener )
 
        # Aquire data from URL
        request = urllib2.Request( url )
        response = urllib2.urlopen( request )
    else:
        # Aquire data from URL
        response = urllib2.urlopen( url )
 
    # Extract data as HTML data
    html = response.read()
 
    # Parse HTML data
    soup = BeautifulSoup( html )
 
    # Search requested page for <div> section with id="footer"
    # (The result is returned in unicode)
    footer = soup.findAll( 'div', id="footer" )
 
    # Hint: on this site, it is known that only a single "footer" section
    # exists, and that the hit counter resides in that same section
 
    # Search for the frase "Hits=<some number>"
    pattern = re.compile( r'Hits=.*[0-9]' )
    items = re.findall( pattern, str(footer[0]) )
 
    # Print result
    print items[0]        # -> "Hits=<count>"
 
 
if __name__ == "__main__":
    print "Processing..."
    FindHits("")          # Supply proxy if required. 
                          # FindHist("http://<proxyname>:<port>")

Explanation: If connecting to the internet through a proxy, some additional setup must be done to urllib2. Although urllib2 do provide some automatic proxy configuration detection, but here the configuration is made explicitly.

When the URL is opened the HTML is feed to the Beautiful Soup parser. Here after the member call findAll is used for finding the HTML div section identified as “footer” (<div id="footer">). As noted, no further parsing is done, as this page on contains only one footer section, but otherwise Beautiful Soup provides functions like findAllNext and findNextSiblings to iterate through the parse tree (Beautiful Soup is unicode aware, but not using it in this example, so converting the found section to ascii before inputting it to findall).

The resulting output from the search is the hit counter is extracted from this page.

Copyright © All Rights Reserved · Green Hope Theme by Sivan & schiy · Proudly powered by WordPress