Chris Lamb

No citations please, we're British

import re
import urllib

from lxml import etree
from BeautifulSoup import BeautifulSoup

tree = etree.parse(urllib.urlopen(''))

for item in tree.xpath('//*[name()="content:encoded"]'):
    soup = BeautifulSoup(item.text)

    for link in soup.findAll('a'):
        for x in link.contents:
            # Kids and their damn hypertext
            x.replaceWith(re.sub(r' \[\d+\]$', '', x))

    except IndexError:

    item.text = unicode(soup)

print etree.tostring(tree)

Chris Lamb is a freelance software developer and the current Debian Project Leader. You can read other posts by me, see software I have written or read more about me. You can also follow me @lolamby.

Planets: ALUG UWCS Debian WUGLUG

Tuesday 11th August 2009

One comment