Bed, Bath & Beyond Screen Scraper
Language: Python
Date: August 2004
When developing the website for our wedding we wanted to combine the various registries we held in one page, so as to make the experience for those who wished to buy a gift as simple as possible. In order to achieve this it was necessary to write "screen-scraper" scripts that extracted the relevant data from the retailers' websites. A simple enough task were those retailers to employ well structured, semantically rich markup on their pages, but a tricky undertaking when faced with the "tag soup" of Bed, Bath & Beyond's site.
This script consists of two sections. The first is an extension to python's SGMLParser module to parse the relevant BB&B page and return it as a nested dictionary, and a second that converts that into a simple, structured XML format. This separation allows for re-use of code should BB&B restructure their site, or for the XML format used to be switched out and replaced with a module to produce, say, RSS or Atom feeds for use in a newsreader.
The server is set up to run this script four times an hour (so as to limit server load from the http request made) and to place the resultant XML in a file on the server. Whenever anyone requests the 'wedding list' page on the website, this file is loaded, transformed using a simple XSLT file and delivered to the user.
In the near future I hope to extend the script to automatically re-try should the HTTP request fail, and then email an administrator should any processes fail repeatedly. I have also emailed Bed, Bath & Beyond to tell them about this script and to request they both improve the quality of their mark-up and develop facilities for web developers seeking to integrate with their site.
The Code:
#!/usr/bin/python
# Bed, Bath & Beyond registry screen-scraper
# By James Stewart
# v2 - 26th August 2004
# The only configuration variable we need as input is the registry id.
# note that this is not the registry number provided by BB&B, and is alphanumeric.
registry = "-884877278"
from xml.dom import minidom
from sgmllib import SGMLParser
import re, urllib
class BBBParser(SGMLParser):
# this class takes the html from a bed,bath,beyond wedding registry printable page and outputs
# it as a dictionary of dictionaries with the structure of the useful lines being:
#
# [x][3]: href, [x][4]: product name, [x][6]: upc, [x][8]: price, [x][10]: requested
# [x][12]: purchased
startparse = 0
startrow = 0
parserow = 0
column_no = 0
row_no = 0
items = {}
def parse(self, html):
self.feed(html)
return self.items
def reset (self):
SGMLParser.reset(self)
self.urls = []
def start_tr(self, attrs):
self.row_no = self.row_no + 1
self.items[self.row_no] = {}
# We need to dispatch with the first row into our table
# (provides headings). The second one is what we want
if self.startparse == 2:
self.startrow = 1
if self.startparse == 1:
self.startparse = 2
def start_a(self, attrs):
# Column number 4 is that which contains the link and the description
# We don't want to have to choose so let's use index 3 to hold the link
if self.column_no == 4:
# Uses a list comprehension to extract any hrefs from the list of tuples and insert
# in the array
self.items[self.row_no][3] = [x[1] for x in attrs if x[0] == 'href'][0]
def end_tr(self):
self.parserow = 0
self.column_no = 0
def start_td(self, attrs):
self.column_no = self.column_no + 1
if self.startrow == 1:
self.startrow = 0
# Uses a list comprehension to check if any element of attrs is a tuple containing
# 'colspan' and '15' If so, we want to skip that row so self.parserow = 0. Otherwise
# we may want it, so let's continue
if len([len(list(x)) for x in attrs if x[0] == 'colspan' and x[1] == '15']) > 0:
self.parserow = 0
else:
self.parserow = 1
def handle_data(self, text):
# We'd better hope they don't start using 'Description' elsewhere in the page
if text == 'Description':
self.startparse = 1
if self.parserow > 0:
if not self.column_no % 2:
if self.items[self.row_no].has_key(self.column_no):
self.items[self.row_no][self.column_no] = self.items[self.row_no][self.column_no] + text
else:
self.items[self.row_no][self.column_no] = text
sock = urllib.urlopen("http://www.bedbathandbeyond.com/regGiftRegistry.asp?order_num=-1&WRN="+registry+"&st=D&smode=prt&show_images=N")
html = sock.read()
sock.close()
# The ascii parser chokes on (registered mark) so let's strip out anything not in the normal charset
html = re.sub("&#\d{1,4};","",html)
# Run our parser
parser = BBBParser()
items = parser.parse(html)
# the next step will be to convert this into our xml file
xml = minidom.parseString('<bbbitems></bbbitems>')
for item in items:
if items[item].has_key(6):
if items[item][6] != 'Upc':
gift = xml.createElement('item')
if items[item].has_key(3):
link = xml.createElement('link')
link.appendChild(xml.createTextNode(items[item][3]))
gift.appendChild(link)
title = xml.createElement('title')
titletext = items[item][4]
title.appendChild(xml.createTextNode(titletext))
gift.appendChild(title)
want = xml.createElement('want')
want.appendChild(xml.createTextNode(items[item][10]))
gift.appendChild(want)
have = xml.createElement('have')
have.appendChild(xml.createTextNode(items[item][12]))
gift.appendChild(have)
price = xml.createElement('price')
price.appendChild(xml.createTextNode(items[item][8]))
gift.appendChild(price)
xml.getElementsByTagName('bbbitems')[0].appendChild(gift)
print xml.toprettyxml()
