matt laine dot com / blog


Using Python and Shell to Manipulate Large Data

by on Apr.03, 2009, under python, shell scripting, web development

I’ve been working on a project where we are dealing with a large amount of data. So large in fact that it will crash most programs that try and open it (try parsing a 60MB CSV file with Excel). So, I’m using Mac OS X Terminal to read and sort through data using UNIX commands.

Often times, as a web developer, you are asked to step outside of your normal routine and do some investigating. In my case, I was asked to audit a website with over 30,000 pages. Simply figuring out how many pages were on the site was a large task on it’s own.

Getting a “complete” set of URLs

I say “complete” in quotes because for some sites it’s just not feasible to find 100% of the pages. The good thing is that a site that is really large will have a lot of stuff that the CEO or whoever just doesn’t care about. Still I wanted to capture as much data as I can as my starting point.

First, I used Xenu link sleuth to crawl the site for links. This is not the best way to crawl a site, but it was a quick way to get my set of URLs. On reason why this method isn’t the best is that it fails to find orphaned pages (pages that are not linked to by other pages). However, I does record some interesting tidbits about each page, such as the content-type, folder level relative to the site root, and whether or not the link was broken (404) or forbidden (403).

Bash that data into submission

After exporting this data out, my file was quick large. Since the data was in tab-delimited format, I knew I could extract the URLs by parsing the string after the first tab character. I use the awk program to do this.

$ awk -F"\t" 'OFS="\t"{print $1 >> url_results.txt}' url.txt

Next I wanted to filter out any duplicates. I do this by using sort with -u and -f flags. This basically tells the sort to treat uppercase and lowercase strings equally, and save the unique lines in a seperate text file

$ sort -u -f url_results.txt > url_results_uniq.txt

Enter the Python

Now my URL list is more manageable and I can play around with it. One thing that I wanted to do is see what the HTML source was for each file. This way, I can check to see what CSS files are linking from it, etc. Enter Python; a scripting language that also acts like an interpretor. It’s fast for processing large files and forces you to handle errors like 404s or page timeouts. Using Regular Expressions (again) to match the CSS file reference, my Python script goes a little like this:

import re, urllib2, socket
socket.setdefaulttimeout(10) ##this sets the timeout manually for each page

def analyzeData(url, htmlcode): ##this function does the regex matching
    patterns = dict(firststyle= '/pattern1.css/', secondstyle='/pattern2.css')
    for template in patterns:
        regex = str(patterns[template])
        if(, htmlcode)):
        #open file w/ template name, print url to it
            fh = "fileHandle"+template
            fh = open(template+".txt", "a")
            print url+", "+template

f = open('urls_results_uniq.txt', 'r') ##opens the list of unique urls
httperror = open('httperror.txt', 'w') ##open/create a file to store HTTP errors
timeouts = open('timeouts.txt', 'w') ##open/create a file to store time outs
index = 0;
for line in f.readlines():
    urlregex = re.match('(^http:\/\/.*\.html|^http:\/\/.*\.php)', line) ##checks for .html or .php in url path (excludes images, css, etc.)
        url =
            req = urllib2.Request(url)
            page = urllib2.urlopen(req)
            htmlcode =
            analyzeData(url, htmlcode);
        except urllib2.HTTPError: ##handle exceptions
            print str(index)+' - http error for '+url
        except urllib2.URLError:
            print str(index)+' - url error (timeout) for '+url
        index = index + 1

The first part of the script imports the proper libraries. For the example, I’m using the Regular Expression library (re), urllib2 to send an HTTP request and return the HTML source, and socket to set the time out for each page manually. The second part is the function that actually does the regex matching on the html source code. I found out that the function had to be defined before it was called, which is different that what I’m used to in PHP. Lastly, I’m opening, reading, and looping through my list of URLs and calling the ‘analyzeData’ for each URL. If the function finds a match in the HTML source, it writes that to a file named either ‘firststyle’ or ‘secondstyle’.

My first impressions on Python

This was the first time I used Python. I’d previously tried to perform the method above using PHP and cURL. The script was continually timing out or stalling, so I changed technology. I also looked to see if I could leverage a Python framework for the task I was doing, such as Django, but for what I needed, I ended up using straight-up Python. Python is extremely fast, and I like the combination of using Python in the Shell. I also found it useful to print out something for each URL that I’m checking, that way I can know that the script is still running, even if no matches were found or an error was found.

Probably the hardest thing to get used to in Python is indenting the code. You have to indent properly or Python will not be able to interpret your code. I imagine this is an attempt to get rid of curly-braces, a la C or PHP, and to it’s credit, does force you to write code consistently. The down-side to using Python is that it’s a pretty big learning curve. I’m the type of person who won’t dive into something like Python with a project, or reason (other than curiosity).

Leave a Comment : more...

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Visit our friends!

A few highly recommended friends...


All entries, chronologically...