Python modules you should know: Scrapy

April 22, 2012 at 10:50 AM | categories: Python, PyMYSK, Howto | View Comments

Next in our series of Python modules you should know is Scrapy. Do you want to be the next Google ? Well read on.

Home page


Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

You can use Scrapy to extract any kind of data from a web page, in HTML, XML, CSV and other formats. I recently used it to automate the extraction of domains and emails on the ISPA Spam Hall of Shame list, for use in a DNSBL.


pip install scrapy


Scrapy is a very extensive package it is not possible to describe its full usage in a single blog post, There is tutorial on the scrapy website as well as extensive documentation.

For this post i will describe how i used it to extract listed domains from the ISPA hall of shame website.

The page is and looking at the page source you find that the domains are displayed in lists with bold text "Domains: " before the actual domains list

    <li><strong>Domains: </strong>,,,,
    <li><strong>Addresses: </strong>,,,,,,,,,,

The Xpath expression to extract this will be.

'//li/strong[text()="Domains: "]/following-sibling::text()'

For more information about XPath see the XPath reference.

With the Xpath expression we can now write a spider to download the webpage and extract the data we want.

Create a python file with the following contents

# -*- coding: utf-8 -*-
# Copyright (C) 2012  Andrew Colin Kissa <>
# vim: ai ts=4 sts=4 et sw=4

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class ISPASpider(BaseSpider):
    name = "ispa-domains"
    allowed_domains = [""]
    start_urls = [

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    lines ='//li/strong[text()="Domains: "]/following-sibling::text()').extract()
    for line in lines:
        domains = line.split(',')
        domains = [domain.strip() for domain in domains if domain.strip()]
        for domain in domains:
            print domain

You can then run the spider from the command line and it should provide you will the list of domains extracted.

scrapy runspider --nolog

And there is more

This post just touches a tip of what scrapy can do, use the documentation for details on what can be done using this package.

blog comments powered by Disqus