Python modules you should know: Scrapy

April 22, 2012 at 10:50 AM | categories: Python, PyMYSK, Howto | View Comments

Next in our series of Python modules you should know is Scrapy. Do you want to be the next Google ? Well read on.

Home page

Use

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

You can use Scrapy to extract any kind of data from a web page, in HTML, XML, CSV and other formats. I recently used it to automate the extraction of domains and emails on the ISPA Spam Hall of Shame list, for use in a DNSBL.

Installation

pip install scrapy

Usage

Scrapy is a very extensive package it is not possible to describe its full usage in a single blog post, There is tutorial on the scrapy website as well as extensive documentation.

For this post i will describe how i used it to extract listed domains from the ISPA hall of shame website.

The page is http://ispa.org.za/spam/hall-of-shame/ and looking at the page source you find that the domains are displayed in lists with bold text "Domains: " before the actual domains list

<ul>
    <li><strong>Domains: </strong>
    dfemail.co.za, extremedeals.co.za, hospitalcoverza.co.za,
    lifeinsuranceza.co.za, portablebreathalyzer.co.za
    </li>
    <li><strong>Addresses: </strong>bounce@dfemail.co.za, bounce@extremedeals.co.za,
    bounce@hospitalcoverza.co.za, bounce@lifeinsuranceza.co.za,
    bounce@portablebreathalyzer.co.za, info@dfemail.co.za, info@extremedeals.co.za,
    info@gmarketing.co.za, info@hospitalcoverza.co.za, info@lifeinsuranceza.co.za,
    sales@portablebreathalyzer.co.za
    </li>
</ul>

The Xpath expression to extract this will be.

'//li/strong[text()="Domains: "]/following-sibling::text()'

For more information about XPath see the XPath reference.

With the Xpath expression we can now write a spider to download the webpage and extract the data we want.

Create a python file crawl-ispa-domains.py with the following contents

#!/usr/bin/python
# -*- coding: utf-8 -*-
# crawl-ispa-domains.py
# Copyright (C) 2012  Andrew Colin Kissa <andrew@topdog.za.net>
# vim: ai ts=4 sts=4 et sw=4

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class ISPASpider(BaseSpider):
    name = "ispa-domains"
    allowed_domains = ["ispa.org.za"]
    start_urls = [
        "http://ispa.org.za/spam/hall-of-shame/",
    ]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    lines = hxs.select('//li/strong[text()="Domains: "]/following-sibling::text()').extract()
    for line in lines:
        domains = line.split(',')
        domains = [domain.strip() for domain in domains if domain.strip()]
        for domain in domains:
            print domain

You can then run the spider from the command line and it should provide you will the list of domains extracted.

scrapy runspider --nolog crawl-ispa-domains.py

And there is more

This post just touches a tip of what scrapy can do, use the documentation for details on what can be done using this package.


blog comments powered by Disqus