Next in our series of Python modules you should know is Scrapy. Do you want to be the next Google ? Well read on.
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
You can use Scrapy to extract any kind of data from a web page, in HTML, XML, CSV and other formats. I recently used it to automate the extraction of domains and emails on the ISPA Spam Hall of Shame list, for use in a DNSBL.
pip install scrapy
For this post i will describe how i used it to extract listed domains from the ISPA hall of shame website.
The page is http://ispa.org.za/spam/hall-of-shame/ and looking at the page source you find that the domains are displayed in lists with bold text "Domains: " before the actual domains list
<ul> <li><strong>Domains: </strong> dfemail.co.za, extremedeals.co.za, hospitalcoverza.co.za, lifeinsuranceza.co.za, portablebreathalyzer.co.za </li> <li><strong>Addresses: </strong>email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com </li> </ul>
The Xpath expression to extract this will be.
For more information about XPath see the XPath reference.
With the Xpath expression we can now write a spider to download the webpage and extract the data we want.
Create a python file crawl-ispa-domains.py with the following contents
#!/usr/bin/python # -*- coding: utf-8 -*- # crawl-ispa-domains.py # Copyright (C) 2012 Andrew Colin Kissa <firstname.lastname@example.org> # vim: ai ts=4 sts=4 et sw=4 from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector class ISPASpider(BaseSpider): name = "ispa-domains" allowed_domains = ["ispa.org.za"] start_urls = [ "http://ispa.org.za/spam/hall-of-shame/", ] def parse(self, response): hxs = HtmlXPathSelector(response) lines = hxs.select('//li/strong[text()="Domains: "]/following-sibling::text()').extract() for line in lines: domains = line.split(',') domains = [domain.strip() for domain in domains if domain.strip()] for domain in domains: print domain
You can then run the spider from the command line and it should provide you will the list of domains extracted.
scrapy runspider --nolog crawl-ispa-domains.py
And there is more
This post just touches a tip of what scrapy can do, use the documentation for details on what can be done using this package.
blog comments powered by Disqus