Scrapyを触ってみる

2022年3月1日2022年3月12日

Scrapyとは

ScrapyはPythonで開発されたクローラーフレームワークである。
Webサイトから情報を抽出するプログラムを簡単に作成することができる。

お試しコード

まずはお試しコードを触ってみる。

下記のお試しコードをquotes_spider.pyファイルに保存して実行する。

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

$scrapy runspider quotes_spider.py -o quotes.jl
$cat quotes.jl
{"author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"}
{"author": "Steve Martin", "text": "\u201cA day without sunshine is like, you know, night.\u201d"}
{"author": "Garrison Keillor", "text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d"}
...

何が起きているかというと、scrapy runspider quotes_spider.pyコマンドで、Scrapyがファイル内のSpiderの定義を見つけ出して、内蔵のクローラーで処理しているらしい。

今回のSpider定義では、start_urlsの中のurlにリクエストを送り、そのレスポンスをparse関数で処理している。デフォルトでparseという名前の関数が呼ばれるようだ。
parse関数ではCSSセレクタで指定した要素をループ処理しているみたい。ループ処理のあとは、次ページへのリンクを取得して、さらにprase関数で処理している。

チュートリアル

チュートリアルでScrapyプロジェクトを作成してみる。
ますはtutorialプロジェクトを作成する。

$scrapy startproject tutorial

作成されたtutorialディレクトリは下記の構造になっている。

tutorial/
    scrapy.cfg            # deploy configuration file
    tutorial/             # project's Python module, you'll import your code from here
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # a directory where you'll later put your spiders
            __init__.py

tutorial/spidersにSpiderを定義していく。

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

このSpiderの定義の説明は下記

name: Spiderの名前を設定する。プロジェクト内でユニークな名前でないといけない。
start_requests(): Spiderがクローリングを始めるRequestのiterableを返り値にしないといけない。Requestのリスト、generator関数でも可。初めのRequestが成功すれば、後続のRequestも生成される。
parse(): 各RequestからのResponseを処理するために呼び出される。Response引数は TextResponseのインスタンスで、ページコンテンツとそれを処理するのに便利なメソッドを保持している。

このSpiderを実行してみる。
このコマンドでquotesと名付けたSpiderが実行される。

$scrapy crawl quotes

Spiderが実行され、quotes-1.htmlとquotes-2.htmlの2つのファイルが作成された。

$ ls
quotes-1.html  quotes-2.html  scrapy.cfg  tutorial/

Scrapy shell

parseメソッドを実装するにあたり、どうやってページコンテンツを抽出するか試行錯誤しないといけない。
その際にScrapy shellが役に立つ。
下記コマンドでページコンテンツに対していろいろなセレクタを試すことができる。
※urlはシングルクウォートで囲む必要がある。

$scrapy shell 'http://quotes.toscrape.com/page/1/'

いろいろなセレクタを試してみる。
CSSセレクタ以外にXPathも使うことができる。
exitでshellを終了する。

In [1]: response.css('title')
Out[1]: [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

In [2]: response.xpath('//title')
Out[2]: [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]

In [3]: exit

Scrapy shellでどうやってデータを抽出するか試したあと、parseメソッドを実装する。

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

データの保存

Feed exportsを使うことで簡単に、CSV, XML, JSON形式でデータを保存できる。

このコマンドで、データがquotes.jsonにJSON形式で保存できる。
-O(大文字)は既存ファイルに上書き、-o(小文字)は既存ファイルに追記するオプションとなる。

$scrapy crawl quotes -O quotes.json
$scrapy crawl quotes -o quotes.json

リンクを辿る

リンクを辿ってほかのページをクローリングするには下記のようにする。
yield response.follow(next_page, callback=self.parse)でnext_pageにリクエストして、レスポンスをself.parseで処理できる。

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

<a>要素については、hrefの抽出を省略できる。

for a in response.css('ul.pager a'):
    yield response.follow(a, callback=self.parse)

また、response.follow_allを使えば、複数リクエストをiterableで渡して処理することもできる。

anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)

複合パターン

複数パターンのページをクローリングする場合は、下記のようにparseメソッドを複数パターン(parse, parse_author)作成することで対処できる。

import scrapy

class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        author_page_links = response.css('.author + a')
        yield from response.follow_all(author_page_links, self.parse_author)

        pagination_links = response.css('li.next a')
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

Spider引数

-aオプションでSpiderに引数を渡すことができる。

$scrapy crawl quotes -O quotes-humor.json -a tag=humor

この引数はgetattrから取得できる。
この例ではtag引数にhumorを渡しているので、「http://quotes.toscrape.com/tag/humor」のURLにアクセスしに行くことになる。

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)