Beautiful Soupを使ってスクレイピングを遊ぶ

Beautiful Soupとは？

Beautiful SoupとはPythonのライブラリです。
「スクレイピング」に特化した機能を持っています。

「スクレイピング」って？

取得したHTMLから任意の情報を「抽出」する行為や技術を指します。

「クロール」との違いは？

クロールは任意のサイトからHTMLや任意の情報を取得する行為や技術を指します。
ただ取得するのではなく抽出までを含めたものが「スクレイピング」と言えるかと思います。

何故Beautiful Soupなのか？

例えばVBAなど多言語では正規表現でしかスクレイピングに相当する事が再現出来ません。
かなり複雑な条件になる上に、抽出した対象を全文検索します。
Beautiful Soupであれば書き方は至って単純。
抽出した時点で対象をオブジェクトに渡して絞り込みをしていくので処理も軽い。

pythonがインストールされている環境では以下のように構築出来ます。


  easy_install pip
  pip install beautifulsoup4

1 2	easy_install pip pip install beautifulsoup4

これだけです。
今や標準でpythonと一緒にあるeasy_installを使用して
pipと言う「pythonの中のサブ機能管理ツール」みっちょなものをインストール。
その後でBeautifulSoupをインストール。終了です。

実際にどんなことが出来るの？

今回はサンプルとしてYahoo!ニュースのトップ記事の一番上にあるテキストを取得してみました。Yahoo! ニュースを見てみましょう。
ターゲットは「主要」の下にあるリストの一番上のテキストです。

次にHTMLソースの位置を把握します。
おおまかに書くとこんな感じのソースとなっています。


  &lt;div id="main"&gt;
    &lt;div id="editorsPick"&gt;
      &lt;noscript&gt;
        &lt;!-- 何かのスクリプト --&gt;
      &lt;/noscript&gt;
      &lt;div id="epTabTop" class="epContents current hasBigImg top"&gt;
        &lt;ul class="topics"&gt;
          &lt;li class="topTpi"&gt;
            &lt;div&gt;
              &lt;h1 class="ttl"&gt;&lt;a href="{{ 記事のURI }}"&gt;{{ 記事のタイトル }}&lt;span class="icNew"&gt;new&lt;/span&gt;&lt;/a&gt;&lt;/h1&gt;
            &lt;/div&gt;
          &lt;/li&gt;
          .......
        &lt;/ul&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;

</noscript>

<div>

<h1 class="ttl"><a href="{{ 記事のURI }}">{{ 記事のタイトル }}<span class="icNew">new</span></a></h1>

</div>

</li>

.......

</ul>

</div>

文章にしてみれば
「”main”の中の”editorsPick”の中の”ul”リスト中の最初の”li”文字列」
です。日本語とは程遠いですが。。。

答えを先に書いてしまうと以下の様なスクリプトです。


# -*- coding: utf-8 -*-

import urllib2
import re
from bs4 import BeautifulSoup

html = urllib2.urlopen("http://news.yahoo.co.jp/")

# get first all html source
soup = BeautifulSoup(html, "html.parser")

# extract main body which tag is div
main_body = soup.find("div", {"id": "main"})

# extract list items in main_body which class name equals to "topics"
topics = main_body.find("ul", {"class": "topics"})

# join each list item's string
out_str = ""
for first_topic in topics.find("li").a.contents:
    out_str += first_topic.string

# remove if string contains "new"
out_str = re.sub(r'new', '', out_str)

# output
print out_str

# -*- coding: utf-8 -*-

import urllib2

import re

from bs4 import BeautifulSoup

html = urllib2.urlopen("http://news.yahoo.co.jp/")

# get first all html source

soup = BeautifulSoup(html, "html.parser")

# extract main body which tag is div

main_body = soup.find("div", {"id": "main"})

# extract list items in main_body which class name equals to "topics"

topics = main_body.find("ul", {"class": "topics"})

# join each list item's string

out_str = ""

for first_topic in topics.find("li").a.contents:

out_str += first_topic.string

# remove if string contains "new"

out_str = re.sub(r'new', '', out_str)

# output

print out_str

大まかな流れは
1. htmlを取得してBeautifulSoup的な何か(object)を渡す
2. divタグのidがmainと一致するものを取得
3. その中から更にulの中身を取得
4. 最初のリストアイテム(li)を取得
5. 文字列の中にspanタグとかがあるかもしれないのでcontentsを取得してループ
6. out_str(出力用の文字列バッファ)に結合していく
7. 出力

となります。

その他の活用方法など

筆者もBeautiful Soup自体にはそれほど馴染んでいないため、実例はあまり思い浮かばないのが正直なところです。例えば「キーワード検索結果のTOP10からタイトル、サイトを取得して、その遷移をグラフ化する」であったり
「.zipなどの書庫一覧のリンクを辿ってリスト化する」など様々ですが、基本的には自動化用途が多い気がします。

Chromeのプラグインと併せてサイト内検索やデータマイニングなども今後は拡張する可能性があるのではないでしょうか。

Author Profile