BeautifulSoup

$ pip3 install beautifulsoup4

ウェブページをスクレイプしたい

import requests
from bs4 import BeautifulSoup

url = "https://example.com/"
html = requests.get(url)
soup = BeautifulSoup(html.content, "html.parser")

ウェブページの取得はrequestsを使います。その内容をパース（＝スクレイプ）するためにBeautifulSoupを使います。

タグで検索したい

tag = soup.タグ名
tag = soup.find("タグ名")
tag = soup.select_one("タグ名")
tags = soup.find_all("タグ名")
tags = soup.select("タグ名")

soup.タグ名やsoup.find("タグ名")だと、最初のタグを拾うことができます。すべてのタグを拾いたい場合はsoup.find_all("タグ名")を使います。返り値はbs4.element.ResultSet型のオブジェクトですが、リストのように扱うことができます。

属性で検索したい

tag = soup.find("タグ名", "属性と値")
tags = soup.select("タグ名[属性と値]")
tag = soup.find("a", href="検索したいURL")

クラス名で検索したい

tag = soup.find("タグ名", class_="クラス名")
tags = soup.select("CSSセレクタ")

タイトルを取得したい

soup.title       # "<title>記事のタイトル</title>"
soup.title.text  # "記事のタイトル"
soup.title.name  # "title" = タグ名

リンクのURLを取得したい

soup.a.get("href")
soup.find("a").get("href")
[tag.get("href") for tag in soup.find_all("a")]

<a href="URL" class="...">のようなaタグの属性値hrefを拾うことができます。複数のaタグのURLを拾いたい場合は、リスト内包表記を使います。

バージョンを確認したい

import bs4
bs4.__version__

リファレンス

BeautifulSoup4 Documentation