beautiful soup4をインストールしよう

pipでbeautifulsoup4を入れます。

[vagrant@localhost python]$ pip install beautifulsoup4
Collecting beautifulsoup4
Downloading https://files.pythonhosted.org/packages/a6/29/bcbd41a916ad3faf517780a0af7d0254e8d6722ff6414723eedba4334531/beautifulsoup4-4.6.0-py2-none-any.whl (86kB)
100% |################################| 92kB 186kB/s
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.6.0
You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the ‘pip install –upgrade pip’ command.

おおおお、入ったようだ。

ん、どうやら、python3系のコードを書いてしまったよう。
python app.py
Traceback (most recent call last):
File “app.py”, line 3, in
import requests, bs4
ImportError: No module named requests

やり直します。

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen("https://www.monex.co.jp/")
soup = BeautifulSoup(html)

[vagrant@localhost python]$ python app.py
/home/linuxbrew/.linuxbrew/Cellar/python@2/2.7.14_4/lib/python2.7/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I’m using the best available HTML parser for this system (“html.parser”). This usually isn’t a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 7 of the file app.py. To get rid of this warning, change code that looks like this:

BeautifulSoup(YOUR_MARKUP})

to this:

BeautifulSoup(YOUR_MARKUP, “html.parser”)

markup_type=markup_type))

html.prserが必要のようですね。

soup = BeautifulSoup(html, "html.parser")

これでOK

では、社長が大好きなマネックス(https://www.monex.co.jp/)を見てみましょう。

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen("https://www.monex.co.jp/")
soup = BeautifulSoup(html, "html.parser")

tag = soup.find("title")
print(tag)

おおおおおおおおおおおおおおおおおお、
ちゃんとスクレイピングできてます!

[vagrant@localhost python]$ python app.py
マネックス証券 | ネット証券(株・アメリカ株・投資信託)