上課前可以用pip裝一下 beautifulsoup4 跟 selenium
pip install beautifulsoup4 selenium
待會會用到這兩個外部套件 🚀
蛤?
例如 Google 日曆 API 可以讓你用程式取得,建立,修改,刪除行事曆資料。
然後你就可以以此開發各種服務
我們先用 Reqres 這個網站來看一下效果
Request 那裏的是要呼叫的網址,Response 那裏則是你會收到的內容(通常是JSON格式)
/api/users?page=2 後面的 ?page=2 是什麼意思?
如果是 GET request,在網址最末端加上 '?' 後,可以繼續附帶參數(parameter), 以這裡來說,GET /api/users?page=2 就是取得第二頁的使用者的意思,如果他的使用者資料有1000000筆,為避免一次回傳過多資料, 通常我們會用page來讓別人能分批取得使用者資料。
若看到 GET /api/users?page=2&gender=male&age=20 意思就會是: 取得第二頁且性別為男性且年齡為20的使用者。
(不過Reqres這個網站沒提供這種進階搜尋的功能就是了)
給大家 10 分鐘,大家可以操作一下上面的 Reqres 這個網站
先取得 API key
找Book API -> GET /lists/best-sellers/history.json
他有提供很方便的測試環境,只要在 左側 API key 那裏輸入剛才拿到的 key,即可取得結果。
import json
import requests
api_key = '你剛拿到的 api-key'
url = 'https://api.nytimes.com/svc/books/v3/lists/best-sellers/history.json?api-key=' + api_key
r = requests.get(url)
r.encoding = 'utf-8'
# print(r.text)
data = json.loads(r.text)
print(data['num_results'])
31425
[補充]
為了避免有人瘋狂發API拖慢系統,有些(多數) API 因為要商用或有驗證身份的需求,會要求你先註冊一個帳號來取得一個 API Key,這時你就需要在發送 request 時一併將這個 API Key 發送給對方才能得到資料。 這類 API 通常會限制每小時的使用次數。
請先閱讀這份 Weather API ,再編寫出一支程式 print 出台北今天(7/18)的天氣為何。
可以先找到台北的 woeid,再使用後面查詢天氣的API。
注意:請了解 GET /api/location/(woeid)/ 後回傳的資料的意義,可以先觀察他回傳的資料,再嘗試將該天氣print出來。
練完你就會串基本的API了 🙌
接下來你會學到幾件事
一隻程式,透過自動瀏覽網際網路並下載資料,可用於編纂網路索引來建立搜尋引擎。
蛤??
記得早上的 Request Response 流程嗎? 你也可以從response中取得網頁的內容 (html)
比方說你嘗試用上面的方法取得 Wiki 百科 某一頁
你得到的 html 裡面通常會有很多超連結,而當爬蟲程式看到超連結時,也可以選擇順手把這些超連結存下來,等爬(載)完這一頁之後,再取得其他超連結的資料。
爬完這一頁後再跳到下一頁,然後再重複... 直到爬完整個 Wiki
或你電腦爆掉 💥
所以才叫爬蟲。
當然,有些網站不希望別人去爬取他的內容,因此他會在網頁最上層(some_url/robots.txt)放個 robots.txt
例如知乎的 robots.txt 放在https://www.zhihu.com/robots.txt
pip install beautifulsoup4
步驟 1: 開啟開發人員工具,查看Tag
確認 "檢查網頁原始碼"內的內容跟 "開發人員工具"的內容 是否一致
步驟 2: 引用Package,取得 html string
import requests
from bs4 import BeautifulSoup
url = 'http://quotes.toscrape.com/'
response = requests.get(url)
print(response.status_code)
200
# response.encoding = 'utf-8'
print(response.text)
# 後面還會有
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Quotes to Scrape</title> <link rel="stylesheet" href="/static/bootstrap.min.css"> <link rel="stylesheet" href="/static/main.css"> </head> <body> <div class="container"> <div class="row header-box"> <div class="col-md-8"> <h1> <a href="/" style="text-decoration: none">Quotes to Scrape</a> </h1> </div> <div class="col-md-4"> <p> <a href="/login">Login</a> </p> </div> </div> <div class="row"> <div class="col-md-8"> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span> <span>by <small class="author" itemprop="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" / > <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span> <span>by <small class="author" itemprop="author">J.K. Rowling</small> <a href="/author/J-K-Rowling">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="abilities,choices" / > <a class="tag" href="/tag/abilities/page/1/">abilities</a> <a class="tag" href="/tag/choices/page/1/">choices</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span> <span>by <small class="author" itemprop="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="inspirational,life,live,miracle,miracles" / > <a class="tag" href="/tag/inspirational/page/1/">inspirational</a> <a class="tag" href="/tag/life/page/1/">life</a> <a class="tag" href="/tag/live/page/1/">live</a> <a class="tag" href="/tag/miracle/page/1/">miracle</a> <a class="tag" href="/tag/miracles/page/1/">miracles</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span> <span>by <small class="author" itemprop="author">Jane Austen</small> <a href="/author/Jane-Austen">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" / > <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a> <a class="tag" href="/tag/books/page/1/">books</a> <a class="tag" href="/tag/classic/page/1/">classic</a> <a class="tag" href="/tag/humor/page/1/">humor</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span> <span>by <small class="author" itemprop="author">Marilyn Monroe</small> <a href="/author/Marilyn-Monroe">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="be-yourself,inspirational" / > <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a> <a class="tag" href="/tag/inspirational/page/1/">inspirational</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span> <span>by <small class="author" itemprop="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="adulthood,success,value" / > <a class="tag" href="/tag/adulthood/page/1/">adulthood</a> <a class="tag" href="/tag/success/page/1/">success</a> <a class="tag" href="/tag/value/page/1/">value</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span> <span>by <small class="author" itemprop="author">André Gide</small> <a href="/author/Andre-Gide">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="life,love" / > <a class="tag" href="/tag/life/page/1/">life</a> <a class="tag" href="/tag/love/page/1/">love</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span> <span>by <small class="author" itemprop="author">Thomas A. Edison</small> <a href="/author/Thomas-A-Edison">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased" / > <a class="tag" href="/tag/edison/page/1/">edison</a> <a class="tag" href="/tag/failure/page/1/">failure</a> <a class="tag" href="/tag/inspirational/page/1/">inspirational</a> <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span> <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small> <a href="/author/Eleanor-Roosevelt">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt" / > <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span> <span>by <small class="author" itemprop="author">Steve Martin</small> <a href="/author/Steve-Martin">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="humor,obvious,simile" / > <a class="tag" href="/tag/humor/page/1/">humor</a> <a class="tag" href="/tag/obvious/page/1/">obvious</a> <a class="tag" href="/tag/simile/page/1/">simile</a> </div> </div> <nav> <ul class="pager"> <li class="next"> <a href="/page/2/">Next <span aria-hidden="true">→</span></a> </li> </ul> </nav> </div> <div class="col-md-4 tags-box"> <h2>Top Ten tags</h2> <span class="tag-item"> <a class="tag" style="font-size: 28px" href="/tag/love/">love</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 26px" href="/tag/inspirational/">inspirational</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 26px" href="/tag/life/">life</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 24px" href="/tag/humor/">humor</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 22px" href="/tag/books/">books</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 14px" href="/tag/reading/">reading</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 10px" href="/tag/friendship/">friendship</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 8px" href="/tag/friends/">friends</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 8px" href="/tag/truth/">truth</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a> </span> </div> </div> </div> <footer class="footer"> <div class="container"> <p class="text-muted"> Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a> </p> <p class="copyright"> Made with <span class='sh-red'>❤</span> by <a href="https://scrapinghub.com">Scrapinghub</a> </p> </div> </footer> </body> </html>
先看一下 BeautifulSoup 的文檔
find_all 似乎符合我們的要求
查看相關參數(Parameters)
Parameters
步驟 3: 將html string 轉為 bs4 物件,進行解析
# response.text 來自前面取得的內容
soup = BeautifulSoup(response.text, 'html.parser') # 將 html 字串喂給 BeautifulSoup,產生 soup 物件
first_tag = soup.find(name='span',class_='text') # 找第一個看到的 span
print(first_tag)
print(first_tag.name)
print(first_tag.attrs) # 印出 Tag 的屬性
print(first_tag['class']) # 印出 Tag 的 class 值
print(first_tag['itemprop'])
print(first_tag.text)
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span> span {'class': ['text'], 'itemprop': 'text'} ['text'] text “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
tags = soup.find_all(name='span',class_='text')
for tag in tags:
print(tag.text)
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” “It is our choices, Harry, that show what we truly are, far more than our abilities.” “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” “Try not to become a man of success. Rather become a man of value.” “It is better to be hated for what you are than to be loved for what you are not.” “I have not failed. I've just found 10,000 ways that won't work.” “A woman is like a tea bag; you never know how strong it is until it's in hot water.” “A day without sunshine is like, you know, night.”
# 找 class 為 some_class 的 li
soup.find_all('li', 'some_class')
# 找 id為 some_id 的 li
soup.find('li', id='some_id')
# 先找 class 為 some_class 的div,再從中找所有 class 為 link 的 a
soup.find('div', class='some_class').find_all('a', 'links')
# 找所有有 itemprop 這個屬性且值為text 的 tag
soup.find_all('', itemprop='text')
# 找所有 itemprop 這個屬性且值為text,且class也是text的 tag
soup.find_all('', {'itemprop':'text','class':'text'})
# 先找所有 class 為tags 的tag,再選第3個(因為從0開始數),再從中找所有的 a
soup.find_all('', 'tags')[2].find_all('a')
# 當然,也可以用前幾天學的 Regex 來找
import re
# check email pattern
pattern = r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)" # <- 辨識 email 格式用
soup.find_all(text=re.compile(pattern))
[]
任務: 印出第一個 Quote 所有Tag的超連結
結果應該要印出
/tag/change/page/1/
/tag/deep-thoughts/page/1/
/tag/thinking/page/1/
/tag/world/page/1/
# Sample Answer
tags = soup.find('','tags').find_all(class_='tag')
for tag in tags:
print(tag['href'])
/tag/change/page/1/ /tag/deep-thoughts/page/1/ /tag/thinking/page/1/ /tag/world/page/1/
先查看文檔
解說用法
# 找在p裡面,class為text的tag
soup.select('p .text') # <p><span class='text'>sample</span></p>
# 找p,同時p的class也是text
soup.select('p.text') # <p class='text'>sample</p>
# 找p,同時 p 的 class 有text 也有 link
soup.select('p.text.link') # <p class='text link'>sample</p>
# Sample Answer
link_set = set()
tags = soup.select('.quote span a')
for tag in tags:
link_set.add(tag['href'])
print(link_set)
{'/author/Marilyn-Monroe', '/author/Steve-Martin', '/author/Albert-Einstein', '/author/Eleanor-Roosevelt', '/author/Thomas-A-Edison', '/author/Jane-Austen', '/author/J-K-Rowling', '/author/Andre-Gide'}
"可能" 會用到:
.parent
.find_previous_siblings or .find_next_siblings
挑戰題:
A: Bradley Whitford, Mandy Moore, Gwendoline Christie, Amandla Stenberg
# 請取得 IMDB 8月1~8/10 (8/10前) 上映的電影標題 參考答案
import requests
from bs4 import BeautifulSoup
def get_body_by_url(url):
response = requests.get(url)
if (response.status_code != 200):
raise ValueError(response.status_code)
else:
print('ok')
return BeautifulSoup(response.text, 'html.parser')
url = 'https://www.imdb.com/movies-coming-soon/2018-08/'
soup = get_body_by_url(url)
ok
# sample answer
list_items = soup.select('a[name="Aug10"]')[0].parent.find_previous_siblings(class_='list_item')
for item in list_items:
title = item.select('h4 a')[0].text
print(title)
Nico, 1988 (2017) Singwa hamkke: Ingwa yeon (2018) Never Goin' Back (2018) The Miseducation of Cameron Post (2018) Searching (2018) The Spy Who Dumped Me (2018) The Darkest Minds (2018) Christopher Robin (2018)
特殊網站像是:
註: 絕大多數的資料可以透過 API 或基礎爬蟲學到的方法取得,後續內容等真遇到這種網站再學也可以。
現在在這裡
有時是兩種東西造成的
Form (表單) 可以讓你輸入內容,而當你點擊下方的 Submit按鈕,通常會有另一隻Javascript程式將表單中的內容透過 Request(POST)發送給遠端伺服器,而遠端伺服器將新的內容回傳之後,Javascript 再將網頁中的內容直接修改,因此網址不會有變化。
有些則是當你點擊按鈕後,他一樣會發個 request,但他會將你導向到別的網址(像火車時刻表)
也就是說我們透過觀察開發人員工具中的"Network"來查看他發送了什麼request,再依樣畫葫蘆發一樣的request給對方,通常我們就能得到我們要的資料。
使用 Post 來取得 7/20 22:00 出發,台南到台北的高鐵車號
打開開發人員工具的Network頁籤,觀察點擊"立即查詢"後會發佈什麼 Request
import requests
form_data = {
'startStation': '9c5ac6ca-ec89-48f8-aab0-41b738cb1814',
'endStation': '977abb69-413a-4ccf-a109-0272c24fd490',
'theDay': '2018/07/20',
'timeSelect': '22:00',
'waySelect': 'DepartureInMandarin',
}
url = 'https://m.thsrc.com.tw/tw/TimeTable/SearchResult'
response = requests.post(url, data=form_data)
print(login_response.status_code)
200
soup = BeautifulSoup(response.text, 'html.parser')
# print(response.text)
tags = soup.find_all('a','ui-block-a')
print(tags)
[<a class="ui-block-a" data-ajax="false" href="/tw/TimeTable/TrainInfo/0696"><div>0696</div></a>, <a class="ui-block-a" data-ajax="false" href="/tw/TimeTable/TrainInfo/0294"><div>0294</div></a>]
先想想. . .
hint: 也是Request
最簡單的方式是,跟剛才一樣觀察登入時會發的Request,再用程式發一樣的內容。
等登入成功後,通常對方會給你個cookie(用來暫存資料),你只要將cookie存下來,下次 GET 網頁時一併將這個cookie給對方即可。
模擬登入來取得資料,確認登入後收到的html資料裡有 Logout 字樣
注意,多數網站不會想讓開發者能用機器人來登入,所以才會有驗證碼或什麼Receptra(我不是機器人)。
因此請先確認 /robots.txt 中有無限制使用權限,不過這個網站沒有放。
form_data = {
'username': 'Neo',
'password': 'quote*2018*some_static_word'
}
url = 'http://quotes.toscrape.com/login'
login_response = requests.post(url,data=form_data)
print(login_response.status_code)
print(login_response.cookies.get_dict())
# session = requests.Session()
# session.post(url, form_data)
# r = session.get('http://quotes.toscrape.com/')
# r.text
200 {'session': 'eyJ1c2VybmFtZSI6Ik5lbyJ9.Dio9Ng.UDw5cSFXPlLVbuVNhCF4IqhUaro'}
url = 'http://quotes.toscrape.com/'
response = requests.get(url,cookies=login_response.cookies)
print(response.text)
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Quotes to Scrape</title> <link rel="stylesheet" href="/static/bootstrap.min.css"> <link rel="stylesheet" href="/static/main.css"> </head> <body> <div class="container"> <div class="row header-box"> <div class="col-md-8"> <h1> <a href="/" style="text-decoration: none">Quotes to Scrape</a> </h1> </div> <div class="col-md-4"> <p> <a href="/logout">Logout</a> </p> </div> </div> <div class="row"> <div class="col-md-8"> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span> <span>by <small class="author" itemprop="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> - <a href="http://goodreads.com/author/show/9810.Albert_Einstein">(Goodreads page)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" / > <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span> <span>by <small class="author" itemprop="author">J.K. Rowling</small> <a href="/author/J-K-Rowling">(about)</a> - <a href="http://goodreads.com/author/show/1077326.J_K_Rowling">(Goodreads page)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="abilities,choices" / > <a class="tag" href="/tag/abilities/page/1/">abilities</a> <a class="tag" href="/tag/choices/page/1/">choices</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span> <span>by <small class="author" itemprop="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> - <a href="http://goodreads.com/author/show/9810.Albert_Einstein">(Goodreads page)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="inspirational,life,live,miracle,miracles" / > <a class="tag" href="/tag/inspirational/page/1/">inspirational</a> <a class="tag" href="/tag/life/page/1/">life</a> <a class="tag" href="/tag/live/page/1/">live</a> <a class="tag" href="/tag/miracle/page/1/">miracle</a> <a class="tag" href="/tag/miracles/page/1/">miracles</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span> <span>by <small class="author" itemprop="author">Jane Austen</small> <a href="/author/Jane-Austen">(about)</a> - <a href="http://goodreads.com/author/show/1265.Jane_Austen">(Goodreads page)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" / > <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a> <a class="tag" href="/tag/books/page/1/">books</a> <a class="tag" href="/tag/classic/page/1/">classic</a> <a class="tag" href="/tag/humor/page/1/">humor</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span> <span>by <small class="author" itemprop="author">Marilyn Monroe</small> <a href="/author/Marilyn-Monroe">(about)</a> - <a href="http://goodreads.com/author/show/82952.Marilyn_Monroe">(Goodreads page)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="be-yourself,inspirational" / > <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a> <a class="tag" href="/tag/inspirational/page/1/">inspirational</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span> <span>by <small class="author" itemprop="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> - <a href="http://goodreads.com/author/show/9810.Albert_Einstein">(Goodreads page)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="adulthood,success,value" / > <a class="tag" href="/tag/adulthood/page/1/">adulthood</a> <a class="tag" href="/tag/success/page/1/">success</a> <a class="tag" href="/tag/value/page/1/">value</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span> <span>by <small class="author" itemprop="author">André Gide</small> <a href="/author/Andre-Gide">(about)</a> - <a href="http://goodreads.com/author/show/7617.Andr_Gide">(Goodreads page)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="life,love" / > <a class="tag" href="/tag/life/page/1/">life</a> <a class="tag" href="/tag/love/page/1/">love</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span> <span>by <small class="author" itemprop="author">Thomas A. Edison</small> <a href="/author/Thomas-A-Edison">(about)</a> - <a href="http://goodreads.com/author/show/3091287.Thomas_A_Edison">(Goodreads page)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased" / > <a class="tag" href="/tag/edison/page/1/">edison</a> <a class="tag" href="/tag/failure/page/1/">failure</a> <a class="tag" href="/tag/inspirational/page/1/">inspirational</a> <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span> <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small> <a href="/author/Eleanor-Roosevelt">(about)</a> - <a href="http://goodreads.com/author/show/44566.Eleanor_Roosevelt">(Goodreads page)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt" / > <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a> </div> </div> <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span> <span>by <small class="author" itemprop="author">Steve Martin</small> <a href="/author/Steve-Martin">(about)</a> - <a href="http://goodreads.com/author/show/7103.Steve_Martin">(Goodreads page)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="humor,obvious,simile" / > <a class="tag" href="/tag/humor/page/1/">humor</a> <a class="tag" href="/tag/obvious/page/1/">obvious</a> <a class="tag" href="/tag/simile/page/1/">simile</a> </div> </div> <nav> <ul class="pager"> <li class="next"> <a href="/page/2/">Next <span aria-hidden="true">→</span></a> </li> </ul> </nav> </div> <div class="col-md-4 tags-box"> <h2>Top Ten tags</h2> <span class="tag-item"> <a class="tag" style="font-size: 28px" href="/tag/love/">love</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 26px" href="/tag/inspirational/">inspirational</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 26px" href="/tag/life/">life</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 24px" href="/tag/humor/">humor</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 22px" href="/tag/books/">books</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 14px" href="/tag/reading/">reading</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 10px" href="/tag/friendship/">friendship</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 8px" href="/tag/friends/">friends</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 8px" href="/tag/truth/">truth</a> </span> <span class="tag-item"> <a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a> </span> </div> </div> </div> <footer class="footer"> <div class="container"> <p class="text-muted"> Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a> </p> <p class="copyright"> Made with <span class='sh-red'>❤</span> by <a href="https://scrapinghub.com">Scrapinghub</a> </p> </div> </footer> </body> </html>
先想想. . . . . . .
因為這些網頁在瀏覽器收到 html後,還要用Javascript跑一下才會得到你在開發人員工具看到的最終版本 (例如像臉書那樣,只要滑鼠不斷向下捲動就會不斷跑出新內容的網站)
先想想
cp /Downloads/chromedriver /usr/local/bin
chmod +x /usr/local/bin/chromedriver
請取得 Data Science and Artificial Intelligence Practice 課程網站中,2/27的Description的內容
(Course Introduction and Basics of...)
from bs4 import BeautifulSoup
from selenium import webdriver
try:
url = 'https://netdbncku.github.io/dsai/2018/'
driver = webdriver.Chrome() # '選擇使用 Chrome 來開啟網頁',如果要用safari就要載 safari的driver
driver.get(url)
content = driver.page_source
finally:
print('ok')
# driver.close() # 關閉一個 Tab
driver.quit() # 關閉整個瀏覽器
ok
soup = BeautifulSoup(content, 'html.parser')
tag = soup.find(attrs={'data-title': 'Description'})
print(tag)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-2-dedc37234e2f> in <module>() ----> 1 soup = BeautifulSoup(content, 'html.parser') 2 tag = soup.find(attrs={'data-title': 'Description'}) 3 print(tag) NameError: name 'BeautifulSoup' is not defined
每次都會打開瀏覽器很麻煩? (又很慢)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
try:
url = 'https://netdbncku.github.io/dsai/2018/'
chrome_options = Options()
chrome_options.add_argument("--headless") # 選擇 headless mode
chrome_options.add_argument('window-size=375x812') # 選擇視窗大小
# 告訴網站你是用 Safari開(雖然不是)
chrome_options.add_argument(
'--user-agent=Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1')
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(url)
content = driver.page_source
finally:
print('ok')
# driver.close()
driver.quit()
ok
from bs4 import BeautifulSoup
from selenium import webdriver
try:
url = 'https://shopee.tw/search/?keyword=%E6%A9%9F%E8%BB%8A'
driver = webdriver.Chrome() # '選擇使用 Chrome 來開啟網頁'
driver.get(url)
content = driver.page_source
finally:
print('ok')
# driver.close()
driver.quit()
ok
soup = BeautifulSoup(content, 'html.parser')
tag = soup.find(class_='shopee-item-card__text-name').text
print(tag)
APM赤鬼封體蠟180g 超強撥水鍍膜蠟 汽機車蠟 消光烤漆可用 前擋風玻璃撥水適用 可濕上濕下無油影 抗刮耐磨防靜電