[X-Village] Lesson 09 - Crawler (& API)¶

By 呂伯駿 (Po Chun)¶

上課前可以用pip裝一下 requests-html

pip install requests-html

待會會用到這個外部套件 🚀
python 要 3.6以上才能用

今天的 Roadmap：¶

待會你會先學到三件事：¶

API 是什麼?
為什麼需要 API?
API 怎麼用? (計分練習 ex1.1)

1.1 API 是什麼？¶

聽過 API 的話請舉手～¶

應用程式介面¶

Application Programming Interface¶

蛤？

先別管應用程式介面 (API) 了，你聽過介面 (Interface) 嗎？¶

先回憶一下 "介面" (Interface)¶

同理，User Interface 意思是...¶

以 Google 為例，UI的部分是¶

因此，在Google首頁上你看得到的就是 UI，那個輸入框是UI，那個按鈕也是...

UI其實是讓使用者能跟系統溝通，操作系統的東西

廣義而言，電梯按鈕或飲水機按鈕或飯店櫃臺都是UI

那應用程式介面 (API) 是什麼？¶

讓應用程式能跟系統溝通，操作系統的東西

1.2 那為什麼需要API?¶

生活情境

開放資料 (政府天氣資料)
開放服務 (Google 日曆)

開發情境

作業系統 API
軟硬體 API

1.2.1 政府有一堆資料，政府能拿它們做什麼？¶

開放開源資料、開發 API !¶

開放應用程式介面之後，就能做很多事!¶

1.2.2 視使用情境，API 不見得只能取得資料，也能修改，刪除，建立新資料。¶

例如 Google 日曆 API 可以讓你用程式取得，建立，修改，刪除行事曆資料。
然後你就可以以此開發各種服務

間接用途¶

假設老闆給你個任務

其實，你可以寫一隻程式，透過google的API來操作google，也就是說，你可以寫一支程式來搜尋100個關鍵字

API定義了一個讓程式間能相互溝通存取的介面，能讓別人幫你的服務加值，也可以將工作自動化以減少不必要的人力。¶

消化一下¶

現在知道API是什麼請舉手～ ✋¶

1.3 那怎麼用 API？¶

一些常見的 API 可以讓你用剛才教的 GET, POST, PUT, DELETE 等 Request 操作¶

這種 API 的格式叫 RESTful API，有興趣的可以看這篇回答

我們先用 Reqres 這個網站來看一下效果
Request 那裏的是要呼叫的網址，Response 那裏則是你會收到的內容(通常是JSON格式)

/api/users?page=2 後面的 ?page=2 是什麼意思？

如果是 GET request，在網址最末端加上 '?' 後，可以繼續附帶參數(parameter)，以這裡來說，GET /api/users?page=2 就是取得第二頁的使用者的意思，如果他的使用者資料有1000000筆，為避免一次回傳過多資料，通常我們會用page來讓別人能分批取得使用者資料。

若看到 GET /api/users?page=2&gender=male&age=20 意思就會是：取得第二頁且性別為男性且年齡為20的使用者。
(不過Reqres這個網站沒提供這種進階搜尋的功能就是了)

給大家 5~10 分鐘，大家可以操作一下上面的 Reqres 這個網站

Example 1.1¶

使用NY Times API 取得暢銷書歷史紀錄

先取得 API key

找Book API -> GET /lists/best-sellers/history.json

他有提供很方便的測試環境，只要在左側 API key 那裏輸入剛才拿到的 key，即可取得結果。

In [19]:

import json
import requests
api_key = '你剛拿到的 api-key'
url = 'https://api.nytimes.com/svc/books/v3/lists/best-sellers/history.json?api-key=' + api_key
r = requests.get(url)
r.encoding = 'utf-8' 
# print(r.text)

In [20]:

data = json.loads(r.text)
print(data['num_results'])

[補充]
為了避免有人瘋狂發API拖慢系統，有些(多數) API 因為要商用或有驗證身份的需求，會要求你先註冊一個帳號來取得一個 API Key，這時你就需要在發送 request 時一併將這個 API Key 發送給對方才能得到資料。這類 API 通常會限制每小時的使用次數。

Exercise 1.1 (計分)¶

請先閱讀這份 Weather API ，再編寫出一支程式 print 出台北今天(7/18)的天氣為何。
可以先找到台北的 woeid，再使用後面查詢天氣的API。

注意：請了解 GET /api/location/(woeid)/ 後回傳的資料的意義，可以先觀察他回傳的資料，再嘗試將該天氣print出來。

練完你就會串基本的API了 🙌

複習跟解惑昨天下午的 😨¶

sli.do 連結 #5669

1. Get, 跟 POST 到底是什麼鬼？¶

首先，瀏覽器要取得資料必須要向伺服器發送請求 (request)，發完後可能得到html或json或各種格式的內容

至於 GET 跟 POST 差在哪，可以先想像 GET 是把 "給我這個" 這段話用明信片寄給伺服器
POST 則是把"給我這個"包在信封內裏寄給伺服器即可。前者較方便，後者較安全可傳敏感資料。

所以下面 requests 程式碼的意義是這樣，他使用名為 "requests" 的"python套件"來發送 GET request

另外你不能用瀏覽器上面的網址列按 Enter 來發POST，只能用來發 GET

因為這是 reqres.in 這個網站的 API，所以只有reqres這個網站能用

其他網站的 API 不見得是這樣設計的，即使是這樣他們通常也會因為你沒權限把你擋下來

In [86]:

## 另外 從字典取值，好像很多人會轉不過來，這裏做個補充
first_dict = {
    'dog_name': 'Incredible boy',
    'cat_name': 'Dory'
}
print(first_dict['dog_name'])

another_dict = {
    'room1': {
        'dog_name': 'Cerberus',
        'cat_name': 'Diego'
    }
}
print(another_dict['room1']['dog_name'])

Incredible boy
Cerberus

In [87]:

# 再複雜一點
other_dict = {'star_lab': [
    {
        'room1': {
            'dog_name': 'Crash',
            'cat_name': 'Eddie'
        }
    },
    {
        'room2': {
            'dog_name': 'Indominus rex',
            'cat_name': 'Indoraptor'
        }
    }
]}
print(other_dict['star_lab'][0]['room1']['dog_name'])

Crash

接下來你會學到幾件事

爬蟲是什麼？它在爬什麼？
怎麼寫一個可以爬一般網站的爬蟲 (計分練習ex2.2-2 -> 不計分了)
[延伸] 怎麼寫一個可以爬特殊網站的爬蟲

看 Roadmap 的話，你現在會在這裡¶

2.1 爬蟲是什麼？¶

知道爬蟲是什麼的請舉手？¶

一隻程式，透過自動瀏覽網際網路並下載資料，可用於編纂網路索引來建立搜尋引擎。

蛤？？

記得早上的 Request Response 流程嗎？你也可以從response中取得網頁的內容 (html)

比方說你嘗試用上面的方法取得 Wiki 百科某一頁
你得到的 html 裡面通常會有很多超連結，而當爬蟲程式看到超連結時，也可以選擇順手把這些超連結存下來，等爬(載)完這一頁之後，再取得其他超連結的資料。

爬完這一頁後再跳到下一頁，然後再重複... 直到爬完整個 Wiki
或你電腦爆掉 💥
所以才叫爬蟲。

其實網路上超過 50% 的流量都不是人類的流量 [Ref](https://www.incapsula.com/blog/bot-traffic-report-2016.html)。

回來看爬蟲是什麼？¶

一隻程式，透過自動瀏覽網際網路並下載資料，而當你把網路上多數網站爬完時(要有錢買設備)，你也可用於編纂網路索引，進而建立搜尋引擎。
一般一點的用途也可以拿來找最便宜的飯店，或是取得某個網站所有的電影評論。

很多情況我們只會想爬某個網站，或是某個頁面。

爬什麼？¶

網頁中的內容(html)，後續可再擷取出有效資訊(例如文字)。

消化一下¶

現在知道爬蟲是什麼的請舉手 ✋¶

爬蟲禮儀 1¶

當然，有些網站不希望別人去爬取他的內容，因此他會在網頁最上層(some_url/robots.txt)放個 robots.txt
例如知乎的 robots.txt 放在https://www.zhihu.com/robots.txt

Disallow 後面的東西表示他不希望你爬的頁面。(通常會是登入/註冊等頁面)
但這東西基本上對程式沒有約束力，你還是可以爬不會被擋下來，但如果你太高調或是拿去商業用，對方又剛好有錢有閒，可能會被他吉。

爬蟲禮儀 2¶

除了上面提到的 robots.txt 之外，一般爬蟲也不建議"太過頻繁"的爬取特定網頁的內容，這樣的行為可能會使對方網站阻塞。
通常爬蟲發送request的頻率建議控制在 1秒 1次，以不超過人類的操作速度為佳。

2.2 那怎麼寫一個可以爬一般網站的爬蟲?¶

你可以用 request_html ¶

它能像"requests 套件"一樣發送 http request(其實他把requests套件包在裡面), 也能解析 response 回傳 html的字串，取得想要的資訊

[補充]
除了 request_html 之外，類似功用的套件還有 beautifulsoup，不過 beautifulsoup 只有解析 html 字串的功能，發送 request 還是要靠 "requests 套件"

安裝¶

pip install requests_html

爬蟲基礎網站¶

連結：Quotes

alt text

Example 2.2-1¶

任務： Print 出 "首頁" 所有的 Quotes

alt text

步驟 1: 開啟開發人員工具，查看 Element 資訊

開啟方式
- Windows / Linux: "Ctrl + Shift + I" or F12
- Mac: "Command(⌘) + Option + I"
選取方式
1. 右鍵 > 檢查元素 or 檢查
2. 使用 "Select"

alt text

確認 "滑鼠右鍵 > 檢查網頁原始碼"內的內容跟 "瀏覽器開發人員工具"的內容是否一致。
若不做處理，一般來說爬蟲只會取得 "滑鼠右鍵 > 檢查網頁原始碼" 內的 html資訊，有時這資訊會與瀏覽器開發人員工具"的內容不同

步驟 2: 引用Package，取得 html string

In [67]:

from requests_html import HTMLSession
session = HTMLSession()
response = session.get('http://quotes.toscrape.com/')
print(response) # status_code, example: 200, 404...
print(response.html.text) # 印出網頁上所有的文字
# print(response.html.html) # 印出網頁的 html

<Response [200]>
Quotes to Scrape
Quotes to Scrape
Login
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” by Albert Einstein (about)
Tags:
change deep-thoughts thinking world
“It is our choices, Harry, that show what we truly are, far more than our abilities.” by J.K. Rowling (about)
Tags:
abilities choices
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” by Albert Einstein (about)
Tags:
inspirational life live miracle miracles
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” by Jane Austen (about)
Tags:
aliteracy books classic humor
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” by Marilyn Monroe (about)
Tags:
be-yourself inspirational
“Try not to become a man of success. Rather become a man of value.” by Albert Einstein (about)
Tags:
adulthood success value
“It is better to be hated for what you are than to be loved for what you are not.” by André Gide (about)
Tags:
life love
“I have not failed. I've just found 10,000 ways that won't work.” by Thomas A. Edison (about)
Tags:
edison failure inspirational paraphrased
“A woman is like a tea bag; you never know how strong it is until it's in hot water.” by Eleanor Roosevelt (about)
Tags:
misattributed-eleanor-roosevelt
“A day without sunshine is like, you know, night.” by Steve Martin (about)
Tags:
humor obvious simile
Next →
Top Ten tags
love inspirational life humor books reading friendship friends truth simile
Quotes by: GoodReads.com
Made with ❤ by Scrapinghub

那怎麼從html中取得我要的 Element？¶

依照 Selector 語法 (也稱為 CSS Selector, 更多語法可參考CSS Selector)

In [71]:

# 找 所有 的 p element,  像<p>oxz</p>
select_string = 'p'

# 找 所有 id 為 'some_id' 的 element, 像 <span id='some_id'>texttext</span>
select_string = '#some_id'
# 找 所有 class 為 some_class 的 element，像 <span class='some_class'>texttext</span>
select_string = '.some_class'
# 找 所有 屬性有 some_attr，且值為 some_value element，像 <span some_attr=some_value>texttext</span>
select_string = '[some_attr=some_value]'

# 找 所有 <p> element,  且 id 為 'some_id', 像 <p id='some_id'>texttext</p>
select_string = 'p#some_id'
# 找 所有 <p> element,  且 屬性有 some_attr，且值為 some_value，像 <p some_attr=some_value>texttext</p>
select_string = 'p[some_attr=some_value]'
# 找 所有 <p> element,  且 class 是 text，像 <p class='text'>sample</p>
select_string = 'p.text'

# 找 所有 <p> element，同時 p 的 class 有 text 也有 link，像 <p class='text link'>sample</p>
select_string = 'p.text.link'

element = response.html.find(select_string) # 這一行使用上述 select_string 來取得符合指定 pattern 的element

也可以多層檢索¶

In [ ]:

## 多層搜尋

# 找在 p 這個 element 裡面，class 為 text 的 element，像 <p><span class='text'>sample</span></p>
element = response.html.find('p .text') 

# 找在 "class 為 some_class 的 div" 裡的 "p element" 裡面，且 class 為 text 的 element
# 像 <div class='some_class'><p><span class='text'>sample</span></p></div>
element = response.html.find('div.some_class p .text') 


# 先找 class 為 some_class 的第一個 div，再從中找所有 class 為 link 的 <a ...> element
element = response.html.find('div.some_class')[0].find('a.link')

回來這個題目¶

element = response.html.find('div span')[0]

進一步取得 element 內的資訊，如文字或屬性¶

In [68]:

# 若要找第一個作者
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('http://quotes.toscrape.com/')
# 取得 class 為 quote 的 element 裡面 的第一個 a element
element = response.html.find('.quote a', first=True)# first=True 意思是 只回傳"第一個"符合這格式的 element
print(element)
print(element.text)   # 印出該 element 內包含的文字
print(element.attrs)  # 印出該 element 內包含的屬性
print(element.attrs['href'])
print(element.absolute_links)

<Element 'a' href='/author/Albert-Einstein'>
(about)
{'href': '/author/Albert-Einstein'}
/author/Albert-Einstein
{'http://quotes.toscrape.com/author/Albert-Einstein'}

如何一次印出全部的 Quotes?¶

In [69]:

from requests_html import HTMLSession
session = HTMLSession()
response = session.get('http://quotes.toscrape.com/')
elements = response.html.find('[itemprop=text]')
for element in elements:
    print(element.text)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”

Exercise 2.2-1 (10min) (不計分)¶

任務：印出第一個 Quote 所有"Tag"的超連結

alt text

結果應該要印出
/tag/change/page/1/
/tag/deep-thoughts/page/1/
/tag/thinking/page/1/
/tag/world/page/1/

hint:

先找到第一個 quote 再往下找 "tag 的element" 所屬 a element
print(element.attrs['xxxx'])

In [70]:

from requests_html import HTMLSession
session = HTMLSession()
response = session.get('http://quotes.toscrape.com/')
elements = response.html.find('.quote')[0].find('.tag')
for element in elements:
    print(element.attrs['href'])

/tag/change/page/1/
/tag/deep-thoughts/page/1/
/tag/thinking/page/1/
/tag/world/page/1/

Example 2.2-2¶

任務： Print 出所有Ｑuotes 的作者的 About頁面的超連結，且不能重複

hint: Set

alt text

In [62]:

# Sample Answer
from requests_html import HTMLSession

session = HTMLSession()
response = session.get('http://quotes.toscrape.com/')
link_set = set()
elements = response.html.find('.quote span a')
for element in elements:
    link_set.add(element.attrs['href'])
print(link_set)

{'/author/Jane-Austen', '/author/Andre-Gide', '/author/Eleanor-Roosevelt', '/author/Steve-Martin', '/author/Albert-Einstein', '/author/Thomas-A-Edison', '/author/Marilyn-Monroe', '/author/J-K-Rowling'}

挑戰題，思考一下，要怎麼一次把第2，第3...頁的作者內容也爬下來？¶

Exercise 2.2-2 (15min) (不計分了)¶

任務：從 imdb 取得 The Darkest Minds (2018) 這部電影的導演

A: Jennifer Yuh Nelson

若有需要，請參考request_html文檔取得需要的 method 資訊。 alt text

In [119]:

# Sample Answer
from requests_html import HTMLSession

session = HTMLSession()
response = session.get('https://www.imdb.com/movies-coming-soon/2018-08/')
element = response.html.find('[itemprop=director] a[itemprop=url]')[1]
print(element.text)

Jennifer Yuh Nelson

挑戰題：

請取得 The Darkest Minds (2018) 這部電影的所有演員

A: Bradley Whitford, Mandy Moore, Gwendoline Christie, Amandla Stenberg

請取得 IMDB 8月1~8/10 (8/10前) 上映的電影標題

3 怎麼寫一個可以爬特殊網站的爬蟲？¶

特殊的網站像是：

點擊網頁中的某個按鈕後:
- 即使網頁內容變了，網址還是一樣的網站。
- 網頁內容跟網址都變了，但你根據那網址沒辦法直接得到相同資料的網站。 (例如火車時刻表列車查詢結果)
需要登入才能看到內容的網站
"右鍵 > 檢查原始碼" 內的東西跟瀏覽器開發人員工具看到的不一樣。

註：絕大多數的資料可以透過 API 或基礎爬蟲學到的方法取得。

現在在這裡

3.1. 為何點擊某個按鈕後，即使網頁內容變了，網址還是一樣？¶

有時是兩種東西造成的

form (表單)
javascript (可以用修改html)（未來web課程才會細講）

Form (表單) 可以讓你輸入內容，而當你點擊下方的 Submit按鈕，通常會有另一隻Javascript程式將表單中的內容透過 Request(POST)發送給遠端伺服器，而遠端伺服器將新的內容回傳之後，Javascript 再將網頁中的內容直接修改，因此網址不會有變化。

有些則是當你點擊按鈕後，他一樣會發個 request，但他會將你導向到別的網址(像火車時刻表)

也就是說我們透過觀察開發人員工具中的"Network"來查看他發送了什麼request，再依樣畫葫蘆發一樣的request給對方，通常我們就能得到我們要的資料。

Example 3.1¶

使用 Post 來取得 7/20 22:00 出發，台南到台北的高鐵車號

備註：高鐵有 [API](https://ptx.transportdata.tw/PTX)，以後有需要請直接使用API不要這樣直接爬，這裏只是拿來做 post 的練習。

打開開發人員工具的Network頁籤，觀察點擊"立即查詢"後會發佈什麼 Request

In [41]:

# Sample Answer
from requests_html import HTMLSession
session = HTMLSession()
form_data = {
    'startStation': '9c5ac6ca-ec89-48f8-aab0-41b738cb1814',
    'endStation': '977abb69-413a-4ccf-a109-0272c24fd490',
    'theDay': '2018/07/20',
    'timeSelect': '22:00',
    'waySelect': 'DepartureInMandarin',
}
url = 'https://m.thsrc.com.tw/tw/TimeTable/SearchResult'
response = session.post(url, data=form_data) #### 這裏改用 POST 了
elements = response.html.find('a.ui-block-a')
for element in elements:
    print(element.text)

0696
0294

3.2. 那怎麼寫一個可以抓登入後的頁面資料的爬蟲？¶

先想想. . .
hint: 也是Request

概念也是跟剛才一樣觀察登入時會發的Request，再用程式發一樣的內容。
一般來說，等登入成功後，通常對方會給你個cookie(用來暫存資料)，你只要將cookie存下來，下次 GET 網頁時一併將這個cookie給對方即可。

不過這點 request_html 也幫你做好了!!¶

Example 3.2¶

模擬登入來取得資料，確認登入後收到的html資料裡有 Logout 字樣

注意，多數網站不會想讓開發者能用機器人來登入，所以才會有驗證碼或什麼Receptra(我不是機器人)。
因此請先確認 /robots.txt 中有無限制使用權限，不過這個網站沒有放。

In [63]:

# Sample Answer
from requests_html import HTMLSession
session = HTMLSession()

form_data = {
    'username': 'Neo',
    'password': 'quote*2018*some_static_word'
}
url = 'http://quotes.toscrape.com/login'

login_response = session.post(url, data=form_data) # 先發登入用的 POST request!
print(login_response)

url = 'http://quotes.toscrape.com/'
second_response = session.get(url, data=form_data)   # 登入成功後再 GET 一次!
print(second_response.html.find('.col-md-4 p a', first=True).text)

<Response [200]>
Logout

3.3 為何 "檢查原始碼內的東西跟開發人員工具看到的不一樣" ？¶

先想想. . . . . . .

因為這些網頁在瀏覽器收到 html後，還要用Javascript跑一下才會得到你在開發人員工具看到的最終版本 (例如像臉書那樣，只要滑鼠不斷向下捲動就會不斷跑出新內容的網站)

那要怎麼爬 "html內的東西都是後來程式(JavaScript)產生的，因此光用request爬不到什麼東西的網站。"¶

先想想

想辦法模擬瀏覽器收到html後做的動作 💪¶

以往我們會需要一個開發用的瀏覽器驅動程式 (Chrome Driver)，還要一個可以操作這個驅動程式的python套件 (Selenium)...

但 request_html 也都處理好了! 🙌¶

Example 3.3¶

請取得 Data Science and Artificial Intelligence Practice 課程網站中，2/27的Description的內容
(Course Introduction and Basics of...)

In [64]:

from requests_html import HTMLSession
session = HTMLSession()
url = 'https://sirius207.github.io/course-template/2018/'
response = session.get(url)
response.html.render() # 加上這一行即可，第一次跑他會花幾分鐘的時間下載 Chromium，用來跑 Javascript
elements = response.html.find('td[data-title=Description]')
print(elements[0].text)

Course Introduction and Basics of Supervised/Unsupervised Learning
Slides: Course introduction
Slides: Supervised unsupervised Learning

Exercise 3.3 (不計分)¶

取得蝦皮拍賣中，"機車" 搜尋結果的第一個商品名稱

In [ ]:

# 註，這個似乎 requests_html 的render 會 render 不完全，如果想知道怎麼取的話可以參考下面這份舊版簡報的內容
# https://nbviewer.jupyter.org/format/slides/github/x-village/python-course/blob/master/Lesson08-Web%20Crawler/Lesson08-Crawler-old.ipynb#/8/18

延伸學習¶

反爬蟲
- 你操作(GET)太快太頻繁了
- 擋你IP
- 輸入驗證碼
- "我不是機器人"
- class 或 id 隨機產生
- ...
反反爬蟲
- GET 中間加時間間隔
- 換IP (Proxy)
- 圖像辨識驗證碼 (OCR)
- ...應該暫時無解
想加速爬蟲運行
- 分布式
- Asyncio

[X-Village] Lesson 09 - Crawler (& API)¶

By 呂伯駿 (Po Chun)¶

今天的 Roadmap：¶

待會你會先學到三件事：¶

1.1 API 是什麼？¶

聽過 API 的話請舉手～¶

應用程式介面¶

Application Programming Interface¶

先別管應用程式介面 (API) 了，你聽過介面 (Interface) 嗎？¶

先回憶一下 "介面" (Interface)¶

同理，User Interface 意思是...¶

以 Google 為例，UI的部分是¶

因此，在Google首頁上你看得到的就是 UI，那個輸入框是UI，那個按鈕也是...

UI其實是讓使用者能跟系統溝通，操作系統的東西

廣義而言，電梯按鈕或飲水機按鈕或飯店櫃臺都是UI

那應用程式介面 (API) 是什麼？¶

讓 應用程式 能跟 系統 溝通，操作系統的東西

1.2 那為什麼需要API?¶

1.2.1 政府有一堆資料，政府能拿它們做什麼？¶

開放開源資料、開發 API !¶

開放應用程式介面之後，就能做很多事!¶

1.2.2 視使用情境，API 不見得只能取得資料，也能修改，刪除，建立新資料。¶

間接用途¶

假設老闆給你個任務

其實，你可以寫一隻程式，透過google的API來操作google，也就是說，你可以寫一支程式來搜尋100個關鍵字

API定義了一個讓程式間能相互溝通存取的介面，能讓別人幫你的服務加值，也可以將工作自動化以減少不必要的人力。¶

消化一下¶

現在知道API是什麼請舉手～ ✋¶

1.3 那怎麼用 API？¶

一些常見的 API 可以讓你用剛才教的 GET, POST, PUT, DELETE 等 Request 操作¶

Example 1.1¶

Exercise 1.1 (計分)¶

複習跟解惑昨天下午的 😨¶

1. Get, 跟 POST 到底是什麼鬼？¶

看 Roadmap 的話，你現在會在這裡¶

2.1 爬蟲是什麼？¶

知道爬蟲是什麼的請舉手？¶

回來看爬蟲是什麼？¶

爬什麼？¶

消化一下¶

現在知道爬蟲是什麼的請舉手 ✋¶

爬蟲禮儀 1¶

爬蟲禮儀 2¶

2.2 那怎麼寫一個可以爬一般網站的爬蟲?¶

你可以用 request_html¶

安裝¶

爬蟲基礎網站¶

Example 2.2-1¶

那怎麼從html中取得我要的 Element？¶

也可以多層檢索¶

回來這個題目¶

進一步取得 element 內的資訊，如文字或屬性¶

如何一次印出全部的 Quotes?¶

Exercise 2.2-1 (10min) (不計分)¶

Example 2.2-2¶

挑戰題，思考一下，要怎麼一次把第2，第3...頁的作者內容也爬下來？¶

Exercise 2.2-2 (15min) (不計分了)¶

3 怎麼寫一個可以爬特殊網站的爬蟲？¶

3.1. 為何點擊某個按鈕後，即使網頁內容變了，網址還是一樣？¶

Example 3.1¶

3.2. 那怎麼寫一個可以抓登入後的頁面資料的爬蟲？¶

不過這點 request_html 也幫你做好了!!¶

Example 3.2¶

3.3 為何 "檢查原始碼內的東西跟開發人員工具看到的不一樣" ？¶

那要怎麼爬 "html內的東西都是後來程式(JavaScript)產生的，因此光用request爬不到什麼東西的網站。"¶

想辦法模擬瀏覽器收到html後做的動作 💪¶

但 request_html 也都處理好了! 🙌¶

Example 3.3¶

Exercise 3.3 (不計分)¶

延伸學習¶

讓應用程式能跟系統溝通，操作系統的東西

你可以用 request_html ¶