数据抓取：¶

Requests、Beautifulsoup、Xpath简介¶

王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com

In [12]:

?display_html

In [11]:

# 爬虫基本原理
from IPython.display import display_html, HTML

HTML(url="http://www.cnblogs.com/zhaof/p/6898138.html")

Out[11]:

python爬虫从入门到放弃（二）之爬虫的原理 - python修行路 - 博客园

python爬虫从入门到放弃（二）之爬虫的原理

在上文中我们说了：爬虫就是请求网站并提取数据的自动化程序。其中请求，提取，自动化是爬虫的关键！下面我们分析爬虫的基本流程

爬虫的基本流程

发起请求
通过HTTP库向目标站点发起请求，也就是发送一个Request，请求可以包含额外的header等信息，等待服务器响应

获取响应内容
如果服务器能正常响应，会得到一个Response，Response的内容便是所要获取的页面内容，类型可能是HTML,Json字符串，二进制数据（图片或者视频）等类型

解析内容
得到的内容可能是HTML,可以用正则表达式，页面解析库进行解析，可能是Json,可以直接转换为Json对象解析，可能是二进制数据，可以做保存或者进一步的处理

保存数据
保存形式多样，可以存为文本，也可以保存到数据库，或者保存特定格式的文件

什么是Request,Response

浏览器发送消息给网址所在的服务器，这个过程就叫做HTPP Request

服务器收到浏览器发送的消息后，能够根据浏览器发送消息的内容，做相应的处理，然后把消息回传给浏览器，这个过程就是HTTP Response

浏览器收到服务器的Response信息后，会对信息进行相应的处理，然后展示

Request中包含什么？

请求方式

主要有：GET/POST两种类型常用，另外还有HEAD/PUT/DELETE/OPTIONS
GET和POST的区别就是：请求的数据GET是在url中，POST则是存放在头部

GET:向指定的资源发出“显示”请求。使用GET方法应该只用在读取数据，而不应当被用于产生“副作用”的操作中，例如在Web Application中。其中一个原因是GET可能会被网络蜘蛛等随意访问

POST:向指定资源提交数据，请求服务器进行处理（例如提交表单或者上传文件）。数据被包含在请求本文中。这个请求可能会创建新的资源或修改现有资源，或二者皆有。

HEAD：与GET方法一样，都是向服务器发出指定资源的请求。只不过服务器将不传回资源的本文部分。它的好处在于，使用这个方法可以在不必传输全部内容的情况下，就可以获取其中“关于该资源的信息”（元信息或称元数据）。

PUT：向指定资源位置上传其最新内容。

OPTIONS：这个方法可使服务器传回该资源所支持的所有HTTP请求方法。用'*'来代替资源名称，向Web服务器发送OPTIONS请求，可以测试服务器功能是否正常运作。

DELETE：请求服务器删除Request-URI所标识的资源。

请求URL

URL，即统一资源定位符，也就是我们说的网址，统一资源定位符是对可以从互联网上得到的资源的位置和访问方法的一种简洁的表示，是互联网上标准资源的地址。互联网上的每个文件都有一个唯一的URL，它包含的信息指出文件的位置以及浏览器应该怎么处理它。

URL的格式由三个部分组成：
第一部分是协议(或称为服务方式)。
第二部分是存有该资源的主机IP地址(有时也包括端口号)。
第三部分是主机资源的具体地址，如目录和文件名等。

爬虫爬取数据时必须要有一个目标的URL才可以获取数据，因此，它是爬虫获取数据的基本依据。

请求头

包含请求时的头部信息，如User-Agent,Host,Cookies等信息，下图是请求请求百度时，所有的请求头部信息参数

请求体
请求是携带的数据，如提交表单数据时候的表单数据（POST）

Response中包含了什么

所有HTTP响应的第一行都是状态行，依次是当前HTTP版本号，3位数字组成的状态代码，以及描述状态的短语，彼此由空格分隔。

响应状态

有多种响应状态，如：200代表成功，301跳转，404找不到页面，502服务器错误

1xx消息——请求已被服务器接收，继续处理
2xx成功——请求已成功被服务器接收、理解、并接受
3xx重定向——需要后续操作才能完成这一请求
4xx请求错误——请求含有词法错误或者无法被执行
5xx服务器错误——服务器在处理某个正确请求时发生错误常见代码： 200 OK 请求成功 400 Bad Request 客户端请求有语法错误，不能被服务器所理解 401 Unauthorized 请求未经授权，这个状态代码必须和WWW-Authenticate报头域一起使用 403 Forbidden 服务器收到请求，但是拒绝提供服务 404 Not Found 请求资源不存在，eg：输入了错误的URL 500 Internal Server Error 服务器发生不可预期的错误 503 Server Unavailable 服务器当前不能处理客户端的请求，一段时间后可能恢复正常 301 目标永久性转移 302 目标暂时性转移

响应头

如内容类型，类型的长度，服务器信息，设置Cookie,如下图

响应体

最主要的部分，包含请求资源的内容，如网页HTMl,图片，二进制数据等

能爬取什么样的数据

网页文本：如HTML文档，Json格式化文本等
图片：获取到的是二进制文件，保存为图片格式
视频:同样是二进制文件
其他：只要请求到的，都可以获取

如何解析数据

直接处理
Json解析
正则表达式处理
BeautifulSoup解析处理
PyQuery解析处理
XPath解析处理

关于抓取的页面数据和浏览器里看到的不一样的问题

出现这种情况是因为，很多网站中的数据都是通过js，ajax动态加载的，所以直接通过get请求获取的页面和浏览器显示的不同。

如何解决js渲染的问题？

分析ajax
Selenium/webdriver
Splash
PyV8,Ghost.py

怎样保存数据

文本：纯文本，Json,Xml等

关系型数据库：如mysql,oracle,sql server等结构化数据库

非关系型数据库：MongoDB,Redis等key-value形式存储

posted @ 2017-05-24 11:44 python修行路阅读(...) 评论(...) 编辑收藏

刷新评论刷新页面返回顶部

需要解决的问题¶

页面解析
获取Javascript隐藏源数据
自动翻页
自动登录
连接API接口

一般的数据抓取，使用requests和beautifulsoup配合就可以了。
尤其是对于翻页时url出现规则变化的网页，只需要处理规则化的url就可以了。
以简单的例子是抓取天涯论坛上关于某一个关键词的帖子。
- 在天涯论坛，关于雾霾的帖子的第一页是： http://bbs.tianya.cn/list.jsp?item=free&nextid=0&order=8&k=雾霾
- 第二页是： http://bbs.tianya.cn/list.jsp?item=free&nextid=1&order=8&k=雾霾

第一个爬虫¶

Beautifulsoup Quick Start

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

http://computational-class.github.io/bigdata/data/test.html

In [3]:

import requests
from bs4 import BeautifulSoup

In [53]:

help(requests.get)

Help on function get in module requests.api:

get(url, params=None, **kwargs)
    Sends a GET request.
    
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response

In [5]:

url = 'http://computational-class.github.io/bigdata/data/test.html'
content = requests.get(url)
help(content)

Help on Response in module requests.models object:

class Response(builtins.object)
 |  The :class:`Response <Response>` object, which contains a
 |  server's response to an HTTP request.
 |  
 |  Methods defined here:
 |  
 |  __bool__(self)
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if the status code of the response is between
 |      400 and 600 to see if there was a client error or a server error. If
 |      the status code, is between 200 and 400, this will return True. This
 |      is **not** a check to see if the response code is ``200 OK``.
 |  
 |  __getstate__(self)
 |  
 |  __init__(self)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |      Allows you to use a response as an iterator.
 |  
 |  __nonzero__(self)
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if the status code of the response is between
 |      400 and 600 to see if there was a client error or a server error. If
 |      the status code, is between 200 and 400, this will return True. This
 |      is **not** a check to see if the response code is ``200 OK``.
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  close(self)
 |      Releases the connection back to the pool. Once this method has been
 |      called the underlying ``raw`` object must not be accessed again.
 |      
 |      *Note: Should not normally need to be called explicitly.*
 |  
 |  iter_content(self, chunk_size=1, decode_unicode=False)
 |      Iterates over the response data.  When stream=True is set on the
 |      request, this avoids reading the content at once into memory for
 |      large responses.  The chunk size is the number of bytes it should
 |      read into memory.  This is not necessarily the length of each item
 |      returned as decoding can take place.
 |      
 |      chunk_size must be of type int or None. A value of None will
 |      function differently depending on the value of `stream`.
 |      stream=True will read data as it arrives in whatever size the
 |      chunks are received. If stream=False, data is returned as
 |      a single chunk.
 |      
 |      If decode_unicode is True, content will be decoded using the best
 |      available encoding based on the response.
 |  
 |  iter_lines(self, chunk_size=512, decode_unicode=None, delimiter=None)
 |      Iterates over the response data, one line at a time.  When
 |      stream=True is set on the request, this avoids reading the
 |      content at once into memory for large responses.
 |      
 |      .. note:: This method is not reentrant safe.
 |  
 |  json(self, **kwargs)
 |      Returns the json-encoded content of a response, if any.
 |      
 |      :param \*\*kwargs: Optional arguments that ``json.loads`` takes.
 |      :raises ValueError: If the response body does not contain valid json.
 |  
 |  raise_for_status(self)
 |      Raises stored :class:`HTTPError`, if one occurred.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  apparent_encoding
 |      The apparent encoding, provided by the chardet library
 |  
 |  content
 |      Content of the response, in bytes.
 |  
 |  is_permanent_redirect
 |      True if this Response one of the permanent versions of redirect
 |  
 |  is_redirect
 |      True if this Response is a well-formed HTTP redirect that could have
 |      been processed automatically (by :meth:`Session.resolve_redirects`).
 |  
 |  links
 |      Returns the parsed header links of the response, if any.
 |  
 |  ok
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if the status code of the response is between
 |      400 and 600 to see if there was a client error or a server error. If
 |      the status code, is between 200 and 400, this will return True. This
 |      is **not** a check to see if the response code is ``200 OK``.
 |  
 |  text
 |      Content of the response, in unicode.
 |      
 |      If Response.encoding is None, encoding will be guessed using
 |      ``chardet``.
 |      
 |      The encoding of the response content is determined based solely on HTTP
 |      headers, following RFC 2616 to the letter. If you can take advantage of
 |      non-HTTP knowledge to make a better guess at the encoding, you should
 |      set ``r.encoding`` appropriately before accessing this property.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __attrs__ = ['_content', 'status_code', 'headers', 'url', 'history', '...

In [6]:

print(content.text)

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

In [7]:

content.encoding

Out[7]:

'utf-8'

Beautiful Soup¶

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

Beautiful Soup provides a few simple methods. It doesn't take much code to write an application
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Then you just have to specify the original encoding.
Beautiful Soup sits on top of popular Python parsers like lxml and html5lib.

Install beautifulsoup4¶

open your terminal/cmd¶

~~$ pip install beautifulsoup4~~

html.parser¶

Beautiful Soup supports the html.parser included in Python’s standard library

lxml¶

but it also supports a number of third-party Python parsers. One is the lxml parser lxml. Depending on your setup, you might install lxml with one of these commands:

$ apt-get install python-lxml

$ easy_install lxml

$ pip install lxml

html5lib¶

Another alternative is the pure-Python html5lib parser html5lib, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:

$ apt-get install python-html5lib

$ easy_install html5lib

$ pip install html5lib

In [9]:

url = 'http://computational-class.github.io/bigdata/data/test.html'
content = requests.get(url)
content = content.text
soup = BeautifulSoup(content, 'html.parser') 
soup

Out[9]:

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body></html>

In [10]:

print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

html
- head
  - title
- body
  - p (class = 'title', 'story' )
    - a (class = 'sister')
      - href/id

Select 方法¶

标签名不加任何修饰
类名前加点
id名前加 #

我们也可以利用这种特性，使用soup.select()方法筛选元素，返回类型是 list

Select方法三步骤¶

Inspect (检查)
Copy
- Copy Selector

鼠标选中标题The Dormouse's story, 右键检查Inspect
鼠标移动到选中的源代码
右键Copy-->Copy Selector

body > p.title > b

In [14]:

soup.select('body > p.title > b')#[0].text

Out[14]:

[<b>The Dormouse's story</b>]

Select 方法: 通过标签名查找¶

In [68]:

soup.select('title')

Out[68]:

[<title>The Dormouse's story</title>]

In [65]:

soup.select('a')

Out[65]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [66]:

soup.select('b')

Out[66]:

[<b>The Dormouse's story</b>]

Select 方法: 通过类名查找¶

In [69]:

soup.select('.title')

Out[69]:

[<p class="title"><b>The Dormouse's story</b></p>]

In [26]:

soup.select('.sister')

Out[26]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [27]:

soup.select('.story')

Out[27]:

[<p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>, <p class="story">...</p>]

Select 方法: 通过id名查找¶

In [15]:

soup.select('#link1')

Out[15]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [16]:

soup.select('#link1')[0]['href']

Out[16]:

'http://example.com/elsie'

Select 方法: 组合查找¶

将标签名、类名、id名进行组合

例如查找 p 标签中，id 等于 link1的内容

In [70]:

soup.select('p #link1')

Out[70]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

Select 方法:属性查找¶

加入属性元素

属性需要用中括号>连接
属性和标签属于同一节点，中间不能加空格。

In [17]:

soup.select("head > title")

Out[17]:

[<title>The Dormouse's story</title>]

In [72]:

soup.select("body > p")

Out[72]:

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

find_all方法¶

In [30]:

soup('p')

Out[30]:

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [31]:

soup.find_all('p')

Out[31]:

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [32]:

[i.text for i in soup('p')]

Out[32]:

["The Dormouse's story",
 'Once upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.',
 '...']

In [34]:

for i in soup('p'):
    print(i.text)

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

In [35]:

for tag in soup.find_all(True):
    print(tag.name)

html
head
title
body
p
b
p
a
a
a
p

In [36]:

soup('head') # or soup.head

Out[36]:

[<head><title>The Dormouse's story</title></head>]

In [37]:

soup('body') # or soup.body

Out[37]:

[<body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>
 <p class="story">...</p></body>]

In [38]:

soup('title')  # or  soup.title

Out[38]:

[<title>The Dormouse's story</title>]

In [39]:

soup('p')

Out[39]:

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [40]:

soup.p

Out[40]:

<p class="title"><b>The Dormouse's story</b></p>

In [41]:

soup.title.name

Out[41]:

'title'

In [42]:

soup.title.string

Out[42]:

"The Dormouse's story"

In [43]:

soup.title.text
# 推荐使用text方法

Out[43]:

"The Dormouse's story"

In [44]:

soup.title.parent.name

Out[44]:

'head'

In [45]:

soup.p

Out[45]:

<p class="title"><b>The Dormouse's story</b></p>

In [46]:

soup.p['class']

Out[46]:

['title']

In [47]:

soup.find_all('p', {'class', 'title'})

Out[47]:

[<p class="title"><b>The Dormouse's story</b></p>]

In [19]:

soup.find_all('p', class_= 'title')

Out[19]:

"The Dormouse's story"

In [49]:

soup.find_all('p', {'class', 'story'})

Out[49]:

[<p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>, <p class="story">...</p>]

In [34]:

soup.find_all('p', {'class', 'story'})[0].find_all('a')

Out[34]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [51]:

soup.a

Out[51]:

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [52]:

soup('a')

Out[52]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [53]:

soup.find(id="link3")

Out[53]:

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [54]:

soup.find_all('a')

Out[54]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [55]:

soup.find_all('a', {'class', 'sister'}) # compare with soup.find_all('a')

Out[55]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [56]:

soup.find_all('a', {'class', 'sister'})[0]

Out[56]:

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [57]:

soup.find_all('a', {'class', 'sister'})[0].text

Out[57]:

'Elsie'

In [58]:

soup.find_all('a', {'class', 'sister'})[0]['href']

Out[58]:

'http://example.com/elsie'

In [59]:

soup.find_all('a', {'class', 'sister'})[0]['id']

Out[59]:

'link1'

In [71]:

soup.find_all(["a", "b"])

Out[71]:

[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [38]:

print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

数据抓取：¶

抓取微信公众号文章内容¶

王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com

In [16]:

from IPython.display import display_html, HTML
HTML(url = 'http://mp.weixin.qq.com/s?__biz=MzA3MjQ5MTE3OA==&mid=206241627&idx=1&sn=471e59c6cf7c8dae452245dbea22c8f3&3rd=MzA3MDU4NTYzMw==&scene=6#rd')
# the webpage we would like to crawl

Out[16]:

南大新传 | 微议题：地震中民族自豪—“中国人先撤”

南大新传院微议题排行榜

点击上方“微议题排行榜”可以订阅哦！

导读

2015年4月25日，尼泊尔发生8.1级地震，造成至少7000多人死亡，中国西藏、印度、孟加拉国、不丹等地均出现人员伤亡。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。

我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。

热词图现

本文以“地震”为关键词，选取了2015年4月10日至4月30日期间微议题TOP100阅读排行进行分析。根据微议题TOP100标题的词频统计，我们可以看出有关“地震”的话题最热词汇的有“尼泊尔”、“油价”、“发改委”。4月25日尼泊尔发生了8级地震，深受人们的关注。面对国外灾难性事件，微媒体的重心却转向“油价”、“发改委”、“祖国先撤”，致力于将世界重大事件与中国政府关联起来。

微议题演化趋势

总文章数

总阅读数

从4月10日到4月30日，有关“地震”议题出现三个峰值，分别是在4月15日内蒙古地震，20日台湾地震和25日尼泊尔地震。其中对台湾地震与内蒙古地震报道文章较少，而对尼泊尔地震却给予了极大的关注，无论是在文章量还是阅读量上都空前增多。内蒙古、台湾地震由于级数较小，关注少，议程时间也比较短，一般3天后就会淡出公共视野。而尼泊尔地震虽然接近性较差，但规模大，且衍生话题性较强，其讨论热度持续了一周以上。

议题分类

如图，我们将此议题分为6大类。

尼泊尔地震

这类文章是对4月25日尼泊尔地震的新闻报道，包括现场视频，地震强度、规模，损失程度、遇难人员介绍等。更进一步的，有对尼泊尔地震原因探析，认为其处在板块交界处，灾难是必然的。因尼泊尔是佛教圣地，也有从佛学角度解释地震的启示。

国内地震报道

主要是对10日内蒙古、甘肃、山西等地的地震，以及20日台湾地震的报道。偏重于对硬新闻的呈现，介绍地震范围、级数、伤亡情况，少数几篇是对甘肃地震的辟谣，称其只是微震。

中国救援回应

地震救援的报道大多是与尼泊尔地震相关，并且80%的文章是中国政府做出迅速反应派出救援机接国人回家。以“中国人又先撤了”，来为祖国点赞。少数几篇是滴滴快的、腾讯基金、万达等为尼泊尔捐款的消息。

发改委与地震

这类文章内容相似，纯粹是对发改委的调侃。称其“预测”地震非常准确，只要一上调油价，便会发生地震。

地震常识介绍

该类文章介绍全国地震带、地震频发地，地震逃生注意事项，“专家传受活命三角”，如何用手机自救等小常识。

地震中的故事

讲述地震中的感人瞬间，回忆汶川地震中的故事，传递“：地震无情，人间有情”的正能量。

国内外地震关注差异大

关于“地震”本身的报道仍旧是媒体关注的重点，尼泊尔地震与国内地震报道占一半的比例。而关于尼泊尔话题的占了45%，国内地震相关的只有22%。微媒体对国内外地震关注有明显的偏差，而且在衍生话题方面也相差甚大。尼泊尔地震中，除了硬新闻报道外，还有对其原因分析、中国救援情况等，而国内地震只是集中于硬新闻。地震常识介绍只占9%，地震知识普及还比较欠缺。

阅读与点赞分析

爱国新闻容易激起点赞狂潮

整体上来说，网民对地震议题关注度较高，自然灾害类话题一旦爆发，很容易引起人们情感共鸣，掀起热潮。但从点赞数来看，“中国救援回应”类的总点赞与平均点赞都是最高的，网民对地震的关注点并非地震本身，而是与之相关的“政府行动”。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。而爱国新闻则往往是最容易煽动民族情绪，产生民族优越感，激起点赞狂潮。

人的关注小于国民尊严的保护

另一方面，国内地震的关注度却很少，不仅体现在政府救援的报道量小，网民的兴趣点与评价也较低。我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。“发改委与地震”的点赞量也相对较高，网民对发改委和地震的调侃，反映出的是对油价上涨的不满，这种“怨气”也容易产生共鸣。一面是民族优越感，一面是对政策不满，两种情绪虽矛盾，但同时体现了网民心理趋同。

数据附表

微文章排行TOP50：

公众号排行TOP20：

作者：晏雪菲

出品单位：南京大学计算传播学实验中心

技术支持：南京大学谷尼舆情监测分析实验室

题图鸣谢：谷尼舆情新微榜、图悦词云

查看源代码 Inspect¶

In [36]:

url = "http://mp.weixin.qq.com/s?__biz=MzA3MjQ5MTE3OA==&mid=206241627&idx=1&sn=471e59c6cf7c8dae452245dbea22c8f3&3rd=MzA3MDU4NTYzMw==&scene=6#rd"
content = requests.get(url).text #获取网页的html文本
soup = BeautifulSoup(content, 'html.parser')

In [37]:

title = soup.select("#activity-name")
title[0].text.strip()

Out[37]:

'南大新传 | 微议题：地震中民族自豪—“中国人先撤”'

In [40]:

soup.find('h2', {'class', 'rich_media_title'}).text.strip()

Out[40]:

'南大新传 | 微议题：地震中民族自豪—“中国人先撤”'

In [185]:

print(soup.find('div', {'class', 'rich_media_meta_list'}) )

<div class="rich_media_meta_list" id="meta_content">
<em class="rich_media_meta rich_media_meta_text" id="post-date">2015-05-04</em>
<em class="rich_media_meta rich_media_meta_text">南大新传院</em>
<a class="rich_media_meta rich_media_meta_link rich_media_meta_nickname" href="##" id="post-user">微议题排行榜</a>
<span class="rich_media_meta rich_media_meta_text rich_media_meta_nickname">微议题排行榜</span>
<div class="profile_container" id="js_profile_qrcode" style="display:none;">
<div class="profile_inner">
<strong class="profile_nickname">微议题排行榜</strong>
<img alt="" class="profile_avatar" id="js_profile_qrcode_img" src="">
<p class="profile_meta">
<label class="profile_meta_label">微信号</label>
<span class="profile_meta_value">IssuesRank</span>
</p>
<p class="profile_meta">
<label class="profile_meta_label">功能介绍</label>
<span class="profile_meta_value">感谢关注《微议题排行榜》。我们是南京大学新闻传播学院，计算传播学实验中心，致力于研究社会化媒体时代的公共议程，发布新媒体平台的议题排行榜。</span>
</p>
</img></div>
<span class="profile_arrow_wrp" id="js_profile_arrow_wrp">
<i class="profile_arrow arrow_out"></i>
<i class="profile_arrow arrow_in"></i>
</span>
</div>
</div>

In [42]:

soup.find('em').text

Out[42]:

'2015-05-04'

In [43]:

article = soup.find('div', {'class' , 'rich_media_content'}).text
print(article)

点击上方“微议题排行榜”可以订阅哦！导读2015年4月25日，尼泊尔发生8.1级地震，造成至少7000多人死亡，中国西藏、印度、孟加拉国、不丹等地均出现人员伤亡。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。  热词图现 本文以“地震”为关键词，选取了2015年4月10日至4月30日期间微议题TOP100阅读排行进行分析。根据微议题TOP100标题的词频统计，我们可以看出有关“地震”的话题最热词汇的有“尼泊尔”、“油价”、“发改委”。4月25日尼泊尔发生了8级地震，深受人们的关注。面对国外灾难性事件，微媒体的重心却转向“油价”、“发改委”、“祖国先撤”，致力于将世界重大事件与中国政府关联起来。  微议题演化趋势 总文章数总阅读数从4月10日到4月30日，有关“地震”议题出现三个峰值，分别是在4月15日内蒙古地震，20日台湾地震和25日尼泊尔地震。其中对台湾地震与内蒙古地震报道文章较少，而对尼泊尔地震却给予了极大的关注，无论是在文章量还是阅读量上都空前增多。内蒙古、台湾地震由于级数较小，关注少，议程时间也比较短，一般3天后就会淡出公共视野。而尼泊尔地震虽然接近性较差，但规模大，且衍生话题性较强，其讨论热度持续了一周以上。  议题分类 如图，我们将此议题分为6大类。1尼泊尔地震这类文章是对4月25日尼泊尔地震的新闻报道，包括现场视频，地震强度、规模，损失程度、遇难人员介绍等。更进一步的，有对尼泊尔地震原因探析，认为其处在板块交界处，灾难是必然的。因尼泊尔是佛教圣地，也有从佛学角度解释地震的启示。2国内地震报道主要是对10日内蒙古、甘肃、山西等地的地震，以及20日台湾地震的报道。偏重于对硬新闻的呈现，介绍地震范围、级数、伤亡情况，少数几篇是对甘肃地震的辟谣，称其只是微震。3中国救援回应地震救援的报道大多是与尼泊尔地震相关，并且80%的文章是中国政府做出迅速反应派出救援机接国人回家。以“中国人又先撤了”，来为祖国点赞。少数几篇是滴滴快的、腾讯基金、万达等为尼泊尔捐款的消息。4发改委与地震这类文章内容相似，纯粹是对发改委的调侃。称其“预测”地震非常准确，只要一上调油价，便会发生地震。5地震常识介绍该类文章介绍全国地震带、地震频发地，地震逃生注意事项，“专家传受活命三角”，如何用手机自救等小常识。6地震中的故事讲述地震中的感人瞬间，回忆汶川地震中的故事，传递“：地震无情，人间有情”的正能量。 国内外地震关注差异大关于“地震”本身的报道仍旧是媒体关注的重点，尼泊尔地震与国内地震报道占一半的比例。而关于尼泊尔话题的占了45%，国内地震相关的只有22%。微媒体对国内外地震关注有明显的偏差，而且在衍生话题方面也相差甚大。尼泊尔地震中，除了硬新闻报道外，还有对其原因分析、中国救援情况等，而国内地震只是集中于硬新闻。地震常识介绍只占9%，地震知识普及还比较欠缺。  阅读与点赞分析  爱国新闻容易激起点赞狂潮整体上来说，网民对地震议题关注度较高，自然灾害类话题一旦爆发，很容易引起人们情感共鸣，掀起热潮。但从点赞数来看，“中国救援回应”类的总点赞与平均点赞都是最高的，网民对地震的关注点并非地震本身，而是与之相关的“政府行动”。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。而爱国新闻则往往是最容易煽动民族情绪，产生民族优越感，激起点赞狂潮。 人的关注小于国民尊严的保护另一方面，国内地震的关注度却很少，不仅体现在政府救援的报道量小，网民的兴趣点与评价也较低。我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。“发改委与地震”的点赞量也相对较高，网民对发改委和地震的调侃，反映出的是对油价上涨的不满，这种“怨气”也容易产生共鸣。一面是民族优越感，一面是对政策不满，两种情绪虽矛盾，但同时体现了网民心理趋同。  数据附表 微文章排行TOP50：公众号排行TOP20：作者：晏雪菲出品单位：南京大学计算传播学实验中心技术支持：南京大学谷尼舆情监测分析实验室题图鸣谢：谷尼舆情新微榜、图悦词云

In [44]:

rmml = soup.find('div', {'class', 'rich_media_meta_list'})
date = rmml.find(id = 'post-date').text
rmc = soup.find('div', {'class', 'rich_media_content'})
content = rmc.get_text()
print(title[0].text.strip())
print(date)
print(content)

南大新传 | 微议题：地震中民族自豪—“中国人先撤”
2015-05-04

点击上方“微议题排行榜”可以订阅哦！导读2015年4月25日，尼泊尔发生8.1级地震，造成至少7000多人死亡，中国西藏、印度、孟加拉国、不丹等地均出现人员伤亡。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。  热词图现 本文以“地震”为关键词，选取了2015年4月10日至4月30日期间微议题TOP100阅读排行进行分析。根据微议题TOP100标题的词频统计，我们可以看出有关“地震”的话题最热词汇的有“尼泊尔”、“油价”、“发改委”。4月25日尼泊尔发生了8级地震，深受人们的关注。面对国外灾难性事件，微媒体的重心却转向“油价”、“发改委”、“祖国先撤”，致力于将世界重大事件与中国政府关联起来。  微议题演化趋势 总文章数总阅读数从4月10日到4月30日，有关“地震”议题出现三个峰值，分别是在4月15日内蒙古地震，20日台湾地震和25日尼泊尔地震。其中对台湾地震与内蒙古地震报道文章较少，而对尼泊尔地震却给予了极大的关注，无论是在文章量还是阅读量上都空前增多。内蒙古、台湾地震由于级数较小，关注少，议程时间也比较短，一般3天后就会淡出公共视野。而尼泊尔地震虽然接近性较差，但规模大，且衍生话题性较强，其讨论热度持续了一周以上。  议题分类 如图，我们将此议题分为6大类。1尼泊尔地震这类文章是对4月25日尼泊尔地震的新闻报道，包括现场视频，地震强度、规模，损失程度、遇难人员介绍等。更进一步的，有对尼泊尔地震原因探析，认为其处在板块交界处，灾难是必然的。因尼泊尔是佛教圣地，也有从佛学角度解释地震的启示。2国内地震报道主要是对10日内蒙古、甘肃、山西等地的地震，以及20日台湾地震的报道。偏重于对硬新闻的呈现，介绍地震范围、级数、伤亡情况，少数几篇是对甘肃地震的辟谣，称其只是微震。3中国救援回应地震救援的报道大多是与尼泊尔地震相关，并且80%的文章是中国政府做出迅速反应派出救援机接国人回家。以“中国人又先撤了”，来为祖国点赞。少数几篇是滴滴快的、腾讯基金、万达等为尼泊尔捐款的消息。4发改委与地震这类文章内容相似，纯粹是对发改委的调侃。称其“预测”地震非常准确，只要一上调油价，便会发生地震。5地震常识介绍该类文章介绍全国地震带、地震频发地，地震逃生注意事项，“专家传受活命三角”，如何用手机自救等小常识。6地震中的故事讲述地震中的感人瞬间，回忆汶川地震中的故事，传递“：地震无情，人间有情”的正能量。 国内外地震关注差异大关于“地震”本身的报道仍旧是媒体关注的重点，尼泊尔地震与国内地震报道占一半的比例。而关于尼泊尔话题的占了45%，国内地震相关的只有22%。微媒体对国内外地震关注有明显的偏差，而且在衍生话题方面也相差甚大。尼泊尔地震中，除了硬新闻报道外，还有对其原因分析、中国救援情况等，而国内地震只是集中于硬新闻。地震常识介绍只占9%，地震知识普及还比较欠缺。  阅读与点赞分析  爱国新闻容易激起点赞狂潮整体上来说，网民对地震议题关注度较高，自然灾害类话题一旦爆发，很容易引起人们情感共鸣，掀起热潮。但从点赞数来看，“中国救援回应”类的总点赞与平均点赞都是最高的，网民对地震的关注点并非地震本身，而是与之相关的“政府行动”。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。而爱国新闻则往往是最容易煽动民族情绪，产生民族优越感，激起点赞狂潮。 人的关注小于国民尊严的保护另一方面，国内地震的关注度却很少，不仅体现在政府救援的报道量小，网民的兴趣点与评价也较低。我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。“发改委与地震”的点赞量也相对较高，网民对发改委和地震的调侃，反映出的是对油价上涨的不满，这种“怨气”也容易产生共鸣。一面是民族优越感，一面是对政策不满，两种情绪虽矛盾，但同时体现了网民心理趋同。  数据附表 微文章排行TOP50：公众号排行TOP20：作者：晏雪菲出品单位：南京大学计算传播学实验中心技术支持：南京大学谷尼舆情监测分析实验室题图鸣谢：谷尼舆情新微榜、图悦词云

requests + Xpath方法介绍：以豆瓣电影为例¶

Xpath 即为 XML 路径语言（XML Path Language），它是一种用来确定 XML 文档中某部分位置的语言。

Xpath 基于 XML 的树状结构，提供在数据结构树中找寻节点的能力。起初 Xpath 的提出的初衷是将其作为一个通用的、介于 Xpointer 与 XSL 间的语法模型。但是Xpath 很快的被开发者采用来当作小型查询语言。

获取元素的Xpath信息并获得文本：这里的“元素的Xpath信息”是需要我们手动获取的，获取方式为：

定位目标元素
在网站上依次点击：右键 > 检查
copy xpath
xpath + '/text()'

参考：https://mp.weixin.qq.com/s/zx3_eflBCrrfOqFEWjAUJw

In [46]:

import requests
from lxml import etree

url = 'https://movie.douban.com/subject/26611804/'
data = requests.get(url).text
s = etree.HTML(data)

豆瓣电影的名称对应的的xpath为xpath_title，那么title表达为：

title = s.xpath('xpath_info/text()')

其中，xpath_info为：

//*[@id="content"]/h1/span[1]

In [47]:

title = s.xpath('//*[@id="content"]/h1/span[1]/text()')[0]
director = s.xpath('//*[@id="info"]/span[1]/span[2]/a/text()')
actors = s.xpath('//*[@id="info"]/span[3]/span[2]/a/text()')
type1 = s.xpath('//*[@id="info"]/span[5]/text()')
type2 = s.xpath('//*[@id="info"]/span[6]/text()')
type3 = s.xpath('//*[@id="info"]/span[7]/text()')
time = s.xpath('//*[@id="info"]/span[11]/text()')
length = s.xpath('//*[@id="info"]/span[13]/text()')
score = s.xpath('//*[@id="interest_sectl"]/div[1]/div[2]/strong/text()')[0]

In [48]:

print(title, director, actors, type1, type2, type3, time, length, score)

三块广告牌 Three Billboards Outside Ebbing, Missouri ['马丁·麦克唐纳'] ['弗兰西斯·麦克多蒙德', '伍迪·哈里森', '山姆·洛克威尔', '艾比·考尼什', '卢卡斯·赫奇斯', '彼特·丁拉基', '约翰·浩克斯', '卡赖伯·兰德里·琼斯', '凯瑟琳·纽顿', '凯瑞·康顿', '泽利科·伊万内克', '萨玛拉·维文', '克拉克·彼得斯', '尼克·西塞', '阿曼达·沃伦', '玛拉雅·瑞沃拉·德鲁 ', '布兰登·萨克斯顿', '迈克尔·艾伦·米利甘'] ['剧情'] ['犯罪'] ['官方网站:'] ['2018-03-02(中国大陆)'] ['2017-12-01(美国)'] 8.7

Douban API¶

https://developers.douban.com/wiki/?title=guide

In [92]:

import requests
url = 'https://api.douban.com/v2/movie/26611804'
#url = 'https://api.douban.com/v2/user/1000001/'
jsonm = requests.get(url).json()

In [93]:

jsonm

Out[93]:

{'alt': 'https://movie.douban.com/movie/26611804',
 'alt_title': '三块广告牌 / 意外(台)',
 'attrs': {'cast': ['弗兰西斯·麦克多蒙德 Frances McDormand',
   '伍迪·哈里森 Woody Harrelson',
   '山姆·洛克威尔 Sam Rockwell',
   '艾比·考尼什 Abbie Cornish',
   '卢卡斯·赫奇斯 Lucas Hedges',
   '彼特·丁克拉奇 Peter Dinklage',
   '约翰·浩克斯 John Hawkes',
   '卡赖伯·兰德里·琼斯 Caleb Landry Jones',
   '凯瑟琳·牛顿 Kathryn Newton',
   '凯瑞·康顿 Kerry Condon',
   '泽利科·伊万内克 Zeljko Ivanek',
   '萨玛拉·维文 Samara Weaving',
   '克拉克·彼得斯 Clarke Peters',
   '尼克·西塞 Nick Searcy',
   '阿曼达·沃伦 Amanda Warren',
   '玛拉雅·瑞沃拉·德鲁  Malaya Rivera Drew',
   '布兰登·萨克斯顿 Brendan Sexton III',
   '迈克尔·艾伦·米利甘 Michael Aaron Milligan'],
  'country': ['美国', '英国'],
  'director': ['马丁·麦克唐纳 Martin McDonagh'],
  'language': ['英语'],
  'movie_duration': ['115分钟'],
  'movie_type': ['剧情', '犯罪'],
  'pubdate': ['2017-09-04(威尼斯电影节)', '2017-12-01(美国)', '2018-03-02(中国大陆)'],
  'title': ['Three Billboards Outside Ebbing, Missouri'],
  'website': ['www.foxsearchlight.com/threebillboardsoutsideebbingmissouri'],
  'writer': ['马丁·麦克唐纳 Martin McDonagh'],
  'year': ['2017']},
 'author': [{'name': '马丁·麦克唐纳 Martin McDonagh'}],
 'id': 'https://api.douban.com/movie/26611804',
 'image': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2510081688.jpg',
 'mobile_link': 'https://m.douban.com/movie/subject/26611804/',
 'rating': {'average': '8.7', 'max': 10, 'min': 0, 'numRaters': 401598},
 'summary': '米尔德雷德（弗兰西斯·麦克多蒙德 Frances McDormand 饰）的女儿在外出时惨遭奸杀，米尔德雷德和丈夫查理（约翰·哈克斯 John Hawkes 饰）之间的婚姻因此走到了尽头，如今，她同儿子罗比（卢卡斯·赫奇斯 Lucas Hedges饰）过着相依为命的生活。一晃眼几个月过去了，案件仍然没有告破预兆，而警方似乎早已经将注意力从案子上转移了开来。\n被绝望和痛苦缠绕的米尔德雷德租下了高速公路边上的三块巨型广告牌，在上面控诉警方办案无能，并将矛头直接对准了警察局局长威洛比（伍迪·哈里森 Woody Harrelson 饰）。实际上，威洛比一直隐瞒着自己身患绝症命不久矣的事实。因为这三块广告牌，米尔德雷德和威洛比的生活发生了翻天覆地的变化。',
 'tags': [{'count': 46821, 'name': '人性'},
  {'count': 38006, 'name': '剧情'},
  {'count': 37581, 'name': '美国'},
  {'count': 34477, 'name': '犯罪'},
  {'count': 30513, 'name': '黑色幽默'},
  {'count': 23182, 'name': '女性'},
  {'count': 22567, 'name': '奥斯卡'},
  {'count': 15784, 'name': '2017'}],
 'title': 'Three Billboards Outside Ebbing, Missouri'}

In [89]:

#jsonm.values()
jsonm.keys(), jsonm['rating']

Out[89]:

(dict_keys(['alt_title', 'mobile_link', 'attrs', 'author', 'summary', 'rating', 'alt', 'id', 'image', 'title', 'tags']),
 {'average': '8.7', 'max': 10, 'min': 0, 'numRaters': 401598})

In [84]:

jsonm['alt']

Out[84]:

'https://movie.douban.com/movie/26611804'

In [87]:

jsonm['attrs']['director']

Out[87]:

['马丁·麦克唐纳 Martin McDonagh']

In [89]:

jsonm['attrs']['movie_type']

Out[89]:

['剧情', '犯罪']

In [88]:

jsonm['attrs']['cast']

Out[88]:

['弗兰西斯·麦克多蒙德 Frances McDormand',
 '伍迪·哈里森 Woody Harrelson',
 '山姆·洛克威尔 Sam Rockwell',
 '艾比·考尼什 Abbie Cornish',
 '卢卡斯·赫奇斯 Lucas Hedges',
 '彼特·丁克拉奇 Peter Dinklage',
 '约翰·浩克斯 John Hawkes',
 '卡赖伯·兰德里·琼斯 Caleb Landry Jones',
 '凯瑟琳·牛顿 Kathryn Newton',
 '凯瑞·康顿 Kerry Condon',
 '泽利科·伊万内克 Zeljko Ivanek',
 '萨玛拉·维文 Samara Weaving',
 '克拉克·彼得斯 Clarke Peters',
 '尼克·西塞 Nick Searcy',
 '阿曼达·沃伦 Amanda Warren',
 '玛拉雅·瑞沃拉·德鲁  Malaya Rivera Drew',
 '布兰登·萨克斯顿 Brendan Sexton III',
 '迈克尔·艾伦·米利甘 Michael Aaron Milligan']

In [129]:

headers = {
    'Host': 'api.douban.com',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 11_0 like Mac OS X) AppleWebKit/604.1.34 (KHTML, like Gecko) Version/11.0 Mobile/15A5341f Safari/604.1',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7,zh-CN;q=0.6',
    'Cookie': 'gr_user_id=54559934-955b-4798-9df1-ed12a97b61b1; ue="wangchj04@126.com"; _ga=GA1.2.1584253277.1448983887; _vwo_uuid_v2=7CD7A27EE46C68D5713E8870DCBB0C50|a39513af0c4457f727aeb9dcb79c7867; douban-profile-remind=1; douban-fav-remind=1; bid=-n0SJDzOCOU; ll="118159"; __gads=ID=223032a1f45c3c9d:T=1541658183:S=ALNI_MY65rcbNHf8eTpIzbr9MTNv1lhuSg; push_doumail_num=0; UM_distinctid=167a02c6a9b273-064d1e75f342f3-35677603-fa000-167a02c6a9c203; __utmv=30149280.155; ct=y; push_noty_num=0; __utmc=30149280; __utmz=30149280.1548213414.56.6.utmcsr=book.douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/subject/2893874/comments/; viewed="1536615"; dbcl2="1558440:4omV9m7YBqg"; ck=AdQI; ap_v=0,6.0; __utma=30149280.1584253277.1448983887.1548653281.1548735278.59; __utmt=1; __utmb=30149280.17.5.1548735287585'
}

cookies={}
raw_cookies = headers['Cookie']
for line in raw_cookies.split(';'):
    key,value=line.split('=',1)#1代表只分一次，得到两个数据
    cookies[key]=value
cookies

Out[129]:

{' UM_distinctid': '167a02c6a9b273-064d1e75f342f3-35677603-fa000-167a02c6a9c203',
 ' __gads': 'ID=223032a1f45c3c9d:T=1541658183:S=ALNI_MY65rcbNHf8eTpIzbr9MTNv1lhuSg',
 ' __utma': '30149280.1584253277.1448983887.1548653281.1548735278.59',
 ' __utmb': '30149280.17.5.1548735287585',
 ' __utmc': '30149280',
 ' __utmt': '1',
 ' __utmv': '30149280.155',
 ' __utmz': '30149280.1548213414.56.6.utmcsr=book.douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/subject/2893874/comments/',
 ' _ga': 'GA1.2.1584253277.1448983887',
 ' _vwo_uuid_v2': '7CD7A27EE46C68D5713E8870DCBB0C50|a39513af0c4457f727aeb9dcb79c7867',
 ' ap_v': '0,6.0',
 ' bid': '-n0SJDzOCOU',
 ' ck': 'AdQI',
 ' ct': 'y',
 ' dbcl2': '"1558440:4omV9m7YBqg"',
 ' douban-fav-remind': '1',
 ' douban-profile-remind': '1',
 ' ll': '"118159"',
 ' push_doumail_num': '0',
 ' push_noty_num': '0',
 ' ue': '"wangchj04@126.com"',
 ' viewed': '"1536615"',
 'gr_user_id': '54559934-955b-4798-9df1-ed12a97b61b1'}

In [135]:

import requests
url = 'https://api.douban.com/v2/user/1000001/'
jsonm = requests.get(url,  cookies = cookies)#.json()

In [136]:

jsonm

Out[136]:

<Response [404]>

In [134]:

jsonm.request.headers

Out[134]:

{'Connection': 'keep-alive', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'User-Agent': 'python-requests/2.14.2', 'Cookie': ' __utmc=30149280;  ap_v=0,6.0;  __gads=ID=223032a1f45c3c9d:T=1541658183:S=ALNI_MY65rcbNHf8eTpIzbr9MTNv1lhuSg;  douban-fav-remind=1;  __utmt=1;  _ga=GA1.2.1584253277.1448983887;  push_noty_num=0;  _vwo_uuid_v2=7CD7A27EE46C68D5713E8870DCBB0C50|a39513af0c4457f727aeb9dcb79c7867;  ue="wangchj04@126.com";  ll="118159"; gr_user_id=54559934-955b-4798-9df1-ed12a97b61b1;  ct=y;  __utmv=30149280.155;  ck=AdQI;  __utmb=30149280.17.5.1548735287585;  push_doumail_num=0;  bid=-n0SJDzOCOU;  UM_distinctid=167a02c6a9b273-064d1e75f342f3-35677603-fa000-167a02c6a9c203;  viewed="1536615";  __utma=30149280.1584253277.1448983887.1548653281.1548735278.59;  douban-profile-remind=1;  __utmz=30149280.1548213414.56.6.utmcsr=book.douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/subject/2893874/comments/;  dbcl2="1558440:4omV9m7YBqg"'}

requests.post模拟登录豆瓣（包括获取验证码）¶

https://blog.csdn.net/zhuzuwei/article/details/80875538

作业：抓取豆瓣电影 Top 250¶

In [59]:

import requests
from bs4 import BeautifulSoup
from lxml import etree

url0 = 'https://movie.douban.com/top250?start=0&filter='
data = requests.get(url0).text
s = etree.HTML(data)

In [222]:

s.xpath('//*[@id="content"]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/span[1]/text()')[0]

Out[222]:

'肖申克的救赎'

In [225]:

s.xpath('//*[@id="content"]/div/div[1]/ol/li[2]/div/div[2]/div[1]/a/span[1]/text()')[0]

Out[225]:

'霸王别姬'

In [227]:

s.xpath('//*[@id="content"]/div/div[1]/ol/li[3]/div/div[2]/div[1]/a/span[1]/text()')[0]

Out[227]:

'这个杀手不太冷'

In [60]:

import requests
from bs4 import BeautifulSoup

url0 = 'https://movie.douban.com/top250?start=0&filter='
data = requests.get(url0).text
soup = BeautifulSoup(data, 'lxml')

In [61]:

movies = soup.find_all('div', {'class', 'info'})

In [62]:

len(movies)

Out[62]:

In [63]:

movies[0].a['href']

Out[63]:

'https://movie.douban.com/subject/1292052/'

In [39]:

movies[0].find('span', {'class', 'title'}).text

Out[39]:

'肖申克的救赎'

In [26]:

movies[0].find('div', {'class', 'star'})

Out[26]:

<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.6</span>
<span content="10.0" property="v:best"></span>
<span>1004428人评价</span>
</div>

In [28]:

movies[0].find('span', {'class', 'rating_num'}).text

Out[28]:

'9.6'

In [90]:

people_num = movies[0].find('div', {'class', 'star'}).find_all('span')[-1]
people_num.text.split('人评价')[0]

Out[90]:

'1004428'

In [64]:

for i in movies:
    url = i.a['href']
    title = i.find('span', {'class', 'title'}).text
    des = i.find('div', {'class', 'star'})
    rating = des.find('span', {'class', 'rating_num'}).text
    rating_num = des.find_all('span')[-1].text.split('人评价')[0]
    print(url, title, rating, rating_num)

https://movie.douban.com/subject/1292052/ 肖申克的救赎 9.6 1021383
https://movie.douban.com/subject/1291546/ 霸王别姬 9.5 742984
https://movie.douban.com/subject/1295644/ 这个杀手不太冷 9.4 957578
https://movie.douban.com/subject/1292720/ 阿甘正传 9.4 814634
https://movie.douban.com/subject/1292063/ 美丽人生 9.5 475813
https://movie.douban.com/subject/1291561/ 千与千寻 9.3 762619
https://movie.douban.com/subject/1292722/ 泰坦尼克号 9.3 754309
https://movie.douban.com/subject/1295124/ 辛德勒的名单 9.4 433191
https://movie.douban.com/subject/3541415/ 盗梦空间 9.3 853620
https://movie.douban.com/subject/2131459/ 机器人总动员 9.3 559729
https://movie.douban.com/subject/1292001/ 海上钢琴师 9.2 657670
https://movie.douban.com/subject/3793023/ 三傻大闹宝莱坞 9.2 767473
https://movie.douban.com/subject/3011091/ 忠犬八公的故事 9.2 529473
https://movie.douban.com/subject/1291549/ 放牛班的春天 9.2 513071
https://movie.douban.com/subject/1292213/ 大话西游之大圣娶亲 9.2 561091
https://movie.douban.com/subject/1292064/ 楚门的世界 9.1 533017
https://movie.douban.com/subject/1291560/ 龙猫 9.1 473631
https://movie.douban.com/subject/1291841/ 教父 9.2 385130
https://movie.douban.com/subject/5912992/ 熔炉 9.2 309138
https://movie.douban.com/subject/1889243/ 星际穿越 9.2 560855
https://movie.douban.com/subject/1300267/ 乱世佳人 9.2 299301
https://movie.douban.com/subject/6786002/ 触不可及 9.1 416073
https://movie.douban.com/subject/1307914/ 无间道 9.0 458107
https://movie.douban.com/subject/1849031/ 当幸福来敲门 8.9 606767
https://movie.douban.com/subject/1291828/ 天堂电影院 9.1 337952

In [51]:

for i in range(0, 250, 25):
    print('https://movie.douban.com/top250?start=%d&filter='% i)

https://movie.douban.com/top250?start=0&filter=
https://movie.douban.com/top250?start=25&filter=
https://movie.douban.com/top250?start=50&filter=
https://movie.douban.com/top250?start=75&filter=
https://movie.douban.com/top250?start=100&filter=
https://movie.douban.com/top250?start=125&filter=
https://movie.douban.com/top250?start=150&filter=
https://movie.douban.com/top250?start=175&filter=
https://movie.douban.com/top250?start=200&filter=
https://movie.douban.com/top250?start=225&filter=

In [65]:

import requests
from bs4 import BeautifulSoup
dat = []
for j in range(0, 250, 25):
    urli = 'https://movie.douban.com/top250?start=%d&filter='% j
    data = requests.get(urli).text
    soup = BeautifulSoup(data, 'lxml')
    movies = soup.find_all('div', {'class', 'info'})
    for i in movies:
        url = i.a['href']
        title = i.find('span', {'class', 'title'}).text
        des = i.find('div', {'class', 'star'})
        rating = des.find('span', {'class', 'rating_num'}).text
        rating_num = des.find_all('span')[-1].text.split('人评价')[0]
        listi = [url, title, rating, rating_num]
        dat.append(listi)

In [66]:

import pandas as pd
df = pd.DataFrame(dat, columns = ['url', 'title', 'rating', 'rating_num'])
df['rating'] = df.rating.astype(float)
df['rating_num'] = df.rating_num.astype(int)
df.head()

Out[66]:

	url	title	rating	rating_num
0	https://movie.douban.com/subject/1292052/	肖申克的救赎	9.6	1021383
1	https://movie.douban.com/subject/1291546/	霸王别姬	9.5	742984
2	https://movie.douban.com/subject/1295644/	这个杀手不太冷	9.4	957578
3	https://movie.douban.com/subject/1292720/	阿甘正传	9.4	814634
4	https://movie.douban.com/subject/1292063/	美丽人生	9.5	475813

In [3]:

%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(df.rating_num)
plt.show()

In [19]:

plt.hist(df.rating)
plt.show()

In [11]:

fig = plt.figure(figsize=(16, 16),facecolor='white')

plt.plot(df.rating_num, df.rating, 'bo')
for i in df.index:
    plt.text(df.rating_num[i], df.rating[i], df.title[i], 
             fontsize = df.rating[i], 
             color = 'red', rotation = 45)
plt.show()

In [123]:

df[df.rating > 9.4]

Out[123]:

	url	title	rating	rating_num
0	https://movie.douban.com/subject/1292052/	肖申克的救赎	9.6	1004428
1	https://movie.douban.com/subject/1291546/	霸王别姬	9.5	730274
4	https://movie.douban.com/subject/1292063/	美丽人生	9.5	469332
41	https://movie.douban.com/subject/1296141/	控方证人	9.6	108598

In [69]:

alist = []
for i in df.index:
    alist.append( [df.rating_num[i], df.rating[i], df.title[i] ])

blist =[[df.rating_num[i], df.rating[i], df.title[i] ] for i in df.index] 

alist

Out[69]:

[[1021383, 9.5999999999999996, '肖申克的救赎'],
 [742984, 9.5, '霸王别姬'],
 [957578, 9.4000000000000004, '这个杀手不太冷'],
 [814634, 9.4000000000000004, '阿甘正传'],
 [475813, 9.5, '美丽人生'],
 [762619, 9.3000000000000007, '千与千寻'],
 [754309, 9.3000000000000007, '泰坦尼克号'],
 [433191, 9.4000000000000004, '辛德勒的名单'],
 [853620, 9.3000000000000007, '盗梦空间'],
 [559729, 9.3000000000000007, '机器人总动员'],
 [657670, 9.1999999999999993, '海上钢琴师'],
 [767473, 9.1999999999999993, '三傻大闹宝莱坞'],
 [529473, 9.1999999999999993, '忠犬八公的故事'],
 [513071, 9.1999999999999993, '放牛班的春天'],
 [561091, 9.1999999999999993, '大话西游之大圣娶亲'],
 [533017, 9.0999999999999996, '楚门的世界'],
 [473631, 9.0999999999999996, '龙猫'],
 [385130, 9.1999999999999993, '教父'],
 [309138, 9.1999999999999993, '熔炉'],
 [560855, 9.1999999999999993, '星际穿越'],
 [299301, 9.1999999999999993, '乱世佳人'],
 [416073, 9.0999999999999996, '触不可及'],
 [458107, 9.0, '无间道'],
 [606767, 8.9000000000000004, '当幸福来敲门'],
 [337952, 9.0999999999999996, '天堂电影院'],
 [633995, 8.9000000000000004, '怦然心动'],
 [190977, 9.4000000000000004, '十二怒汉'],
 [434420, 9.0, '搏击俱乐部'],
 [640800, 9.0, '少年派的奇幻漂流'],
 [260089, 9.1999999999999993, '鬼子来了'],
 [367866, 9.0999999999999996, '蝙蝠侠：黑暗骑士'],
 [314885, 9.0999999999999996, '指环王3：王者无敌'],
 [306344, 9.0999999999999996, '活着'],
 [369956, 9.0, '天空之城'],
 [585740, 9.1999999999999993, '疯狂动物城'],
 [426150, 8.9000000000000004, '罗马假日'],
 [451703, 8.9000000000000004, '大话西游之月光宝盒'],
 [554642, 8.9000000000000004, '飞屋环游记'],
 [249586, 9.0999999999999996, '窃听风暴'],
 [296760, 9.0999999999999996, '两杆大烟枪'],
 [111737, 9.5999999999999996, '控方证人'],
 [301329, 9.0, '飞越疯人院'],
 [358755, 8.9000000000000004, '闻香识女人'],
 [393556, 8.9000000000000004, '哈尔的移动城堡'],
 [196094, 9.3000000000000007, '海豚湾'],
 [464601, 8.8000000000000007, 'V字仇杀队'],
 [237421, 9.0999999999999996, '辩护人'],
 [309071, 9.0, '死亡诗社'],
 [207619, 9.0999999999999996, '教父2'],
 [333942, 8.9000000000000004, '美丽心灵'],
 [296196, 9.0, '指环王2：双塔奇兵'],
 [331529, 8.9000000000000004, '指环王1：魔戒再现'],
 [411534, 8.8000000000000007, '情书'],
 [223469, 9.0999999999999996, '饮食男女'],
 [517803, 9.0999999999999996, '摔跤吧！爸爸'],
 [191667, 9.0999999999999996, '美国往事'],
 [309325, 8.9000000000000004, '狮子王'],
 [220420, 9.0, '钢琴家'],
 [520325, 8.6999999999999993, '天使爱美丽'],
 [205704, 9.0999999999999996, '素媛'],
 [469032, 8.6999999999999993, '七宗罪'],
 [153673, 9.1999999999999993, '小鞋子'],
 [320506, 8.9000000000000004, '被嫌弃的松子的一生'],
 [375951, 8.8000000000000007, '致命魔术'],
 [378652, 8.8000000000000007, '看不见的客人'],
 [251308, 8.9000000000000004, '音乐之声'],
 [315215, 8.8000000000000007, '勇敢的心'],
 [523686, 8.6999999999999993, '剪刀手爱德华'],
 [425844, 8.8000000000000007, '本杰明·巴顿奇事'],
 [365086, 8.8000000000000007, '低俗小说'],
 [385562, 8.6999999999999993, '西西里的美丽传说'],
 [307307, 8.8000000000000007, '黑客帝国'],
 [262404, 8.9000000000000004, '拯救大兵瑞恩'],
 [383825, 8.6999999999999993, '沉默的羔羊'],
 [338488, 8.8000000000000007, '入殓师'],
 [414361, 8.6999999999999993, '蝴蝶效应'],
 [677352, 8.6999999999999993, '让子弹飞'],
 [270494, 8.8000000000000007, '春光乍泄'],
 [244643, 8.9000000000000004, '玛丽和马克思'],
 [111733, 9.1999999999999993, '大闹天宫'],
 [295606, 8.8000000000000007, '心灵捕手'],
 [189568, 8.9000000000000004, '末代皇帝'],
 [292721, 8.8000000000000007, '阳光灿烂的日子'],
 [254400, 8.8000000000000007, '幽灵公主'],
 [252833, 8.8000000000000007, '第六感'],
 [359281, 8.6999999999999993, '重庆森林'],
 [389844, 8.6999999999999993, '禁闭岛'],
 [345885, 8.8000000000000007, '布达佩斯大饭店'],
 [271656, 8.6999999999999993, '大鱼'],
 [142601, 9.0, '狩猎'],
 [284871, 8.6999999999999993, '哈利·波特与魔法石'],
 [296911, 8.6999999999999993, '射雕英雄传之东成西就'],
 [344355, 8.5999999999999996, '致命ID'],
 [248165, 8.8000000000000007, '甜蜜蜜'],
 [344588, 8.5999999999999996, '断背山'],
 [251749, 8.6999999999999993, '猫鼠游戏'],
 [166973, 8.9000000000000004, '一一'],
 [367791, 8.6999999999999993, '告白'],
 [289385, 8.8000000000000007, '阳光姐妹淘'],
 [373118, 8.5999999999999996, '加勒比海盗'],
 [166903, 8.9000000000000004, '上帝之城'],
 [97659, 9.1999999999999993, '摩登时代'],
 [162190, 8.9000000000000004, '穿条纹睡衣的男孩'],
 [565530, 8.5999999999999996, '阿凡达'],
 [237864, 8.6999999999999993, '爱在黎明破晓前'],
 [385266, 8.6999999999999993, '消失的爱人'],
 [188690, 8.8000000000000007, '风之谷'],
 [212467, 8.6999999999999993, '爱在日落黄昏时'],
 [181917, 8.8000000000000007, '侧耳倾听'],
 [275127, 8.5999999999999996, '倩女幽魂'],
 [146507, 8.9000000000000004, '红辣椒'],
 [241887, 8.6999999999999993, '恐怖直播'],
 [185888, 8.8000000000000007, '超脱'],
 [217398, 8.6999999999999993, '萤火虫之墓'],
 [304866, 8.6999999999999993, '驯龙高手'],
 [239308, 8.5999999999999996, '幸福终点站'],
 [195650, 8.6999999999999993, '菊次郎的夏天'],
 [144405, 8.9000000000000004, '小森林 夏秋篇'],
 [341432, 8.5, '喜剧之王'],
 [323425, 8.5999999999999996, '岁月神偷'],
 [232077, 8.6999999999999993, '借东西的小人阿莉埃蒂'],
 [82623, 9.1999999999999993, '七武士'],
 [405200, 8.5, '神偷奶爸'],
 [222549, 8.6999999999999993, '杀人回忆'],
 [102681, 9.0, '海洋'],
 [332455, 8.5, '真爱至上'],
 [210611, 8.6999999999999993, '电锯惊魂'],
 [415291, 8.5, '贫民窟的百万富翁'],
 [191225, 8.6999999999999993, '谍影重重3'],
 [149579, 8.8000000000000007, '喜宴'],
 [266681, 8.5999999999999996, '东邪西毒'],
 [295660, 8.5, '记忆碎片'],
 [220414, 8.5999999999999996, '雨人'],
 [257769, 8.5999999999999996, '怪兽电力公司'],
 [440539, 8.5, '黑天鹅'],
 [391224, 8.6999999999999993, '疯狂原始人'],
 [179698, 8.6999999999999993, '英雄本色'],
 [154659, 8.6999999999999993, '燃情岁月'],
 [127219, 8.8000000000000007, '卢旺达饭店'],
 [112345, 8.9000000000000004, '虎口脱险'],
 [189074, 8.6999999999999993, '7号房的礼物'],
 [300454, 8.5, '恋恋笔记本'],
 [125724, 8.9000000000000004, '小森林 冬春篇'],
 [320997, 8.5, '傲慢与偏见'],
 [208380, 8.5999999999999996, '海边的曼彻斯特'],
 [290089, 8.6999999999999993, '哈利·波特与死亡圣器(下)'],
 [168987, 8.6999999999999993, '萤火之森'],
 [138798, 8.8000000000000007, '教父3'],
 [86319, 9.0, '完美的世界'],
 [156471, 8.6999999999999993, '纵横四海'],
 [151799, 8.8000000000000007, '荒蛮故事'],
 [105774, 8.8000000000000007, '二十二'],
 [135526, 8.8000000000000007, '魂断蓝桥'],
 [259388, 8.5, '猜火车'],
 [194663, 8.5999999999999996, '穿越时空的少女'],
 [201714, 8.8000000000000007, '玩具总动员3'],
 [260957, 8.5, '花样年华'],
 [97486, 9.0, '雨中曲'],
 [183786, 8.5999999999999996, '心迷宫'],
 [214531, 8.5999999999999996, '时空恋旅人'],
 [351836, 8.4000000000000004, '唐伯虎点秋香'],
 [392857, 8.5999999999999996, '超能陆战队'],
 [110358, 8.8000000000000007, '我是山姆'],
 [309924, 8.5999999999999996, '蝙蝠侠：黑暗骑士崛起'],
 [199924, 8.5999999999999996, '人工智能'],
 [139242, 8.6999999999999993, '浪潮'],
 [285601, 8.4000000000000004, '冰川时代'],
 [289504, 8.4000000000000004, '香水'],
 [288650, 8.5, '朗读者'],
 [132226, 8.6999999999999993, '罗生门'],
 [174301, 8.8000000000000007, '请以你的名字呼唤我'],
 [251364, 8.5999999999999996, '爆裂鼓手'],
 [85770, 8.9000000000000004, '追随'],
 [138571, 8.6999999999999993, '一次别离'],
 [104317, 8.8000000000000007, '未麻的部屋'],
 [181166, 8.5999999999999996, '撞车'],
 [334741, 8.6999999999999993, '血战钢锯岭'],
 [135259, 8.6999999999999993, '可可西里'],
 [182221, 8.5, '战争之王'],
 [343703, 8.3000000000000007, '恐怖游轮'],
 [89868, 8.8000000000000007, '地球上的星星'],
 [116667, 8.6999999999999993, '梦之安魂曲'],
 [176988, 8.6999999999999993, '达拉斯买家俱乐部'],
 [270993, 8.5999999999999996, '被解救的姜戈'],
 [192717, 8.5, '阿飞正传'],
 [112326, 8.6999999999999993, '牯岭街少年杀人事件'],
 [200329, 8.5, '谍影重重'],
 [166328, 8.5, '谍影重重2'],
 [204653, 8.5, '魔女宅急便'],
 [240090, 8.6999999999999993, '头脑特工队'],
 [164479, 8.8000000000000007, '房间'],
 [63374, 9.0, '忠犬八公物语'],
 [87474, 8.9000000000000004, '惊魂记'],
 [110499, 8.6999999999999993, '碧海蓝天'],
 [179269, 8.5, '再次出发之纽约遇见你'],
 [231647, 8.4000000000000004, '青蛇'],
 [157071, 8.5999999999999996, '小萝莉的猴神大叔'],
 [53476, 9.1999999999999993, '东京物语'],
 [312322, 8.3000000000000007, '秒速5厘米'],
 [84575, 8.9000000000000004, '哪吒闹海'],
 [109454, 8.6999999999999993, '末路狂花'],
 [169778, 8.5999999999999996, '海盗电台'],
 [111040, 8.6999999999999993, '绿里奇迹'],
 [147035, 8.5999999999999996, '终结者2：审判日'],
 [424177, 8.3000000000000007, '源代码'],
 [267159, 8.5999999999999996, '模仿游戏'],
 [192005, 8.5, '新龙门客栈'],
 [162903, 8.5, '黑客帝国3：矩阵革命'],
 [147043, 8.5, '勇闯夺命岛'],
 [189831, 8.5, '这个男人来自地球'],
 [125973, 8.6999999999999993, '一个叫欧维的男人决定去死'],
 [129304, 8.5999999999999996, '卡萨布兰卡'],
 [494602, 8.4000000000000004, '你的名字。'],
 [46323, 9.1999999999999993, '城市之光'],
 [221714, 8.4000000000000004, '变脸'],
 [132083, 8.5999999999999996, '荒野生存'],
 [53099, 9.0999999999999996, '迁徙的鸟'],
 [159426, 8.5, 'E.T. 外星人'],
 [192409, 8.4000000000000004, '发条橙'],
 [231469, 8.4000000000000004, '无耻混蛋'],
 [479894, 8.3000000000000007, '初恋这件小事'],
 [53709, 9.0999999999999996, '黄金三镖客'],
 [191992, 8.4000000000000004, '美国丽人'],
 [121427, 8.8000000000000007, '爱在午夜降临前'],
 [178607, 8.4000000000000004, '英国病人'],
 [60049, 9.0, '无人知晓'],
 [110300, 8.5999999999999996, '燕尾蝶'],
 [120585, 8.5999999999999996, '非常嫌疑犯'],
 [328162, 8.3000000000000007, '疯狂的石头'],
 [112286, 8.5999999999999996, '叫我第一名'],
 [90201, 8.9000000000000004, '勇士'],
 [242926, 8.3000000000000007, '穆赫兰道'],
 [190730, 8.5999999999999996, '无敌破坏王'],
 [352129, 8.3000000000000007, '国王的演讲'],
 [77399, 8.8000000000000007, '步履不停'],
 [137843, 8.5, '血钻'],
 [99101, 8.5999999999999996, '上帝也疯狂'],
 [186988, 8.4000000000000004, '彗星来的那一夜'],
 [103282, 8.5999999999999996, '枪火'],
 [278772, 8.3000000000000007, '蓝色大门'],
 [97025, 8.5999999999999996, '大卫·戈尔的一生'],
 [134046, 8.5, '遗愿清单'],
 [59825, 9.0, '我爱你'],
 [89377, 8.6999999999999993, '千钧一发'],
 [139223, 8.5, '荒岛余生'],
 [48744, 9.0, '爱·回家'],
 [119390, 8.5, '黑鹰坠落'],
 [131277, 8.8000000000000007, '聚焦'],
 [131618, 8.5, '麦兜故事'],
 [148685, 8.4000000000000004, '暖暖内含光']]

In [70]:

    
from IPython.display import display_html, HTML
HTML('<iframe src=http://nbviewer.jupyter.org/github/computational-class/bigdata/blob/gh-pages/vis/douban250bubble.html \
     width=1000 height=500></iframe>')

Out[70]:

作业：¶

抓取复旦新媒体微信公众号最新一期的内容

抓取江苏省政协十年提案¶

打开http://www.jszx.gov.cn/zxta/2019ta/

所以数据的更新是使用js推送的

分析network中的内容，发现proposalList.jsp

查看它的header，并发现了form_data

http://www.jszx.gov.cn/zxta/2019ta/

In [4]:

import requests
from bs4 import BeautifulSoup

In [5]:

form_data = {'year':2019,
        'pagenum':1,
        'pagesize':20
}
url = 'http://www.jszx.gov.cn/wcm/zxweb/proposalList.jsp'
content = requests.get(url, form_data)
content.encoding = 'utf-8'
js = content.json()

In [6]:

js['data']['totalcount']

Out[6]:

'424'

In [7]:

dat = js['data']['list']
pagenum = js['data']['pagecount']

抓取所有提案的链接¶

In [147]:

for i in range(2, pagenum+1):
    print(i)
    form_data['pagenum'] = i
    content = requests.get(url, form_data)
    content.encoding = 'utf-8'
    js = content.json()
    for j in js['data']['list']:
        dat.append(j)

In [149]:

len(dat)

Out[149]:

In [150]:

dat[0]

Out[150]:

{'personnel_name': '邹正',
 'pkid': '18b1b347f9e34badb8934c2acec80e9e',
 'proposal_number': '0001',
 'publish_time': '2019-01-12 16:04:48',
 'reason': '关于完善城市环卫公厕指引系统的建议',
 'rownum': 1,
 'type': '城乡建设',
 'year': '2019'}

In [155]:

import pandas as pd

df = pd.DataFrame(dat)
df.head()

Out[155]:

	personnel_name	pkid	proposal_number	publish_time	reason	rownum	type	year
0	邹正	18b1b347f9e34badb8934c2acec80e9e	0001	2019-01-12 16:04:48	关于完善城市环卫公厕指引系统的建议	1	城乡建设	2019
1	省政协学习委员会	da43aae2378244faa961dd1224d1343e	0002	2019-01-12 16:04:48	关于加强老小区光纤化改造的建议	2	城乡建设	2019
2	韩鸣明	9d9b03f2e78345faa265eb99ce49e97e	0003	2019-01-12 16:24:23	关于加快建立省民营经济发展推进机制的建议	3	经济发展	2019
3	许文前	c0a1626a1bb744ebb0852cf25b21fb0a	0004	2019-01-12 15:42:19	加强科技创新，推动制造业转型升级	4	工业商贸	2019
4	段绪强	ce60d71296764cfe997d62bb2c0990af	0005	2019-01-12 16:21:46	深化落实金融政策举措 ,促进民营企业高质量发展	5	财税金融	2019

In [158]:

df.groupby('type').size()

Out[158]:

type
农林水利     4
医卫体育    45
城乡建设    25
工业商贸    34
政治建设    18
教育事业    58
文化宣传    34
法制建设    24
社会事业    77
科学技术    25
经济发展    52
统战综合     4
财税金融    12
资源环境    24
dtype: int64

抓取提案内容¶

http://www.jszx.gov.cn/zxta/2019ta/index_61.html?pkid=18b1b347f9e34badb8934c2acec80e9e

http://www.jszx.gov.cn/wcm/zxweb/proposalInfo.jsp?pkid=18b1b347f9e34badb8934c2acec80e9e

In [163]:

url_base = 'http://www.jszx.gov.cn/wcm/zxweb/proposalInfo.jsp?pkid='
urls = [url_base + i  for i in df['pkid']]

In [176]:

import sys
def flushPrint(www):
    sys.stdout.write('\r')
    sys.stdout.write('%s' % www)
    sys.stdout.flush()
    
text = []
for k, i in enumerate(urls):
    flushPrint(k)
    content = requests.get(i)
    content.encoding = 'utf-8'
    js = content.json()
    js = js['data']['binfo']['_content']
    soup = BeautifulSoup(js, 'html.parser') 
    text.append(soup.text)

In [177]:

len(text)

Out[177]:

In [178]:

df['content'] = text

In [179]:

df.head()

Out[179]:

	personnel_name	pkid	proposal_number	publish_time	reason	rownum	type	year	content
0	邹正	18b1b347f9e34badb8934c2acec80e9e	0001	2019-01-12 16:04:48	关于完善城市环卫公厕指引系统的建议	1	城乡建设	2019	调研情况： 2015 年 4 月 1 日，习近平总书记首次提出要坚持不懈地推进“厕所革...
1	省政协学习委员会	da43aae2378244faa961dd1224d1343e	0002	2019-01-12 16:04:48	关于加强老小区光纤化改造的建议	2	城乡建设	2019	调研情况：近期，省政协学习委员会组织部分委员对我省信息通信业发展情况进行考察调研，总的感到，...
2	韩鸣明	9d9b03f2e78345faa265eb99ce49e97e	0003	2019-01-12 16:24:23	关于加快建立省民营经济发展推进机制的建议	3	经济发展	2019	调研情况：习近平总书记在全国民营企业座谈会上指出，要把支持民营企业发展作为一项重要任...
3	许文前	c0a1626a1bb744ebb0852cf25b21fb0a	0004	2019-01-12 15:42:19	加强科技创新，推动制造业转型升级	4	工业商贸	2019	调研情况：早在2012年，美国国会的一份报告就声称，华为和中兴通讯可能涉嫌从事威胁美国...
4	段绪强	ce60d71296764cfe997d62bb2c0990af	0005	2019-01-12 16:21:46	深化落实金融政策举措 ,促进民营企业高质量发展	5	财税金融	2019	调研情况：2018年，国家支持民营企业融资所出台的政策众多、且力度空前。这在一定程度上提振了...

In [181]:

df.to_csv('../data/jszx2019.csv', index = False)

In [182]:

dd = pd.read_csv('../data/jszx2019.csv')
dd.head()

Out[182]:

	personnel_name	pkid	proposal_number	publish_time	reason	rownum	type	year	content
0	邹正	18b1b347f9e34badb8934c2acec80e9e	1	2019-01-12 16:04:48	关于完善城市环卫公厕指引系统的建议	1	城乡建设	2019	调研情况： 2015 年 4 月 1 日，习近平总书记首次提出要坚持不懈地推进“厕所革...
1	省政协学习委员会	da43aae2378244faa961dd1224d1343e	2	2019-01-12 16:04:48	关于加强老小区光纤化改造的建议	2	城乡建设	2019	调研情况：近期，省政协学习委员会组织部分委员对我省信息通信业发展情况进行考察调研，总的感到，...
2	韩鸣明	9d9b03f2e78345faa265eb99ce49e97e	3	2019-01-12 16:24:23	关于加快建立省民营经济发展推进机制的建议	3	经济发展	2019	调研情况：习近平总书记在全国民营企业座谈会上指出，要把支持民营企业发展作为一项重要任...
3	许文前	c0a1626a1bb744ebb0852cf25b21fb0a	4	2019-01-12 15:42:19	加强科技创新，推动制造业转型升级	4	工业商贸	2019	调研情况：早在2012年，美国国会的一份报告就声称，华为和中兴通讯可能涉嫌从事威胁美国...
4	段绪强	ce60d71296764cfe997d62bb2c0990af	5	2019-01-12 16:21:46	深化落实金融政策举措 ,促进民营企业高质量发展	5	财税金融	2019	调研情况：2018年，国家支持民营企业融资所出台的政策众多、且力度空前。这在一定程度上提振了...

数据抓取：¶

Requests、Beautifulsoup、Xpath简介¶

python爬虫从入门到放弃（二）之爬虫的原理

爬虫的基本流程

什么是Request,Response

Request中包含什么？

Response中包含了什么

能爬取什么样的数据

如何解析数据

关于抓取的页面数据和浏览器里看到的不一样的问题

怎样保存数据

公告

需要解决的问题¶

第一个爬虫¶

Beautiful Soup¶

Install beautifulsoup4¶

open your terminal/cmd¶

html.parser¶

lxml¶

html5lib¶

Select 方法¶

Select方法三步骤¶

Select 方法: 通过标签名查找¶

Select 方法: 通过类名查找¶

Select 方法: 通过id名查找¶

Select 方法: 组合查找¶

Select 方法:属性查找¶

find_all方法¶

数据抓取：¶

抓取微信公众号文章内容¶

南大新传 | 微议题：地震中民族自豪—“中国人先撤”

朋友将在看一看看到

发送想法到看一看

查看源代码 Inspect¶

requests + Xpath方法介绍：以豆瓣电影为例¶

Douban API¶

requests.post模拟登录豆瓣（包括获取验证码）¶

作业：抓取豆瓣电影 Top 250¶

作业：¶

抓取江苏省政协十年提案¶

抓取所有提案的链接¶

抓取提案内容¶