scrapy爬虫笔记2

Python

发布日期: 2019-05-30

文章字数: 3.8k

阅读时长: 17 分

阅读次数:

css选择器：BeautifulSoup4

lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。

BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器。

Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful Soup 4。使用 pip 安装即可：pip install beautifulsoup4

抓取工具	速度	使用难度	安装难度
正则	最快	困难	无（内置）
BeautifulSoup	慢	最简单	简单
lxml	快	简单	一般

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag
NavigableString
BeautifulSoup
Comment

1. Tag

Tag 通俗点讲就是 HTML 中的一个个标签， title head a p等等 HTML 标签加上里面包括的内容就是 Tag，那么试着使用 Beautiful Soup 来获取 Tags:

from bs4 import BeautifulSoup

html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""

#创建 Beautiful Soup 对象
soup = BeautifulSoup(html,'lxml')

print(soup.title)
# The Dormouse's story

print (soup.head)
# The Dormouse's story

print (soup.a)
# 

print (soup.p)
# The Dormouse's story

print(type(soup.p))
#

我们可以利用 soup 加标签名轻松地获取这些标签的内容，这些对象的类型是bs4.element.Tag。但是注意，它查找的是在所有内容中的第一个符合要求的标签。如果要查询所有的标签，后面会进行介绍。

对于 Tag，它有两个重要的属性，是 name 和 attrs

实际案例：

print (soup.name)
# [document] #soup 对象本身比较特殊，它的 name 即为 [document]

print (soup.head.name)
# head #对于其他内部标签，输出的值便为标签本身的名称

print( soup.p.attrs)
# {'class': ['title'], 'name': 'dromouse'}
# 在这里，我们把 p 标签的所有属性打印输出了出来，得到的类型是一个字典。

print (soup.p['class']) # soup.p.get('class')
# ['title'] #还可以利用get方法，传入属性的名称，二者是等价的

soup.p['class'] = "newClass"
print (soup.p) # 可以对这些属性和内容等等进行修改
# The Dormouse's story

del soup.p['class'] # 还可以对这个属性进行删除
print (soup.p)
# The Dormouse's story
######################输出结果
[document]
head
{'class': ['title'], 'name': 'dromouse'}
['title']
The Dormouse's story
The Dormouse's story

既然我们已经得到了标签的内容，那么问题来了，我们要想获取标签内部的文字怎么办呢？很简单，用 .string 即可。

遍历文档树

1.直接子节点：`.contents` `.children` 属性

tag 的 .content 属性可以将tag的子节点以列表的方式输出，输出方式为列表，我们可以用列表索引来获取它的某一个元素。

.children它返回的不是一个 list，不过我们可以通过遍历获取所有子节点。

我们打印输出 .children 看一下，可以发现它是一个 list 生成器对象

print (soup.head.children)
#

for child in  soup.body.children:
    print (child)

######################输出结果



The Dormouse's story


Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.


...

2. 所有子孙节点: `.descendants` 属性

.contents 和 .children 属性仅包含tag的直接子节点，.descendants 属性可以对所有tag的子孙节点进行递归循环，和 children类似，我们也需要遍历获取其中的内容。

搜索文档树

1.`find_all(name, attrs, recursive, text, **kwargs)`

name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉

A.传字符串

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的<b>标签:soup.find_all('b')

B.传正则表达式

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示<body>和<b>标签都应该被找到

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

C.传列表

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签:soup.find_all(["a", "b"])

2）keyword 参数（就是相当于id和css选择器）

比如查找所有idweilink2的标签：soup.find_all(id='link2')

3）text 参数

通过 text 参数可以搜搜文档中的字符串内容，与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表

soup.find_all(text="Elsie")
# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

CSS选择器

这就是另一种与 find_all 方法有异曲同工之妙的查找方法.

写 CSS 时，标签名不加任何修饰，类名前加.，id名前加#
在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

（1）通过标签名查找

比如查找所有的title标签：print soup.select('title')

（2）通过类名查找

比如查找所有class为sister的标签：print soup.select('.sister')

（3）通过 id 名查找

比如查找所有id为link1的标签：print soup.select('#link1')

（4）组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开比如：

print soup.select('p #link1')

直接子标签查找，则使用 > 分隔：print soup.select("head > title")

（5）属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

比如查找class为sister的a标签：print soup.select('a[class="sister"]')

同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格。

(6) 获取内容

以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容：print soup.select('title')[0].get_text()

时间有限，这里就不写案例了。

数据提取之JSON与JsonPATH

JSON(JavaScript Object Notation) 是一种轻量级的数据交换格式，它使得人们很容易的进行阅读和编写。同时也方便了机器进行解析和生成。适用于进行数据交互的场景，比如网站前台与后台之间的数据交互。

json模块提供了四个功能：dumps、dump、loads、load，用于字符串和 python数据类型间进行转换。

1.json.loads()就是把Json格式字符串解码转换成Python对象

2.json.dumps()实现python类型转化为json字符串，返回一个str对象把一个Python对象编码转换成Json字符串

# 注意：json.dumps() 序列化时默认使用的ascii编码
# 添加参数 ensure_ascii=False 禁用ascii编码，按utf-8编码
# chardet.detect()返回字典, 其中confidence是检测精确度

json.dumps(dictStr) 
# '{"city": "u5317u4eac", "name": "u5927u5218"}'

chardet.detect(json.dumps(dictStr))
# {'confidence': 1.0, 'encoding': 'ascii'}

print (json.dumps(dictStr, ensure_ascii=False) )
# {"city": "北京", "name": "大刘"}

chardet.detect(json.dumps(dictStr, ensure_ascii=False))
# {'confidence': 0.99, 'encoding': 'utf-8'}

3.json.dump()将Python内置类型序列化为json对象后写入文件

listStr = [{"city": "北京"}, {"name": "大刘"}]
json.dump(listStr, open("listStr.json","w"), ensure_ascii=False)

4. json.load()读取文件中json形式的字符串元素转化成python类型:strList = json.load(open("listStr.json"))

JsonPath

JsonPath 是一种信息抽取类库，是从JSON文档中抽取指定信息的工具，提供多种语言实现版本，包括：Javascript, Python， PHP 和 Java。

Json结构清晰，可读性高，复杂度低，非常容易匹配，下表中对应了XPath的用法。

注意事项：

json.loads() 是把 Json格式字符串解码转换成Python对象，如果在json.loads的时候出错，要注意被解码的Json字符的编码。

如果传入的字符串的编码不是UTF-8的话，需要指定字符编码的参数 encoding

dataDict = json.loads(jsonStrGBK);

dataJsonStr是JSON字符串，假设其编码本身是非UTF-8的话而是GBK 的，那么上述代码会导致出错，改为对应的：

  dataDict = json.loads(jsonStrGBK, encoding="GBK");

##字符串编码转换

这是中国程序员最苦逼的地方，什么乱码之类的几乎都是由汉字引起的。
其实编码问题很好搞定，只要记住一点：
####任何平台的任何编码 都能和 Unicode 互相转换
UTF-8 与 GBK 互相转换，那就先把UTF-8转换成Unicode，再从Unicode转换成GBK，反之同理。
# 这是一个 UTF-8 编码的字符串
utf8Str = "你好地球"

# 1. 将 UTF-8 编码的字符串 转换成 Unicode 编码
unicodeStr = utf8Str.decode("UTF-8")

# 2. 再将 Unicode 编码格式字符串 转换成 GBK 编码
gbkData = unicodeStr.encode("GBK")

# 1. 再将 GBK 编码格式字符串 转化成 Unicode
unicodeStr = gbkData.decode("gbk")

# 2. 再将 Unicode 编码格式字符串转换成 UTF-8
utf8Str = unicodeStr.encode("UTF-8")
</code></pre>
<p><code>decode</code>的作用是将其他编码的字符串转换成 Unicode 编码</p>
<p><code>encode</code>的作用是将 Unicode 编码转换成其他编码的字符串</p>
<p><code>一句话：UTF-8是对Unicode字符集进行编码的一种编码方式</code></p>
<p>下面有一个综合案例（用Python2写的，所以最好只是拿来参考）：</p>



<pre class="line-numbers"><code class="language-Python">#qiushibaike.py

#import urllib
#import re
#import chardet

import requests
from lxml import etree

page = 1
url = 'http://www.qiushibaike.com/8hr/page/' + str(page) 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
    'Accept-Language': 'zh-CN,zh;q=0.8'}

try:
    response = requests.get(url, headers=headers)
    resHtml = response.text

    html = etree.HTML(resHtml)
    result = html.xpath('//div[contains(@id,"qiushi_tag")]')

    for site in result:
        item = {}

        imgUrl = site.xpath('./div/a/img/@src')[0].encode('utf-8')
        username = site.xpath('./div/a/@title')[0].encode('utf-8')
        #username = site.xpath('.//h2')[0].text
        content = site.xpath('.//div[@class="content"]/span')[0].text.strip().encode('utf-8')
        # 投票次数
        vote = site.xpath('.//i')[0].text
        #print site.xpath('.//*[@class="number"]')[0].text
        # 评论信息
        comments = site.xpath('.//i')[1].text

        print imgUrl, username, content, vote, comments

except Exception, e:
    print e
</pre></code>



<h2>Python多线程</h2>
<h3 id="queue（队列对象）">Queue（队列对象）</h3>
<p>Queue是python中的标准库，可以直接import Queue引用;队列是线程间最常用的交换数据的形式</p>
<p>python下多线程的思考</p>
<p>对于资源，加锁是个重要的环节。因为python原生的list,dict等，都是not thread safe的。而Queue，是线程安全的，因此在满足使用条件下，建议使用队列</p>
<ol>
<li>
<p>初始化： class Queue.Queue(maxsize) FIFO 先进先出</p>
</li>
<li>
<p>包中的常用方法:</p>
<ul>
<li>
<p>Queue.qsize() 返回队列的大小</p>
</li>
<li>
<p>Queue.empty() 如果队列为空，返回True,反之False</p>
</li>
<li>
<p>Queue.full() 如果队列满了，返回True,反之False</p>
</li>
<li>
<p>Queue.full 与 maxsize 大小对应</p>
</li>
<li>
<p>Queue.get([block[, timeout]])获取队列，timeout等待时间</p>
</li>
</ul>
</li>
<li>
<p>创建一个“队列”对象</p>
<ul>
<li>import Queue</li>
<li>myqueue = Queue.Queue(maxsize = 10)</li>
</ul>
</li>
<li>
<p>将一个值放入队列中</p>
<ul>
<li>myqueue.put(10)</li>
</ul>
</li>
<li>
<p>将一个值从队列中取出</p>
<ul>
<li>myqueue.get()</li>
</ul>
</li>
</ol>
<p>这里还是直接拿一个例子作为说明（具体自己看源码）：</p>



<pre class="line-numbers"><code class="language-Python"># -*- coding:utf-8 -*-
import requests
from lxml import etree
from Queue import Queue
import threading
import time
import json


class thread_crawl(threading.Thread):
    '''
    抓取线程类
    '''

    def __init__(self, threadID, q):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.q = q

    def run(self):
        print "Starting " + self.threadID
        self.qiushi_spider()
        print "Exiting ", self.threadID

    def qiushi_spider(self):
        # page = 1
        while True:
            if self.q.empty():
                break
            else:
                page = self.q.get()
                print 'qiushi_spider=', self.threadID, ',page=', str(page)
                url = 'http://www.qiushibaike.com/8hr/page/' + str(page) + '/'
                headers = {
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
                    'Accept-Language': 'zh-CN,zh;q=0.8'}
                # 多次尝试失败结束、防止死循环
                timeout = 4
                while timeout > 0:
                    timeout -= 1
                    try:
                        content = requests.get(url, headers=headers)
                        data_queue.put(content.text)
                        break
                    except Exception, e:
                        print 'qiushi_spider', e
                if timeout < 0:
                    print 'timeout', url


class Thread_Parser(threading.Thread):
    '''
    页面解析类；
    '''

    def __init__(self, threadID, queue, lock, f):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.queue = queue
        self.lock = lock
        self.f = f

    def run(self):
        print 'starting ', self.threadID
        global total, exitFlag_Parser
        while not exitFlag_Parser:
            try:
                '''
                调用队列对象的get()方法从队头删除并返回一个项目。可选参数为block，默认为True。
                如果队列为空且block为True，get()就使调用线程暂停，直至有项目可用。
                如果队列为空且block为False，队列将引发Empty异常。
                '''
                item = self.queue.get(False)
                if not item:
                    pass
                self.parse_data(item)
                self.queue.task_done()
                print 'Thread_Parser=', self.threadID, ',total=', total
            except:
                pass
        print 'Exiting ', self.threadID

    def parse_data(self, item):
        '''
        解析网页函数
        :param item: 网页内容
        :return:
        '''
        global total
        try:
            html = etree.HTML(item)
            result = html.xpath('//div[contains(@id,"qiushi_tag")]')
            for site in result:
                try:
                    imgUrl = site.xpath('.//img/@src')[0]
                    title = site.xpath('.//h2')[0].text
                    content = site.xpath('.//div[@class="content"]/span')[0].text.strip()
                    vote = None
                    comments = None
                    try:
                        vote = site.xpath('.//i')[0].text
                        comments = site.xpath('.//i')[1].text
                    except:
                        pass
                    result = {
                        'imgUrl': imgUrl,
                        'title': title,
                        'content': content,
                        'vote': vote,
                        'comments': comments,
                    }

                    with self.lock:
                        # print 'write %s' % json.dumps(result)
                        self.f.write(json.dumps(result, ensure_ascii=False).encode('utf-8') + "
")

                except Exception, e:
                    print 'site in result', e
        except Exception, e:
            print 'parse_data', e
        with self.lock:
            total += 1

data_queue = Queue()
exitFlag_Parser = False
lock = threading.Lock()
total = 0

def main():
    output = open('qiushibaike.json', 'a')

    #初始化网页页码page从1-10个页面
    pageQueue = Queue(50)
    for page in range(1, 11):
        pageQueue.put(page)

    #初始化采集线程
    crawlthreads = []
    crawlList = ["crawl-1", "crawl-2", "crawl-3"]

    for threadID in crawlList:
        thread = thread_crawl(threadID, pageQueue)
        thread.start()
        crawlthreads.append(thread)

    #初始化解析线程parserList
    parserthreads = []
    parserList = ["parser-1", "parser-2", "parser-3"]
    #分别启动parserList
    for threadID in parserList:
        thread = Thread_Parser(threadID, data_queue, lock, output)
        thread.start()
        parserthreads.append(thread)

    # 等待队列清空
    while not pageQueue.empty():
        pass

    # 等待所有线程完成
    for t in crawlthreads:
        t.join()

    while not data_queue.empty():
        pass
    # 通知线程是时候退出
    global exitFlag_Parser
    exitFlag_Parser = True

    for t in parserthreads:
        t.join()
    print "Exiting Main Thread"
    with lock:
        output.close()


if __name__ == '__main__':
    main()
</pre></code>

小游

https://xiaoyou66.com/archives/1192/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源小游 !

python 爬虫

scrapy爬虫笔记3

2019-05-30 Python

python 爬虫

日记(2019-05-29)

2019-05-29 日记

scrapy爬虫笔记2

css选择器：BeautifulSoup4

1. Tag

对于 Tag，它有两个重要的属性，是 name 和 attrs

遍历文档树

1.直接子节点 ：.contents .children 属性

2. 所有子孙节点: .descendants 属性

搜索文档树

1.find_all(name, attrs, recursive, text, **kwargs)

A.传字符串

B.传正则表达式

C.传列表

2）keyword 参数（就是相当于id和css选择器）

3）text 参数

CSS选择器

（1）通过标签名查找

（2）通过类名查找

（3）通过 id 名查找

（4）组合查找

（5）属性查找

(6) 获取内容

数据提取之JSON与JsonPATH

JsonPath

你的赏识是我前进的动力

1.直接子节点：`.contents` `.children` 属性

2. 所有子孙节点: `.descendants` 属性

1.`find_all(name, attrs, recursive, text, **kwargs)`