🍉Scrapy的三种数据解析方式

Scrapy 提供了自己的数据提取方法，即 Selector（选择器）。Selector 是基于 lxml 来构建的，支持 XPath 选择器、CSS 选择器以及正则表达式，功能全面，解析速度和准确度非常高。

🌵xpath选择器

测试用例

html = '''
<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>
'''

构建一个对象

response = Selector(text=html)

提取a节点

result = response.xpath('//a')

提取 a 节点内包含的 img 节点

result.xpath('./img')

提取数据

result.extract()

选取节点的内容和属性

response.xpath('//a/text()').extract()

匹配第一个结果

response.xpath('//a[@href="image1.html"]/text()').extract_first()

xpath选择器是scrapy框架最常用的解析方式~

🌵CSS选择器

Scrapy 的选择器同时还对接了 CSS 选择器，使用 response.css() 方法可以使用 CSS 选择器来选择对应的元素。

测试用例还是上面那个

res = response.css('a')
# 提取a节点返回列表
print(response.css('a[href="image1.html"]').extract())
# 查找 a 节点内的 img 节点返回列表
print(response.css('a[href="image1.html"] img').extract())
# 提取匹配到的节点
print(response.css('a[href="image1.html"] img').extract_first())
# 提取值
print(response.css('a[href="image1.html"]::text').extract_first())
# 提取属性
print(response.css('a[href="image1.html"] img::attr(src)').extract_first())

🌵正则匹配

Scrapy 的选择器还支持正则匹配。在示例的 a 节点中的文本类似于 Name: My image 1，现在我们只想把 Name: 后面的内容提取出来，这时就可以借助 re() 方法。

response.xpath('//a/text()').re('Name:\s(.*)')

给 re() 方法传了一个正则表达式，其中 (.*) 就是要匹配的内容

print(response.xpath('//a/text()').re('(.*?):\s(.*)'))

print(response.xpath('.').re('Name:\s(.*)<br>'))

🍉dome

做个小测试巩固一下吧~

🌴spider.py

from urllib.parse import urljoin
import scrapy
from ..items import DoubanItem
from scrapy import cmdline
from scrapy import Selector


class DoubSpider(scrapy.Spider):
    name = 'doub'
    # allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        item = DoubanItem()
        selector = Selector(response)
        tags = selector.xpath('//div[@class="info"]')
        for i in tags:
            title = i.xpath('div[@class="hd"]/a/span/text()').extract()
            titles = ''.join(title)  # 拼接成字符串
            movieinfo = i.xpath('div[@class="bd"]/p/text()').extract()
            star = i.xpath('div[@class="bd"]/div[@class="star"]/span/text()').extract()[0]
            item["title"] = ''.join(titles.replace('\n', '').replace(' ', '').split())
            item['movieinfo'] = ''.join(';'.join(movieinfo).replace('\n', '').replace(' ', '').split())
            item['star'] = star
            yield item

        # 处理翻页
        next = selector.xpath('//span[@class="next"]/link/@href').extract_first()
        if next:
            yield scrapy.Request(urljoin(response.url, next), callback=self.parse)

🌴itmes.py

import scrapy


class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    movieinfo = scrapy.Field()
    star = scrapy.Field()

🌴pipeline.py

from itemadapter import ItemAdapter
import pymongo
class DoubanMogodbPipeline:
    client = MongoClient()
    db = client['python']  # 指令库

    def process_item(self, item, spider):
        # mongo插入数据  字典格式
        items = dict(item)
        if isinstance(items, dict):
            self.db['xj'].insert(items)
            return item
        else:
            return '数据格式有误'