爬虫系列:某当网图书数据采集

导读:本篇文章讲解 爬虫系列:某当网图书数据采集,希望对大家有帮助,欢迎收藏,转发!站点地址:www.bmabk.com

🌵爬虫之Scrapy系列文章

🌴欢迎点赞评论学习交流~

🌱各位看官多多关注哦😘~

目录

 🍉项目介绍 

 🍉项目演示 

🌴spider.py

🌴items.py 

🌴 pipelines.py

🍉 结束


 🍉项目介绍 

   我们需要对某当网图书数据(标题,作者,价格,评论,出版社)采集并入库。

爬虫系列:某当网图书数据采集

 🍉项目演示 

🌴spider.py

'''
coding:utf-8
@Software:PyCharm
@Time:2022/6/25 18:21
@Author:小久
'''

import scrapy
from ..items import DangdangItem
import re


class DangbookSpider(scrapy.Spider):
    name = 'dangbook'
    allowed_domains = ['dangdang.com']
    # start_urls = ['http://category.dangdang.com/cp01.03.00.00.00.00.html']

    def start_requests(self):
        for i in range(9):
            start_urls = 'http://category.dangdang.com/pg{}-cp01.03.00.00.00.00.html'.format(i)
            # print(start_urls)
            yield scrapy.Request(url=start_urls,callback=self.parse)

    def parse(self, response):
        items = DangdangItem()
        titles = response.xpath('//p[@class="name"]/a/@title').extract()
        prices = response.xpath('//p[@class="price"]/span[@class="search_now_price"]/text()').extract()
        authors = response.xpath('//p[@class="search_book_author"]/span[1]/a[1]/@title').extract()
        comments = response.xpath('//p[@class="search_star_line"]/a/text()').extract()
        publishes = response.xpath('//p[5]/span[3]/a/text()').extract()
        for title,price,author,comment,publish in zip(titles,prices,authors,comments,publishes):
            items['title'] = title
            price = price.split('¥')[1]
            items['price'] = price
            items['author'] = author
            comment = re.findall('(.*?)条评论',comment)[0]
            items['comment'] = comment
            items['publish'] = publish
            # print(publish)
            yield items






用xpath对 标题,作者,价格,评论,出版社进行定位,这里我们可以发现了链接的规律,我们直接写一个基础的url,然后遍历翻页就行了。

🌴items.py 

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DangdangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    picture = scrapy.Field()
    title = scrapy.Field()
    price = scrapy.Field()
    author = scrapy.Field()
    comment = scrapy.Field()
    publish = scrapy.Field()

 将数据存入items容器中,等下有利于入库。 

🌴 pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html



# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.utils.project import get_project_settings
import pymysql


class DangdangMysql:

    def open_spider(self,spider):
        setting = get_project_settings()
        self.host = setting['HOST']
        self.user = setting['USER']
        self.password = setting['PASSWORD']
        self.charset = setting['CHARSET']
        self.db = setting['DATABASE']
        self.port = setting['PORT']
        self.connect()

    def connect(self):
        self.conn = pymysql.connect(host=self.host,user=self.user,
                                    password=self.password,charset=self.charset,
                                    db=self.db,port=self.port)

    def process_item(self,item,spider):
        self.cursor = self.conn.cursor()
        sql = 'insert into dangbooks(id,title,author,price,publish,comment) values("{}","{}","{}","{}","{}","{}")'\
            .format(0,item['title'],item['author'],item['price'],item['publish'],item['comment'])
        try:
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            self.conn.rollback()
        return item

    def close_process(self,spider):
        self.cursor.close()
        self.conn.close()

 将数据存入mysql数据库。

🍉 结束

 爬虫系列:某当网图书数据采集

   

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。

文章由极客之音整理,本文链接:https://www.bmabk.com/index.php/post/61452.html

(0)
小半的头像小半

相关推荐

极客之音——专业性很强的中文编程技术网站,欢迎收藏到浏览器,订阅我们!