一个妹子套图python爬虫（完整源码）

葫芦侠五楼 • 2024年3月9日下午11:01 • Python • 阅读 327

本文针对初学者，通过分享一个妹子套图python爬虫，带你零基础快速入门python爬虫！仅仅25行代码，包含了网络请求、数据解析以及数据存储，实现一个“麻雀虽小，五脏俱全”的python爬虫。

一、实现目标

根据一个图集的入口地址，下载该图集中的所有图片。

图集的入口地址如下：

https://pic.yesky.com/306/2147478806.shtml

二、网站分析

图集入口地址进入，页面中是可以查看图集中所有图片的预览小图。

通过对比分析预览小图和大图的地址，可以知道它们存在着对应关系。

总结一下，图片集入口地址——预览小图地址——图片文件地址。

三、完整源码

#""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
#
# Copyright (c) 2024 愤怒的it男, All Rights Reserved.
# FileName : code.py
# Date     : 2024.01.15
# Author   : 愤怒的it男
# Version  : 1.0.0
# Node     : 欢迎关注微信公众号【愤怒的it男】
#
#""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

import requests
from lxml import etree

def getPrePicUrlList(picSetPageUrl, headers):
    response = requests.get(url=picSetPageUrl, headers=headers)
    html = etree.HTML(response.text)
    prePicUrlList = html.xpath("//ul[@class='previewPic clearfix']//img/@src")
    return prePicUrlList
    
def getPicUrlList(prePicUrlList):
    picUrlList = []
    for url in prePicUrlList:
        picUrl = url.replace("https://d-pic-image.yesky.com/180x320/", "https://pic-image.yesky.com/")
        picUrlList.append(picUrl)
    return picUrlList

def getPic(index, picUrl, headers):
    suffix = picUrl.split(".")[-1]
    response = requests.get(url=picUrl, headers=headers)
    with open('output/'+str(index+1)+'.'+suffix, 'wb') as file:
        file.write(response.content)

def main():
    picSetPageUrl = 'https://pic.yesky.com/306/2147478806.shtml'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    prePicUrlList = getPrePicUrlList(picSetPageUrl, headers)
    picUrlList = getPicUrlList(prePicUrlList)
    for index, picUrl in enumerate(picUrlList):
        getPic(index, picUrl, headers)
    
if __name__== "__main__" :
    main()