【爬虫】第二部分 urllib_handler处理器
2. urllib_handler处理器
handler处理器是urllib库中继urlopen()
方法之后又一种模拟浏览器向服务器发起请求的方法或技术
那么我们为什么要学习它呢?
因为随着我们的业务逻辑越来越复杂,定制请求对象的已经不能够满足我们的需求,所以我们需要借助handler处理器
2.1 handler的基本使用
import urllib.request
url = 'https://www.xxx.com'
headers = {
'Referer': 'https://www.xxx.com/link?url=Rg4aCsjouphcJ5OGEv4z8RlR9Wc4ERQipSjI1HqkVfG&wd=&eqid=af45491f000c228e000000036357fbd3',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'Cookie': 'BIDUPSID=F8062DDC1948E6DAC11A23B9185211BC; PSTM=1644290569; ab_sr=1.0.1_ZWEzMTIzMDUyNmJjOTQ3MmExYTRhNDkwMGY0M2FkM2U0NzE5MzM2OGY0NjFhNTBlZDJjM2FmNDY2NDg0MDlhY2FkMDlmY2IyODdmNTIzMDg2YzU2MThlYTdhODUxYWRiMWRmN2IzYmNjOWY5ZjNkZWM0MjY3OTRkNGZkOGZjMDliYjY4YzFhNDU0NzdjMDYxNGQ0MTNhZDM3ZjZiYmIzMjUxZDNlNTU3NGM0MmUzYzdjZWU2M2FiZDY4MzFlZjM1; BD_HOME=1; H_PS_PSSID=36546_37553_37518_37355_37584_36885_37626_36786_37536_37581_26350; delPer=0; BD_CK_SAM=1; PSINO=1; H_PS_645EC=0e91%2B6teFT2SdG7NZB2Vx8CCNZDdA40fjlteqshvLmXMdu9%2F3Xm9G32mkpQ; baikeVisitId=f0b5771a-9a25-4525-8569-22025278cb44'
}
# 伪装
request = urllib.request.Request(url=url, headers=headers)
# 获取handler对象
handler = urllib.request.HTTPHandler()
# 获取opener对象
opener = urllib.request.build_opener(handler)
# 调用opener,这部操作相当于urlopen()
response = opener.open(request).read().decode()
print(response)
2.2 代理服务器
代理的功能:
- 突破IP地址访问的限制,去访问国外的网站
- 访问一些单位或者是公司内部的资源
- 提高访问的速度
- 隐藏真实的IP地址
- 当IP被禁止了就可以通过代理继续爬取数据
具体使用看下面的代码
import urllib.request
url = 'https://www.xxx.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'Cookie': 'BIDUPSID=F8062DDC1948E6DAC11A23B9185211BC; PSTM=1644290569; BAIDUID=F8062DDC1948E6DA5E48F5D4AE020258:FG=1; __yjs_duid=1_79801aa46424f5e5bc57241a10490cb21644290773722; BD_UPN=12314753; BDUSS_BFESS=NvckF2cUNuMkhWY25QdUs5ekhQWjczM09RTUY0ekRCV0VNbDJhaG13OVVOMVpqRUFBQUFBJCQAAAAAAAAAAAEAAABujHGhs7~q2NPrzqLCtgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFSqLmNUqi5jcV; BAIDUID_BFESS=F8062DDC1948E6DA5E48F5D4AE020258:FG=1; BA_HECTOR=al2hag850kakak0k0g8k86f11hlfsa11b; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; B64_BOT=1; RT="z=1&dm=baidu.com&si=zuxuc8uuzs&ss=l9oas0nm&sl=2&tt=4h8&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=4cl&ul=ash&hd=ath"; H_PS_645EC=169b7wti5DafF3FtoCuCPgYMW%2BvxNf%2F8byaWwhPXCkfgwRW7AdS%2FgK3ReXA; baikeVisitId=e0c37c9d-7434-4297-808b-9b0fb72d4635'
}
# 伪装
request = urllib.request.Request(url=url, headers=headers)
# proxies 代理服务器信息
proxies = {
# 可以到快代理中去使用免费的或者进行购买
'http': '223.96.90.216:8085'
}
# 获取handler对象
handler = urllib.request.ProxyHandler(proxies)
# 获取opener对象
opener = urllib.request.build_opener(handler)
# 调用opener,这部操作相当于urlopen()
response = opener.open(request).read().decode('utf-8')
with open('baidu.html', 'w', encoding='utf-8') as fs:
fs.write(response)
2.3 代理池
import urllib.request
import random
# 模拟一个简单的代理池
proxies_pool = [
{'http': '61.216.185.88:60808'},
{'http': '223.96.90.216:8085'},
{'http': '223.96.90.216:8085'},
{'http': '58.20.184.187:9091'},
{'http': '183.247.202.208:30001'}
]
# 随机从列表中选择一条数据
proxies = random.choice(proxies_pool)
url = 'https://www.xxx.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'Referer': 'https://www.xxx.com/link?url=BaNnxSgMcoGHDufIkglmT9y6jMWok_p8P6JD0GdUFue&wd=&eqid=fe3d76e8000024600000000663580923',
'Cookie': 'Hm_lvt_f4f76646cd877e538aa1fbbdf351c548=1666713895; Hm_lpvt_f4f76646cd877e538aa1fbbdf351c548=1666713895'
}
# 伪装
request = urllib.request.Request(url=url, headers=headers)
# 获取handler
handler = urllib.request.ProxyHandler(proxies)
# 获取opener
opener = urllib.request.build_opener(handler)
# 模拟浏览器发送请求获取数据
response = opener.open(request).read().decode('utf-8')
with open('ip.html', 'w', encoding='utf-8') as fs:
fs.write(response)
总结
以上就是今天要讲的内容,希望对大家有所帮助!!!
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
文章由极客之音整理,本文链接:https://www.bmabk.com/index.php/post/82830.html