爬虫实战_1_某大佬博客

环境配置

1 2	pip install requests pip install beautifulsoup4

Requests: 让 HTTP 服务人类

这里以P4tt0n大佬的博客https://p4tt0n.github.io/为例ψ(｀∇´)ψ

先导入Requests库，其宗旨是”让HTTP服务人类”，使用它，可以很方便向网站发起请求，进行通信等：

# 简单测试下
import requests

# 添加headers，有些请求要求加headers，非必要
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'}
r=requests.get('https://p4tt0n.github.io/',headers=headers)

print(r.status_code)  # 返回请求状态码
print(r.text)  # 返回页面代码
print(r.json())  # 返回json

成功返回数据

Requests也可以发Post请求：

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'}

# post请求，如果需要POST data，如需要登录进去
data = {'users': 'admin', 'password': 'admin'}
r = requests.post('https://p4tt0n.github.io/', data=data, headers=headers)

若登录进去后，可以建立一个session对象保持会话，这样就不用每次都发送请求了：

# 新建一个session对象
sess = requests.session()
# 先完成登录
sess.post('login url', data=data, headers=headers)
# 然后再在这个会话下去访问其他的网址
sess.get('other urls')

Beautifulsoup: 干了这碗美丽的汤

在使用requests获取到整个页面的源码后，就需要beautifulsoup对数据进一步处理。

BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库，翻译成中文是美丽的汤，与传统的tagsoup对应，希望改进传统不好的web格式，该名字出自《爱丽丝梦游仙境》的第十章《龙虾四组舞》：

美丽的汤，如此浓郁和绿色
在热腾腾的茶中等待着
谁会不为这样的美味而弯腰呢？

傍晚的汤 
美丽的汤

美丽的汤 
美丽的汤 
美丽的 sou-oop...
美丽的汤 
晚上的汤...
好好喝的汤！

美汤，谁在乎鱼？
野味还是任何其他菜肴？
谁不愿意为两个人付出一切呢？

Pennyworth 唯有美丽的汤
Pennyworth 唯有美丽的汤

美丽的 sou-oop...
美丽的汤 
傍晚的汤 
美丽，美丽的汤！

这也就是BeautifulSoup官方文档中大量使用《爱丽丝梦游仙境》中文本作为例子的原因。下面我们就使用官方示例说明BeautifulSoup有多美丽：

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# 创建一个名为soup的BeautifulSoup对象
soup = BeautifulSoup(html, 'lxml')

1 2	# 按照标准的缩进格式的结构输出 print(soup.prettify())

2_2

1 2	# 获取标题 print(soup.title)

<title>The Dormouse’s story</title>

1 2	# 获取标题文本 print(soup.title.text)

The Dormouse’s story

1 2	# 获取所有文字内容 print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

1 2	# 通过标签定位 print(soup.find_all('a'))

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1 2	# 通过属性定位 print(soup.find_all(attrs={'id': 'link1'}))

1	[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

1 2	# 标签 + 属性定位 print(soup.find_all('a', id='link1'))

1	[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

博客爬虫实战

这里我们试着爬取博客中的所有文章，先整理返回的html数据，观察特征：

import requests
from bs4 import BeautifulSoup

r=requests.get('https://p4tt0n.github.io/')
soup=BeautifulSoup(r.text, 'lxml')
# 标准缩进输出
print(soup.prettify())

发现文章标题信息以这样存在：

1
2
3

<a class="article-title" href="/2024/10/10/%E6%98%A5%E7%A7%8B%E4%BA%91%E9%95%9CInitial/" title="春秋云镜Initial">
        春秋云镜Initial
       </a>

指定a和class=”article-title”，输出所有的title，也就是所有博客的标题：

1
2
3

#‘class’在python中是保留字，所以使用时需加‘下划线_’
for link in soup.find_all('a',class_="article-title"):
    print(link.get('title'))

成功爬取：

春秋云镜Initial
深入浅出.user.ini
ThinkPHP v5.0.24反序列化代码审计
Nodejs沙箱逃逸
命令执行
深入浅出PHP反序列化
Hello World

同理，爬取所有文章：

1
2
3

#‘class’在python中是保留字，所以使用时需加‘下划线_’
for link in soup.find_all('div',class_="content"):
    print(link.get_text())

成功爬取：

2_3

≡