6.4K+ Star！一个基于 AI 的 Python 爬虫库，只需指定想要提取的信息，就能自动完成抓取工作

https://github.com/VinciGit00/Scrapegraph-ai

Github项目详情见【阅读原文】

项目简介

ScrapeGraphAI 是一个基于人工智能的 Python 网络爬虫库，它利用大型语言模型（LLM）和直接图逻辑为网站和本地文档（如 XML、HTML、JSON 等）创建抓取管道。

用户只需指定想要提取的信息，库就会自动完成抓取工作。

6.4K+ Star！一个基于 AI 的 Python 爬虫库，只需指定想要提取的信息，就能自动完成抓取工作

使用场景

从单页网站提取信息，如使用 SmartScraperGraph。
从搜索引擎的前 n 个搜索结果中提取信息，如使用 SearchGraph。
从网站提取信息并生成音频文件，如使用 SpeechGraph。

安装与使用

快速安装

通过 pip 安装 ScrapeGraphAI：

pip install scrapegraphai

对于基于 JavaScript 的抓取，还需要安装 Playwright：

playwright install

建议在虚拟环境中安装库，以避免与其他库发生冲突。

使用方法

ScrapeGraphAI 提供了三种主要的抓取管道：

SmartScraperGraph：单页抓取器，只需要用户提示和一个输入源。
SearchGraph：多页抓取器，从一个搜索引擎的前 n 个搜索结果中提取信息。
SpeechGraph：单页抓取器，从网站提取信息并生成音频文件。

用户可以通过 API 使用不同的 LLM，如 OpenAI、Groq、Azure 和 Gemini，或者使用本地模型如 Ollama。

示例用法

使用本地模型 Ollama 的 SmartScraperGraph：

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
 "llm": {
  "model": "ollama/mistral",
  "temperature": 0,
  "format": "json",
  "base_url": "http://localhost:11434",
 },
 "embeddings": {
  "model": "ollama/nomic-embed-text",
  "base_url": "http://localhost:11434",
 },
 "verbose": True,
}

smart_scraper_graph = SmartScraperGraph(
 prompt="List me all the projects with their descriptions",
 source="https://perinim.github.io/projects",
 config=graph_config
)

result = smart_scraper_graph.run()
print(result)

使用混合模型的 SearchGraph：

from scrapegraphai.graphs import SearchGraph

graph_config = {
 "llm": {
  "model": "groq/gemma-7b-it",
  "api_key": "GROQ_API_KEY",
  "temperature": 0
 },
 "embeddings": {
  "model": "ollama/nomic-embed-text",
  "base_url": "http://localhost:11434",
 },
 "max_results": 5,
}

search_graph = SearchGraph(
 prompt="List me all the traditional recipes from Chioggia",
 config=graph_config
)

result = search_graph.run()
print(result)

使用 OpenAI 的 SpeechGraph：

from scrapegraphai.graphs import SpeechGraph

graph_config = {
 "llm": {
  "api_key": "OPENAI_API_KEY",
  "model": "gpt-3.5-turbo",
 },
 "tts_model": {
  "api_key": "OPENAI_API_KEY",
  "model": "tts-1",
  "voice": "alloy"
 },
 "output_path": "audio_summary.mp3",
}

speech_graph = SpeechGraph(
 prompt="Make a detailed audio summary of the projects.",
 source="https://perinim.github.io/projects/",
 config=graph_config,
)

result = speech_graph.run()
print(result)

文档与演示

文档及示例

可以参考 Scrapegraph-ai 的官方文档来了解更多关于如何使用该库的信息：

Scrapegraph-ai 文档：https://scrapegraph-ai.readthedocs.io/en/latest/
Scrapegraph-ai 示例：https://github.com/VinciGit00/Scrapegraph-ai/tree/main/examples