4K+ Star！Zerox OCR：一个基于AI的文档OCR工具

欢迎关注我，持续获取更多内容，感谢赞&在看~

Zerox OCR 简介

Zerox OCR^[1] 是一个简单易用的文档识别工具，它能够将各种格式的文档（如PDF、Word、图片等）转换为图像，并通过GPT模型将图像内容转换为Markdown格式。

这一过程完全自动化，使得文档的视觉呈现能够被AI轻松“阅读”和处理。

项目特点

主要特点

一键式操作：用户只需上传文件，Zerox OCR会自动处理并返回Markdown格式的内容。
支持多种文件格式：包括PDF、Word文档、图片等。
自动化图像转换：将文档转换为一系列图像，以便模型处理。
灵活的配置选项：用户可以根据需要调整并发处理数量、是否清理临时文件等。
支持多种模型：包括gpt-4o-mini等，用户可以根据需求选择合适的模型。

使用场景

Zerox OCR适用于需要将文档内容转换为可由AI处理的格式的场景，例如：

自动化文档处理：将纸质文档转换为电子格式，便于后续的自动化处理。
数据提取：从复杂的布局、表格、图表中提取数据。
内容迁移：将旧有文档内容迁移到新的系统中，以便于统一管理和检索。

项目使用

安装

Node.js环境: 确保你的系统已经安装了Node.js环境。
安装Zerox: 使用npm或yarn来安装Zerox OCR。

npm install zerox-ocr
# 或者
yarn add zerox-ocr

安装依赖: 需要安装graphicsmagick和ghostscript作为依赖，这些工具用于处理图像。

在Ubuntu上，可以使用以下命令：

sudo apt-get install graphicsmagick
sudo apt-get install ghostscript

在Mac上，可以使用Homebrew：

brew install graphicsmagick
brew install ghostscript

使用示例

基本使用: 下面是一个基本的使用示例，将PDF文件转换为Markdown格式。

const { ZeroxOCR } = require('zerox-ocr');

// 创建ZeroxOCR实例
const zerox = new ZeroxOCR();

// 转换文件
zerox.convert('path/to/your/file.pdf')
  .then(markdown => {
    console.log(markdown); // 输出Markdown格式的文本
  })
  .catch(error => {
    console.error('转换失败:', error);
  });

高级配置: 可以通过传递配置对象来自定义ZeroxOCR的行为。

const config = {
  cleanup: true, // 清理临时文件
  concurrency: 2, // 并发处理数量
  model: 'latest', // 使用最新的视觉模型
  systemPrompt: '自定义系统提示' // 自定义系统提示
};

zerox.convert('path/to/your/file.pdf', config)
  .then(markdown => {
    console.log(markdown);
  })
  .catch(error => {
    console.error('转换失败:', error);
  });

Python SDK: 如果你使用的是Python环境，可以按照类似的步骤安装和使用Zerox OCR的Python SDK。

from pyzerox import zerox
import os

model = "gpt-4o-mini"
os.environ["OPENAI_API_KEY"] = ""  # 你的API密钥

file_path = "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf"
output_dir = "./output_test"

result = await zerox(file_path=file_path, model=model, output_dir=output_dir)
print(result)

示例输出

假设你有一个包含文本、表格和图表的PDF文件，使用Zerox OCR转换后，你将得到一个Markdown格式的文本，其中包含了原始文档的内容、格式和结构。例如项目示例：

ZeroxOutput(
    completion_time=9432.975,
    file_name='cs101',
    input_tokens=36877,
    output_tokens=515,
    pages=[
        Page(
            content='| Type    | Description                          | Wrapper Class |n' +
                    '|---------|--------------------------------------|---------------|n' +
                    '| byte    | 8-bit signed 2s complement integer   | Byte          |n' +
                    '| short   | 16-bit signed 2s complement integer  | Short         |n' +
                    '| int     | 32-bit signed 2s complement integer  | Integer       |n' +
                    '| long    | 64-bit signed 2s complement integer  | Long          |n' +
                    '| float   | 32-bit IEEE 754 floating point number| Float         |n' +
                    '| double  | 64-bit floating point number         | Double        |n' +
                    '| boolean | may be set to true or false          | Boolean       |n' +
                    '| char    | 16-bit Unicode (UTF-16) character    | Character     |nn' +
                    'Table 26.2.: Primitive types in Javann' +
                    '### 26.3.1. Declaration & Assignmentnn' +
                    'Java is a statically typed language meaning that all variables must be declared before you can use ' +
                    'them or refer to them. In addition, when declaring a variable, you must specify both its type and ' +
                    'its identifier. For example:nn' +
                    '```javan' +
                    'int numUnits;n' +
                    'double costPerUnit;n' +
                    'char firstInitial;n' +
                    'boolean isStudent;n' +
                    '```nn' +
                    'Each declaration specifies the variable’s type followed by the identifier and ending with a ' +
                    'semicolon. The identifier rules are fairly standard: a name can consist of lowercase and ' +
                    'uppercase alphabetic characters, numbers, and underscores but may not begin with a numeric ' +
                    'character. We adopt the modern camelCasing naming convention for variables in our code. In ' +
                    'general, variables must be assigned a value before you can use them in an expression. You do not ' +
                    'have to immediately assign a value when you declare them (though it is good practice), but some ' +
                    'value must be assigned before they can be used or the compiler will issue an error.nn' +
                    'The assignment operator is a single equal sign, `=` and is a right-to-left assignment. That is, ' +
                    'the variable that we wish to assign the value to appears on the left-hand-side while the value ' +
                    '(literal, variable or expression) is on the right-hand-side. Using our variables from before, ' +
                    'we can assign them values:nn' +
                    '> 2 Instance variables, that is variables declared as part of an object do have default values. ' +
                    'For objects, the default is `null`, for all numeric types, zero is the default value. For the ' +
                    'boolean type, `false` is the default, and the default char value is `\0`, the null-terminating ' +
                    'character (zero in the ASCII table).',
            content_length=2333,
            page=1
        )
    ]
)