6.5K+ Star！MiniCPM-V：面向图文理解的端侧多模态大模型系列，接受图像和文本输入，并提供高质量的文本输出

项目简介

MiniCPM-V^[1] 是由 OpenBMB 组织开发的一系列端侧多模态大型语言模型（MLLMs），专为视觉-语言理解而设计。

这些模型接受图像和文本作为输入，并提供高质量的文本输出。

项目特点

使用场景

MiniCPM-V 系列模型适用于多种场景，包括但不限于：

图像和文本的多模态理解与交互。
高效的端侧部署，适用于移动设备和个人电脑。
多语言支持，适合全球化应用场景。
图像识别、场景理解和文本生成等任务。

主要模型

自 2024 年 2 月以来，已发布了 4 个版本的模型，旨在实现强大的性能和高效的部署。MiniCPM-V 系列中最值得关注的模型包括：

MiniCPM-Llama3-V 2.5: 这是 MiniCPM-V 系列中最新、功能最强大的模型，拥有 8B 参数。在整体性能上，它超越了 GPT-4V-1106、Gemini Pro、Qwen-VL-Max 和 Claude 3 等专有模型。该模型还支持包括英语、中文、法语、西班牙语、德语等超过 30 种语言的多模态对话。
MiniCPM-V 2.0: MiniCPM-V 系列中最轻量级的模型，拥有 2B 参数，在整体性能上超越了更大的模型，如 Yi-VL 34B、CogVLM-Chat 17B 和 Qwen-VL-Chat 10B。它可以接收任何纵横比的图像输入，像素高达 180 万（例如 1344×1344），在理解场景文本方面与 Gemini Pro 相当，并在低幻觉率方面与 GPT-4V 相匹配。

使用方法

项目提供了在线和本地 Demo的使用方法

线上 Demo

项目提供了两个Hugging Face Spaces 上的线上Demo:

MiniCPM-Llama3-V 2.5^[2]
MiniCPM-V 2.0^[3]。

6.5K+ Star！MiniCPM-V：面向图文理解的端侧多模态大模型系列，接受图像和文本输入，并提供高质量的文本输出

本地安装

克隆代码库并打开文件夹：

git clone https://github.com/OpenBMB/MiniCPM-V.git
cd MiniCPM-V

创建 conda 环境：

conda create -n MiniCPM-V python=3.10 -y
conda activate MiniCPM-V

安装依赖项：

pip install -r requirements.txt

运行 WebUI

# 对于 NVIDIA GPU，请运行：
python web_demo_2.5.py --device cuda

# 对于搭载 MPS 的 Mac（Apple 芯片或 AMD GPU），请运行：
PYTORCH_ENABLE_MPS_FALLBACK=1 python web_demo_2.5.py --device mps

推理

模型库：提供了不同版本的 MiniCPM-V 模型，适用于不同的硬件和内存需求。
多轮对话：可以使用提供的代码示例进行多轮对话推理。
Mac 上的推理：提供了在 Mac 上使用 MPS 进行推理的示例。
手机端部署：MiniCPM-Llama3-V 2.5 和 MiniCPM-V 2.0 可以部署在 Android 手机上。
使用 llama.cpp 进行推理：MiniCPM-Llama3-V 2.5 支持使用 llama.cpp 进行推理。
使用 vLLM 进行推理：提供了使用 vLLM 进行 MiniCPM-V 2.0 推理的示例和步骤。

微调

支持使用 Hugging Face 对 MiniCPM-V 2.0 和 MiniCPM-Llama3-V 2.5 进行简单微调。
支持使用 SWIFT 框架对 MiniCPM-V 系列进行微调。

项目示例

推理请参考

图片示例

推理代码

from chat import MiniCPMVChat, img2base64
import torch
import json

torch.manual_seed(0)

chat_model = MiniCPMVChat('openbmb/MiniCPM-Llama3-V-2_5')

im_64 = img2base64('./assets/airplane.jpeg')

# First round chat 
msgs = [{"role": "user", "content": "Tell me the model of this aircraft."}]

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.chat(inputs)
print(answer)

# Second round chat 
# pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": answer})
msgs.append({"role": "user", "content": "Introduce something about Airbus A380."})

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.chat(inputs)
print(answer)

输出：

"The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database."
# 图片中的飞机是空客A380，可通过其庞大的体型、双层结构以及独特翼型和发动机形状来辨认。A380是一款宽体飞机，作为全球最大的客运飞机而闻名，专为长途航班设计。它装备有四个发动机，这是大型商业飞机的典型特征。如果在航空数据库中查询，飞机上的注册号还能提供更多具体型号信息。
"The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."
# 空客A380是由空客制造的一款双层、宽体、四引擎喷气式客机，它是世界上最大的客运飞机，并以其长距离飞行能力著称。该飞机是为了提升长途旅行乘客的效率与舒适度而研发的。它拥有两个全长的乘客甲板，能比一般的单通道飞机容纳更多乘客。A380已被多家航空公司运营，包括汉莎航空、新加坡航空和阿联酋航空等。它因独特的设计及对航空业的重大影响而广为人知。