适用于单张图片、多张图片和高帧率视频理解的GPT-4o级别的手机端多模态大语言模型

MiniCPM-V 4.5

MiniCPM-V 4.5 是MiniCPM-V系列中最新且功能最强大的模型。该模型基于Qwen3-8B和SigLIP2-400M构建，总参数量为8B。它在性能上显著优于之前的MiniCPM-V和MiniCPM-o模型，并引入了新的实用功能。MiniCPM-V 4.5的主要特点包括：

🔥 最先进的视觉-语言能力。 MiniCPM-V 4.5在OpenCompass（涵盖8个流行基准测试的综合评估）上的平均得分为77.0。仅凭8B参数，它就超越了广泛使用的专有模型如GPT-4o-latest、Gemini-2.0 Pro以及强大的开源模型如Qwen2.5-VL 72B 的视觉-语言能力，使其成为30B参数以下性能最强的多模态大语言模型。
🎬 高效的高帧率和长视频理解。 借助一种新的统一3D重采样器，MiniCPM-V 4.5现在可以实现96倍的视频令牌压缩率，其中6个448×448的视频帧可以联合压缩成64个视频令牌（大多数多模态大语言模型通常需要1,536个令牌）。这意味着模型可以在不增加大语言模型推理成本的情况下感知更多的视频帧。这带来了在Video-MME、LVBench、MLVU、MotionBench、FavorBench等基准测试中的最先进的高帧率（最高可达10FPS）视频理解和长视频理解能力。
⚙️ 可控的混合快速/深度思考。 MiniCPM-V 4.5既支持高效频繁使用的快速思考模式，也支持解决更复杂问题的深度思考模式。为了在不同用户场景中权衡效率和性能，这种快速/深度思考模式可以以高度可控的方式进行切换。
💪 强大的OCR、文档解析等功能。 基于LLaVA-UHD架构，MiniCPM-V 4.5可以处理任何纵横比的高分辨率图像，最高可达180万像素（例如1344×1344），并且使用比大多数多模态大语言模型少4倍的视觉令牌。该模型在OCRBench上的表现领先，超越了GPT-4o-latest和Gemini 2.5等专有模型。它还在OmniDocBench基准测试中实现了对PDF文档解析能力的最先进性能。基于最新的RLAIF-V和VisCPM技术，它具备可信赖的行为，在MMHal-Bench上优于GPT-4o-latest，并支持超过30种语言的多语言能力。
💫 易于使用。 MiniCPM-V 4.5可以通过多种方式轻松使用：(1) llama.cpp 和 ollama 支持在本地设备上进行高效的CPU推理，(2) 提供16种大小的int4, GGUF 和 AWQ 格式的量化模型，(3) SGLang 和 vLLM 支持高吞吐量和内存高效的推理，(4) 使用 Transformers 和 LLaMA-Factory 在新领域和任务上进行微调，(5) 快速本地WebUI演示，(6) 针对iPhone和iPad优化的本地iOS应用，(7) 在服务器上的在线网络演示。请参阅我们的 Cookbook 获取完整的使用方法！

关键技术

架构：统一的3D重采样器用于高密度视频压缩。 MiniCPM-V 4.5引入了一个3D重采样器，克服了视频理解中的性能与效率之间的权衡。通过将最多6个连续的视频帧分组并联合压缩为仅64个token（与MiniCPM-V系列中单张图片使用的token数量相同），MiniCPM-V 4.5实现了对视频token 96倍的压缩率。这使得模型能够在不增加额外LLM计算成本的情况下处理更多的视频帧，从而支持高FPS视频和长视频的理解。该架构支持图像、多图像输入和视频的统一编码，确保能力和平滑的知识转移。
预训练：OCR和文档知识的统一学习。 现有的MLLMs在孤立的学习方法中分别从文档中学习OCR能力和知识。我们观察到这两种训练方法之间本质的区别在于图像中文本的可见性。通过对文档中的文本区域施加不同程度的噪声动态损坏，并要求模型重建文本，模型学会了根据情况自适应且恰当地切换准确的文字识别（当文本可见时）和基于多模态上下文的知识推理（当文本被严重遮挡时）。这种方法消除了从文档中学习知识时对容易出错的文档解析器的依赖，并防止因过度增强的OCR数据而产生的幻觉，以最小的工程开销实现了一流的OCR和多模态知识表现。
后训练：结合多模态RL的混合快速/深度思考。 MiniCPM-V 4.5通过两种可切换模式提供平衡的推理体验：日常使用时高效的快速思考模式以及复杂任务所需的深度思考模式。采用一种新的混合强化学习方法，模型同时优化了这两种模式，在显著提升快速模式性能的同时不牺牲深度模式的能力。结合RLPR和RLAIF-V，它能够从广泛的多模态数据中概括出稳健的推理技能，同时有效减少幻觉。

评估

推理效率

OpenCompass

模型	大小	平均分数 ↑	总推理时间 ↓
GLM-4.1V-9B-Thinking	10.3B	76.6	17.5h
MiMo-VL-7B-RL	8.3B	76.4	11h
MiniCPM-V 4.5	8.7B	77.0	7.5h

Video-MME

模型	大小	平均分数 ↑	总推理时间 ↓	GPU内存 ↓
Qwen2.5-VL-7B-Instruct	8.3B	71.6	3h	60G
GLM-4.1V-9B-Thinking	10.3B	73.6	2.63h	32G
MiniCPM-V 4.5	8.7B	73.5	0.26h	28G

Video-MME 和 OpenCompass 的评估均使用了8×A100 GPU进行推理。报告的Video-MME推理时间包括完整的模型侧计算，但不包括外部的视频帧提取成本（取决于特定的帧提取工具），以便公平比较。

示例

我们在iPad M4上部署了MiniCPM-V 4.5，并提供了iOS演示。演示视频是未经编辑的原始屏幕录制。

框架支持矩阵

类别	框架	手册链接	上游PR	支持自（分支）	支持自（发布）
边缘（设备端）	Llama.cpp	Llama.cpp 文档	#15575 (2025-08-26)	master (2025-08-26)	b6282
边缘（设备端）	Ollama	Ollama 文档	#12078 (2025-08-26)	合并中	等待官方发布
服务（云端）	vLLM	vLLM 文档	#23586 (2025-08-26)	main (2025-08-27)	等待官方发布
服务（云端）	SGLang	SGLang 文档	#9610 (2025-08-26)	合并中	等待官方发布
微调	LLaMA-Factory	LLaMA-Factory 文档	#9022 (2025-08-26)	main (2025-08-26)	等待官方发布
量化	GGUF	GGUF 文档	—	—	—
	BNB	BNB 文档	—	—	—
	AWQ	AWQ 文档	—	—	—
演示	Gradio 演示	Gradio 演示文档	—	—	—

注意：如果您希望我们优先支持另一个开源框架，请通过此简短表单告诉我们。

使用方法

如果您希望启用思考模式，请在聊天函数中提供参数 enable_thinking=True。

图片聊天

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

torch.manual_seed(100)

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6

image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')

enable_thinking=False # If `enable_thinking=True`, the thinking mode is enabled.
stream=True # If `stream=True`, the answer is string

# First round chat 
question = "What is the landform in the picture?"
msgs = [{'role': 'user', 'content': [image, question]}]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    enable_thinking=enable_thinking,
    stream=True
)

generated_text = ""
for new_text in answer:
    generated_text += new_text
    print(new_text, flush=True, end='')

# Second round chat, pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": [answer]})
msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    stream=True
)

generated_text = ""
for new_text in answer:
    generated_text += new_text
    print(new_text, flush=True, end='')

您将获得以下输出：

# round1
The landform in the picture is karst topography. Karst landscapes are characterized by distinctive, jagged limestone hills or mountains with steep, irregular peaks and deep valleys—exactly what you see here These unique formations result from the dissolution of soluble rocks like limestone over millions of years through water erosion.

This scene closely resembles the famous karst landscape of Guilin and Yangshuo in China’s Guangxi Province. The area features dramatic, pointed limestone peaks rising dramatically above serene rivers and lush green forests, creating a breathtaking and iconic natural beauty that attracts millions of visitors each year for its picturesque views.

# round2
When traveling to a karst landscape like this, here are some important tips:

1. Wear comfortable shoes: The terrain can be uneven and hilly.
2. Bring water and snacks for energy during hikes or boat rides.
3. Protect yourself from the sun with sunscreen, hats, and sunglasses—especially since you’ll likely spend time outdoors exploring scenic spots.
4. Respect local customs and nature regulations by not littering or disturbing wildlife.

By following these guidelines, you'll have a safe and enjoyable trip while appreciating the stunning natural beauty of places such as Guilin’s karst mountains.

与视频聊天

## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids. 
# To achieve this, you need to organize your video data into two corresponding sequences: 
#   frames: List[Image]
#   temporal_ids: List[List[Int]].

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu    # pip install decord
from scipy.spatial import cKDTree
import numpy as np
import math

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True,  # or openbmb/MiniCPM-o-2_6
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True)  # or openbmb/MiniCPM-o-2_6

MAX_NUM_FRAMES=180 # Indicates the maximum number of frames received after the videos are packed. The actual maximum number of valid frames is MAX_NUM_FRAMES * MAX_NUM_PACKING.
MAX_NUM_PACKING=3  # indicates the maximum packing number of video frames. valid range: 1-6
TIME_SCALE = 0.1 

def map_to_nearest_scale(values, scale):
    tree = cKDTree(np.asarray(scale)[:, None])
    _, indices = tree.query(np.asarray(values)[:, None])
    return np.asarray(scale)[indices]

def group_array(arr, size):
    return [arr[i:i+size] for i in range(0, len(arr), size)]

def encode_video(video_path, choose_fps=3, force_packing=None):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]
    vr = VideoReader(video_path, ctx=cpu(0))
    fps = vr.get_avg_fps()
    video_duration = len(vr) / fps

    if choose_fps * int(video_duration) <= MAX_NUM_FRAMES:
        packing_nums = 1
        choose_frames = round(min(choose_fps, round(fps)) * min(MAX_NUM_FRAMES, video_duration))

    else:
        packing_nums = math.ceil(video_duration * choose_fps / MAX_NUM_FRAMES)
        if packing_nums <= MAX_NUM_PACKING:
            choose_frames = round(video_duration * choose_fps)
        else:
            choose_frames = round(MAX_NUM_FRAMES * MAX_NUM_PACKING)
            packing_nums = MAX_NUM_PACKING

    frame_idx = [i for i in range(0, len(vr))]      
    frame_idx =  np.array(uniform_sample(frame_idx, choose_frames))

    if force_packing:
        packing_nums = min(force_packing, MAX_NUM_PACKING)

    print(video_path, ' duration:', video_duration)
    print(f'get video frames={len(frame_idx)}, packing_nums={packing_nums}')

    frames = vr.get_batch(frame_idx).asnumpy()

    frame_idx_ts = frame_idx / fps
    scale = np.arange(0, video_duration, TIME_SCALE)

    frame_ts_id = map_to_nearest_scale(frame_idx_ts, scale) / TIME_SCALE
    frame_ts_id = frame_ts_id.astype(np.int32)

    assert len(frames) == len(frame_ts_id)

    frames = [Image.fromarray(v.astype('uint8')).convert('RGB') for v in frames]
    frame_ts_id_group = group_array(frame_ts_id, packing_nums)

    return frames, frame_ts_id_group

video_path="video_test.mp4"
fps = 5 # fps for video
force_packing = None # You can set force_packing to ensure that 3D-Resampler packing is forcibly enabled; otherwise, encode_video will dynamically set the packing quantity based on the duration.
frames, frame_ts_id_group = encode_video(video_path, fps, force_packing=force_packing)

question = "Describe the video"
msgs = [
    {'role': 'user', 'content': frames + [question]}, 
]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    use_image_id=False, # ensure use_image_id=False when video inference
    max_slice_nums=1,
    temporal_ids=frame_ts_id_group
)
print(answer)

多图片聊天

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True)

image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'

msgs = [{'role': 'user', 'content': [image1, image2, question]}]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

少样本学习

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True)

question = "production date" 
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')

msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

许可证

模型许可证

本仓库中的代码遵循 Apache-2.0 许可证发布。
使用 MiniCPM-V 系列模型权重必须严格遵守 MiniCPM 模型许可证.md。
MiniCPM 的模型和权重完全免费用于学术研究。在填写了 “问卷” 进行注册后，MiniCPM-V 4.5 的权重也可免费用于商业用途。

声明

作为 LMM，MiniCPM-V 4.5 通过学习大量的多模态语料库生成内容，但它不能理解、表达个人观点或进行价值判断。MiniCPM-V 4.5 生成的任何内容不代表模型开发者的观点和立场。
对于使用 MinCPM-V 模型过程中出现的任何问题，包括但不限于数据安全问题、舆论风险或因误导、滥用、传播或误用模型而产生的任何风险和问题，我们概不负责。

关键技术和其他多模态项目

👏 欢迎探索 MiniCPM-V 4.5 的关键技术以及我们团队的其他多模态项目：

VisCPM | RLPR | RLHF-V | LLaVA-UHD | RLAIF-V

引用

如果您认为我们的工作有所帮助，请考虑引用我们的论文 📝 并喜欢此项目 ❤️！

@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={Nat Commun 16, 5509 (2025)},
  year={2025}
}

文章版权归作者所有，未经允许请勿转载。

THE END

AI前沿
# 视频理解

单图、多图及视频理解模型MiniCPM-V 4.5