一键通吃全任务！Lumina-DiMOO全离散扩散登场

太阳花

2026-01-07 16:59:24

生成式大模型

多模态大模型

图像生成与编辑

跨模态融合增强

模型优化

文章摘要

Lumina-DiMOO的横空出世，不仅以其全离散扩散架构在技术路线上树立了新的标杆，更以其卓越的综合性能、惊人的效率和独特的交互能力，为多模态AI的未来发展打开了全新的想象空间。它证明了生成与理解并非鱼与熊掌，速度与质量亦可兼得。

文本到图像生成还在为“数分钟出图”烦恼，当多模态模型还在为“生成与理解割裂”纠结，Lumina-DiMOO的出现打破了这些僵局。这款开源全离散扩散大语言模型，不仅实现了32倍于传统自回归模型的采样效率，更在文本生成图像、图像编辑、图像理解等全场景中斩获顶尖性能。它究竟凭借什么技术突破改写了多模态格局？普通用户又该如何快速上手？

论文地址：https://arxiv.org/pdf/2510.06308

Lumina-DiMOO是什么？

Lumina-DiMOO是一款面向多模态生成与理解的全离散扩散大语言模型。其核心目标是将文本到图像生成、图像到图像生成（如编辑、风格迁移、主体驱动生成等）以及图像理解等多种能力，统一整合到一个端到端的架构中。

不同于传统多模态模型“模态分离处理”的思路，Lumina-DiMOO的核心创新在于“全离散扩散建模”。其底层逻辑是将文本、图像等所有模态统一转化为离散令牌（Token）序列，通过扩散过程实现端到端的生成与理解。具体而言，模型先对输入令牌添加噪声至完全随机状态，再通过反向去噪过程逐步恢复原始信息，这种统一框架彻底摒弃了自回归模型“逐令牌生成”的低效模式。

支撑这一架构的关键组件包括三大核心：一是aMUSEd-VQ图像令牌器，以16×16下采样因子平衡性能与效率，并通过理解数据扩展弥补语义信息不足；二是基于预训练离散扩散大语言模型LLaDA-Base的初始化，无需结构修改即可无缝整合多模态能力，大幅降低训练成本；三是<end-of-line>等特殊令牌设计，让模型能从一维序列中解析二维图像结构，天然支持任意分辨率处理。

Lumina-DiMOO的核心优势

1.极速采样，效率颠覆

与代表性自回归模型Lumina-mGPT2.0相比，Lumina-DiMOO在文本到图像生成任务上实现了高达32倍的推理速度提升。这彻底告别了以往大模型生成图片需要漫长等待的窘境。此外，其创新的基于最大对数似然的缓存方法，无需额外训练即可将采样速度再提升2倍，让高质量图像生成真正步入“秒级”时代。

2.全能多面手，任务无缝切换

Lumina-DiMOO的真正强大之处在于其无与伦比的通用性：

高质量文本到图像生成：在GenEval、DPG、UniGenBench等多个权威基准测试中，其综合得分（最高达91%）超越了FLUX.1[Dev]、GPT-4o等顶尖模型，生成的图像在视觉质量、文本遵循度和推理准确性上均表现卓越。

丰富的图像到图像能力：无论是根据深度图生成图像（可控生成）、进行风格迁移、基于特定主体进行创作，还是执行复杂的图像编辑（如添加、替换或移除对象），Lumina-DiMOO都展现出媲美甚至超越专业模型的能力。

顶尖的图像理解水平：在POPE、MMBench、SEED、MMMU等视觉语言理解基准上，它的表现超越了众多专用理解模型和其他统一模型，证明了其在OCR、图像描述、数学几何、表格解析等任务上的深刻洞察力。

3.独门绝技：零样本交互式修图

这是Lumina-DiMOO独有的“黑科技”。用户可以通过精确的标注（如框选）来指定图像中需要优化的区域，模型则会仅重新生成被掩码的区域，而完美保留未标注区域的所有细节。这种基于理解的精准编辑，是传统扩散模型或自回归模型难以实现的，为用户提供了前所未有的创作灵活性。

4.统一的训练闭环

通过四个阶段的精心训练——从多模态预训练、多样化任务中期训练，到指令跟随微调，最后通过创新的自改进强化学习框架进行优化——Lumina-DiMOO成功地闭合了生成与理解的训练循环。该框架联合优化文本到图像生成和多模态理解，并引入结构化语义反馈，确保模型能力全面、均衡且高度对齐人类指令。

📌快速入门实操攻略：

⚙️安装

1.创建conda环境

git clone https://github.com/Alpha-VLLM/Lumina-DiMOO.git && cd Lumina-DiMOO

conda create -n lumina_dimoo python=3.10 -y

conda activate lumina_dimoo

2.安装依赖项

pip install -r requirements.txt

🧨如何微调Lumina-DiMOO

1.预提取训练图像的离散代码。

经过特定处理后的最终格式可以参考示例json文件assets/mmu_sample.json。assets/t2i_sample.json

bash pretokenizer/runpre_token.sh

2.训练Lumina-DiMOO模型。

bash train/train.sh

🚗文本到图像生成推理

1.正态抽样

python inference/inference_t2i.py\

--checkpoint Alpha-VLLM/Lumina-DiMOO \

--prompt "A striking photograph of a glass of orange juice on a wooden kitchen table, capturing a playful moment. The orange juice splashes out of the glass and forms the word \"Smile\" in a whimsical, swirling script just above the glass. The background is softly blurred, revealing a cozy, homely kitchen with warm lighting and a sense of comfort." \

--height 768 \

--width 1536 \

--timesteps 64 \

--cfg_scale 4.0 \

--seed 65513 \

--vae_ckpt Alpha-VLLM/Lumina-DiMOO \

--output_dir output/results_text_to_image

2.DDP抽样

为了支持大规模采样/测试，我们提供了支持多GPU并行采样的额外ddp采样脚本。

torchrun --nprocper\

--checkpoint Alpha-VLLM/Lumina-DiMOO \

--prompt_path /path/to/prompts.jsonl \

--height 1024 \

--width 1024 \

--timesteps 64 \

--cfg_scale 4.0 \

--vae_ckpt Alpha-VLLM/Lumina-DiMOO \

--output_dir output/results_image_to_image_ddp \

--output_json output/results_image_to_image_ddp/results.json

3.利用缓存加快采样速度

添加--use-cache以通过基于最大逻辑的缓存（ML-Cache）加速采样。效率-质量权衡可以通过以下方式调整cacheratio（在(0,1)；越高速度越快），warmupratio（在[0,1)；越低速度越快），以及refresh_interval（在(1,timesteps-int(warmup_ratio*timesteps)-1]；越高速度越快）。

python inference/inference_t2i.py\

--checkpoint Alpha-VLLM/Lumina-DiMOO \

--height 768 \

--width 1536 \

--timesteps 64 \

--cfg_scale 4.0 \

--seed 65513 \

--vae_ckpt Alpha-VLLM/Lumina-DiMOO \

--output_dir output/results_text_to_image_usecache \

--use-cache \

--cache_ratio 0.9 \

--warmup_ratio 0.3 \

--refresh_interval 5

🌟图像到图像推理

1.可控生成：头部控制、深度控制、开放姿态控制、主体驱动。

python inference/inference_i2i.py \

--checkpoint Alpha-VLLM/Lumina-DiMOO \

--prompt "A functional wooden printer stand.Nestled next to a brick wall in a bustling city street, it stands firm as pedestrians hustle by, illuminated by the warm glow of vintage street lamps." \

--image_path examples/example_2.jpg \

--edit_type depth_control \

--timesteps 64 \

--cfg_scale 2.5 \

--cfg_img 4.0 \

--vae_ckpt Alpha-VLLM/Lumina-DiMOO \

--output_dir output/results_image_to_image

2.主题驱动生成。

python inference/inference_i2i.py \

--checkpoint Alpha-VLLM/Lumina-DiMOO \

--prompt "A creamy, rich-flavored dark beverage.Captured in a bustling urban street at twilight, this item is placed on an outdoor café table, as city lights begin to twinkle and passersby create a lively atmosphere." \

--image_path examples/example_3.jpg \

--edit_type subject_driven \

--timesteps 64 \

--cfg_scale 2.5 \

--cfg_img 4.0 \

--vae_ckpt Alpha-VLLM/Lumina-DiMOO \

--output_dir output/results_image_to_image

3.图像编辑：添加编辑、移除编辑、替换编辑、背景编辑、文本迁移编辑。

python inference/inference_i2i.py \

--checkpoint Alpha-VLLM/Lumina-DiMOO \

--prompt "Add a beige shed with brown trim and double doors with a diamond pattern in the center-right, occupying more than a third of the image." \

--image_path examples/example_4.png \

--edit_type edit_add \

--timesteps 64 \

--cfg_scale 2.5 \

--cfg_img 4.0 \

--vae_ckpt Alpha-VLLM/Lumina-DiMOO \

--output_dir output/results_image_to_image

4.风格迁移（以图片作为风格参考）

python inference/inference_i2i.py \

--checkpoint Alpha-VLLM/Lumina-DiMOO \

--prompt "Transform the current image into the style of the provided image." \

--image_path examples/example_5.png \

--ref_image_path examples/example_5_style.png \

--edit_type image_ref_transfer \

--timesteps 64 \

--cfg_scale 2.5 \

--cfg_img 4.0 \

--vae_ckpt Alpha-VLLM/Lumina-DiMOO \

--output_dir output/results_image_to_image

5.密集预测：坎尼边缘预测、边缘检测预测、深度预测、姿态预测、坎尼边缘控制。

python inference/inference_i2i.py \

--checkpoint Alpha-VLLM/Lumina-DiMOO \

--prompt "Generate a canny edge map accroding to the image." \

--image_path examples/example_1.png \

--edit_type canny_pred \

--timesteps 64 \

--cfg_scale 2.5 \

--cfg_img 4.0 \

--vae_ckpt Alpha-VLLM/Lumina-DiMOO \

--output_dir output/results_image_to_image

🏃图像修复与外推推理

1.图像修复

python inference/inference_t2i.py\

--checkpoint Alpha-VLLM/Lumina-DiMOO \

--prompt "Porsche showroom. Make there be a Porsche logo on the back wall behind the car." \

--painting_mode inpainting \

--painting_image examples/example_8.png \

--mask_h_ratio 0.5 \

--mask_w_ratio 0.5 \

--timesteps 64 \

--cfg_scale 4.0 \

--seed 65513 \

--vae_ckpt Alpha-VLLM/Lumina-DiMOO \

--output_dir output/results_text_to_image

2.图像外推

python inference/inference_t2i.py\

--checkpoint Alpha-VLLM/Lumina-DiMOO \

--prompt "A photograph showcasing a pale gold moon, partially veiled by wispy cirrus clouds, dominating a dramatic twilight sky. The moon's soft glow reflects on the tranquil surface of a lake below, creating a shimmering mirror effect, while a small wooden rowboat gently bobs on the water's edge. Dark silhouettes of tall, ancient pine trees encircle the lake, their branches reaching towards the sky like skeletal fingers, as a gentle mist hangs low, diffusing the moonlight and adding a sense of serene mystery. The scene is bathed in soft, cool lighting, creating an ethereal and captivating atmosphere." \

--painting_mode outpainting \

--painting_image examples/example_7.png \

--mask_h_ratio 1 \

--mask_w_ratio 0.2 \

--timesteps 64 \

--cfg_scale 4.0 \

--seed 65513 \

--vae_ckpt Alpha-VLLM/Lumina-DiMOO \

--output_dir output/results_text_to_image

⚡️图像理解推理

python inference/inference_mmu.py \

--checkpoint Alpha-VLLM/Lumina-DiMOO \

--prompt "Please describe this image." \

--image_path examples/example_6.jpg \

--steps 128 \

--gen_length 128 \

--block_length 32 \

--vae_ckpt Alpha-VLLM/Lumina-DiMOO \

--output_dir output/outputs_text_understanding

以上内容不代表本平台立场，仅供读者参考