Computer Vision – Page 4

This AI Paper from Alibaba Introduces Lumos-1: A Unified Autoregressive Video Generator Leveraging MM-RoPE and AR-DF for Efficient Spatiotemporal Modeling

Autoregressive video generation is a rapidly evolving research domain. It focuses on the synthesis of videos frame-by-frame using learned patterns […]

GLM-4.1V-Thinking: Advancing General-Purpose Multimodal Understanding and Reasoning

Vision-language models (VLMs) play a crucial role in today’s intelligent systems by enabling a detailed understanding of visual content. The […]

Mirage: Multimodal Reasoning in VLMs Without Rendering Images

While VLMs are strong at understanding both text and images, they often rely solely on text when reasoning, limiting their […]

JarvisArt: A Human-in-the-Loop Multimodal Agent for Region-Specific and Global Photo Editing

Bridging the Gap Between Artistic Intent and Technical Execution Photo retouching is a core aspect of digital photography, enabling users […]

This AI Paper Introduces MMSearch-R1: A Reinforcement Learning Framework for Efficient On-Demand Multimodal Search in LMMs

Large multimodal models (LMMs) enable systems to interpret images, answer visual questions, and retrieve factual information by combining multiple modalities. […]

This AI Paper Introduces PEVA: A Whole-Body Conditioned Diffusion Model for Predicting Egocentric Video from Human Motion

Understanding the Link Between Body Movement and Visual Perception The study of human visual perception through egocentric views is crucial […]

NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video

AI-powered video generation is improving at a breathtaking pace. In a short time, we’ve gone from blurry, incoherent clips to […]

How Radial Attention Cuts Costs in Video Diffusion by 4.4× Without Sacrificing Quality

Introduction to Video Diffusion Models and Computational Challenges Diffusion models have made impressive progress in generating high-quality, coherent videos, building […]

ByteDance Researchers Introduce VGR: A Novel Reasoning Multimodal Large Language Model (MLLM) with Enhanced Fine-Grained Visual Perception Capabilities

Why Multimodal Reasoning Matters for Vision-Language Tasks Multimodal reasoning enables models to make informed decisions and answer questions by combining […]

BAAI Launches OmniGen2: A Unified Diffusion and Transformer Model for Multimodal AI

Beijing Academy of Artificial Intelligence (BAAI) introduces OmniGen2, a next-generation, open-source multimodal generative model. Expanding on its predecessor OmniGen, the […]