Autoregressive video generation is a rapidly evolving research domain. It focuses on the synthesis of videos frame-by-frame using learned patterns […]
Category: Computer Vision
GLM-4.1V-Thinking: Advancing General-Purpose Multimodal Understanding and Reasoning
Vision-language models (VLMs) play a crucial role in today’s intelligent systems by enabling a detailed understanding of visual content. The […]
Mirage: Multimodal Reasoning in VLMs Without Rendering Images
While VLMs are strong at understanding both text and images, they often rely solely on text when reasoning, limiting their […]
JarvisArt: A Human-in-the-Loop Multimodal Agent for Region-Specific and Global Photo Editing
Bridging the Gap Between Artistic Intent and Technical Execution Photo retouching is a core aspect of digital photography, enabling users […]
This AI Paper Introduces MMSearch-R1: A Reinforcement Learning Framework for Efficient On-Demand Multimodal Search in LMMs
Large multimodal models (LMMs) enable systems to interpret images, answer visual questions, and retrieve factual information by combining multiple modalities. […]
This AI Paper Introduces PEVA: A Whole-Body Conditioned Diffusion Model for Predicting Egocentric Video from Human Motion
Understanding the Link Between Body Movement and Visual Perception The study of human visual perception through egocentric views is crucial […]
NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video
AI-powered video generation is improving at a breathtaking pace. In a short time, we’ve gone from blurry, incoherent clips to […]
How Radial Attention Cuts Costs in Video Diffusion by 4.4× Without Sacrificing Quality
Introduction to Video Diffusion Models and Computational Challenges Diffusion models have made impressive progress in generating high-quality, coherent videos, building […]
ByteDance Researchers Introduce VGR: A Novel Reasoning Multimodal Large Language Model (MLLM) with Enhanced Fine-Grained Visual Perception Capabilities
Why Multimodal Reasoning Matters for Vision-Language Tasks Multimodal reasoning enables models to make informed decisions and answer questions by combining […]
BAAI Launches OmniGen2: A Unified Diffusion and Transformer Model for Multimodal AI
Beijing Academy of Artificial Intelligence (BAAI) introduces OmniGen2, a next-generation, open-source multimodal generative model. Expanding on its predecessor OmniGen, the […]