MIT graduate student Alex Kachkine once spent nine months meticulously restoring a damaged baroque Italian painting, which left him plenty […]
Category: Computer Vision
EPFL Researchers Unveil FG2 at CVPR: A New AI Model That Slashes Localization Errors by 28% for Autonomous Vehicles in GPS-Denied Environments
Navigating the dense urban canyons of cities like San Francisco or New York can be a nightmare for GPS systems. […]
Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video Control
Key Takeaways: Researchers from Google DeepMind, the University of Michigan & Brown university have developed “Motion Prompting,” a new method […]
This AI Paper Introduces VLM-R³: A Multimodal Framework for Region Recognition, Reasoning, and Refinement in Visual-Linguistic Tasks
Multimodal reasoning ability helps machines perform tasks such as solving math problems embedded in diagrams, reading signs from photographs, or […]
VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control
Bridging Perception and Action in Robotics Multimodal Large Language Models (MLLMs) hold promise for enabling machines, such as robotic arms […]
Yandex Releases Alchemist: A Compact Supervised Fine-Tuning Dataset for Enhancing Text-to-Image T2I Model Quality
Despite the substantial progress in text-to-image (T2I) generation brought about by models such as DALL-E 3, Imagen 3, and Stable […]
ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation
Autoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses […]
National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text Generation
In recent months, there has been growing interest in applying diffusion models—originally designed for continuous data, such as images—to natural […]
This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image Generation
Diffusion models, known for their success in generating high-quality images, are now being explored as a foundation for handling diverse […]
Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models
Multi-modal large language models (MLLMs) have shown great progress as versatile AI assistants capable of handling diverse visual tasks. However, […]