Computer Vision – Page 9

Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder Sharing

Diffusion Transformers have demonstrated outstanding performance in image generation tasks, surpassing traditional models, including GANs and autoregressive architectures. They operate […]

Long-Context Multimodal Understanding No Longer Requires Massive Models: NVIDIA AI Introduces Eagle 2.5, a Generalist Vision-Language Model that Matches GPT-4o on Video Tasks Using Just 8B Parameters

In recent years, vision-language models (VLMs) have advanced significantly in bridging image, video, and textual modalities. Yet, a persistent limitation […]

Stanford Researchers Propose FramePack: A Compression-based AI Framework to Tackle Drifting and Forgetting in Long-Sequence Video Generation Using Efficient Context Management and Sampling

Video generation, a branch of computer vision and machine learning, focuses on creating sequences of images that simulate motion and […]

Meta AI Released the Perception Language Model (PLM): An Open and Reproducible Vision-Language Model to Tackle Challenging Visual Recognition Tasks

Despite rapid advances in vision-language modeling, much of the progress in this field has been shaped by models trained on […]

Meta AI Introduces Perception Encoder: A Large-Scale Vision Encoder that Excels Across Several Vision Tasks for Images and Video

The Challenge of Designing General-Purpose Vision Encoders As AI systems grow increasingly multimodal, the role of visual perception models becomes […]

Do We Still Need Complex Vision-Language Pipelines? Researchers from ByteDance and WHU Introduce Pixel-SAIL—A Single Transformer Model for Pixel-Level Understanding That Outperforms 7B MLLMs

MLLMs have recently advanced in handling fine-grained, pixel-level visual understanding, thereby expanding their applications to tasks such as precise region-based […]

Category: Computer Vision

Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder Sharing

Long-Context Multimodal Understanding No Longer Requires Massive Models: NVIDIA AI Introduces Eagle 2.5, a Generalist Vision-Language Model that Matches GPT-4o on Video Tasks Using Just 8B Parameters

Stanford Researchers Propose FramePack: A Compression-based AI Framework to Tackle Drifting and Forgetting in Long-Sequence Video Generation Using Efficient Context Management and Sampling

Meta AI Released the Perception Language Model (PLM): An Open and Reproducible Vision-Language Model to Tackle Challenging Visual Recognition Tasks

Meta AI Introduces Perception Encoder: A Large-Scale Vision Encoder that Excels Across Several Vision Tasks for Images and Video

Do We Still Need Complex Vision-Language Pipelines? Researchers from ByteDance and WHU Introduce Pixel-SAIL—A Single Transformer Model for Pixel-Level Understanding That Outperforms 7B MLLMs

Advancing Vision-Language Reward Models: Challenges, Benchmarks, and the Role of Process-Supervised Learning

VideoMind: A Role-Based Agent for Temporal-Grounded Video Understanding

Meta Reality Labs Research Introduces Sonata: Advancing Self-Supervised Representation Learning for 3D Point Clouds

TokenBridge: Bridging The Gap Between Continuous and Discrete Token Representations In Visual Generation