Skip to main content
Vision models understand images for visual tasks including image analysis, OCR, and screenshot-to-code generation.

Qwen
Qwen3-VL 30B
qwen3-vl-30b
Parameters: 30B (3B active)Context: 256K tokensStrengths: Vision-language understanding, GUI interaction, screenshot-to-code generation, spatial understanding, multilingual OCROCR Languages: Supports 32 languagesBest for: Image analysis, screenshot-to-code generation, OCR tasks, GUI automation, and vision-text understandingConfiguration repo: tinfoilsh/confidential-qwen3-vl-30b
Multimodal: Processes images with up to 256K context for long documents. See Image Processing Guide for usage examples.

Moonshot
Kimi K2.5
kimi-k2-5
Parameters: 1T total (32B activated)Context: 256K tokensStrengths: Unified vision and text processing, image analysis, generates code from screenshots and mockups, parallel task execution across specialized sub-agentsBest for: Converting designs to code, image analysis, visual reasoning, orchestrating complex workflows with multiple parallel agentsConfiguration repo: tinfoilsh/confidential-kimi-k2-5
Vision + Language: Jointly trained on images and text. See Image Processing Guide for usage examples.