Multimodal Models Only: Image processing requires models with vision capabilities. Currently, Qwen3-VL 30B, Kimi K2.5, Kimi K2.6, and Gemma 4 31B support image inputs. Other models (Llama, GPT-OSS) are text-only and cannot process images.
See the vision models and chat models pages for complete model specifications and multimodal capabilities.
Image processing works through the chat/completions endpoint using base64-encoded images. Images are sent as data URLs in the message content alongside your text prompt.
There are several ways to convert your images to base64 format:
# Convert image to base64base64 -i image.jpg -o image_base64.txt# Or use it directly in your terminalbase64 image.jpg | pbcopy # Copies to clipboard on macOS