Multimodal Models Only: Image processing requires models with vision capabilities. Currently, Qwen3-VL 30B and Kimi K2.5 support image inputs. Other models (DeepSeek, Llama, Qwen Coder) are text-only and cannot process images.
See the vision models and chat models pages for complete model specifications and multimodal capabilities.
Image processing works through the chat/completions endpoint using base64-encoded images. Images are sent as data URLs in the message content alongside your text prompt.
There are several ways to convert your images to base64 format:
# Convert image to base64base64 -i image.jpg -o image_base64.txt# Or use it directly in your terminalbase64 image.jpg | pbcopy # Copies to clipboard on macOS