RealTime-VLM: Real-Time Vision-Language Model Inference in the Browser
RealTime-VLM brings real-time vision-language model inference directly to the browser. It continuously captures webcam frames, encodes them, and sends an image+text prompt to any OpenAI-compatible API endpoint — displaying model responses with sub-second latency. No server-side relay is needed: the browser communicates with the VLM endpoint directly.
Features
- Continuous webcam capture: frames are grabbed at a configurable rate and sent as base-64 encoded images.
- OpenAI-compatible API: works out of the box with hosted APIs (GPT-4o, Gemini) and local VLMs served via Ollama, LM Studio, or vLLM.
- Sub-second feedback loop: streaming responses are displayed as they arrive, giving a live “describe what you see” experience.
- Zero dependencies on a custom backend: the entire pipeline runs in a single HTML+JS file.
Use Cases
- Real-time scene description for accessibility tools.
- Interactive vision demos and classroom experiments.
- Rapid prototyping of vision-aware chat interfaces.
- Local VLM benchmarking with live visual input.
Technical Details
The app uses the browser’s MediaDevices.getUserMedia API to capture frames, converts them to JPEG via a <canvas> element, and base-64 encodes the result before attaching it to the messages payload. Responses are streamed back using the standard SSE/streaming mode of the OpenAI chat completions endpoint.
