interlocute.ai beta
Out-of-the-box agent

Media Processor

Deploy a media-processing AI agent that handles audio, video, images, and documents out of the box — no pipeline assembly required.

What you get out of the box

Audio transcription (speech-to-text) out of the box

Image upload with OCR and vision analysis

Video ingestion and frame extraction

PDF and text document parsing and summarisation

All media processing governed and usage-metered

Combine media inputs with conversational AI in one node

How setup works

01

Sign up and create a new node

02

Select the Media Processor profile

03

Upload audio, images, or documents via the dashboard or API

04

Optionally customise the constitution for your media workflow

05

Embed the transcription widget or call the API directly

Try these prompts

Transcribe this meeting recording and list the action items
Extract all text from this scanned invoice image
Summarise this PDF report in three bullet points
What language is being spoken in this audio clip?

Frequently Asked Questions

Media Processor

What media formats does the Media Processor support?
The agent supports common audio formats (MP3, WAV, M4A, WebM) for transcription, image formats (JPEG, PNG, WebP) for vision analysis, and document formats (PDF, plain text, Markdown) for parsing. Video support includes MP4 and WebM for frame extraction. Format support expands as new capabilities are added.
How does audio transcription work?
Upload an audio file to the node's transcription endpoint and receive text back. The platform uses speech-to-text models, supports optional language hints, and meters usage by audio duration. Every transcription is logged in your usage ledger.
Can I combine media processing with conversation?
Yes. The Media Processor has threads and artifact handling enabled by default. Upload a recording, get a transcription, then ask follow-up questions about the content — all in the same thread with full context.
Is the Media Processor suitable for production workloads?
Yes. All media processing is governed by the platform contract, metered per-request, and logged in an auditable usage ledger. You can set budget limits and monitor costs per media type.
How is media processing billed?
Audio transcription is billed by audio duration (centiseconds). Image, video, and document processing incur LLM token costs for analysis plus a platform premium. All costs are attributed per-request in the usage ledger.

Ready to deploy?

Create your Media Processor node in seconds and start building.