Media Processor Agent | Interlocute.ai Agents

Built-in media modules

Structured intelligence for video, documents, and images

Video Intelligence

Upload a video and get a structured AI index: speech transcripts, visual scene analysis, entity extraction, sentiment, and AI summaries — choose the signals you need.

Speech profile — what was said
Visual profile — what was seen
Insights profile — what it means
Composable profiles

Learn more

Coming Soon

Document Processing

Upload PDFs, forms, and scanned documents. Three composable profiles extract text, structural layout, and form fields — combine them to pay only for what you need.

PDF Read — text and language
PDF Layout — structure and figures
Form Extraction — fields and barcodes
Composable profiles

Learn more

Coming Soon

Image Intelligence

Upload an image and get layered AI analysis: a structural fingerprint with instant local metrics, semantic understanding from a multimodal LLM, and full forensic verification with manipulation detection.

Structural fingerprint — instant local analysis
Semantic intelligence — LLM-powered understanding
Forensic verification — adversarial analysis
Object detection & vision features

Learn more

What you get out of the box

Audio transcription (speech-to-text) out of the box

Image upload with OCR and vision analysis

Video ingestion and frame extraction

PDF and text document parsing and summarisation

All media processing governed and usage-metered

Combine media inputs with conversational AI in one node

How setup works

01

Sign up and create a new node

02

Select the Media Processor profile

03

Upload audio, images, or documents via the dashboard or API

04

Optionally customise the constitution for your media workflow

05

Embed the transcription widget or call the API directly

Try these prompts

› Transcribe this meeting recording and list the action items

› Extract all text from this scanned invoice image

› Summarise this PDF report in three bullet points

› What language is being spoken in this audio clip?

Common use cases

HTTP API Reference

Full REST API for chat, threads, and streaming. Build custom integrations with standard HTTP endpoints and predictable contracts.

Frequently Asked Questions

Media Processor

What media formats does the Media Processor support?

The agent supports common audio formats (MP3, WAV, M4A, WebM) for transcription, image formats (JPEG, PNG, WebP) for vision analysis, and document formats (PDF, plain text, Markdown) for parsing. Video support includes MP4 and WebM for frame extraction. Format support expands as new capabilities are added.

How does audio transcription work?

Upload an audio file to the node's transcription endpoint and receive text back. The platform uses speech-to-text models, supports optional language hints, and meters usage by audio duration. Every transcription is logged in your usage ledger.

Can I combine media processing with conversation?

Yes. The Media Processor has threads and artifact handling enabled by default. Upload a recording, get a transcription, then ask follow-up questions about the content — all in the same thread with full context.

Is the Media Processor suitable for production workloads?

Yes. All media processing is governed by the platform contract, metered per-request, and logged in an auditable usage ledger. You can set budget limits and monitor costs per media type.

How is media processing billed?

Audio transcription is billed by audio duration (centiseconds). Image, video, and document processing incur LLM token costs for analysis plus a platform premium. All costs are attributed per-request in the usage ledger.

Ready to deploy?

Create your Media Processor node in seconds and start building.

Create this agent See all agents