How it works

  1. Send your song for transcription. The audio travels over an encrypted connection, is transcribed with line timings by AI (WhisperX), then discarded from our servers.
  2. Fix and style the lyrics. Edit any line, split or merge them, nudge the global timing, then pick the font, size, color, readability style and position over a looping background.
  3. Render the MP4 in your browser. The final vertical 1080×1920 video is encoded locally with ffmpeg.wasm — the render itself uses no AI credit.

Features

  • AI transcription with timings. WhisperX returns each lyric line with its start and end time, in the language it detects automatically.
  • Vocal isolation mode. Dense mix, buried vocals? Re-run the transcription on the isolated voice (separated first) for much better accuracy — it uses 4 credits instead of 2.
  • A real lyric editor. Fix words, split a line with Enter, add or delete lines, and shift all timings together by up to ±2 seconds.
  • Looping backgrounds and styles. Urban, nature, abstract, artist, music, nightlife and lofi galleries — or your own video (up to 60 s, 50 MB) — with 5 fonts, 4 readability styles and free positioning.

FAQ

Is the Lyrics Video maker free?

No — it is part of AudioKit Premium (€9.90/month or €99/year), because the AI transcription runs on dedicated servers. AudioKit Premium includes 100 AI credits per month, shared across all AI tools. A transcription costs 2 credits (4 with isolated vocals, which runs an extra AI separation pass). Need more? Credit packs are available from your space (100 credits for €5.99, 250 for €11.99) — pack credits never expire. The video render itself is free: it runs in your browser and uses no credit.

How accurate is the lyric synchronization?

Honestly: very good on clear vocals, imperfect on hard material. WhisperX times each line of sung text, and singing is genuinely difficult to transcribe — dense mixes, heavy effects or buried vocals can produce wrong words or shifted timings; that is the state of the art, not a quirk of AudioKit. You can fix every line in the editor, nudge all timings by ±2 seconds, and the vocal isolation mode handles the difficult cases.

What is the vocal isolation option?

For busy mixes: the AI first separates the vocals from the instrumental (Demucs), then transcribes the isolated voice — much more accurate when the vocal is buried. It takes longer (about 3–5 minutes instead of 1–2) and uses 4 credits instead of 2, because two AI operations run on our servers.

What happens to my audio file?

It is sent over an encrypted connection, transcribed, then discarded — it is not kept. Your background video never leaves your browser: only the audio goes out for transcription, and the final MP4 is rendered locally on your machine.

What video do I get at the end?

A vertical 9:16 MP4 in 1080×1920 — the native format of Reels, TikTok and Shorts — with your lyrics burned in, exactly as previewed. You choose the looping background (seven galleries or your own video), the font among five, the size, color, readability style and position. The render runs in your browser and is free.