Video to stems

Extract audio from video and remove vocals

If your source is a video file, upload it here. The audio gets extracted and split into vocals and accompaniment so you can preview and download what you need.

  • audio from video
  • video to stems
  • MP4 support
  • free browser test

Drop a song here — or tap to try it on your track

Free, in your browser. No signup. MP3, WAV, FLAC, M4A, OGG, or video.

Choose a file
Want 5-stem (drums, bass, piano)? iOS App Android App

How it works

1. Upload video

Drop your MP4 or other video file into the browser. Most common video formats are supported.

2. Extract and separate

Audio is pulled from the video container and split into vocals and accompaniment. The video itself is not stored.

3. Preview and export

Listen first, then download only what sounds good. Same quality expectations as audio file separation.

Supported video formats and audio codecs

The tool reads the audio track from these video containers. Video resolution does not affect results — only the audio codec and bitrate matter.

ContainerCommon audio codecTypical sourceSeparation quality
MP4 (.mp4, .m4v)AAC 128–256 kbpsMusic videos, screen recordingsGood — AAC at 256 kbps is close to CD quality
WebM (.webm)Opus 128–160 kbpsBrowser recordings, web exportsDecent — Opus is efficient but lower bitrate hurts
MOV (.mov)AAC or PCMiPhone/iPad recordings, Final Cut exportsVaries — PCM is lossless and excellent; AAC depends on bitrate
AVI (.avi)MP3 or PCMLegacy files, older screen recordersDepends entirely on the audio codec inside
MKV (.mkv)AAC, FLAC, or OpusRipped media, OBS recordingsGood if FLAC; variable otherwise

Is your video source good enough?

Quick checks before uploading.

  • Official music video or lyric videoBest case. Audio is usually mastered quality at 256 kbps AAC or higher.
  • Soundboard recording of a live showCan work well. The audio was captured directly from the mixing board, not from room mics.
  • Screen recording with system audioDepends on the recording software settings. Check that audio bitrate is 128 kbps or higher.
  • Phone recording from an audienceWorst case. Room ambience, crowd noise, and phone mic compression make separation much harder.
  • Social media clip (Instagram, TikTok)Heavily compressed audio. Try to find the original source instead if separation quality matters.

Working with YouTube, TikTok, and Instagram sources

Most video audio isn't created equal. Here's what to expect from each platform and when the video route makes sense.

Before you download — know what you're working with

YouTube music videos are typically 128 kbps Opus (WebM) or 192 kbps AAC (MP4); TikTok is 96 kbps AAC mono; Instagram Reels is 128 kbps mono. Lower bitrate means less detail for the AI — expect noticeably worse separation than from a Spotify-quality source. If you can find the same track on a music streaming service or as an MP3, use that instead.

Legal-ish download paths for separation work

YouTube's Terms of Service prohibit downloading most videos, but Creative Commons-licensed videos, videos you personally uploaded, and content you've purchased can be downloaded freely via yt-dlp or youtube-dl. For educational use (language learning, transcription), personal fair-use downloads are generally tolerated. Commercial use requires explicit licensing. When in doubt, link to the original video instead of rehosting.

When the video source is actually your best option

Video-only sources are the right pick for: live concert footage where the audio isn't on Spotify, YouTube uploads with rare or niche content, language immersion clips, music in films or TV soundtracks not released separately, and custom footage where audio is bespoke (interviews, lectures, tutorials). For commercial releases available elsewhere, always prefer the audio-direct source.

Test it on your own track

Upload any song and hear the separated stems in seconds. Free, no account needed.

Tips for better results

If you have the audio file separately, use that instead

A standalone MP3 or WAV avoids the extra step of container extraction and often has higher bitrate audio than video files.

Official music videos have the best audio

Audio in official releases is typically 256 kbps AAC or higher. Screen recordings and social media rips are usually much lower quality.

Video length does not affect processing time much

Extraction from the container is fast. The stem separation step takes the same time regardless of whether the source was video or audio.

Phone recordings rarely work well

Audio captured by phone microphones includes room reflections, crowd noise, and aggressive compression that degrades every stem.

FAQ

Can I upload video and get stems?

Yes. The browser extracts the audio and splits it automatically.

What video formats are supported?

MP4, WebM, MOV, and other common video formats. The tool extracts the audio track for processing.

Does video quality affect the result?

Video resolution does not matter. The audio track quality inside the video is what determines separation quality.

Can I make karaoke tracks from MP4 files?

Yes. Upload the MP4, extract audio, remove vocals, and use the accompaniment stem as a karaoke track.

Will the audio quality be worse than using an MP3 directly?

Not necessarily. If the video contains AAC at 256 kbps or higher, the audio quality is comparable to a good MP3. PCM audio in MOV files is lossless.

Can I extract audio from a YouTube link directly?

No — the browser tool needs the video file itself, not a URL. Download the video first (yt-dlp is the standard open-source option), then upload the file. For YouTube-specific workflows, services like y2mate and clipto.com offer URL-to-MP3 conversion, but they operate against YouTube's ToS. For your own uploads or Creative Commons videos, yt-dlp is legal and the best quality.

Why does the stem quality vary so much between videos?

Because audio quality inside videos varies dramatically. A 4K music video on YouTube might have 192 kbps AAC audio — near-CD quality. A 720p TikTok clip might have 96 kbps mono AAC. A phone recording at a concert is typically 64 kbps at best. The AI works with what's inside, and lossy compression at low bitrates destroys the harmonic detail the model needs to separate cleanly.

Can I remove the original song from a reaction video and keep the streamer's commentary?

Partially. The AI separates 'vocals' from 'accompaniment' — it doesn't distinguish between the streamer's voice and the original song's vocals. You'll get all vocal content (streamer plus singer) in one stem. For true reaction-video use, you'd need source separation that handles two vocal signals, which requires more advanced models like MVSep's 'karaoke' mode or manual spectral editing in iZotope RX.

Explore more tools