🎥 Dolphin: Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Hierarchical Top-Down Attention

Automatically detect and separate ALL speakers in your video

Simply upload a video and the system will:

🔍 Automatically detect all speakers in the video
🎭 Show you each detected speaker's face
🎬 Generate individual videos for each speaker with their isolated audio

📹 Upload Your Video

🎬 Try with Example Video

Click to load example video

Status

📋 Processing Details

👇 Click on any face below to view that speaker's video

📖 How it works:

Upload - Select any video file
Process - Click the button to start automatic detection
Review - See all detected speakers and their positions
Select - Click on any face to watch that speaker's separated video

💡 Tips for best results:

✅ Ensure all speakers' faces are visible in the first frame
✅ Use videos with good lighting and clear face views
✅ Works best with frontal or near-frontal face angles
⏱️ Processing time depends on video length and number of speakers

🚀 Powered by:

RetinaFace for face detection
Dolphin model for audio-visual separation
CPU processing