← All projects
NLP & Speechlive
Audio Intelligence
Speech → transcription, sentiment, keywords — a full NLP pipeline.
Speech → 4 structured outputs in one pass
The problem
Upload an audio clip and you usually get just a transcript. Adding sentiment, keywords, and speaking-rate to the same upload is mechanically easy but rarely shipped together. This is the one-pass version.
Who this is for
Anyone evaluating NLP pipeline glue, candidates for a voice-product role wanting a small but complete example.
Architecture
- faster-whisper (tiny)
- Transcription with word-level timestamps. POST /api/transcribe.
- DistilBERT sentiment
- Sentence-level positive / negative scoring over the transcript.
- Keyword extractor
- Salient terms surfaced from the transcript for skim-friendly summaries.
- Speaking-rate calculator
- Tokens / second derived from word timestamps.
- FastAPI + Next.js
- POST /api/analyze runs the full pipeline; UI shows transcript and the three analyses side by side.
Request / data flow
- 01Audio uploaded → Whisper produces transcript + word timestamps.
- 02Transcript chunked by sentence → DistilBERT scores each.
- 03Keyword extractor pulls salient terms; speaking rate computed from timestamps.
- 04Structured response with all four blocks returned in one shot.
Key decisions
Whisper tiny instead of larger variants.
whyLatency on a homelab CPU matters more than the last point of WER for a demo; tiny still produces usable transcripts for sentiment.
One analyze endpoint that returns everything.
whyMultiple round-trips would make the UI more complex without changing what the user sees.
Stack
faster-whisperDistilBERTNLPSentimentFastAPINext.js
If I rebuilt it
- ›Add speaker diarization so sentiment per speaker is meaningful in multi-voice clips.
- ›Stream the transcript word-by-word as Whisper emits it instead of waiting for full completion.