← All projects
NLP & Speechlive

Audio Intelligence

Speech → transcription, sentiment, keywords — a full NLP pipeline.

Speech → 4 structured outputs in one pass

The problem

Upload an audio clip and you usually get just a transcript. Adding sentiment, keywords, and speaking-rate to the same upload is mechanically easy but rarely shipped together. This is the one-pass version.

Who this is for

Anyone evaluating NLP pipeline glue, candidates for a voice-product role wanting a small but complete example.

Architecture

faster-whisper (tiny)
Transcription with word-level timestamps. POST /api/transcribe.
DistilBERT sentiment
Sentence-level positive / negative scoring over the transcript.
Keyword extractor
Salient terms surfaced from the transcript for skim-friendly summaries.
Speaking-rate calculator
Tokens / second derived from word timestamps.
FastAPI + Next.js
POST /api/analyze runs the full pipeline; UI shows transcript and the three analyses side by side.

Request / data flow

  1. 01Audio uploaded → Whisper produces transcript + word timestamps.
  2. 02Transcript chunked by sentence → DistilBERT scores each.
  3. 03Keyword extractor pulls salient terms; speaking rate computed from timestamps.
  4. 04Structured response with all four blocks returned in one shot.

Key decisions

Whisper tiny instead of larger variants.

whyLatency on a homelab CPU matters more than the last point of WER for a demo; tiny still produces usable transcripts for sentiment.

One analyze endpoint that returns everything.

whyMultiple round-trips would make the UI more complex without changing what the user sees.

Stack

faster-whisperDistilBERTNLPSentimentFastAPINext.js

If I rebuilt it

  • Add speaker diarization so sentiment per speaker is meaningful in multi-voice clips.
  • Stream the transcript word-by-word as Whisper emits it instead of waiting for full completion.