How on-device AI transcription works on a Mac

Two years ago, “AI scribe” meant exactly one architecture: stream audio to a server farm, run it through models too big for consumer hardware, send text back. The privacy trade-off wasn’t a vendor choice — it was a hardware constraint.

That constraint is gone. Here’s the pipeline CouchNotes runs entirely on a Mac, and why each piece became feasible.

The pipeline

A session recording becomes a progress note in four local stages:

1. Speech-to-text. An open Whisper-family model, quantized to GGML/GGUF format, transcribes the audio via whisper.cpp — a C++ inference engine built specifically to run these models fast on consumer hardware. On Apple Silicon it uses Metal and the unified memory architecture rather than needing a discrete GPU. Our target: transcribing a one-hour session in under five minutes on an M-series Mac.

2. Speaker separation. Therapy notes need to distinguish therapist from client. A compact diarization model (via sherpa-onnx) segments the audio by speaker — a two-speaker problem, which is the easy case of a hard problem — and those segments are merged with the transcript timestamps.

3. Note drafting. A small open LLM (Qwen-family, 4B parameters, 4-bit quantized) runs via llama.cpp. At 4-bit quantization, a 4B model fits in roughly 2.5 GB of memory — comfortable on any Apple Silicon Mac. It reads the diarized transcript and drafts the note in your chosen format, section by section, following per-template instructions. Long sessions get chunked map-reduce style: summarize segments, then compose the note from the summaries.

4. Cleanup. When you finalize the note, the audio is deleted. That’s a policy choice the architecture makes trivial: there’s no server-side copy to chase.

Why a 4B model is enough here

The reflex objection: “small local models are worse than GPT-class cloud models.” True in general; mostly irrelevant for this task.

Note drafting is a constrained summarization task, not open-ended reasoning. The model isn’t diagnosing anyone — it’s reorganizing a transcript into Subjective/Objective/Assessment/Plan sections under explicit instructions, and a clinician reviews every word before anything enters the record (“draft until touched” is the product’s core interaction). For structured summarization with the source text in context, current 4B-class models are genuinely good — and they fail visibly (a weak sentence you edit) rather than dangerously (a confident hallucination you don’t catch), because the reviewing therapist was in the room the transcript describes.

Meanwhile, the things that make cloud models attractive — encyclopedic knowledge, tool use, long-horizon reasoning — buy you nothing when the entire task fits in one context window with explicit formatting rules.

What Apple Silicon changed

Three properties of M-series Macs make this practical where 2020-era laptops weren’t:

Unified memory. Model weights don’t need a discrete GPU’s VRAM; the whole memory pool is available to inference. A base-model Mac runs the full pipeline.
The Neural Engine and Metal. whisper.cpp and llama.cpp both ship optimized Apple Silicon backends; this is arguably the best-supported consumer hardware for open-model inference right now.
Efficiency. Inference at this scale doesn’t spin fans for an hour. Transcription is a background job, not an event.

The one-time download cost: plan for roughly 4 GB of model weights, fetched once at setup and stored on disk. After that, the entire system works in airplane mode — which doubles as the easiest privacy audit a non-engineer can run: turn off Wi-Fi and watch it keep working.

The honest limitations

Apple Silicon only. Intel Macs lack the memory bandwidth and acceleration; we’d rather not ship a degraded experience.
Speed scales with your chip. An M4 will transcribe faster than an M1. Both hit usable speeds; “usable” just means different waits.
English and Spanish at launch. Whisper-family models are multilingual, but we ship languages we can verify end-to-end — transcription and note drafting.
No cloud fallback, on purpose. Some local-first tools quietly upload “hard” cases. There is no code path for that here; that’s the product.

Why this matters beyond therapy

Therapy notes are close to the maximally sensitive case: a recording of someone’s most private disclosures, made under a duty of confidentiality. If on-device AI works here — and it does — the “we need your data in our cloud” architecture becomes a business-model choice, not a technical necessity, for a whole class of professional tools.

We wrote about the policy side of that in Why “HIPAA compliant” isn’t enough. The short version: when processing is local, most data-governance questions stop needing good answers and start having no premise.

CouchNotes is in development — on-device session notes for therapists on macOS, one-time license. The waitlist gets beta invites and early-access pricing.