Whisper dictation on Linux

macOS has built-in dictation. Windows has it too. Linux does not — not in any form that works system-wide, without the cloud, and without a painful setup ritual. So I built one.

whisper-dictation is a single Python file (~300 lines). Hold a keyboard shortcut, speak, release — your words appear in whatever app has focus. It runs entirely locally using OpenAI’s Whisper model via faster-whisper. No API keys. No cloud. No subscriptions.

The interesting part was not the speech recognition. That was easy. The interesting part was getting the transcribed text into the active window without breaking things.

The stack

  • faster-whisper — CTranslate2-optimized Whisper with int8 quantization. The “base” model is 140MB and transcribes a few seconds of audio in about a second on a laptop CPU.
  • sounddevice — PortAudio wrapper for mic capture.
  • evdev — reads keyboard events directly from /dev/input/*. On Wayland, there is no equivalent of X11’s XGrabKey, so reading from the input device directly is the only way to get global hotkeys.
  • uv — the script uses PEP 723 inline metadata, so uv run ./dictation.py installs everything automatically. No venv to manage.

The space-eating bug

The first version worked: evdev catches the hotkey, sounddevice records while it is held, faster-whisper transcribes on release, ydotool types the result. Had it running in an afternoon.

Then I tried using it with Claude Code.

Every space disappeared. The transcription in the logs was perfect — “Hello, I am testing this dictation tool” — but what appeared in the terminal was “Hello,Iamtestingthisdictationtool”.

The cause: Claude Code uses the space bar as a shortcut to activate voice chat mode. When ydotool type sends text character by character, each space is a separate key event. Claude Code intercepts those events before they reach the input field.

This is the kind of bug that only shows up when the tool meets a specific app’s keyboard handling. Unit tests would never catch it.

The fix is to not type characters at all. Instead:

  1. Copy the transcribed text to the clipboard (wl-copy on Wayland)
  2. Simulate Ctrl+Shift+V to paste it as a block

Simple in theory. In practice, this was a tour of everything broken about input simulation on Linux Wayland.

The Wayland input simulation gauntlet

Attempt 1: wl-copy + ydotool key

First try: wl-copy "text" to set clipboard, then ydotool key ctrl+shift+v to paste. The script hung. wl-copy does not fork on newer versions — it stays running as a clipboard server process, serving paste requests. subprocess.run() blocks forever waiting for it to exit.

Fix: use subprocess.Popen() and kill the previous instance before each new copy.

Attempt 2: ydotool syntax mismatch

The system had ydotool v0.1.8 (Ubuntu’s default), and I was using v1.0+ raw keycode syntax (29:1 42:1 47:1...). v0.1.8 uses named syntax (ctrl+shift+v). Different tool, different interface.

Also: the setup script had created a systemd service for ydotoold, the ydotool daemon. That binary does not exist in v0.1.8 — it was added in v1.0+. The service crash-looped with exit code 203 (EXEC: binary not found). Without the daemon, ydotool key was unreliable for single key combos.

Attempt 3: wtype

wtype is the Wayland-native text input tool. It uses the wlr-virtual-keyboard-v1 protocol.

Exit code 1. GNOME/Mutter does not implement wlr-virtual-keyboard-v1. That protocol is specific to wlroots-based compositors (Sway, Hyprland). wtype is useless on GNOME.

Attempt 4: xdotool

xdotool key ctrl+shift+v. Nothing happened. My terminal (Ptyxis, GNOME’s new GTK4 terminal) is a native Wayland app. xdotool only reaches XWayland windows.

What actually works: ydotool with a delay

Back to ydotool. The key insight: ydotool operates at the kernel level via uinput. It creates a virtual input device that the Wayland compositor treats as a real keyboard. This works with every window — Wayland-native, XWayland, GTK, Qt, terminal, browser.

The trick for reliability without the daemon: ydotool key --delay 200 ctrl+shift+v. The 200ms delay gives the compositor time to register the virtual keyboard device before the key events arrive.

Final working pipeline on GNOME Wayland:

wl-copy "text"  →  sleep 100ms  →  ydotool key --delay 200 ctrl+shift+v
(Popen, non-blocking)              (uinput → compositor → active window)

The compatibility breakdown across tools:

Tool Mechanism GNOME Wayland Sway/Hyprland X11
ydotool Linux uinput (kernel) Yes Yes Yes
wtype wlr-virtual-keyboard-v1 No Yes No
xdotool X11 protocol No* No Yes

*Only reaches XWayland windows

A few other decisions

Hotkey combo, not single key. The initial hotkey was just Right Super (KEY_RIGHTMETA). On GNOME, releasing Super opens the Activities overview. Every dictation ended with the app launcher flashing open. Switched to Right Shift + Right Super as the default.

Tool detection at startup. All the shutil.which() checks for wl-copy, ydotool, xclip, xdotool run once when the script boots. During actual use, type_text() just branches on a cached string — no overhead per transcription.

Warnings, not crashes. If the ideal tools are not installed, the script tells you exactly what to install, but still tries to work with whatever is available. It falls back to ydotool type if nothing else is found — which works everywhere except in apps with custom space-bar handling.

What I learned

wl-copy is a long-lived process. It does not fire and forget. subprocess.run() will block your event loop forever.

ydotool 0.1.8 and ydotool 1.0+ are different tools. Different syntax, different architecture (daemon vs. direct), different reliability characteristics. Ubuntu ships the old one.

Clipboard paste is more robust than typing. Any app that does custom key handling — vim, tmux, Claude Code — may interpret individual key events differently than raw text. Pasting as a block sidesteps all of that.

Debug logging saves hours. Adding --debug early and logging every subprocess call with its exit code and stderr made each failure diagnosable in one iteration instead of three.

Try it

git clone https://github.com/srih4ri/whisper-dictation
cd whisper-dictation
./setup.sh
# log out, log back in
./dictation.py

Tested on Ubuntu 25.10 with GNOME 48/Mutter and Ptyxis terminal. The setup script detects your display server and installs the right tools. If something does not work on your setup, open an issue.