Voice transcriptions and speech recognition
January 7, 2019 — December 5, 2023
The converse to generating speech from text is generating text from speech. We might do this in real time, to control something or to subtitle, or in batch mode, to turn an audio recording into text. Or some hybrid of both, which ends up being what I typically want in practice when I am attempting to take dictation.
1 Dictation
Speaking as a real-time textual input method. This is a rapidly moving area.
- macOS includes dictation.
- So does Windows.
- So does VS Code Speech, which is good because the macOS one clashes with its input method somehow.
See the following older roundups of dictation apps to start:
- Zapier dictation roundup
- The rather grimmer Linux-specific roundup.
Here are some options culled from those lists and elsewhere of vague relevance to me:
- dictation.io provides a frontend to Google speech recognition.
- A classic is Nuance Dragon dictate.
2 Coding by voice
See Speaking in code: how to program by voice
Coding by voice command requires two kinds of software: a speech-recognition engine and a platform for voice coding. Dragon from Nuance, a speech-recognition software developer in Burlington, Massachusetts, is an advanced engine and is widely used for programming by voice, with Windows and Mac versions available. Windows also has its own built-in speech recognition system. On the platform side, VoiceCode by Ben Meyer and Talon by Ryan Hileman … are popular.
Two other platforms for voice programming are Caster and Aenea, the latter of which runs on Linux. Both are free and open source, and enable voice-programming functionality in Dragonfly, which is an open-source Python framework that links actions with voice commands detected by a speech-recognition engine.
See also: Programming by Voice May Be the Next Frontier in Software Development.
Full disclosure: I am researching this because I have temporarily disabled my hands. For the moment, for my purposes, the easiest option is to use Serenade for Python programming, OS speech recognition for prose typing, and to leave my other activities aside for now. If my arms were to be disabled for a longer period of time I would probably accept the learning curve of using Talon, which seems to solve more problems, at the cost of greater commitment.
One point of friction which I did not anticipate is that most of these tools will, for various reasons, do their best to switch off any music playing, any time I use them. For someone like me who can’t focus for three minutes straight without banging electro in the background this is not ideal. My current workaround is to play music on a different device so I can sneak beats past my unnecessarily diligent speech recognition tools trying to stifle background noise or whatever it is they are doing. This means that I am wearing two headsets, which looks funny, but to be honest it is not the worst fashion sacrifice I have been forced to make in the course of this particular injury.
Contrariwise, if I were to try to do the speech control stuff in an open plan office, coworking space or in the family living room, it would be excruciatingly irritating for anyone else who could hear me. My current workaround, when I am annoying some innocent bystander with my narration, is to accuse them of being ableist if they complain.
2.1 Doing without mouse
Try stylus or eye tracking systems, in addition to Talon, below.
2.2 Github copilot voice
Might be interesting.
2.3 Serenade
Simple, low-lift intuitionistic voice recognition for coding. Includes deep integration for various languages and also various code editors including visual studio code and those JetBrains ones. Free. Simple to use.
Supported languages: Python, JavaScript, HTML, Java, C / C++, TypeScript, CSS, Markdown, Dart, Bash, Sass, C#, Go, Ruby, Rust.
The experience is good for plain code. Editor integration is not awesome when using Jupyter, in line with the general rule that Jupyter makes everything more flaky and complicated.
2.4 talon
tl;dr:
Powerful hands-free input
- Voice Control — talk to your computer
- Noise Control — click with a back-beat
- Eye Tracking — mouse where you look
- Python Scripts — customise everything
Full length:
🤳Talon aims to bring programming, realtime video gaming, command line, and full desktop computer proficiency to people who have limited or no use of their hands, and vastly improve productivity and wow-factor of anyone who can use a computer.
System requirements:
macOS High Sierra (10.13) or newer. Talon is a universal2 build with native Apple Silicon support.
Linux / X11 (Ubuntu 18.04+, and most modern distros), Wayland support is currently limited to XWayland
Windows 8 or newer
Powerful voice control - Talon comes with a free speech recognition engine, and it is also compatible with Dragon with no additional setup.
Multiple algorithms for eye tracking mouse control (depends on a single Tobii 4C, Tobii 5 or equivalent eye tracker)
Noise recognition system (pop and hiss). Many more noises coming soon.
Scriptable with Python 3 (via embedded CPython, no need to install or configure Python on your host system).
Talon is very modular and adaptable - you can use eye tracking without speech recognition, or vice versa.
Worked example: Coding with voice dictation using Talon Voice.
2.5 Cursorless
Advanced coding extension for VS Code.
2.6 Dragonfly
Dragonfly is a speech recognition framework for Python that makes it convenient to create custom commands to use with speech recognition software. It was written to make it very easy for Python macros, scripts, and applications to interface with speech recognition engines. Its design allows speech commands and grammar objects to be treated as first-class Python objects. Dragonfly can be used for general programming by voice. It is flexible enough to allow programming in any language, not just Python. It can also be used for speech-enabling applications, automating computer activities and dictating prose.
Dragonfly contains its own powerful framework for defining and executing actions. It includes actions for text input and key-stroke simulation. This framework is cross-platform, working on Windows, macOS and Linux (X11 only). See the actions sub-package documentation for more information, including code examples.
This project is a fork of the original t4ngo/dragonfly project.
Dragonfly currently supports the following speech recognition engines:
- Dragon, a product of Nuance. All versions up to 15 (the latest) should be supported. Home, Professional Individual and previous similar editions of Dragon are supported. Other editions may work too
- Windows Speech Recognition (WSR), included with Microsoft Windows Vista, Windows 7+, and freely available for Windows XP
- Kaldi (under development)
- CMU Pocket Sphinx (with caveats)
2.6.1 mathematics
2.7 VoiceCode
Your voice is the most efficient way to communicate. VoiceCode is a concise spoken language that controls your computer in real-time. When writing anything from emails to kernel code, to switching applications or navigating Photoshop – VoiceCode does the job faster and easier.
VoiceCode is different from other voice-command solutions in that commands can be chained and nested in any combination, allowing complex actions to be performed by a single spoken phrase.
By taking advantage of your brain’s natural aptitude for language you can control your computer more efficiently and naturally.
3 Transcribing recordings
Handy if you have a recording and you want to make it into a text thing offline.
3.1 Whisper
Whisper (Radford et al. 2022) is the recent speech transcription model casually released by OpenAI:
pip install -U openai-whisper
whisper audio.mp3 # transcribes
whisper audio.mp3 --language Japanese --task translate #translates to english
Requires a GPU but otherwise free. Has now been integrated into lotsa things.
3.2 Descript
descript aims to integrate editing with transcription and in particular seems to allow editing audio via editing the transcription via voice fake technology.
3.3 Misc other
- producthunt transcription options Weaponised social media deep fake here we come. USD 15/month for 10hr/month.
- rev transcription is a human-powered service (USD1.25/minute)
- Vatis tech is AI-backed? USD10/hr. Output to video subtitles and identifies different speakers.
- Audioburst offers transcription as part of their podcast service. The price is a mystery.
- The all-manual option: Type it yourself.
- wreally transcribe has built their own in-browser speech recogniser as well as a manual transcription UI. More augmented-manual than automatic. $20/year.
4 Phonetic transcription
It has been a long time since I took Phil Rose’s extravagantly weird undergraduate phonetics class, and I have forgotten much. A cheating tool:
I cannot easily see how to automate phonetic transcription, but surely that is around somewhere? Some voice transcription software may well use phonetics as an intermediate representation or even as the final output.