Transcribe audio recording to text.

Trilby · 2020-12-14 17:41:02

I'm looking for a way to convert an audio recording to a text transcript. I've been digging through various options, but very few of those recommended on sites I find from a web search are packaged in the repos or AUR and would seem to require a lot of work to package / build and it's not really clear if they'd even meet my needs. Some that are packaged in the AUR have multiple AUR dependencies some of which fail to build (e.g., sphinxbase). The one recommendation that keeps coming up is Julius, which is in the repos - but I can't find any useful documentation (there is no man page and the -help output is not helpful).

There is a BOOK for julius which seems targetted more toward those who want to do research on the machine learning models it uses. I just need to get a tool to take a wave file and produce text. I don't mind doing some scripting to acheive this, but I need to know how to call the tool to acheive the end result. I don't intend to develop and train machine learning models (as speaker in the audio may change from one use to the next). I don't need high quality - just a rough outline of the discussion in the recording will be sufficient.

Does anyone have recommendations for tools that may acheive this goal? And if Julius is a good option, how is it used? I found one example in the Julius github page that was giving an example of using it specifically for my goals: to convert a wav file to text. Their example showed this command line:

julius ... -input audio.wav

Unfortunately there wasn't any information on what should be in place of those elipses, and without something it doesn't work as intended:

$ julius -input audio.wav
ERROR: m_options: unknown speech input source "audio.wav"
Try `-help' for more information.

Last edited by Trilby (2020-12-14 21:12:08)

Trilby · 2020-12-14 19:27:44

I set aside Julius for the moment and worked again on building one the AUR options: sphinx. I was able to get it working by ignoring the PKGBUILDs for the sphinx-related software in the AUR that were not building and instead I started from scratch with upstream's github source resulting in the following two PKGBUILDs that work well and produce a working tool:

_gitname=sphinxbase
pkgname=${_gitname}-git
pkgver=r1240.cadcfb1
pkgrel=1
pkgdesc='Common library for sphinx speech recognition.'
url='https://github.com/cmusphinx/sphinxbase'
arch=('i686' 'x86_64')
license=('BSD')
makedepends=('bison' 'swig')
depends=('python' 'lapack' 'libpulse')
provides=(${_gitname})
source=("git+${url}.git")
sha256sums=("SKIP")

pkgver() {
	cd ${_gitname}
	printf "r%s.%s" "$(git rev-list --count HEAD)" "$(git rev-parse --short HEAD)"
}

prepare() {
	cd "${_gitname}"
	./autogen.sh
}

build() {
	cd "${_gitname}"
	./configure --prefix=/usr
	make
}

package() {
	cd "${_gitname}"
	make DESTDIR="${pkgdir}" install
	install -Dm644 LICENSE "${pkgdir}/usr/share/licenses/${pkgname}/LICENSE"
}

_gitname=pocketsphinx
pkgname=${_gitname}-git
pkgver=r1526.ab6d647
pkgrel=1
pkgdesc="CMU's speaker-independent continuous speech recognition engine"
url='https://github.com/cmusphinx/pocketsphinx'
arch=('i686' 'x86_64')
license=('BSD')
depends=('sphinxbase')
provides=(${_gitname})
source=("git+${url}.git")
sha256sums=("SKIP")

pkgver() {
	cd ${_gitname}
	printf "r%s.%s" "$(git rev-list --count HEAD)" "$(git rev-parse --short HEAD)"
}

prepare() {
	cd "${_gitname}"
	./autogen.sh
}

build() {
	cd "${_gitname}"
	./configure --prefix=/usr
	make
}

package() {
	cd "${_gitname}"
	make DESTDIR="${pkgdir}" install
	install -Dm644 LICENSE "${pkgdir}/usr/share/licenses/${pkgname}/LICENSE"
}

Pockesphinx then provides the binary `pocketsphinx_continuous` which does what I want. Though it initially gave an error (actually an informative one) about the bits per sample:

...
ERROR: "continuous.c", line 123: Input audio file has [8] bits per sample instead of 16
FATAL: "continuous.c", line 165: Failed to process file '/tmp/audio.wav' due to format mismatch.

So I converted the input with ffmpeg as follows and ran it again:

ffmpeg -i /tmp/new.wav -ar 16000 -ac 1 /tmp/test.wav
pocketsphinx_continuous -infile /tmp/test.wav >/tmp/out.txt

This resulted in a text file with text of the dialog in the test input. It was pretty choppy, but it was recognizable, and the test input was the audio channel from a downloaded movie with A LOT of background noise / music / effects. This is not the intended use, but it was a sample file I had handy to test with. So if pockesphinx produces some recognizable dialog from a noisy movie, it should do very well with recordings I plan to take of meetings.

I may upload the above PKGBUILDs to the AUR soon, but I need to double check the dependencies in each first. Any other feedback on the PKGBUILDs would of course be welcome.

EDIT: scratch the above - sphinx runs well, but I just tested it on some voice recording under ideal conditions, and it gave English word-salad as a result, but not at all representative of the input audio. Perhaps I gave it too much credit with the movie example. So I'm still on the hunt for something that will actually work. I've been finding lots of Julius step-by-step guides where the authors solve one error after another only to get it to run but not produce any output (it's "search" fails to match the audio to any english text) - I can replicate this ... so Julius doesn't seem viable.

I'm starting to think it will be a much better use of my time to just manually transcribe my recordings myself than to bother with these over-hyped but not-actually-functional (or documented) speech recognition tools.

EDIT: I've just tested Vosk in a virtualenv and it performs very well on real samples that I've tested. I'll next see if I can package it as it's not available in the repo/AUR. Welp ... it works from the virtualenv, but I'm clueless on how to package it - I've been hitting one dead end after another for the past few hours. I'll either just uses from a virtualenv or just not bother.

Last edited by Trilby (2020-12-15 00:04:21)

GSMiller · 2020-12-16 18:46:59

Trilby, have you looked at speech-to-text software for Linux?
This article lists many: https://www.linuxlinks.com/best-free-li … -software/
I have not tried those.

Alataw · 2022-07-09 08:50:53

Please advise the best program for converting mp3 to txt.

Trilby · 2022-07-09 11:51:13

Alataw, there is no need to bump an old thread just to ask the exact question the thread started with. FWIW, this never was marked SOLVED (but perhaps could be marked GAVE-UP) as I worked through getting many of the available tools to run properly which was quite a bit of work in several cases, only to find that none of them were even remotely sufficient for my needs.

I've gone back to listening to the audio and manually transcribing. But as already noted previously, if the audio is only from known speakers who participate in training the model, you may have better luck than I did. But for recordings of abitrary dialog, even with clear recordings, the available tools perform laughably poorly.

seth · 2022-07-09 15:18:49

Upload it to youtube and have youtube add subtitles (automatic captioning)…

Alataw · 2022-07-09 18:50:46

Trilby, seth,

ok, thanks!

Maniaxx · 2022-07-09 20:58:28

I would try 'Speech Services by Google' (from Playstore). They needed years to achieve this quality. It should work offline (once all data is installed). Delete several gapps 'app data' afterwards and freeze the app if there are any privacy concerns.
https://www.techspot.com/news/79166-goo … board.html

Morn · 2022-07-10 10:59:18

vosk-api (https://github.com/alphacep/vosk-api) works really well in my experience if you are looking for an Open Source solution.

There is also https://github.com/ideasman42/nerd-dictation which is based on vosk-api and has advanced dictation features such as converting numbers.

I use vosk-api like this to transcribe audio in the terminal:

cd vosk-api/python/example
python test_microphone.py -d 8 -m en-model/

It works great for me with my Blue Yeti microphone, both in English and German! With a built-in microphone, it makes more mistakes. So using a good mic is very important. With the Yeti, recognition rate has been 100% for me…

Arch Linux

#1 2020-12-14 17:41:02

Transcribe audio recording to text.

#2 2020-12-14 19:27:44

Re: Transcribe audio recording to text.

#3 2020-12-16 18:46:59

Re: Transcribe audio recording to text.

#4 2022-07-09 08:50:53

Re: Transcribe audio recording to text.

#5 2022-07-09 11:51:13

Re: Transcribe audio recording to text.

#6 2022-07-09 15:18:49

Re: Transcribe audio recording to text.

#7 2022-07-09 18:50:46

Re: Transcribe audio recording to text.

#8 2022-07-09 20:58:28

Re: Transcribe audio recording to text.

#9 2022-07-10 10:59:18

Re: Transcribe audio recording to text.

Board footer