[音声認識] DeepSpeechで中国語の音声認識を行う

まず中国語の音声ファイルを用意します

続いてDeepspeechの中国語モデルをDLします。
deepspeech-0.9.3-models-zh-CN.pbmm
deepspeech-0.9.3-models-zh-CN.scorer

実行は、Englishと同様
$ source deepspeech-venv/bin/activate
$ deepspeech –model deepspeech-0.9.3-models-zh-CN.pbmm –scorer deepspeech-0.9.3-models-zh-CN.scorer –audio audio/zh_test.wav
Loading model from file deepspeech-0.9.3-models-zh-CN.pbmm
TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.3-0-gf2e9c85
2021-09-04 02:47:32.705419: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loaded model in 0.0436s.
Loading scorer from files deepspeech-0.9.3-models-zh-CN.scorer
Loaded scorer in 0.00114s.
Running inference.
同的祖母是一位佛教徒但他从二没有在未向前年国佛经有一天他前我是菜里杨聪认我在我一八年气派来结果动的管领流泪总合了他说乔丽多难受到这一个密绝大这么起就有机
Inference took 12.015s for 25.003s audio file.

ほう、何でもできるような気がしてきた。
まあ設計次第かな。

[音声認識] DeepSpeechでvideoのAutoSub(srtファイル)作成

– AutoSub is a CLI application to generate subtile file for any video using DeepSpeech.

### install
$ git clone https://github.com/abhirooptalasila/AutoSub
$ cd AutoSub

### virtual env
$ python3 -m venv sub
$ source sub/bin/activate
$ pip3 install -r requirements.txt
requirementsの中身は以下の通りです。

cycler==0.10.0
numpy
deepspeech==0.9.3
joblib==0.16.0
kiwisolver==1.2.0
pydub==0.23.1
pyparsing==2.4.7
python-dateutil==2.8.1
scikit-learn
scipy==1.4.1
six==1.15.0
tqdm==4.44.1

$ deactivate

### download model & scorer
$ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
$ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
$ mkdir audio output

$ sudo apt-get install ffmpeg
$ ffmpeg -version
ffmpeg version 4.2.4-1ubuntu0.1

今回はyoutubeの動画を使います

これを mp4に変換します。

$ python3 autosub/main.py –file hello.mp4
※main.pyで、modelとscorerのファイルを取得しているため、–model /home/AutoSub/deepspeech-0.9.3-models.pbmm –scorer /home/AutoSub/deepspeech-0.9.3-models.scorerは不要です。

for x in os.listdir():
        if x.endswith(".pbmm"):
            print("Model: ", os.path.join(os.getcwd(), x))
            ds_model = os.path.join(os.getcwd(), x)
        if x.endswith(".scorer"):
            print("Scorer: ", os.path.join(os.getcwd(), x))
            ds_scorer = os.path.join(os.getcwd(), x)

output/hello.srt

1
00:00:06,70 --> 00:00:15,60
a low and low and level how are you have low low and low how are you

2
00:00:16,10 --> 00:00:30,20
i do i am great i wonder for a good i grant it wonder for

3
00:00:32,45 --> 00:00:41,30
now at low halloway hallo hallo hallo how are you

4
00:00:41,90 --> 00:00:43,40
tired

5
00:00:43,55 --> 00:00:50,35
i am angry i'm not so good i'm tired

6
00:00:50,55 --> 00:00:55,95
i'm hungry and not so good

7
00:00:58,10 --> 00:01:07,15
love hollow hollow how are you have to have loved halloo are you

8
00:01:07,30 --> 00:01:16,65
how how low how do how are you allow a love as now how are you

これ、日本語でやりたい & リアルタイム出力したい

[音声認識] Juliusのdictation-kit(日本語のGMM-HMMモデル)で検収

まず.wavファイルの音源を用意します。

「お疲れ様でした」という女性の声が入っています。

これをJuliusで音声認識します。
日本語のモデルはDictation-kitを使います。
https://github.com/julius-speech/dictation-kit
-> Githubのdictation-kitはトータルサイズが2Gで重いのでwgetでダウンロードしてunzipする方を使いたいと思います。

※dictation-kitをgit cloneする時
git-lfsを使うよう指示されます。
$ sudo yum install git-lfs
$ git lfs clone https://github.com/julius-speech/dictation-kit.git
no space left on device
$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 2.0G 0 2.0G 0% /dev
tmpfs 2.0G 0 2.0G 0% /dev/shm
tmpfs 2.0G 520K 2.0G 1% /run
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
/dev/sda1 25G 25G 0 100% /
tmpfs 395M 0 395M 0% /run/user/1000
vagrant 234G 186G 49G 80% /vagrant
tmpfs 395M 0 395M 0% /run/user/0
これだと、直ぐにリソースが一杯になってしまい、使い切っていたのでframework系のファイル群を削除します😅

$ wget https://osdn.net/dl/julius/dictation-kit-4.5.zip
$ unzip ./dictation-kit-4.5.zip
$ cd dictation-kit-4.5

### 日本語のGMM-HMMモデルでJuliusを起動
am-dnn.jconf
L inputがmicになっているので、fileに変更します。

-input file

$ ../julius/julius/julius -C main.jconf -C am-gmm.jconf -nostrip -input rawfile
enter filename->test.wav
——
### read waveform input
enter filename->test2.wav
Stat: adin_file: input speechfile: test2.wav
STAT: 53499 samples (3.34 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………..pass1_best: 別れ た 真似 し た 。
pass1_best_wordseq: 別れ+動詞 た+助動詞 真似+名詞 し+動詞 た+助動詞
pass1_best_phonemeseq: silB | w a k a r e | t a | m a n e | sh i | t a | silE
pass1_best_score: -7376.977051
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 7136 generated, 1958 pushed, 182 nodes popped in 332
sentence1: 伴天連 様 でし た 。
wseq1: 伴天連+名詞 様+接尾辞 でし+助動詞 た+助動詞
phseq1: silB | b a t e r e N | s a m a | d e sh i | t a | silE
cmscore1: 0.477 0.083 0.314 0.446 0.411 1.000
score1: -7376.384766

——

おいおいオイ、「別れ た 真似 し た 。」になってるやんか。
いい加減にしろや✊ どうなってんねんコレ。

まあ、日本語モデルはdeepspeechとかには無いからjuliusでアプリ作るけどさ。

[音声認識] DeepSpeechでTranscriberを実装する

PyAudio has two modes: blocking, where data has to read from the stream; and non-blocking, where a callback function is passed to PyAudio for feeding the audio data stream.
DeepSpeech streaming APIを使う
audio機能を使うには、pyaudioをインストールする必要がある

$ sudo apt-get install portaudio19-dev
$ pip3 install pyaudio

# -*- coding: utf-8 -*-
#! /usr/bin/python3

import deepspeech
import wave
import numpy as np
import os
import pyaudio

model_file_path = 'deepspeech-0.9.3-models.pbmm'
model = deepspeech.Model(model_file_path)

context = model.createStream()

text_so_far = ''

def process_audio(in_data, frame_count, time_info, status):
	global text_so_far
	data16 = np.frombuffer(in_data, dtype=np.int16)
	model.feedAudioContent(context, data16)
	text = model.intermediateDecode(context)
	if text != text_so_far:
		print('Interim text = {}'.format(text))
		text_so_far = text
	return (in_data, pyaudio.paContinue)

audio = pyaudio.PyAudio()
stream = audio.open(
	format=pyaudio.paInt16,
	channels=1,
	rate=16000,
	input=True,
	frames_per_buffer=1024,
	stream_callback=process_audio
)
print('Please start speaking, when done press Ctr-c ...')
stream.start_stream()

try:
	while stream.is_active():
		time.sleep(0.1)
except KeyboardInterrupt:
	stream.stop_stream()
	stream.close()
	audio.terminate()
	print('Finished recording.')

	text = model.finishStream(context)
	print('Final text = {}'.format(text))

$ python3 transcribe.py
Traceback (most recent call last):
File “transcribe.py”, line 28, in
stream = audio.open(
File “/home/vagrant/deepspeech-venv/lib/python3.8/site-packages/pyaudio.py”, line 750, in open
stream = Stream(self, *args, **kwargs)
File “/home/vagrant/deepspeech-venv/lib/python3.8/site-packages/pyaudio.py”, line 441, in __init__
self._stream = pa.open(**arguments)
OSError: [Errno -9996] Invalid input device (no default output device)

vagrantだとテストできないな。。

>>> import pyaudio
>>> pa = pyaudio.PyAudio()
>>> pa.get_default_input_device_info()
OSError: No Default Input Device Available

結局ラズパイ環境を準備しないとダメか。。
DeepSpeechがかなり使えることはわかった。

[音声認識] DeepSpeechをPythonでテキスト出力(batch/stream)

$ python3 –version
Python 3.8.10

### batch API
– 全てのwavファイルを読み込んで処理

# -*- coding: utf-8 -*-
#! /usr/bin/python3

import deepspeech
import wave
import numpy as np

model_file_path = 'deepspeech-0.9.3-models.pbmm'
model = deepspeech.Model(model_file_path)

filename = 'audio/8455-210777-0068.wav'
w = wave.open(filename, 'r')
rate = w.getframerate()
frames = w.getnframes()
buffer = w.readframes(frames)

data16 = np.frombuffer(buffer, dtype=np.int16)
type(data16)
text = model.stt(data16)
print(text)

$ python3 app.py
TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.3-0-gf2e9c85
2021-08-28 08:55:38.538633: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
your paret is sufficient i said

### stream API
– bufferサイズごとに処理

// 上部省略
context = model.createStream()

buffer_len = len(buffer)
offset = 0
batch_size = 16384
text = ''

while offset < buffer_len:
	end_offset = offset + batch_size
	chunk = buffer[offset:end_offset]
	data16 = np.frombuffer(chunk, dtype=np.int16)
	context.feedAudioContent(data16)
	text = context.intermediateDecode()
	print(text)
	offset = end_offset

$ python3 app.py
TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.3-0-gf2e9c85
2021-08-28 09:15:50.970216: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

your paret
your paret is suff
your paret is sufficient i said
your paret is sufficient i said

ほう、これは中々凄いですね。
あとはTranscriberか。

[音声認識] Wav2Letterのインストール

### Let’s get started
C++ compiler and CMake required.
$ sudo apt-get install cmake g++

Flashlight:
In order to build flashlight, we need to install Arrayfire.

$ wget https://arrayfire.s3.amazonaws.com/3.6.1/ArrayFire-no-gl-v3.6.1_Linux_x86_64.sh
$ chmod u+x ArrayFire-no-gl-v3.6.1_Linux_x86_64.sh
$ sudo bash ArrayFire-no-gl-v3.6.1_Linux_x86_64.sh –include-subdir –prefix=/opt
$ sudo echo /opt/arrayfire-no-gl/lib > /etc/ld.so.conf.d/arrayfire.conf
-bash: /etc/ld.so.conf.d/arrayfire.conf: Permission denie
あれ? うまくいかんな

flashlight
$ git clone https://github.com/flashlight/flashlight.git && cd flashlight
$ mkdir -p build && cd build
$ cmake .. -DCMAKE_BUILD_TYPE=Release -DFL_BACKEND=[backend]
CMake Error at CMakeLists.txt:76 (message):
Invalid FL_BACKEND specified

$ cmake .. -DCMAKE_BUILD_TYPE=Release -DFLASHLIGHT_BACKEND=CUDA
— -rdynamic supported.
CMake Error at CMakeLists.txt:76 (message):
$ nvcc -V

Command ‘nvcc’ not found, but can be installed with:

apt install nvidia-cuda-toolkit
Please ask your administrator.

CUDAが入ってないのか

[音声認識] DeepSpeech2

DeepSeech2 pytorch implementation
 L 『PyTorch』とは、Facebookが開発を主導したPython向けの機械学習ライブラリ
Github:

What is Deep Speech2?
end-to-end deep learning approach
Key to approach is our application of HPC techniques, resulting in a 7x speedup over our previous system
入力音声をMelspectrogram変換した後、CNNおよびRNNを適用し、最後にCTCでテキスト出力
  CTCの言語モデルを補正する事で、より自然な文章にすることができる
  python3 deepspeech2.py -i input.wav で使用

### install(安装)
$ git clone http://www.github.com/SeanNaren/deepspeech.pytorch
> 此外,需要安装几个库以便进行训练。我假设所有的东西都安装在Ubuntu上的Anaconda中。 如果你还没有安装pytorch,请安装。 为Warp-CTC绑定安装这个fork :
ubuntuにpytorchをインストールします。
$ pip3 install torchvision

beam search decoding for PyTorch
$ git clone –recursive https://github.com/parlance/ctcdecode.git
$ cd ctcdecode && pip3 install .

finally install deepspeech.pytorch
$ pip3 install -r requirements.txt
$ pip3 install -e .

### Training
– Datasets
$ cd data
$ python3 an4.py

Manifest CSV file
train.py take csv file called manifest file, which is a csv file containing the paths to wav files and label texts files.

$ cd ..
$ python3 train.py +configs=an4
train.py:19: UserWarning:
config_path is not specified in @hydra.main().
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_hydra_main_config_path for more information.
@hydra.main(config_name=”config”)
/home/vagrant/.local/lib/python3.8/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In ‘config’: Defaults list is missing `_self_`. See https://hydra.cc/docs/upgrades/1.0_to_1.1/default_composition_order for more information
warnings.warn(msg, UserWarning)
Global seed set to 123456
Error executing job with overrides: [‘+configs=an4’]
Traceback (most recent call last):
File “train.py”, line 21, in hydra_main
train(cfg=cfg)
File “/home/vagrant/deepspeech2/deepspeech.pytorch/deepspeech_pytorch/training.py”, line 25, in train
checkpoint_callback = FileCheckpointHandler(
File “/home/vagrant/deepspeech2/deepspeech.pytorch/deepspeech_pytorch/checkpoint.py”, line 16, in __init__
super().__init__(
TypeError: __init__() got an unexpected keyword argument ‘prefix’

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

う〜ん、コンパイルまでは上手く行ってるような気がするけど、何か難しいな。。。

[音声認識] メモリを増やしてKaldiのインストール

メモリを1Gから4Gに増やして再度Kaldiをビルドしたいと思う

vagrant file
L 4096に増やします

  config.vm.provider "virtualbox" do |vb|
    # Display the VirtualBox GUI when booting the machine
    # vb.gui = true
  
    # Customize the amount of memory on the VM:
    vb.customize ["modifyvm", :id, "--memory", 4096]
  end

$ free
total used free shared buff/cache available
Mem: 4030612 134668 3550068 940 345876 3669028
Swap: 0 0 0
$ nproc
2
makeするときは、-j 2でCPUを2つ使用します。
$ sudo make -j 2
g++: fatal error: Killed signal terminated program cc1plus

これでもダメなのか。Kaldiは諦めて次のに行こう。

[音声認識] Kaldiによるspeech recognition その1

Github: kaldi
Homepage: KALDI

# What is Kaldi?
Kaldi is speech recognition toolkit written in C++
Ethiopian goatherder who discovered the coffee plant named Kaldi
Code level integration with Finite State Transducers
Extensive linear algebra support

# Downloading Kaldi
$ git clone https://github.com/kaldi-asr/kaldi.git kaldi –origin upstream
$ cd kaldi
INSTALL

Option 1 (bash + makefile):
  Steps:
    (1)
    go to tools/  and follow INSTALL instructions there.
    (2)
    go to src/ and follow INSTALL instructions there.

$ cd tools
$ extras/check_dependencies.sh
$ sudo apt-get install automake autoconf unzip sox gfortran libtool subversion python2.7
$ make

libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -fno-exceptions -funsigned-char -g -O2 -std=c++11 -MT fst-types.lo -MD -MP -MF .deps/fst-types.Tpo -c fst-types.cc -fPIC -DPIC -o .libs/fst-types.o

ここで止まる
何でやねん

mklが入ってなかったようなので、もう一度やります
$ extras/check_dependencies.sh
$ extras/install_mkl.sh
$ make

g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
make[3]: *** [Makefile:460: fst-types.lo] Error 1
make[3]: Leaving directory ‘/home/vagrant/kaldi/kaldi/tools/openfst-1.7.2/src/lib’
make[2]: *** [Makefile:370: install-recursive] Error 1
make[2]: Leaving directory ‘/home/vagrant/kaldi/kaldi/tools/openfst-1.7.2/src’
make[1]: *** [Makefile:426: install-recursive] Error 1
make[1]: Leaving directory ‘/home/vagrant/kaldi/kaldi/tools/openfst-1.7.2’
make: *** [Makefile:64: openfst_compiled] Error 2

メモリが足りない時のエラーみたい

$ nproc
2
4GBにして、swapを増やし、CPU二つで並列でmakeしたい

[音声認識] DeepSpeechを試そう

DeepSeechとは?
– DeepSpeech is an open-source Speech to Text engine, trained by machine learning based on Baidu’s Deep Speech research paper and using TensorFlow.

DeepSpeech Document: deepspeech.readthedocs.io.

# create a virtualenv
$ sudo apt install python3-virtualenv
$ source deepspeech-venv/bin/activate

# install DeepSpeech
$ pip3 install deepspeech

# download pre-trained English model
$ curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
$ curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

# Download example audio files
$ curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/audio-0.9.3.tar.gz
$ tar xvf audio-0.9.3.tar.gz

$ deepspeech –model deepspeech-0.9.3-models.pbmm –scorer deepspeech-0.9.3-models.scorer –audio audio/2830-3980-0043.wav
Loading model from file deepspeech-0.9.3-models.pbmm
TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.3-0-gf2e9c85
2021-08-24 22:27:18.338821: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loaded model in 0.0447s.
Loading scorer from files deepspeech-0.9.3-models.scorer
Loaded scorer in 0.00898s.
Running inference.
experience proves this
Inference took 2.371s for 1.975s audio file.

なるほど、Juliusと似ているところがあるね
.wavファイルを作成せずにmicrophoneでrealtime speech recognitionを作りたいな。