MacOS Montereyに音声認識エンジンJuliusをHomebrew経由でインストールする。
こんにちは。
この記事では、音声認識エンジンJuliusをHomebrew経由でインストールする手法をまとめます。個人的にHomebrewを利用して他のパッケージを管理していることが多かったので、やってみることにしました。
他の手法としては、GitHub Releasesからソースコードをダウンロードしてインストールする手法やリポジトリからクローンしてインストールする手法があります。
環境 (2022/04/10時点)
macOS Monterey 12.2 / Intel Core i7
Homebrew 3.4.5
Julius v4.6
Julius Dictation Kit
OSによってインストールできるかどうかは、Homebrew Formulaeから確認できます。また、Juliusだけでは、音響モデル・言語モデルが含まれていないので、Julius Dictation Kitをインストールする必要があります。
Install julius, julius-dictation-kit
julius-dictation-kitはHomebrewに含まれておらず、自然言語処理に特化したリポジトリHomebrew-nlpにからインストールします。
まず、Homebrew-nlpを追加します。
% brew tap uetchy/nlp
次に、julius, julius-dictation-kitをHomebrew経由でインストールします。
% brew install julius, julius-dictation-kit
これでインストールすることができました。
リアルタイム音声認識をやってみる。
PCのマイクを入力として音声認識を動かしてみます。
インストールしたjulius-dictation-kitからモデルを指定して実行します。
Homebrewでインストールしたパッケージのパスは、"brew --prefix julius-dictation-kit"で取得できます。
% brew --prefix julius-dictation-kit
/usr/local/opt/julius-dictation-kit
今回は、GMMベースのモデルを使用しました。
実行するときに気をつけたいのは、-nostripオプションを付けることです。このオプションがないと音声入力が取得できず実行できませんでした。
% julius \
-C `brew --prefix julius-dictation-kit`/share/main.jconf \
-C `brew --prefix julius-dictation-kit`/share/am-gmm.jconf
// 略 //
STAT: ###### initialize input device
Stat: adin_darwin: sample rate = 16000
Error: adin_darwin: cannot set InputUnit's EnableIO(Input)
ERROR: m_adin: failed to ready input device
% julius -nostrip \
-C `brew --prefix julius-dictation-kit`/share/main.jconf \
-C `brew --prefix julius-dictation-kit`/share/am-gmm.jconf
// 略 //
STAT: All init successfully done
STAT: ###### initialize input device
----------------------- System Information begin ---------------------
JuliusLib rev.4.6 (fast)
Engine specification:
- Base setup : fast
- Supported LM : DFA, N-gram, Word
- Extension : LibSndFile
- Compiled by : gcc -g -O2 -fPIC
Library configuration: version 4.6
- Audio input
primary A/D-in driver : coreaudio (MacOSX CoreAudio)
available drivers :
wavefile formats : various formats by libsndfile ver.1
max. length of an input : 320000 samples, 150 words
- Language Model
class N-gram support : yes
MBR weight support : yes
word id unit : short (2 bytes)
- Acoustic Model
multi-path treatment : autodetect
- External library
file decompression by : zlib library
- Process hangling
fork on adinnet input : no
- built-in SIMD instruction set for DNN
SSE AVX FMA
FMA is available maximum on this cpu, use it
- built-in CUDA support: no
------------------------------------------------------------
Configuration of Modules
Number of defined modules: AM=1, LM=1, SR=1
Acoustic Model (with input parameter spec.):
- AM00 "_default"
hmmfilename=/usr/local/opt/julius-dictation-kit/share/model/phone_m/jnas-tri-3k16-gid.binhmm
hmmmapfilename=/usr/local/opt/julius-dictation-kit/share/model/phone_m/logicalTri
Language Model:
- LM00 "_default"
vocabulary filename=/usr/local/opt/julius-dictation-kit/share/model/lang_m/bccwj.60k.htkdic
n-gram filename=/usr/local/opt/julius-dictation-kit/share/model/lang_m/bccwj.60k.bingram (binary format)
Recognizer:
- SR00 "_default" (AM00, LM00)
------------------------------------------------------------
Speech Analysis Module(s)
[MFCC01] for [AM00 _default]
Acoustic analysis condition:
parameter = MFCC_E_D_N_Z (25 dim. from 12 cepstrum + energy, abs energy supressed with CMN)
sample frequency = 16000 Hz
sample period = 625 (1 = 100ns)
window size = 400 samples (25.0 ms)
frame shift = 160 samples (10.0 ms)
pre-emphasis = 0.97
# filterbank = 24
cepst. lifter = 22
raw energy = False
energy normalize = False
delta window = 2 frames (20.0 ms) around
hi freq cut = OFF
lo freq cut = OFF
zero mean frame = ON
use power = OFF
CVN = OFF
VTLN = OFF
spectral subtraction = off
cep. mean normalization = yes, real-time MAP-CMN, updating initial mean with last 500 input frames
initial mean from file = N/A
beginning data weight = 100.00
cep. var. normalization = no
base setup from = Julius defaults
------------------------------------------------------------
Acoustic Model(s)
[AM00 "_default"]
HMM Info:
8443 models, 3090 states, 3090 mpdfs, 49440 Gaussians are defined
model type = context dependency handling ON
training parameter = MFCC_E_N_D_Z
vector length = 25
number of stream = 1
stream info = [0-24]
cov. matrix type = DIAGC
duration type = NULLD
max mixture size = 16 Gaussians
max length of model = 5 states
logical base phones = 43
model skip trans. = not exist, no multi-path handling
AM Parameters:
Gaussian pruning = none (full computation) (-gprune)
short pause HMM name = "sp" specified, "sp" applied (physical) (-sp)
cross-word CD on pass1 = handle by approx. (use 3-best of same LC)
------------------------------------------------------------
Language Model(s)
[LM00 "_default"] type=n-gram
N-gram info:
spec = 3-gram, backward (right-to-left)
OOV word = <unk>(id=2)
wordset size = 59084
1-gram entries = 59084 ( 0.5 MB)
2-gram entries = 2476660 ( 27.7 MB) (64% are valid contexts)
3-gram entries = 7894442 ( 52.8 MB)
LR 2-gram entries= 2476660 ( 9.7 MB)
pass1 = given additional forward 2-gram
Vocabulary Info:
vocabulary size = 64274 words, 366102 models
average word len = 5.7 models, 17.1 states
maximum state num = 54 nodes per word
transparent words = not exist
words under class = 9444 words
Parameters:
(-silhead)head sil word = 0: "<s> @0.000000 [] silB(silB)"
(-siltail)tail sil word = 1: "</s> @0.000000 [。] silE(silE)"
------------------------------------------------------------
Recognizer(s)
[SR00 "_default"] AM00 "_default" + LM00 "_default"
Lexicon tree:
total node num = 415714
root node num = 632
(148 hi-freq. words are separated from tree lexicon)
leaf node num = 64274
fact. node num = 64274
Inter-word N-gram cache:
root node to be cached = 195 / 631 (isolated only)
word ends to be cached = 59084 (all)
max. allocation size = 46MB
(-lmp) pass1 LM weight = 8.0 ins. penalty = -2.0
(-lmp2) pass2 LM weight = 8.0 ins. penalty = -2.0
(-transp)trans. penalty = +0.0 per word
(-cmalpha)CM alpha coef = 0.050000
Search parameters:
multi-path handling = no
(-b) trellis beam width = 1500
(-bs)score pruning thres= disabled
(-n)search candidate num= 30
(-s) search stack size = 500
(-m) search overflow = after 10000 hypothesis poped
2nd pass method = searching sentence, generating N-best
(-b2) pass2 beam width = 100
(-lookuprange)lookup range= 5 (tm-5 <= t <tm+5)
(-sb)2nd scan beamthres = 80.0 (in logscore)
(-n) search till = 30 candidates found
(-output) and output = 1 candidates out of above
IWCD handling:
1st pass: approximation (use 3-best of same LC)
2nd pass: loose (apply when hypo. is popped and scanned)
factoring score: 1-gram prob. (statically assigned beforehand)
short pause segmentation = off
fall back on search fail = off, returns search failure
------------------------------------------------------------
Decoding algorithm:
1st pass input processing = real time, on-the-fly
1st pass method = 1-best approx. generating indexed trellis
output word confidence measure based on search-time scores
------------------------------------------------------------
FrontEnd:
Input stream:
input type = waveform
input source = microphone
device API = default
sampling freq. = 16000 Hz
threaded A/D-in = supported, on
zero frames stripping = off
silence cutting = on
level thres = 2000 / 32767
zerocross thres = 60 / sec.
head margin = 300 msec.
tail margin = 400 msec.
chunk size = 1000 samples
FVAD switch value = -1 (disabled)
long-term DC removal = off
level scaling factor = 1.00 (disabled)
reject short input = < 800 msec
reject long input = off
----------------------- System Information end -----------------------
Notice for feature extraction (01),
*************************************************************
* Cepstral mean normalization for real-time decoding: *
* NOTICE: The first input may not be recognized, since *
* no initial mean is available on startup. *
*************************************************************
------
### read waveform input
Stat: adin_portaudio: audio cycle buffer length = 256000 bytes
Stat: adin_portaudio: sound capture devices:
Stat: adin_portaudio: use default device
Stat: adin_portaudio: [Core Audio: MacBook Proのマイク]
Stat: adin_portaudio: (you can specify device by "PORTAUDIO_DEV_NUM=number"
Stat: adin_portaudio: try to set default low latency from portaudio: 0 msec
Stat: adin_portaudio: latency was set to 228.500000 msec
STAT: AD-in thread created
<<< please speak >>>
"please speak"と表示され、音声認識できる状態になりました。
試しに「こんにちは」を認識させてみます。
pass1_best: こんにちは 。
pass1_best_wordseq: <s> こんにちは+感動詞 </s>
pass1_best_phonemeseq: silB | k o N n i ch i w a | silE
pass1_best_score: -2789.926758
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 11897 generated, 1183 pushed, 197 nodes popped in 121
sentence1: こんにちは 。
wseq1: <s> こんにちは+感動詞 </s>
phseq1: silB | k o N n i ch i w a | silE
cmscore1: 0.709 0.651 1.000
score1: -2804.239258
最後に
Homebrew経由でJuliusをインストールする方法をまとめました。
今後は、オフラインで任意の音源でできるようにしたいなと思っています。
以上。