ASR Edge Computing
Below is a high-level explanation of the ASR (Automatic Speech Recognition) implementation for SigSRF software, based on the Kaldi open source speech recognition toolkit. A second implementation is ongoing for KubeEdge, the Edge Computing version of Kubernetes.
Contents
Overview
ASR Offloading
Demo Capability
Kaldi Interface
Data Flow
Software Architecture
KubeEdge Integration
Kaldi Info
Run-Time Inference
Kaldi Integration
Kaldi Architecture, DNNs
Overview
SigSRF packet + media processing software
SigSRF is deployed by telecoms, LEAs, and analytics customers worldwide
extreme high capacity, robust packet interface, wideband audio codecs, jitter buffer and packet loss handling, stream alignment, etc.
ASR
new, recently added to SigSRF
based on Kaldi open source speech recognition toolkit
KubeEdge
edge computing version of Kubernetes
open source owned and maintained by LF1 Edge
1 Linux Foundation
ASR Offloading
Demo Capability
ASR based on Kaldi's mini-librispeech model
subset of librispeech model, which has 200k word vocabulary (English)
trained with fewer hours, producing a smaller model easier to use for development and testing
demo uses pre-trained x-vectors and i-vectors - no training required
SigSRF packet + media software
voice/audio codecs - AMR, AMR-WB, EVS, G729, G726
RFCs - child streams (8108), DTMF (4733), 7198, others
concurrent sessions - 8 (demo subset of 512)
packet handling - jitter buffer, DTX, packet loss mitigation
call groups
Call groups (one or more endpoints)
conferencing, merging, deduplication
ASR is applied to call group output
Kaldi Interface
Expects wideband audio
for accuracy benchmarks
for training augmentation, R&D work, published results
Real-time inference is called "online decoding"
Kaldi run-time inference expects raw 16-bit audio chunks
for packet interface, audio should be transported as
raw audio over TCP/IP
RTP audio packets received and decoded by GStreamer
GStreamer not suitable for telecom / wideband audio
doesn't support EVS, AMR-WB for concurrent threads, reliability
lack of advanced handling for packet loss, stream alignment between multiple streams within a call, stream gaps (call waiting, music on hold), etc
doesn't support RFC8108 (multiple streams from one endpoint)
Data Flow
SigSRF replaces GStreamer
minimum REST APIs required - session create/delete/modify
session create can be specific or give IP:port and let SigSRF auto-detect codec, bitrate, ptime, etc from RTP data flow
packet input via UDP, pcap input for R&D, testing purposes
Inferlib
inference library interfaces to Kaldi run-time libs
Software Architecture
KubeEdge Integration
SigSRF and Kaldi libs inside KubeEdge container
minimum 4 x86 cores, 32 GB mem, 1 TB HDD
Mobile device app
creates ASR sessions with REST APIs
push-to-talk, send codec output packets via UDP/RTP
possibly we can send copies of in-call codec packets
Run-Time Inference
One end-to-end thread on one Xeon x86 core
input is 16-bit raw audio, output is ASR text (plus logs, stat files, etc)
ARM cores can be used, but support on Kaldi user groups is limited
Kaldi developers are focused on state-of-the-art R&D
they maintain a sweet spot of about 2x RTF (real-time factor). They don't use OpenMP, TBB, or other HPC multicore methods
DNN and HMM architecture, improved training methods, accuracy are higher priority than performance
not focused on concurrent streams, high capacity, reliability, etc.
Kaldi Integration
Kaldi is its own framework
main Kaldi contributors are working on PyTorch support
partially supports TensorFlow, but main contributors no longer working on it
no support for Caffe, MXnet, etc.
To integrate Kaldi into production applications takes effort
developer interface is based on Linux shell scripts, so we tracked inference scripts + binaries to find necessary APIs that inferlib must support
if you ask questions on Kaldi forums about improving performance, reducing model size, concurrent threads, etc you will get general advice only
Acceleration
GPUs are supported by Nvidia tech personnel on kaldi-asr.org
also seems to be the case for OpenVINO (Intel)
Kaldi Architecture, DNNs
Architecture
Sliding FFT
time domain (time series)
DNN Input Layers (ILn)
frequency domain
uses "chain" models: DNN1 + xMM2
AM (acoustic model) recognizes phonemes
phonemes vary depending on context, so "tri-phones" are used
LM (language model) recognizes words as tri-phone combinations
DNN frequency domain data
formed by sliding FFT analysis of incoming time series data. Each FFT frame output is similar to cochlea in human ears
groups of FFT frames form images
successive images are called "TDNN" (time delayed DNN), similar to series of CNNs3
Training
DNNs saved as "x-vectors" and "i-vectors"
HMM / GMMs saved as FSTs4
1
Deep Neural Network
2
Hidden Markov Model, Gaussian Mixed Model
3
Convolutional Neural Network
4
Finite State Transducer