ASR Edge Computing

Below is a high-level explanation of the ASR (Automatic Speech Recognition) implementation for SigSRF software, based on the Kaldi open source speech recognition toolkit. A second implementation is ongoing for KubeEdge, the Edge Computing version of Kubernetes.

Overview

ASR Offloading

Demo Capability

Kaldi Interface

Data Flow

Software Architecture

KubeEdge Integration

Kaldi Info

Run-Time Inference

Kaldi Integration

Kaldi Architecture, DNNs

Overview

SigSRF packet + media processing software

SigSRF is deployed by telecoms, LEAs, and analytics customers worldwide
extreme high capacity, robust packet interface, wideband audio codecs, jitter buffer and packet loss handling, stream alignment, etc.

ASR

new, recently added to SigSRF
based on Kaldi open source speech recognition toolkit

KubeEdge

edge computing version of Kubernetes
open source owned and maintained by LF¹ Edge

¹ Linux Foundation

ASR Offloading

diagram showing ASR offloading from mobile app to edge computing node

Demo Capability

ASR based on Kaldi's mini-librispeech model

subset of librispeech model, which has 200k word vocabulary (English)
trained with fewer hours, producing a smaller model easier to use for development and testing
demo uses pre-trained x-vectors and i-vectors - no training required

SigSRF packet + media software

voice/audio codecs - AMR, AMR-WB, EVS, G729, G726
RFCs - child streams (8108), DTMF (4733), 7198, others
concurrent sessions - 8 (demo subset of 512)
packet handling - jitter buffer, DTX, packet loss mitigation
call groups

Call groups (one or more endpoints)

conferencing, merging, deduplication
ASR is applied to call group output

Kaldi Interface

Expects wideband audio

for accuracy benchmarks
for training augmentation, R&D work, published results

Real-time inference is called "online decoding"

Kaldi run-time inference expects raw 16-bit audio chunks
for packet interface, audio should be transported as

raw audio over TCP/IP
RTP audio packets received and decoded by GStreamer

diagram showing Kaldi default online decoding dataflow, including RTP audio packet input, GStreamer decoding to raw audio, and Kaldi ASR processing

GStreamer not suitable for telecom / wideband audio

doesn't support EVS, AMR-WB for concurrent threads, reliability
lack of advanced handling for packet loss, stream alignment between multiple streams within a call, stream gaps (call waiting, music on hold), etc
doesn't support RFC8108 (multiple streams from one endpoint)

Data Flow

SigSRF replaces GStreamer

minimum REST APIs required - session create/delete/modify
session create can be specific or give IP:port and let SigSRF auto-detect codec, bitrate, ptime, etc from RTP data flow
packet input via UDP, pcap input for R&D, testing purposes

Inferlib

inference library interfaces to Kaldi run-time libs

data flow diagram, showing SigSRF packet and media processing, signal processing, and Kaldi ASR processing

Software Architecture

software architecture diagram showing SigSRF libs, Kaldi libs, and test as they relate to data flow & measurement I/O

KubeEdge Integration

SigSRF and Kaldi libs inside KubeEdge container

minimum 4 x86 cores, 32 GB mem, 1 TB HDD

Mobile device app

creates ASR sessions with REST APIs
push-to-talk, send codec output packets via UDP/RTP
possibly we can send copies of in-call codec packets

diagram showing integration of SigSRF libs, Kaldi libs, SigSRF packet/media threads within a KubeEdge (Kubernetes edge computing) container

Run-Time Inference

One end-to-end thread on one Xeon x86 core

input is 16-bit raw audio, output is ASR text (plus logs, stat files, etc)
ARM cores can be used, but support on Kaldi user groups is limited

Kaldi developers are focused on state-of-the-art R&D

they maintain a sweet spot of about 2x RTF (real-time factor). They don't use OpenMP, TBB, or other HPC multicore methods
DNN and HMM architecture, improved training methods, accuracy are higher priority than performance
not focused on concurrent streams, high capacity, reliability, etc.

diagram showing Kaldi data flow and which components (libs) act at which data flow stage

Kaldi Integration

Kaldi is its own framework

main Kaldi contributors are working on PyTorch support
partially supports TensorFlow, but main contributors no longer working on it
no support for Caffe, MXnet, etc.

To integrate Kaldi into production applications takes effort

developer interface is based on Linux shell scripts, so we tracked inference scripts + binaries to find necessary APIs that inferlib must support
if you ask questions on Kaldi forums about improving performance, reducing model size, concurrent threads, etc you will get general advice only

Acceleration

GPUs are supported by Nvidia tech personnel on kaldi-asr.org
also seems to be the case for OpenVINO (Intel)

Kaldi Architecture, DNNs

Architecture

raw audio shown as time domain, or time series, data prior to Kaldi input
Sliding FFT time domain (time series)
frequency domain data, shown after sliding FFT processing, before formatting as Kaldi DNN input layers
DNN Input Layers (ILn) frequency domain

Kaldi DNN input layer slices, shown as a series of successive CNNs using frequency domain data images diagram showing Kaldi DNN followed by HMM and/or GMM

uses "chain" models: DNN¹ + xMM²
AM (acoustic model) recognizes phonemes
phonemes vary depending on context, so "tri-phones" are used
LM (language model) recognizes words as tri-phone combinations

DNN frequency domain data

formed by sliding FFT analysis of incoming time series data. Each FFT frame output is similar to cochlea in human ears
groups of FFT frames form images
successive images are called "TDNN" (time delayed DNN), similar to series of CNNs³

Training

DNNs saved as "x-vectors" and "i-vectors"
HMM / GMMs saved as FSTs⁴

¹ Deep Neural Network
² Hidden Markov Model, Gaussian Mixed Model
³ Convolutional Neural Network
⁴ Finite State Transducer

ASR Edge Computing

Contents

Kaldi Info

Overview

SigSRF packet + media processing software

SigSRF is deployed by telecoms, LEAs, and analytics customers worldwide

extreme high capacity, robust packet interface, wideband audio codecs, jitter buffer and packet loss handling, stream alignment, etc.

ASR

new, recently added to SigSRF

based on Kaldi open source speech recognition toolkit

KubeEdge

edge computing version of Kubernetes

open source owned and maintained by LF1 Edge

ASR Offloading

Demo Capability

ASR based on Kaldi's mini-librispeech model

subset of librispeech model, which has 200k word vocabulary (English)

trained with fewer hours, producing a smaller model easier to use for development and testing

demo uses pre-trained x-vectors and i-vectors - no training required

SigSRF packet + media software

voice/audio codecs - AMR, AMR-WB, EVS, G729, G726

RFCs - child streams (8108), DTMF (4733), 7198, others

concurrent sessions - 8 (demo subset of 512)

packet handling - jitter buffer, DTX, packet loss mitigation

call groups

Call groups (one or more endpoints)

conferencing, merging, deduplication

ASR is applied to call group output

Kaldi Interface

Expects wideband audio

for accuracy benchmarks

for training augmentation, R&D work, published results

Real-time inference is called "online decoding"

Kaldi run-time inference expects raw 16-bit audio chunks

for packet interface, audio should be transported as

raw audio over TCP/IP

RTP audio packets received and decoded by GStreamer

GStreamer not suitable for telecom / wideband audio

doesn't support EVS, AMR-WB for concurrent threads, reliability

lack of advanced handling for packet loss, stream alignment between multiple streams within a call, stream gaps (call waiting, music on hold), etc

doesn't support RFC8108 (multiple streams from one endpoint)

Data Flow

SigSRF replaces GStreamer

minimum REST APIs required - session create/delete/modify

session create can be specific or give IP:port and let SigSRF auto-detect codec, bitrate, ptime, etc from RTP data flow

packet input via UDP, pcap input for R&D, testing purposes

Inferlib

inference library interfaces to Kaldi run-time libs

Software Architecture

KubeEdge Integration

SigSRF and Kaldi libs inside KubeEdge container

minimum 4 x86 cores, 32 GB mem, 1 TB HDD

Mobile device app

creates ASR sessions with REST APIs

push-to-talk, send codec output packets via UDP/RTP

possibly we can send copies of in-call codec packets

Run-Time Inference

One end-to-end thread on one Xeon x86 core

input is 16-bit raw audio, output is ASR text (plus logs, stat files, etc)

ARM cores can be used, but support on Kaldi user groups is limited

Kaldi developers are focused on state-of-the-art R&D

they maintain a sweet spot of about 2x RTF (real-time factor). They don't use OpenMP, TBB, or other HPC multicore methods

DNN and HMM architecture, improved training methods, accuracy are higher priority than performance

not focused on concurrent streams, high capacity, reliability, etc.

Kaldi Integration

Kaldi is its own framework

main Kaldi contributors are working on PyTorch support

partially supports TensorFlow, but main contributors no longer working on it

no support for Caffe, MXnet, etc.

To integrate Kaldi into production applications takes effort

developer interface is based on Linux shell scripts, so we tracked inference scripts + binaries to find necessary APIs that inferlib must support

if you ask questions on Kaldi forums about improving performance, reducing model size, concurrent threads, etc you will get general advice only

Acceleration

GPUs are supported by Nvidia tech personnel on kaldi-asr.org

also seems to be the case for OpenVINO (Intel)

Kaldi Architecture, DNNs

Architecture

uses "chain" models: DNN1 + xMM2

AM (acoustic model) recognizes phonemes

phonemes vary depending on context, so "tri-phones" are used

open source owned and maintained by LF¹ Edge

uses "chain" models: DNN¹ + xMM²

successive images are called "TDNN" (time delayed DNN), similar to series of CNNs³

HMM / GMMs saved as FSTs⁴