Intel® Developer Zone offers tools and how-to information for cross-platform app development, platform and technology information, code samples, and peer expertise to help developers innovate and succeed. Join our communities for Android, Internet of Things, Intel® RealSense™ Technology and Windows to download tools, access dev kits, share ideas with like-minded developers, and participate in hackathon’s, contests, roadshows, and local events.

## Overview

Over the past thirty years, speech recognition technology has made significant progress, starting in the lab to the market. Speech recognition technology is becoming important in people lives, and is found in our jobs, houses, automotive, medical and other fields. It’s is one of the TOP 10 merging technologies in the world.

As a result of this year’s’ developments, the main algorithm of speech recognition technology has changed from GMM (Gaussian Mixture Model) and, HMM-GMM (Hidden Markov Model-Gaussian Mixture Model) to DNN (Deep Neural Network). DNN functions similar to the way a human’s brain works, it is a very complicated, heavy calculation, and huge data based model. Thanks to the internet, we just only need a smartphone and don’t care about the huge number of servers in the remote computer room that make it all happen. Without internet, the speech recognition service in your mobile devices nearly useless, very few times it can listen to what you said and work.

Is it possible to move the DNN calculation process from server to the mobile end device? Phones? Tablets? The answer is YES.

With support for the SSSE3 instruction set on Intel’s CPU, we could easy run a DNN based speech recognition application without the internet. The accuracy is over 80% by our test, that’s very close to the result of online mode tests. Adding direct SSSE3 support creates a good user experience on mobile devices. In this article I will explain what is DNN and how the Intel® SSSE3 instruction set helps to accelerate DNN calculation progress.

## Introduction

DNN is the abbreviation for Deep Neural Network, which contains a many hidden layer feed forward network. DNN is a hot spot in the field of machine learning in recent years, producing a wide range of applications. DNN has a deep structure, with tens of millions of parameters needed to be learned, and the lead time for training is very time consuming.

Speech recognition is a typical application case of DNN. To put it simply, speech recognition applications consists of an acoustical model, language model and a decoding process. The acoustical model is used to simulate the probability distribution of pronunciation. The language model is used to simulate the relationship between words. And the decoding process stage uses the above two models to translate the sound into text. A neural network has the ability to simulate any word distribution. Where a deep neural network has a stronger expression ability than a shallow neural network, it simulates the deep structure of the brain and can "understand" more accurately the characteristics of things. So compared with other methods, the depth of the neural network can be a more accurately simulated acoustic and language model.

## Typical DNN Chart

A typical DNN generally contains multiple alternate superposition of a linear and non-linear layer, as it shows below:

In Figure 2, the linear layer is a fully connected relationship, and the input to output could be described by this formula:

Y^{T} = X^{T}W^{T} + B

X^{T} is the row vector, and input is by neural network. In a speech recognition application, we generally put 4 frames of data to calculate together, so that we create a 4xM input matrix. W^{T} and B is the linear transformation matrix of the neural network and offset vector, usually the dimension is huge and square.

## Intel® SSSE3 Instruction Set

Supplemental Streaming SIMD Extensions 3, or SSSE3 for short, is named by Intel and as the extension of SSSE3 instruction set. The SSSE3 instruction set is a part of SIMD technology, which has been integrated into Intel’s CPU and helps to improve the ability of multimedia processing, coding/decoding, and calculations. Using the SSSE3 instruction set, we can process multiple data inputs by a single instruction in a clock cycle, and then greatly improve the program’s efficiency. It works particularly for matrix calculations.

To use the SSSE3 instruction set, we should first declare and include the SIMD header files:

#include //MMX #include //SSE(include mmintrin.h) #include //SSE2(include xmmintrin.h) #include //SSE3(include emmintrin.h) #include //SSSE3(include pmmintrin.h) #include //SSE4.1(include tmmintrin.h) #include //SSE4.2(include smmintrin.h) #include //AES(include nmmintrin.h) #include //AVX(include wmmintrin.h) #include //(include immintrin.h)

The header file "*tmmintrin.h*" is for SSSE3, and the functions defined in this file are below:

/*Add horizonally packed [saturated] words, double words, {X,}MM2/m{128,64} (b) to {X,}MM1 (a).*/ //a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7) //then r0=a0+a1,r1=a2+a3,r2=a4+a5,r3=a6+a7,r4=b0+b1,r5=b2+b3,r6=b4+b5, r7=b6+b7 extern __m128i _mm_hadd_epi16 (__m128i a, __m128i b); //a=(a0, a1, a2, a3), b=(b0, b1, b2, b3) //then r0=a0+a1,r1=a2+a3,r2=b0+b1,r3=b2+b3 extern __m128i _mm_hadd_epi32 (__m128i a, __m128i b); //SATURATE_16(x) is ((x > 32767) ? 32767 : ((x < -32768) ? -32768 : x)) //a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7) //then r0=SATURATE_16(a0+a1), ..., r3=SATURATE_16(a6+a7), //r4=SATURATE_16(b0+b1), ..., r7=SATURATE_16(b6+b7) extern __m128i _mm_hadds_epi16 (__m128i a, __m128i b); //a=(a0, a1, a2, a3), b=(b0, b1, b2, b3) //then r0=a0+a1, r1=a2+a3, r2=b0+b1, r3=b2+b3 extern __m64 _mm_hadd_pi16 (__m64 a, __m64 b); //a=(a0, a1), b=(b0, b1), 则r0=a0+a1, r1=b0+b1 extern __m64 _mm_hadd_pi32 (__m64 a, __m64 b); //SATURATE_16(x) is ((x > 32767) ? 32767 : ((x < -32768) ? -32768 : x)) //a=(a0, a1, a2, a3), b=(b0, b1, b2, b3) //then r0=SATURATE_16(a0+a1), r1=SATURATE_16(a2+a3), //r2=SATURATE_16(b0+b1), r3=SATURATE_16(b2+b3) extern __m64 _mm_hadds_pi16 (__m64 a, __m64 b); /*Subtract horizonally packed [saturated] words, double words, {X,}MM2/m{128,64} (b) from {X,}MM1 (a).*/ //a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7) //then r0=a0-a1, r1=a2-a3, r2=a4-a5, r3=a6-a7, r4=b0-b1, r5=b2-b3, r6=b4-b5, r7=b6-b7 extern __m128i _mm_hsub_epi16 (__m128i a, __m128i b); //a=(a0, a1, a2, a3), b=(b0, b1, b2, b3) //then r0=a0-a1, r1=a2-a3, r2=b0-b1, r3=b2-b3 extern __m128i _mm_hsub_epi32 (__m128i a, __m128i b); //SATURATE_16(x) is ((x > 32767) ? 32767 : ((x < -32768) ? -32768 : x)) //a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7) //then r0=SATURATE_16(a0-a1), ..., r3=SATURATE_16(a6-a7), //r4=SATURATE_16(b0-b1), ..., r7=SATURATE_16(b6-b7) extern __m128i _mm_hsubs_epi16 (__m128i a, __m128i b); //a=(a0, a1, a2, a3), b=(b0, b1, b2, b3) //then r0=a0-a1, r1=a2-a3, r2=b0-b1, r3=b2-b3 extern __m64 _mm_hsub_pi16 (__m64 a, __m64 b); //a=(a0, a1), b=(b0, b1), 则r0=a0-a1, r1=b0-b1 extern __m64 _mm_hsub_pi32 (__m64 a, __m64 b); //SATURATE_16(x) is ((x > 32767) ? 32767 : ((x < -32768) ? -32768 : x)) //a=(a0, a1, a2, a3), b=(b0, b1, b2, b3) //then r0=SATURATE_16(a0-a1), r1=SATURATE_16(a2-a3), //r2=SATURATE_16(b0-b1), r3=SATURATE_16(b2-b3) extern __m64 _mm_hsubs_pi16 (__m64 a, __m64 b); /*Multiply and add packed words, {X,}MM2/m{128,64} (b) to {X,}MM1 (a).*/ //SATURATE_16(x) is ((x > 32767) ? 32767 : ((x < -32768) ? -32768 : x)) //a=(a0, a1, a2, ..., a13, a14, a15), b=(b0, b1, b2, ..., b13, b14, b15) //then r0=SATURATE_16((a0*b0)+(a1*b1)), ..., r7=SATURATE_16((a14*b14)+(a15*b15)) //Parameter a contains unsigned bytes. Parameter b contains signed bytes. extern __m128i _mm_maddubs_epi16 (__m128i a, __m128i b); //SATURATE_16(x) is ((x > 32767) ? 32767 : ((x < -32768) ? -32768 : x)) //a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7) //then r0=SATURATE_16((a0*b0)+(a1*b1)), ..., r3=SATURATE_16((a6*b6)+(a7*b7)) //Parameter a contains unsigned bytes. Parameter b contains signed bytes. extern __m64 _mm_maddubs_pi16 (__m64 a, __m64 b); /*Packed multiply high integers with round and scaling, {X,}MM2/m{128,64} (b) to {X,}MM1 (a).*/ //a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7) //then r0=INT16(((a0*b0)+0x4000) >> 15), ..., r7=INT16(((a7*b7)+0x4000) >> 15) extern __m128i _mm_mulhrs_epi16 (__m128i a, __m128i b); //a=(a0, a1, a2, a3), b=(b0, b1, b2, b3) //then r0=INT16(((a0*b0)+0x4000) >> 15), ..., r3=INT16(((a3*b3)+0x4000) >> 15) extern __m64 _mm_mulhrs_pi16 (__m64 a, __m64 b); /*Packed shuffle bytes {X,}MM2/m{128,64} (b) by {X,}MM1 (a).*/ //SELECT(a, n) extracts the nth 8-bit parameter from a. The 0th 8-bit parameter //is the least significant 8-bits, b=(b0, b1, b2, ..., b13, b14, b15), b is mask //then r0 = (b0 & 0x80) ? 0 : SELECT(a, b0 & 0x0f), ..., //r15 = (b15 & 0x80) ? 0 : SELECT(a, b15 & 0x0f) extern __m128i _mm_shuffle_epi8 (__m128i a, __m128i b); //SELECT(a, n) extracts the nth 8-bit parameter from a. The 0th 8-bit parameter //is the least significant 8-bits, b=(b0, b1, ..., b7), b is mask //then r0= (b0 & 0x80) ? 0 : SELECT(a, b0 & 0x07),..., //r7=(b7 & 0x80) ? 0 : SELECT(a, b7 & 0x07) extern __m64 _mm_shuffle_pi8 (__m64 a, __m64 b); /*Packed byte, word, double word sign, {X,}MM2/m{128,64} (b) to {X,}MM1 (a).*/ //a=(a0, a1, a2, ..., a13, a14, a15), b=(b0, b1, b2, ..., b13, b14, b15) //then r0=(b0 < 0) ? -a0 : ((b0 == 0) ? 0 : a0), ..., //r15= (b15 < 0) ? -a15 : ((b15 == 0) ? 0 : a15) extern __m128i _mm_sign_epi8 (__m128i a, __m128i b); //a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7) //r0=(b0 < 0) ? -a0 : ((b0 == 0) ? 0 : a0), ..., //r7= (b7 < 0) ? -a7 : ((b7 == 0) ? 0 : a7) extern __m128i _mm_sign_epi16 (__m128i a, __m128i b); //a=(a0, a1, a2, a3), b=(b0, b1, b2, b3) //then r0=(b0 < 0) ? -a0 : ((b0 == 0) ? 0 : a0), ..., //r3= (b3 < 0) ? -a3 : ((b3 == 0) ? 0 : a3) extern __m128i _mm_sign_epi32 (__m128i a, __m128i b); //a=(a0, a1, a2, a3, a4, a5, a6, a7), b=(b0, b1, b2, b3, b4, b5, b6, b7) //then r0=(b0 < 0) ? -a0 : ((b0 == 0) ? 0 : a0), ..., //r7= (b7 < 0) ? -a7 : ((b7 == 0) ? 0 : a7) extern __m64 _mm_sign_pi8 (__m64 a, __m64 b); //a=(a0, a1, a2, a3), b=(b0, b1, b2, b3) //则r0=(b0 < 0) ? -a0 : ((b0 == 0) ? 0 : a0), ..., //r3= (b3 < 0) ? -a3 : ((b3 == 0) ? 0 : a3) extern __m64 _mm_sign_pi16 (__m64 a, __m64 b); //a=(a0, a1), b=(b0, b1), 则r0=(b0 < 0) ? -a0 : ((b0 == 0) ? 0 : a0), //r1= (b1 < 0) ? -a1 : ((b1 == 0) ? 0 : a1) extern __m64 _mm_sign_pi32 (__m64 a, __m64 b); /*Packed align and shift right by n*8 bits, {X,}MM2/m{128,64} (b) to {X,}MM1 (a).*/ //n: A constant that specifies how many bytes the interim result will be //shifted to the right, If n > 32, the result value is zero //CONCAT(a, b) is the 256-bit unsigned intermediate value that is a //concatenation of parameters a and b. //The result is this intermediate value shifted right by n bytes. //then r= (CONCAT(a, b) >> (n * 8)) & 0xffffffffffffffff extern __m128i _mm_alignr_epi8 (__m128i a, __m128i b, int n); //n: An integer constant that specifies how many bytes to shift the interim //result to the right,If n > 16, the result value is zero //CONCAT(a, b) is the 128-bit unsigned intermediate value that is formed by //concatenating parameters a and b. //The result value is the rightmost 64 bits after shifting this intermediate //result right by n bytes //then r = (CONCAT(a, b) >> (n * 8)) & 0xffffffff extern __m64 _mm_alignr_pi8 (__m64 a, __m64 b, int n); /*Packed byte, word, double word absolute value, {X,}MM2/m{128,64} (b) to {X,}MM1 (a).*/ //a=(a0, a1, a2, ..., a13, a14, a15) //then r0 = (a0 < 0) ? -a0 : a0, ..., r15 = (a15 < 0) ? -a15 : a15 extern __m128i _mm_abs_epi8 (__m128i a); //a=(a0, a1, a2, a3, a4, a5, a6, a7) //then r0 = (a0 < 0) ? -a0 : a0, ..., r7 = (a7 < 0) ? -a7 : a7 extern __m128i _mm_abs_epi16 (__m128i a); //a=(a0, a1, a2, a3) //then r0 = (a0 < 0) ? -a0 : a0, ..., r3 = (a3 < 0) ? -a3 : a3 extern __m128i _mm_abs_epi32 (__m128i a); //a=(a0, a1, a2, a3, a4, a5, a6, a7) //then r0 = (a0 < 0) ? -a0 : a0, ..., r7 = (a7 < 0) ? -a7 : a7 extern __m64 _mm_abs_pi8 (__m64 a); //a=(a0, a1, a2, a3) //then r0 = (a0 < 0) ? -a0 : a0, ..., r3 = (a3 < 0) ? -a3 : a3 extern __m64 _mm_abs_pi16 (__m64 a); //a=(a0, a1), then r0 = (a0 < 0) ? -a0 : a0, r1 = (a1 < 0) ? -a1 : a1 extern __m64 _mm_abs_pi32 (__m64 a);

The data structure definition of `__m64`

and `__m128`

are in MMX’s header file "*mmintrin.h*" and SSE header file "*xmmintrin.h*".

__m64:

typedef union __declspec(intrin_type) _CRT_ALIGN(8) __m64 { unsigned __int64 m64_u64; float m64_f32[2]; __int8 m64_i8[8]; __int16 m64_i16[4]; __int32 m64_i32[2]; __int64 m64_i64; unsigned __int8 m64_u8[8]; unsigned __int16 m64_u16[4]; unsigned __int32 m64_u32[2]; } __m64;

__m128:

typedef union __declspec(intrin_type) _CRT_ALIGN(16) __m128 { float m128_f32[4]; unsigned __int64 m128_u64[2]; __int8 m128_i8[16]; __int16 m128_i16[8]; __int32 m128_i32[4]; __int64 m128_i64[2]; unsigned __int8 m128_u8[16]; unsigned __int16 m128_u16[8]; unsigned __int32 m128_u32[4]; } __m128;

## Case study: using SSSE3 functions to accelerate DNN calculation

In this section, we take two functions as a sample to describe how SSSE3 is used to accelerate the DNN calculation process.

**__m128i _mm_maddubs_epi16 (__m128i a, __m128i b)** Saturated Accumulation Operation

This function is very critical for the matrix calculation in DNN, the parameter a is a 128bit register, used to store 16 unsigned integers which are 8bit, and parameter b is 16 signed integer which also is 8bit; the return result which included 8 signed 16bit integer. This function is perfect for meeting the requirement of matrix calculation. Such as:

r0 := SATURATE_16((a0*b0) + (a1*b1)) r1 := SATURATE_16((a2*b2) + (a3*b3)) … r7 := SATURATE_16((a14*b14) + (a15*b15))

**__m128i _mm_hadd_epi32 (__m128i a, __m128i b)** Adjacent Elements Add Operation

This function can be called pair-wise add. The parameters a and b both are 128bit registers which store a 4 signed integer of 32bit. According to normal corresponding element add operation in two vector, it does the add operation with adjacent elements with input vector. Such as:

r0 := a0 + a1 r1 := a2 + a3 r2 := b0 + b1 r3 := b2 + b3

Then, we suppose there’s a task of vector calculation in DNN process:

Q: There are five vectors a1, b1, b2, b3, b4. The a1 vector is 16 dimension unsigned-char integer, b1, b2, b3, b4 are both 16 dimension signed-char integers. We need the inner product of a1*b1, a1*b2, a1*b3, a1*b4, and to store the result in a signed int of 32bit.

If we used normal design and C program language to implement it, the coding would be as follows:

unsigned char b1[16],b2[16],b3[16],b4[16]; signed char a1[16]; int c[4],i; // Initialize b1,b2,b3,b4 and a1, for c, initialize with zeros // for(i=0;i<16;i++){ c[0] += (short)a1[i]*(short)b1[i]; c[1] += (short)a1[i]*(short)b1[i]; c[2] += (short)a1[i]*(short)b1[i]; c[3] += (short)a1[i]*(short)b1[i]; }

Suppose there is one multiplication and addition per clock cycle, this code fills 64 clock cycles.

Then we used the SSSE3 instruction set to implement it instead:

register __m128i a1,b1,b2,b3,b4,c,d1,d2,d3,d4; // initialize a1 b1 b2 b3 b4 c here, where c is set to zeros// d1 = _mm_maddubs_epi16(a1,b1); d1 = _mm_add_epi32(_mm_srai_epi32(_mm_unpacklo_epi16(d1, d1), 16), _mm_srai_epi32(_mm_unpackhi_epi16(d1, d1), 16)); d2 = _mm_maddubs_epi16(a1,b2); d2 = _mm_add_epi32(_mm_srai_epi32(_mm_unpacklo_epi16(d2, d2), 16), _mm_srai_epi32(_mm_unpackhi_epi16(d2, d2), 16)); d3 = _mm_hadd_epi32(d1, d2); d1 = _mm_maddubs_epi16(a1,b3); d1 = _mm_add_epi32(_mm_srai_epi32(_mm_unpacklo_epi16(d1, d1), 16), _mm_srai_epi32(_mm_unpackhi_epi16(d1, d1), 16)); d2 = _mm_maddubs_epi16(a1,b4); d2 = _mm_add_epi32(_mm_srai_epi32(_mm_unpacklo_epi16(d2, d2), 16), _mm_srai_epi32(_mm_unpackhi_epi16(d2, d2), 16)); d4 = _mm_hadd_epi32(d1, d2); c = _mm_hadd_epi32(d3, d4);

We stored the result in a 128bit register of "c", where it is jointed by 4 integers. Take in consideration of the pipeline, this process may cost 12 or 13 clock cycles. So, the potential results we could get from this task are:

Implementation | CPU Clock Cycles | Promotion |
---|---|---|

Normal C Coding | 64 | - |

Using SSSE3 Instruction Set | 13 | ~ 500% |

As we know, there are many matrix calculations in the DNN process of speech recognition, if we optimize each one in our code like this, it will achieve better performance on the IA platform than ever. We have cooperated with ISV Unisound, which provides speech recognition services in China. Unisound used the DNN process with an improvement in performance of over 10% on ARM devices.

## Summary

DNN is becoming the main algorithm in speech recognition. It has been selected by Google Now, Baidu Voice, Tencent Wechat, iFlytek Speech Service, Unisound Speech Service, and many others. At the same time, we have the SSSE3 instruction set which could help to optimize the speech recognition process, if all of these applications begin using it, I believe the speech service will give us a better experience and more increased usage of IA platform.

## About the Author

Li Alven graduated from Huazhong University of Science and Technology, where he majored in Computer Science and Information Security at 2007. He joined Intel in 2013 as a senior application engineer in the Developer Relations Division Mobile Enabling Team. He is focused on differentiation and innovative enabling on the IA platform, Speech Recognition Technology, tuning performance, etc.