Performing Speech-to-Text Recognition with OpenVINO™

Adrian Boguszewski

4.50/5 (3 votes)

Jul 20, 2022

CPOL

3 min read

4943

In this article, I will show you how to easily run inference on speech-to-text recognition models using OpenVINO™ so you can start applying this capability in your own applications.

Speech-to-text is rapidly becoming an important part of everyday life. Whether you’re trying to help drivers safely send a message without having to take their hands off the wheel or a business looking to make things more accessible for customers, it’s a crucial capability for AI developers to have.

The most common use cases of speech-to-text today include automatic transcripts of phone calls and conferences. But there’s also an ongoing trend to implement it as part of larger services. For instance, speech-to-text technology can be paired with a machine translation service to automatically create video subtitles in other languages.

In this guide, I will show you how to easily run inference on speech-to-text recognition models using OpenVINO™ so you can start applying this capability in your own applications.

For this demo, we will use the quartznet 15x15 model to perform automatic speech recognition. This specific model is based on Jasper, a neural acoustic end-to-end architecture trained with Connectionist Temporal Classification (CTC) loss.

The first step of the demo imports various functions and declares the program’s variables. It also specifies model precision (FP16 in this case) and gives the model a name.

The next section of the code checks to see if a model needs to be downloaded and creates a subdirectory structure for doing so. In this case, the quartznet-15x5-en model being downloaded comes from Open Model Zoo and must be converted to Intermediate Representation (IR).

# Check if model is already downloaded in download directory
path_to_model_weights = Path(f'{download_folder}/public/{model_name}/models')
downloaded_model_file = list(path_to_model_weights.glob('*.pth'))

if not path_to_model_weights.is_dir() or len(downloaded_model_file) == 0:
    download_command = f"omz_downloader --name {model_name} --output_dir {download_folder} --precision {precision}"
    ! $download_command
    
# Check if model is already converted in model directory
path_to_converted_weights = Path(f'{model_folder}/public/{model_name}/{precision}/{model_name}.bin')

if not path_to_converted_weights.is_file():
    convert_command = f"omz_converter --name {model_name} --precisions {precision} --download_dir {download_folder} --output_dir {model_folder}"
    ! $convert_command

The conversion process is handled by the Model Converter omz_converter, an openvino-dev package command-line tool. It translates the pre-trained PyTorch model to ONNX format, which is then converted to Intel’s OpenVINO format (Intermediate Representation or IR file). Both steps are handled in the same function.

Once this is done, the demo loads an audio file and defines an alphabet for speech recognition. A wide range of audio formats are supported, including WAV, FLAC, and OGG.

Once converted to OpenVINO format, the pre-processed audio is converted to a Mel Spectrum, which uses a scale to teach computers how to listen as humans do. This makes it easier to process the data and yields better performance. For a full deep dive into Mel Spectrograms and their various uses in deep learning, read this article.

Note: Audio must be in 16KHz format in order to be converted.

def audio_to_mel(audio, sampling_rate):
    assert sampling_rate == 16000, "Only 16 KHz audio supported"
    preemph = 0.97
    preemphased = np.concatenate([audio[:1], audio[1:] - preemph * audio[:-1].astype(np.float32)])
    
    # Calculate window length
    win_length = round(sampling_rate * 0.02)
    
    # Based on previously calculated window length run short-time Fourier transform
    spec = np.abs(librosa.core.spectrum.stft(preemphased, n_fft=512, hop_length=round(sampling_rate * 0.01), win_length=win_length, center=True, window=scipy.signal.windows.hann(win_length), pad_mode='reflect'))
    
    # Create mel filter-bank, produce transformation matrix to project current values onto Mel-frequency bins
    mel_basis = librosa.filters.mel(sampling_rate, 512, n_mels=64, fmin=0.0, fmax=8000.0, htk=False)
    return mel_basis, spec
    
    def mel_to_input(mel_basis, spec, padding=16):
        # Convert to logarithmic scale
        log_melspectrum = np.log(np.dot(mel_basis, np.power(spec, 2)) + 2 ** -24)
        
        # Normalize output
        normalized = (log_melspectrum - log_melspectrum.mean(1)[:, None]) / (log_melspectrum.std(1)[:, None] + 1e-5)
        
        # Calculate padding
        remainder = normalized.shape[1] % padding
        if remainder != 0:
            return np.pad(normalized, ((0, 0), (0, padding - remainder)))
    [None]
        return normalized[None]

This block of code runs the necessary conversion before the model is loaded. End users have the option to target a CPU, GPU, or MYRIAD (Neural Compute Stick 2). If set to AUTO, the system will choose its own target device for best performance.

To initialize and load the network, consult the code samples below. By default, the model will execute on the CPU, but you have the option of manually running the workload on a CPU, GPU, or MYRIAD. The print (i.e., available_devices) command below will list all of the locations where the workload can be executed. To change the target device, change the device_name (currently set to CPU) below.

ie = Core()

print(ie.available_devices)

model = ie.read_model(model=f"{model_folder}/public/{model_name}/{precision}/{model_name}.xml")
model_input_layer = model.input(0)
shape = model_input_layer.partial_shape
shape[2] = -1
model.reshape({model_input_layer: shape})
compiled_model = ie.compile_model(model=model, device_name="CPU")

output_layer_ir = compiled_model.output(0)

character_probabilities = compiled_model([audio])[output_layer_ir]

The output_layer_ir is a handle to the output node of the network. Once inferencing is complete, data must be read and translated into a more human-friendly format.

The default output is given per-frame probabilities for every symbol in the alphabet. These probabilities must be decoded via the Connectionist Temporal Classification (CTC) function. The alphabet as encoded is 0 = space, 1 to 26 = “a” to “z”, 27 = apostrophe, 28 = CTC blank symbol.

The code symbol below handles the alphabet decode:

def ctc_greedy_decode(predictions):
    previous_letter_id = blank_id = len(alphabet) - 1
    transcription = list()
    for letter_index in predictions:
        if previous_letter_id != letter_index != blank_id:
            transcription.append(alphabet[letter_index])
        previous_letter_id = letter_index
    return ''.join(transcription)

And with that, the recognition process is done!

Speech-to-text and text-to-speech functions are expected to become more common in future years as more businesses adopt AI for various customer-facing functions. We hope this blog post and accompanying code samples are useful to you in your own explorations of the topic. To learn more about OpenVINO and to boost your AI developer skills, we invite you to take our 30-Day Dev Challenge.

Performing Speech-to-Text Recognition with OpenVINO™

Resources