Click here to Skip to main content
16,002,992 members
Articles / Artificial Intelligence / Tensorflow
Article

Android Tensorflow Lite Best Practices and Optimizations

Rate me:
Please Sign up or sign in to vote.
5.00/5 (1 vote)
22 Sep 2020CPOL2 min read 6.3K   1  
In this article we will consider the ways in which the network could be further optimized.
Here we look at how for pre-trained networks, the network can be altered through quantization. We also discuss how if the model isn’t compatible with 8-bit quantization, the network can be converted to use 16-bit. Finally, we quickly look at network pruning.

In the previous part of this series we completed building a TensorFlow Lite-based application for performing recognition of objects using a network model that came from the ONNX Model Zoo. Let’s consider the ways in which the network could be further optimized.

With neural network models, one of the challenges can be striking the right balance between available resources and accuracy. Generally speaking, if a model is made more complex there is the potential for it to be more accurate, but it will consume more storage space, have longer execution times, or consume more network bandwidth when being downloaded. Not all optimizations will run on all hardware.

Post-Optimizing Pre-Trained Networks

If you are using a network that came from a third-party, attempts to increase performance may start with post-training optimization.

For pre-trained networks, the network can be altered through quantization. Quantization reduces the precision of the network’s parameters. These are usually 32-bit floating point numbers. When quantization is applied to a network the floating operations can be converted to integer or 16-bit floating point operations. These will run with increased speed but slightly lower accuracy. In an earlier part of this series, I used Python code to convert a model from TensorFlow to TensorFlow Lite. With a few modifications, the converter will also perform 8-bit integer quantization while converting:

Python
import tensorflow as tf

saved_model_dir='/dev/projects/models'
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tf_lite_model = converter.convert()
open('output.tflite', 'wb').write(tf_lite_model)

Not all models can be transformed this way. If the model isn’t compatible with 8-bit quantization then the above code will throw an error. One might wonder what other values could be passed to converter.optimizations other than DEFAULT. In previous versions of TensorFlow, it was possible to pass values to optimize the network for size or latency here. Those values are currently deprecated and will now have no effect on the result.

If 32-bit floating point to 8-bit integer quantization is too extreme, the network can be converted to use 16-bit floating-point numbers instead:

Python
import tensorflow as tf

saved_model_dir='/dev/projects/models'
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tf_lite_model = converter.convert()
open('output.tflite', 'wb').write(tf_lite_model)

With the above model quantization, the advantage is only experienced if the model is executed on hardware that supports 16-bit floating point numbers (such as the GPU). If executed on a CPU, the network weights will be expanded to 32-bit floating point numbers before execution. On a CPU there would be a slight loss of accuracy of the equivalent unoptimized network with no performance increase.

Pruning a Network

To optimize a network for size, consider network pruning. In pruning, parts of the network that make less significant contributions to accuracy are removed, and the resulting network compresses more tightly. To perform pruning on a network there is an additional tensorflow library to install:

Python
pip install -q tensorflow-model-optimization

Within the Python code, after building a network model, the TensorFlow Model Optimization package contains a method called prune_low_magnitude which will perform these modifications on the network.

Python
import tensorflow_model_optimization as tfmot
pruned_model = tfmot.sparsity.keras.prune_low_magnitude(source_model)

Next Steps

Now that we’ve learned a bit more about using TensorFlow Lite effectively, we’re ready to dive in deeper. In the next article, we’ll learn how to train our own neural network for use in an Android app.

This article is part of the series 'Mobile Neural Networks On Android with TensorFlow Lite View All

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
United States United States
I attended Southern Polytechnic State University and earned a Bachelors of Science in Computer Science and later returned to earn a Masters of Science in Software Engineering. I've largely developed solutions that are based on a mix of Microsoft technologies with open source technologies mixed in. I've got an interest in astronomy and you'll see that interest overflow into some of my code project articles from time to time.



Twitter:@j2inet

Instagram: j2inet


Comments and Discussions

 
-- There are no messages in this forum --