Tips and Tricks to Optimize Android Apps on x86

Android on Intel

0/5 (0 vote)

Nov 17, 2014

CPOL

7 min read

9594

This article will focus on optimizing NDK based Apps. These may be solely C/C++ code or may include 3rd party libraries and/or assembly code.

Introduction

Intel has a vested interest in helping developers provide Android applications that run well (or even best) on Intel architecture. While Intel is working at the community level - optimizing Dalvik Java, V8 engine, and Bionic C; contributing to the code base; and providing releases with both 32 bit and 64-bit Kernels for IA; they are also creating new tools to help Android developers. Many of these focus on improving performance beyond that available with the default ARM translation layer for x86: libhoudini

But first - choose the right tools. There are 3 common methods of creating Android apps.

Java compiled using the Android SDK APIs to run in the Dalvik VM. Note: An article covering ART will be available soon for Android L.
Using the latest SDK takes care of most differences, although you may still want to look at memory allocated for high resolution screens. Most notably, testing goes faster if you speed up your Android emulation software with Intel® HAXM (requires Intel® Virtualization Technology and XD, both set to ON.)
Web focused HTML5 and JavaScript. For Open Source Android info, check out the Android-IA site.
NDK created or ported (written in C++). This is the preferred method if you will have processor intensive functions or already have the C++ code. Native C++ code usually (not always) runs faster by speaking "natively" to the hardware as the code is compiled into a binary before execution and doesn't require interpretation into machine language.

This article will focus on optimizing NDK based Apps. These may be solely C/C++ code or may include 3rd party libraries and/or assembly code.

Note: If you don't already have an Android development environment (IDE), the new tool suite Intel® INDE (Intel® Integrated Native Developer Experience) will load a selected Android IDE and also download and install multiple Intel tools to help you make, compile, troubleshoot and publish Android applications. Click these links for a series on registering and installing Intel® INDE and setting it up with an Eclipse* IDE with links to videos on setting up the NDK & SDK*, Eclipse*, and running on an emulator (including how to speed it up) or an Intel® architecture based device.

At a high level, NDK development involves the following steps and minimum changes to work with x86 architecture.

Create the Android Project and jni folder. Edit Application.mk to show APP_ABI = all (if file size allows for ARM* and x86 to be in the same pack.) or x86. Note: The APP_ABI setting also affects floating point operations - see below.
Code. Any native (C++) code can be reused. Rewrite any inline assembly code or ARM specific code. Use javah to create the JNI/native code header file.Be sure to interpret between the Windows standard C++ conventions and Java/JNI using the JNIEXPORT and JNICALL macros
Compile/Build libraries (call generates .so libs and puts them in the appropriate project directories). Use "ndk-build APP_ABI = X86 "with a few build flag changes - see below. Also be sure to recompile any 3rd party libraries.
Call it from Java. Declare in Java the native( C++) function calls and load the shared library using System.loadlibrary().
Debug. ndk-gdb debug can be used by running ndk-build with the manifest set to debuggable. Make sure the adb directory is added to the PATH and only one target is running.

Beyond basic "porting", there are some optimizations available,

Optimizing Tips

Speed up your software based Android emulator by using Intel® HAXM for hardware assisted emulation. Intel® HAXM requires Intel® Virtualization Technology (Intel® VT) and XD set to on.
Set APP_ABI = x86 (creates one apk with all binaries) or = armeabi armeabi-v7a x86, depending upon your file size limitations.. (Note x86 includes hardware floating point as does (to some degree) armeabi-v7a-x86)
During compile, use gcc "-malign-double". (This is for memory alignment - see also #9)
During compile, add appropriate CPU threading flags
For Intel® Atom™ processor's Hyperthreading capability try -mtune=atom -mssse3 -mfpmath=sse
For non hyperthreading, (BYT, SLM, Merrifield) use -mtune=slm -msse4.2 -mfpmath=sse
Use -march= to limit to the specified CPU (mtune runs on more models but optimizes for the type listed).
-mavx is not yet useful on Atom.
Use little Endian (default with the NDK). ARM* supports both big and little Endian, Intel® Atom™ only supports little, so check your gcc flags.
Use v 4.8 of gcc. Watch the 2 toolchain paths (android-ndk\toolchains\arm-linux-androideabi-4.8 vs. x86 android-ndk\toolchains\x86-4.8
Be sure to use the correct JNIEXPORT method signature to set the entry method into native code - (match the header file's function signature to ensure the source code compiles on Windows*).
JNIEXPORT void JNICALL Java_ClassName_MethodName
After compile, check the system log to make sure the target native lib successfully loads at runtime. (This will show in the log as "added shared lib //<path>"
Explicitly force memory alignment to prevent loading errors and network packet issues. ARM occupies 24 bytes but requires 8 byte alignment for 64 bit variables, while x86 occupies 16 bytes, So try to ensure 16 byte alignment of data structures. Then use aligned moves (MOVAPS, MOVNTA) when loading from that structure into XMM registers. See Reducing the impact of Misaligned Memory Accesses.
Write data directly to main memory (streaming stores instructions MOVNTPS, MOVNTQ) since the Intel® Atom™ processor has no L3 cache, This also saves on bandwidth consumption by avoiding the dirty writeback on cache eviction.
Avoid stalls due to a limited L2 cache. Except for specific instances (where load and store of data is to the same address, for the same size operand, and done from a general-purpose register), the load on the Intel® Atom™ processor will stall for several cycles while writing to cache. Additionally, stores of SSE operands (from xmm registers) are never forwarded to subsequent loads,
So for both forwarding and non-forwarding scenarios, try to manipulate the data within the register doing sums in the xmm registers. For example in the mp3 decoder, there's a windowing loop to accumulate/compute sums in a register and then sum across the register.
This incurs a blocked store-forward stall between the 16-byte store to the pSum array and the following 4-byte loads from pSum. To avoid this, compute the horizontal sum in the xmm registers, using HADDPS instructions or with a series of adds and shuffles. (But beware, the HADDPS sequence is faster on Intel® Atom™ processors, but slower on many variants of Intel Core processors.). Take advantage of SSE min and max instructions to perform clipping on samples that exceed the 16-bit range.
Zero out the full XMM register (MOVLPS, MOVHPS, PINSRW) before use, since some instructions load only part of the register which can cause issues from code that may remain in the other part.
Read articles on optimizing SIMD instructions Intel SSE vs. ARM* NEON* ). Consider using the library NEONvsSSE_5.h available here, (in place of arm_neon.h). The article also mentions that there are performance penalties when working with constants (don't put initialization in loops, replace with logical/compare operations if possible), and avoid serial implementation (use vectorization).
Replace divide and sqrt operations, which take many cycles, with either a table-lookup operation, a reciprocal approximation (RCPPS instruction), or a Newton-Raphson sequence.
Watch floating point calls. Use Float instead of Double (since Double often uses SW lib routine). Float is faster and saves memory bandwidth on Intel® Atom™ processors. Also the APP_ABI sets whether SW (armeabi) or HW (X86, armeabi-v7a x86) floating point is used. You don't always want to be executing the full HW FPU calculations with the x86 algorithm. (For example, dividing by powers of 2 in integer code is a fast right shift operation but for Android optimization you should multiply by the reciprocal instead (y=x*.5 instead of y=x/2)
Reduce the overhead of small functions. Use Inline functions in areas including parameter passing, new stack frame setup/old stack frame restore/preserving caller's stack frame, putting addresses on the stack; return value calls, and return functions. See also the.Intel® 64 and IA-32 Architectures Optimization Reference Manual