Real-time End-to-End H.265/HEVC Solution for Intel® Architecture-based Platforms

Android on Intel

5.00/5 (4 votes)

Sep 18, 2014

CPOL

22 min read

19522

In this paper, we will investigate the HEVC codec characters and optimize the CPU-based software video trans-coding technologies，which provide the best video quality and the most flexible programming model.

1. Abstract

2. Introduction

2.1 Video Codec and H.265/HEVC

2.2 HEVC Performance Issues

2.3 The Current Solution of H.265/HEVC Investigation

3. Optimized Real-time Solution on IA-based Platforms

3.1 Real-time HEVC Encoder Solution Based on Intel® Xeon™ Processor

3.1.1 Intel SIMD Vectorization Tuning for HEVC Encoding Functions

3.1.2 Thread Concurrency and Core Scalability Tuning

3.1.3 Further Tuning with SMT/HT

3.2 High Performance H.265/HEVC Decoder on Intel® Core™ Processor-based Platforms

3.2.1 Optimization and Performance Analysis of Strongene HEVC Decoder

3.2.2 Comparison of Intel SSE-Optimized Strongene HEVC Decoder with Open Source Alternatives and Future Optimization Opportunities

3.3 Optimizing H.265/HEVC Decoder on Intel® Atom™ Processor-based Platforms

3.3.1 Optimized by YASM & Intel® C++ Compiler

3.3.2 Optimized with Intel® Streaming SIMD Extensions (Intel® SSE) Instructions

3.3.3 Optimized by Intel® Threading Building Blocks (Intel® TBB) Tool

3.3.4 H.265/HEVC Decoder Performance Comparison

4. Summary

5. Other related articles

Reference

1. Abstract

The International Telecommunication Union (ITU) announced the new video codec standard: High Efficiency Video Coding (HEVC)/H.265, which claims to be about 50 percent more efficient than the current H.264/MPEG-4 standard. However, the complexity of the algorithm and data structure of H.265 is more than 4 times the H.264. That means the H.265 based codec will require more computing resource/power than its predecessor. In this paper, we will investigate the HEVC codec characters and optimize the CPU-based software video trans-coding technologies， which provide the best video quality and the most flexible programming model. Our end-to-end solution can maximize Intel® Architecture (IA) platforms’ capabilities for the HEVC codec and achieve real-time performance.

2. Introduction

Video coding standards have evolved primarily through the development of the well-known ITU-T and ISO/IEC standards. The ITU-T produced H.261 and H.263, ISO/IEC produced MPEG-1 and MPEG-4 Visual, and the two organizations jointly produced the H.262/MPEG-2 Video and H.264/MPEG-4 Advanced Video Coding (AVC) standards [1].

H.265/HEVC (High-Efficiency Video Coding), introduced last year, is the latest video codec standard developed by ISO / IEC and ITU-T, aimed to maximize compression capability and reduce data loss. H.265/HEVC doubles the compression ratio compared to the previous H.264/AVC standard, but has the same subjective quality. HEVC technology helps online video providers to provide high-quality video with less bandwidth, making it the next video codec revolution.

2.1 Video Codec and H.265/HEVC

HEVC proposes several new video coding syntax architectures and algorithms to obtain the highly efficient coding standard[1][2]:

a) Random Access and Bitstream Splicing Features

The new design supports special features to enable random access and bitstream splicing. In H.264/MPEG-4 AVC, a bitstream must always start with an IDR access unit, but in HEVC, random access is supported.

b) Coding Tree Units Structure

A picture is partitioned into coding tree units (CTUs), each containing luma coding tree blocks (CTBs) and chroma CTBs. The value of L may be equal to 16, 32, or 64 as determined by an encoded syntax element specified in the sequence parameter set (SPS). The CTU contains a quadtree syntax that allows for splitting the coding blocks (CBs) to a selected appropriate size based on the signal characteristics of the region that is covered by the CTB. All previous video coding standards just used the fixed array size of 16×16 luma samples, but HEVC supports variable-sized CTBs selected according to the needs of encoders in terms of memory and computational requirements.

c) Tree-Structured Partitioning Into Transform Blocks and Units

A CB can be recursively partitioned into transform blocks (TBs). The partitioning is signaled by a residual quadtree. In contrast to previous standards, the HEVC design allows a TB to span across multiple prediction blocks (PBs) for interpicture-predicted coding units (CUs) to maximize the potential coding efficiency benefits of the quadtree-structured TB partitioning.

d) Intrapicture Prediction

Directional prediction with 33 different directional orientations is defined for (square) Transform Block (TB) sizes from 4×4 up to 32×32. The possible prediction directions are all 360’ directions. HEVC supports various intrapicture predictive coding methods referred to as Intra_Angular, Intra_Planar, and Intra_DC.

This advanced coding standard demands extremely high processing capabilities from both client devices and backend trans-coding servers.

2.2 HEVC Performance Issues

The current HEVC Test Model (HM) project[6] only implements the major functionalities of this standard; the real performance is still far from production and real deployment. The project’s two major drawbacks are:

No parallel scheme
Poor vectorization tuning

Figure 1. HM Project Profiling – Thread Concurrency

Figure 2. HM Project Profiling – Hot Code

This HEVC encoder consumes over 100 times more CPU resources than the H.264 on the server side, and over 10 times more on the client side.

2.3 The Current Solution of H.265/HEVC Investigation

The H.265/HEVC codec has drawn the interest of many worldwide groups/agencies to optimize the performance and lead to actual deployment. Several open source projects are:

OpenHEVC (HM10.0 compatible, decoder optimization)

https://github.com/OpenHEVC/openHEVC
x265 (compatible with HM, parallel & SIMD optimization)

http://code.google.com/p/x265/
https://bitbucket.org/multicoreware/x265/wiki/Home

We ran a 720p 24 FPS video to evaluate the performance of the x.265 encoder on an Intel® Xeon® processor-based platform (E5-2680 @ 2.70GHz, 8*2 physical cores, codenamed Sandy Bridge). The implementers of this codec did lots of work to optimize the original standard for both task and data parallelism; however, from our benchmarking it can only use 6 cores in a system with 32 logical cores (SMT ON). Thus, it does not maximize computing resource utilization on current multi-core platforms.

Figure 3. X.265 Project CPU Usage

Figure 4. X.265 Project with Intel® SIMD Tuning

In the x.265 project, Intel® SSE instructions were utilized for vectorization tuning, which contributes to over 70% performance speedup. With further Intel® C Compiler compiling optimization, we get 2x speedup¹ on the IA platform. However the encoder performance here still has big gap with the real-time encoder deployment, especially for HD 1080p videos.

In the PRC, more than 20 multimedia ISVs are pursuing the available HEVC solution and platform to save online video service costs and maintain high quality.

Figure 5. Online Video Market in the PRC

3. Optimized Real-time Solution on IA-based Platforms

Strongene is a Chinese company focusing on kernel video coding technology. It provides advanced H.265/HEVC encoder/decoder codecs that have been adopted by Xunlei online video service. Its encoder/decoder solution has been integrated with open source FFMPEG for ISVs to use. We worked with Strongene to optimize the H.265/HEVC encoder and decoder on platforms with Intel® Xeon® processors, Intel® Core™ processors, and Intel® Atom™ processors using new IA-based platform technologies, to achieve a real-time, end-to-end, HEVC codec solution.

3.1 Real-time HEVC Encoder Solution Based on Intel® Xeon™ Processor

Our video encoding application is a standard CPU and memory-intensive workload that requires high capabilities of the server platform, such as core computing efficiency, reliability, and stability. The computing of H.265/HEVC codec is 4 times more complex than the previous H.264/MPEG. It raises unprecedented processing requirements for the backend server platform. In this section, we will introduce major IA-based technologies that helped Strongene HEVC codec to reach the 1080p real-time encoding standard.

3.1.1 Intel SIMD Vectorization Tuning for HEVC Encoding Functions

Most of the time-consuming video and image processing functions are the block-based data intensive computing, which can be optimized using the Intel® SIMD (Single Instruction Multi Data) vectorization instructions. Intel SIMD instructions process multi set data within one single CPU cycle, which greatly improves the data throughput and execution efficiency. Intel SIMD has been widely supported, evolving from MMX, Intel® SSE, Intel® Advanced Vector Extensions (Intel® AVX), to Intel® Advanced Vector Extensions 2 (Intel® AVX2) for different x86 platform generations.

In the Strongene encoding codec, observed from the profiling data, all the major hot functions can be vectorized using Intel Intel SSE instructions, such as low-complexity, motion-compensated frame interpolation; transpose-free integer transform; butterfly Hadamard transform; and the least-memory-redundancy SAD/SSD calculation. We enabled the Intel SSE instructions in the Intel Xeon processor-based platform, as shown in Figure 6.

Figure 6. Sample of Enabling Intel® SIMD/SSE Instructions in Stongene Codec

With those Intel SIMD programming models and paradigms, Strongene rewrote all the hot functions in the encoding codec to obtain the maximum performance increase. Figure 7 is our profiling data in a standard 1080p HEVC encoding scenario, which shows 60% of the hot functions are running in Intel SIMD instructions.

Figure 7. Profiling Results of Strogene Encoding Functions

Intel AVX2 instructions with 256b int computing will double the performance of previous 128b Intel SSE code. Intel AVX2 will be supported on the Intel Xeon processor-based platform (codenamed Haswell) platform due to be launched in 2014. We take a common 64*64 block SAD computing as an example to test the intrinsic performance of Intel AVX2:

Table 1. Intel® SSE and Intel® AVX2 Implementation Results

CPU Cycle	original	Intel® SSE	Intel® AVX2
run 1	98877	977	679
run 2	98463	1092	690
run 3	98152	978	679
run 4	98003	943	679
run 5	98118	954	678
avg.	98322.6	988.8	681
speedup	1.00	99.44	144.38

As shown in Table 1, in this function, the Intel SSE and Intel AVX2 instructions can boost the performance by 100 times, and the Intel AVX2 code further provides performance improvement of more than 40% over Intel SSE². We can expect further performance improvement when upgrading the Intel SSE code to Intel AVX2 on the to-be-released Haswell platform.

3.1.2 Thread Concurrency and Core Scalability Tuning

As we have seen in Section 2.3, most current implementations do not utilize all the cores on multi-core platforms. Based on the latest Intel Xeon multi-core architecture, with the parallelism dependency between HEVC, CTB-based algorithms clarified, Strongene proposes to replace the original OWF (Overlapped Wave-Front) and WPP (Wave-front Parallel Processing) methods with the Inter-Frame Wave-front (IFW) parallel framework, then develop a three-level thread management scheme to guarantee that the IFW can fully utilize all the CPU cores to accelerate the HEVC encoding process. With this new parallelism framework, on an Ivy Bridge platform (Intel Xeon processor E5-2697 @2.70GHz, 12*2 physical cores, SMT OFF), the Strongene codec can utilize computing resources of 18-24 physical cores, achieving pretty good thread concurrency.

Figure 8. Thread Concurrency and CPU Utilization in Strongene Encoding Codec

With the new WHP parallelism framework and fully implemented Intel SIMD instructions on the task level and data level respectively, the Strongene encoding codec accomplished tremendous performance speedup on x86 processors for 1080p video sequences, leveraged all cores computing capabilities successfully as shown in the figure 8.

3.1.3 Further Tuning with SMT/HT

Simultaneous Multithreading (SMT), also called Hyper-threading (HT) technology, is widely supported in all IA-based platforms. It allows the operating system to address two virtual or logical cores for each physical core and share the resources between them when possible. The main function of hyper-threading is to decrease the number of dependent instructions on the pipeline. It offers performance benefits when CPU cores are fully running at a high level, but not all applications can benefit, such as those that do not utilize all the cores. In these cases SMT technology will introduce task/thread switching overhead. Therefore, we turned off the SMT in the Strongene encoding codec platform and obtained the HEVC 1080p video real-time encoding standard on the Ivy Bridge platform (Intel Xeon processor E5-2697 v2), as highlighted in yellow in the following table.

Table 2. Strongene HEVC Encoding Performance on Intel® Xeon® Processor-based Platform³

Platform	Resolution	Bitrate (kbps)	FPS	CPU Usage	Encoding- mode	SMT
WSM E7-8837 @2.67GHz (8*8c)	720p	800	8.2	15c	ultrafast	OFF
	720p	1600	2.6	18c	ultraslow	OFF
	1080p	1500	3.6	27c	ultrafast	OFF
	1080p	3000	1.4	23c	ultraslow	OFF
	4k	5000	1.2	19c	ultrafast	OFF
	4k	10000	0.5	21c	ultraslow	OFF
IVY E5-2697 v2 @2.70GHz (2*12c)	720p	1000	11	40% 14c	ultraslow	ON
	720p	1000	46	60% 16c	ultrafast	ON
	1080p	1500	21	70% 16c	ultrafast	ON
	1080p	1500	25	80% 18c	ultrafast	OFF
IVB E7-4890 @2.80GHz (4*15c)	1080p	2000	22	19c	ultrafast	ON
	1080p	8000	6.11	15c	ultraslow	ON
	4k	8000	7.02	29c	ultrafast	ON
	4k	8000	3.28	23c	ultraslow	ON

After achieving tremendous performance improvements, we further evaluated the Strongene HEVC encoding codec capability on the Ivy Bridge platform, focusing on the bandwidth and quality issues.

Table 3. H.264 and H.265 Codec Performance Comparison

File: BQTerrance_1920x1080_60.yuv Resolution: 1920x1080 Size: 1869Mbyte, 622080 kbps Platform: E5-2697 v2 @2.70GHz, RAM 64GB DDR3-1867, QPI 8.0 GT /s OS/SW: Red Hat 6.4, kernel 2.6.32, gcc v4.4.7, ffmpeg v2.0.1, Lentoid HEVC Encoder r2096 linux	Codec	Size (byte)	Bitrate (kbps)	PSNR_Y/U/V (db)
	H.264	12254696	4078.1	32.311/39.369/42.043
	H.265	6215615	2064.28	34.016/39.822/42.141

From Table 3, we can see that H.265/HEVC codec saves 50% bandwidth⁴ while maintaining the same video quality.

3.2 High Performance H.265/HEVC Decoder on Intel® Core™ Processor-based Platforms

The Strongene HEVC/H.265 decoder is an optimized H.265 decoder that provides good performance with relatively low computation requirements. The high efficiency of the Strongene HEVC decoder is achieved by a fully parallelized architecture design and Wavefront Parallel Processing (WPP) implementation. Also, Intel SIMD instructions available on Intel Core processor-based platforms, such as Intel SSE2, Intel SSSE3, and Intel SSE4, are utilized to accelerate various decoding blocks and unleash the power of underlying Intel architecture. With the benefits of these features, the Strongene HEVC decoder is able to achieve real-time 4K decoding with mainstream CPU and up to 200 FPS decoding rate for 1080p video streams.

3.2.1 Optimization and Performance Analysis of Strongene HEVC Decoder

Multithreading optimization in the Strongene HEVC decoder is achieved through WPP and frame layer parallelism. WPP is a feature introduced in HEVC to allow for parallel processing by dividing a slice into several rows of Coding Tree Units (CTUs) and then allocating each row to a thread (each row can be processed once the CTUs in the preceding row for reference are decoded). Frame layer parallelism implemented in the Strongene HEVC decoder utilizes the hierarchy structure introduced in the HEVC standard by the fact that B frames can be referenced by other B frames to construct a hierarchy referencing architecture. For example, if Group of Pictures (GOP) equals 8, the sequence can be encoded as follows:

Figure 9. One of the Possible Encoded Frame Structures (Display Order) to Utilize Frame Layer Parallelism for GOP = 8

In this case, B1 uses 2 P frames as reference in the first stage. In the second stage, the two B2 frames use a P frame and a B1 frame as reference. Therefore, these two B2 frames can be processed in parallel. In the third stage, the four B3 frames use either a P frame and a B2 frame, or a B1 frame and a B2 frame as reference. As a result, the four B3 frames can also be processed in parallel. If a larger GOP is used, the frame layer parallelism can be further improved given that the number of threads in the HEVC decoder is sufficient to support the B frame decoding concurrently. The Strongene HEVC decoder is well-organized to achieve the maximum level of parallelism through multithread decoding and WPP in order to boost the decoding speed.

Here is the maximum decoding frame rate of Strongene HEVC decoder before (Table 4) and after (Table 5) Intel SSE optimization on a Sandy Bridge platform⁵, running on 1080p and 4K sequences with different numbers of threads enabled.

Table 4. Decoding Rate of Strongene HEVC Decoder Before Intel SSE Optimization (Lentoid C) on 1080p and 4K Video Streams with Different Numbers of Threads Enabled

	1080p 1.2Mbps		4K 5.6Mbps
	Decoding Rate w/o Rendering (FPS)	Average CPU Utilization	Decoding Rate w/o Rendering (FPS)	Average CPU Utilization
1 thread	25.33	25%	6.85	25%
2 thread	43.03	49%	11.8	47%
4 thread	51.79	93%	14.13	86%
8 thread	53.1	98%	15.03	99%

Table 5. Decoding Rate of Strongene HEVC Decoder After Intel SSE Optimization (v2.0.1.14) on 1080p and 4K Video Streams with Different Numbers of Threads Enabled

	1080p 1.2Mbps		4K 5.6Mbps
	Decoding Rate w/o Rendering (FPS)	Average CPU Utilization	Decoding Rate w/o Rendering (FPS)	Average CPU Utilization
1 thread	75	25%	21	25%
2 thread	120	45%	33	40%
4 thread	140	70%	36	63%
8 thread	154	98%	40	96%

We can see from the above data that a ~3x performance gain can be obtained for 1080p streams and ~2.6x for 4K streams after Intel SSE optimization⁵ on the Sandy Bridge platform. Also, multi-threading design in the Strongene HEVC decoder contributes to a significant performance boost compared to single-thread mode: ~2x decoding frame rate is achieved if the number of simultaneous decoding threads is increased from 1 to 8. In terms of the overall performance, it shows that even on the dual-core Sandy Bridge mobile platform, the Strongene HEVC decoder with Intel SSE optimization is capable of decoding 4K streams in real-time with less than 40% CPU utilization, which is definitely one of the best HEVC software decoders available in the industry. For 1080p streams with bit-rates ranging from 1Mbps to 3Mbps (general bit rate setting for 1080p videos streaming over the Internet), real-time decoding can be achieved with less than 20% CPU utilization.

3.2.2 Comparison of Intel SSE-Optimized Strongene HEVC Decoder with Open Source Alternatives and Future Optimization Opportunities

The performance of the Strongene HEVC decoder can be further examined through comparison with some well-known open source implementations such as HM and FFMPEG. In the following charts, decoding rates for different HEVC decoders are compared by using video streams with various levels of resolution, frame, and bit rate.

HM10.0: HEVC reference decoder HM10.0

FFMPEG: FFMPEG 2.1 HEVC decoder running on single thread

FFMPEG 4 threads: FFMPEG 2.1 HEVC decoder running on 4 threads

Lentoid C: Strongene HEVC decoder before SSE optimization running on single thread

Lentoid SIMD: Strongene HEVC decoder after SSE optimization running on single thread (v2.0.1.16)

Lentoid SIMD 4 threads: Strongene HEVC decoder after SSE optimization running on 4 threads (v2.0.1.16)

Figure 10. H.265 Decoding Frame Rate of 4K Videos for Various Decoders and Configurations

Figure 11. H.265 Decoding Frame Rate of Class A Videos for Various Decoders and Configurations

Figure 12. H.265 Decoding Frame Rate of Class B Videos for Various Decoders and Configurations

Figure 13. H.265 Decoding Frame Rate of Class C Videos for Various Decoders and Configurations

Figure 14. H.265 Decoding Frame Rate of Class E Videos for Various Decoders and Configurations

Figure 15. H.265 Decoding Frame Rate of Class F Videos for Various Decoders and Configurations

Regarding the performance data on different classes of videos, the Strongene HEVC decoder after Intel SSE optimization was able to achieve a ~10x speed boost⁶ compared to the HM10 decoder. The performance gain is even larger for lower bit-rate streams. However, the acceleration ratio of Intel SSE optimization (Lentoid SIMD 4 threads / Lentoid C) decreases when the bit rate increases due to the fact that Intel SIMD instructions are more effective on modules that can be parallelized such as Motion Compensation instead of those that can’t be parallelized (CABAC, IDCT, and deblocking). The phenomenon can be explained in more detail if we look at the VTune™ Amplifier XE hotspot functions before and after Intel SSE optimization:

Figure 16. Hotspot Functions of the Strongene HEVC Decoder Before Intel SSE Optimization (Lentoid C 8 Threads) Running on 4K 5.6Mbps Workload from the Perspective of the VTune™ Amplifier

Figure 17. Hotspot Functions of Strongene HEVC Decoder After SSE Optimization (Lentoid SIMD 8 Threads) Running on 4K 5.6Mbps Workload from the Perspective of the VTune™ Amplifier

In Figure 16, we found that most of the hotspots in Lentoid C decoder were in the Motion Compensation (MC) module since MC has to be done for each CTB and requires extensive computation resources. However, MC can be parallelized in the CTB level, so it can achieve the highest acceleration ratio after Intel SSE optimization:

∑_{∀i,i∈CTBs}MC Acceleration Ratio_i × Numbers of pixels in i

Numbers of luma and chroma pixels in a frame

MC Acceleration Ratio_avg =

As the bit rate increases, many more computation resources are spent on CABAC, IDCT, and deblocking in order to decode and process video data, which leads to a lower Intel SSE acceleration ratio for these modules. That’s why hotspot functions have been shifted from MC to IDCT and deblocking modules after Intel SSE optimization, as shown in Figure 17.

Besides, we can see from the CPU concurrency in VTune that, when running the Intel SSE optimized decoder in 8 threads on 4K 5.6Mbps stream, at least 3 logical CPUs are running for 74% of the decoding time, only 1 or 2 logical CPUs are running for 26% of the time due to workload imbalance among B frames.

Figure 18. CPU Usage Histogram

For hotspot analysis, all the top five hot functions are actually compute-bound instead of memory-bound, which implies that these functions can be further optimized through Intel AVX and Intel AVX2 instructions.

Figure 19. Top Hotspots

3.3 Optimizing H.265/HEVC Decoder on Intel® Atom™ Processor-based Platforms

Watching video is the top usage for mobile devices. Multimedia processing is computing intensive and has a big impact on battery life and user experience. The LCD resolution on mobile devices has improved, from 480p to 720p, to now 1080p. End users want to watch high quality videos, but for online video providers, such as Youku, iQiyi and LeTV, purchasing the network bandwidth becomes increasingly expensive every year.

In 2013 Intel introduced the new 4th generation Intel Atom processor-based platforms (code-named Bay Trail), powered by 22nm Silvermont Architecture. The details of this architecture are shown below:

Figure 20. Bay Trail Platform Introduction

We used Intel VTune tools to debug the Strongene H.265/HEVC decoder. Then we optimized it using the toolsets as explained in the next three subsections. We obtained extreme decoding speed and low CPU occupancy on Intel Atom processor-based platforms.

3.3.1 Optimized by YASM & Intel® C++ Compiler

Instead of compiling the optimized ASM assembly codes in open source FFMPEG with the default Android* compiler, we used YASM and the Intel® C++ Compiler.

YASM is a complete rewrite of the NASM assembler under the "new" BSD License, which can reuse the Intel SIMD-optimized ASM assembly code for x86 platforms. Developers can download and install the YASM compiler from http://yasm.tortall.net. To use it, modify the configure.sh file to enable the YASM and ASM options before compiling FFMPEG, as shown below:

Figure 21. Modify the FFMPEG Configure File

We also encouraged the ISVs to use the Intel C++ Compiler to compile the native code.

3.3.2 Optimized with Intel® Streaming SIMD Extensions (Intel® SSE) Instructions

Debugging with Intel VTune tools, we found that the Strongene codec only used C code to realize YUV2RGB, making the performance less than optimal.

Intel Atom processor-based platforms support Intel SSE instruction codes, which includes MMX, MMXEXT, Intel SSE, Intel SSE2, Intel SSE3, Intel SSSE3, and Intel SSE4. Enabling Intel SSE code in open source FFMPEG can highly improve the YUV2RGB performance.

We open the Intel SSSE3 compiler option in the FFMPEG using MMX EXT code as shown in the code snippet below.

Figure 22. Enable Intel® SSE Code in the FFMPEG

The Bay Trail platform can support Intel SSE 4.1 instructions, which were used to optimize the H.265/HEVC decoder for better performance.

3.3.3 Optimized by Intel® Threading Building Blocks (Intel® TBB) Tool

When we ran the VTune tool, we found that the Strongene codec created four threads. However, the fastest thread had to wait for the slowest thread, creating idle cores.

Intel SSE can only work on a single core if used alone. Using Intel TBB together with Intel SSE can make the code run on multi-cores, improving performance.

We modified the multi-thread code to perform multi-tasks, then used Intel TBB to allocate the task to the idle cores in order to fully utilize the multi-cores.

Intel TBB can be downloaded from http://threadingbuildingblocks.org/download.

Figure 23. YUV to RGB Comparison Data

Optimization by Intel TBB can get up to 2.6x performance improvement⁷.

3.3.4 H.265/HEVC Decoder Performance Comparison

OpenGL* was also enabled for rendering because, through testing, we found that optimization by YASM and the Intel C++ Compiler improved performance up to 1.5x, optimization by Intel SSE improved performance up to 6x compared to C code, and optimization by Intel® TBB improved performance up to 2.6x⁸.

We used Intel® Graphics Performance Analyzers (Intel® GPA) to test the refresh rate when playing video. When tested with the optimized H.265/HEVC decoder on the Bay Trail tablet, the refresh rate can reach 90 FPS (frames per second) when playing the HEVC 1080p video, while the Clover Trail+ tablet can reach 40 FPS.

Figure 24. Performance Comparison on Clover Trail+/ Bay Trail tablets

If we set the refresh rate to 24 FPS on the Bay Trail tablet, when playing the 1080p video, the CPU workload is less than 25%. So we readily recommend the Strongene HEVC decoder solution to the popular online video providers in the PRC for commercial use.

4. Summary

H.265/HEVC will likely be the most popular video standard in the coming decade. Lots of the media applications and products are currently pursuing the HEVC support. In this paper, we implemented a CPU-based, real-time, end-to-end HEVC solution on Intel platforms with new IA platform technologies. Our Intel processor-based advanced solution has been deployed in Xunlei[4] online video services and products, and will definitely accelerate H.265/HEVC technology production and deployment.

5. Other related articles

Reference

[1] Overview of the High Efficiency Video Coding (HEVC) Standard, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012.

[2] High Efficiency Video Coding (HEVC) text specification draft 10, JCTVC-L1003_v34

[3] http://www.strongene.com/en/homepage.jsp

[4] http://yasm.tortall.net

[5] http://threadingbuildingblocks.org

[6] http://hevc.hhi.fraunhofer.de/

To learn more about Intel tools for the Android developer, visit Intel® Developer Zone for Android.