Threading and Intel® Integrated Performance Primitives (PDF 230KB)
Abstract
There is no universal threading solution that works for all applications. Likewise, there are multiple ways for applications built with Intel® Integrated Performance Primitives (Intel® IPP) to utilize multi-threading. Threading can be implemented within the Intel IPP library or at within the application that employs the Intel IPP library. The following article describes some of the ways in which an application that utilizes the Intel IPP library can safely and successfully take advantage of multiple threads of execution.
This article is part of the larger series, "Intel Guide for Developing Multithreaded Applications," which provides guidelines for developing efficient multithreaded applications for Intel® platforms.
Background
Intel IPP is a collection of highly optimized functions for digital media and data-processing applications. The library includes optimized functions for frequently used fundamental algorithms found in a variety of domains, including signal processing, image, audio, and video encode/decode, data compression, string processing, and encryption. The library takes advantage of the extensive SIMD (single instruction multiple data) and SSE (streaming SIMD extensions) instruction sets and multiple hardware execution threads available in modern Intel® processors. Many of the SSE instructions found in today's processors are modeled after those on DSPs (digital signal processors) and are ideal for optimizing algorithms that operate on arrays and vectors of data.
The Intel IPP library is available for applications built for the Windows*, Linux*, Mac OS* X, QNX*, and VxWorks* operating systems. It is compatible with the Intel® C and Fortran Compilers, the Microsoft Visual Studio* C/C++ compilers, and the gcc compilers included with most Linux distributions. The library has been validated for use with multiple generations of Intel and compatible AMD processors, including the Intel® Core™ and Intel® Atom™ processors. Both 32-bit and 64-bit operating systems and architectures are supported.
Introduction
The Intel IPP library has been constructed to accommodate a variety of approaches to multithreading. The library is thread-safe by design, meaning that the library functions can be safely called from multiple threads within a single application. Additionally, variants of the library are provided with multithreading built in, by way of the Intel OpenMP* library, giving you an immediate performance boost without requiring that your application be rewritten as a multithreaded application.
The Intel IPP primitives (the low-level functions that comprise the base Intel IPP library) are a collection of algorithmic elements designed to operate repetitively on data vectors and arrays, an ideal condition for the implementation of multithreaded applications. The primitives are independent of the underlying operating system; they do not utilize locks, semaphores, or global variables, and they rely only on the standard C library memory allocation routines (malloc/realloc/calloc/free) for temporary and state memory storage. To further reduce dependency on external system functions, you can use the i_malloc interface to substitute your own memory allocation routines for the standard C routines.
In addition to the low-level algorithmic primitives, the Intel IPP library includes a collection of industry-standard, high-level applications and tools that implement image, media and speech codecs (encoders and decoders), data compression libraries, string processing functions, and cryptography tools. Many of these high-level tools use multiple threads to divide the work between two or more hardware threads.
Even within a singled-threaded application, the Intel IPP library provides a significant performance boost by providing easy access to SIMD instructions (MMX, SSE, etc.) through a set of functions designed to meet the needs of numerically intensive algorithms.
Figure 1 shows relative average performance improvements measured for the various Intel IPP product domains, as compared to the equivalent functions implemented without the aid of MMX/SSE instructions. Your actual performance improvement will vary.
Figure 1. Relative performance improvements for various Intel® IPP product domains.
System configuration: Intel® Xeon® Quad-Core Processor, 2.8GHz, 2GB using Windows* XP and the Intel IPP 6.0 library
Advice
The simplest and quickest way to take advantage of multiple hardware threads with the Intel IPP library is to use a multi-threaded variant of the library or to incorporate one of the many threaded sample applications provided with the library. This requires no significant code rework and can provide additional performance improvements beyond those that result simply from the use of Intel IPP.
Three fundamental variants of the library are delivered (as of version 6.1): a single-threaded static library; a multithreaded static library; and a multithreaded dynamic library. All three variants of the library are thread-safe. The single-threaded static library should be used in kernel-mode applications or in those cases where the use of the OpenMP library either cannot be tolerated or is not supported (as may be the case with a real-time operating system).
Two Threading Choices: OpenMP Threading and Intel IPP
The low-level primitives within Intel IPP are basic atomic operations, which limits the amount of parallelism that can be exploited to approximately 15% of the library functions. The Intel OpenMP library has been utilized to implement this "behind the scenes" parallelism and is enabled by default when using one of the multi-threaded variants of the library.
A complete list of the multi-threaded primitives is provided in the ThreadedFunctionsList.txt file located in the Intel IPP documentation directory.
Note: the fact that the Intel IPP library is built with the Intel C compiler and OpenMP does not require that applications must also be built using these tools. The Intel IPP library primitives are delivered in a binary format compatible with the C compiler for the relevant operating system (OS) platform and are ready to link with applications. Programmers can build applications that use Intel IPP application with Intel tools or other preferred development tools.
Controlling OpenMP Threading in the Intel IPP Primitives
The default number of OpenMP threads used by the threaded Intel IPP primitives is equal to the number of hardware threads in the system, which is determined by the number and type of CPUs present. For example, a quad-core processor that supports Intel® Hyper-Threading Technology (Intel® HT Technology) has eight hardware threads (each of four cores supports two threads). A dual-core CPU that does not include Intel HT Technology has two hardware threads.
Two Intel IPP primitives provide universal control and status regarding OpenMP threading within the multi-threaded variants of the Intel IPP library: ippSetNumThreads() and ippGetNumThreads(). Call ippGetNumThreads to determine the current thread cap and use ippSetNumThreads to change the thread cap. ippSetNumThreads will not allow you to set the thread cap beyond the number of available hardware threads. This thread cap serves as an upper bound on the number of OpenMP software threads that can be used within a multi-threaded primitive. Some Intel IPP functions may use fewer threads than specified by the thread cap in order to achieve their optimum parallel efficiency, but they will never use more than the thread cap allows.
To disable OpenMP within a threaded variant of the Intel IPP library, call ippSetNumThreads(1) near the beginning of the application, or link the application with the Intel IPP single-threaded static library.
The OpenMP library references several configuration environment variables. In particular, OMP_NUM_THREADS sets the default number of threads (the thread cap) to be used by the OpenMP library at run time. However, the Intel IPP library will override this setting by limiting the number of OpenMP threads used by an application to be either the number of hardware threads in the system, as described above, or the value specified by a call to ippSetNumThreads, whichever is lower. OpenMP applications that do not use the Intel IPP library may still be affected by the OMP_NUM_THREADS environment variable. Likewise, such OpenMP applications are not affected by a call to the ippSetNumThreads function within Intel IPP applications.
Nested OpenMP
If an Intel IPP application also implements multithreading using OpenMP, the threaded Intel IPP primitives the application calls may execute as single-threaded primitives. This happens if the Intel IPP primitive is called within an OpenMP parallelized section of code and if nested parallelization has been disabled (which is the default case) within the Intel OpenMP library.
Nesting of parallel OpenMP regions risks creating a large number of threads that effectively oversubscribe the number of hardware threads available. Creating parallel region always incurs overhead, and the overhead associated with the nesting of parallel OpenMP regions may outweigh the benefit. In general, OpenMP threaded applications that use the Intel IPP primitives should disable multi-threading within the Intel IPP library either by calling ippSetNumThreads(1) or by using the single-threaded static Intel IPP library.
Core Affinity
Some of the Intel IPP primitives in the signal-processing domain are designed to execute parallel threads that exploit a merged L2 cache. These functions (single and double precision FFT, Div, Sqrt, etc.) need a shared cache to achieve their maximum multi-threaded performance. In other words, the threads within these primitives should execute on cores located on a single die with a shared cache. To ensure that this condition is met, the following OpenMP environment variable should be set before an application using the Intel IPP library runs:
KMP_AFFINITY=compact
On processors with two or more cores on a single die, this condition is satisfied automatically and the environment variable is superfluous. However, for those systems with more than two dies (e.g., a Pentium® D processor or a multi-socket motherboard), where the cache serving each die is not shared, failing to set this OpenMP environmental variable may result in significant performance degradation for this class of multi-threaded Intel IPP primitives.
Usage Guidelines
Threading Within an Intel IPP Application
Many multithreaded examples of applications that use the Intel IPP primitives are provided as part of the Intel IPP library. Source code is included with all of these samples. Some of the examples implement threading at the application level, and some use the threading built into the Intel IPP library. In most cases, the performance gain due to multithreading is substantial.
When using the primitives in a multithreaded application, disabling the Intel IPP library's built-in threading is recommended, using any of the techniques described in the previous section. Doing so ensures that there is no competition between the built-in threading of the library and the application's threading mechanism, helping to avoid an oversubscription of software threads to the available hardware threads.
Most of the library primitives emphasize operations on arrays of data, because the Intel IPP library takes advantage of the processor's SIMD instructions, which are well suited to vector operations. Threading is natural for operations on multiple data elements that are largely independent. In general, the easiest way to thread with the library is by using data decomposition, or splitting large blocks of data into smaller blocks and working on those smaller blocks with multiple identical parallel threads of execution.
Memory and Cache Alignment
When working with large blocks of data, improperly aligned data will typically reduce throughput. The library includes a set of memory allocation and alignment functions to address this issue. Additionally, most compilers can be configured to pad structures to ensure bus-efficient alignment of data.
Cache alignment and the spacing of data relative to cache lines is very important when implementing parallel threads. This is especially true for parallel threads that contain constructs of looping Intel IPP primitives. If the operations of multiple parallel threads frequently utilize coincident or shared data structures, the write operations of one thread may invalidate the cache lines associated with the data structures of a "neighboring" thread.
When building parallel threads of identical Intel IPP operations (data decomposition), be sure to consider the relative spacing of the decomposed data blocks being operated on by the parallel threads and the spacing of any control data structures used by the primitives within those threads. Take especial care when the control structures hold state information that is updated on each iteration of the Intel IPP functions. If these control structures share a cache line, an update to one control structure may invalidate a neighboring structure.
The simplest solution is to allocate these control data structures so they occupy a multiple of the processor's cache line size (typically 64 bytes). Developers can also use the compiler's align operators to insure these structures, and arrays of such structures, always align with cache line boundaries. Any wasted bytes used to pad control structures will more than make up for the lost bus cycles required to refresh a cache line on each iteration of the primitives.
Pipelined Processing with DMIP
In an ideal world, applications would adapt at run-time to optimize their use of the SIMD instructions available, the number of hardware threads present, and the size of the high-speed cache. Optimum use of these three key resources might achieve near perfect parallel operation of the application, which is the essential aim behind the DMIP library that is part of Intel IPP.
The DMIP approach to parallelization, building parallel sequences of Intel IPP primitives that are executed on cache-optimal sized data blocks, enables application performance gains of several factors over those that operate sequentially over an entire data set with each function call.
For example, rather than operate over an entire image, break the image into cacheable segments and perform multiple operations on each segment, while it remains in the cache. The sequence of operations is a calculation pipeline and is applied to each tile until the entire data set is processed. Multiple pipelines running in parallel can then be built to amplify the performance.
To find out more detail about this technique, see "A Landmark in Image Processing: DMIP."
Threaded Performance Results
Additional high-level threaded library tools included with Intel IPP offer significant performance improvements when used in multicore environments. For example, the Intel IPP data compression library provides drop-in compatibility for the popular ZLIB, BZIP2, GZIP and LZO lossless data-compression libraries. The Intel IPP versions of the BZIP2 and GZIP libraries take advantage of a multithreading environment by using native threads to divide large files into many multiple blocks to be compressed in parallel, or by processing multiple files in separate threads. Using this technique, the GZIP library is able to achieve as much as a 10x performance gain on a quad-core processor, when compared to an equivalent single-threaded implementation without Intel IPP.
In the area of multi-media (e.g., video and image processing), an Intel IPP version of H.264 and VC 1 decoding is able to achieve near theoretical maximum scaling to the available hardware threads by using multiple native threads to parallelize decoding, reconstruction, and de-blocking operations on the video frames. Executing this Intel IPP enhanced H.264 decoder on a quad-core processor results in performance gains of between 3x and 4x for a high-definition bit stream.
Final Remarks
There is no single approach that is guaranteed to provide the best performance for all situations. The performance improvements achieved through the use of threading and the Intel IPP library depend on the nature of the application (e.g., how easily it can be threaded), the specific mix of Intel IPP primitives it can use (threaded versus non-threaded primitives, frequency of use, etc.), and the hardware platform on which it runs (number of cores, memory bandwidth, cache size and type, etc.).
Additional Resources
Parallel Programming Community
Intel IPP Product website
Taylor, Stewart. Optimizing Applications for Multi-Core Processors: Using the Intel® Integrated Performance Primitives, Second Edition. Intel Press, 2007.
User's Guide, Intel® Integrated Performance Primitives for Windows* OS on IA-32 Architecture, Document Number 318254-007US, March 2009. (pdf)
Reference Manual, Intel® Integrated Performance Primitives for Intel® Architecture: Deferred Mode Image Processing Library, Document Number: 319226-002US, January 2009.
Intel IPP i_malloc sample code, located in advanced-usage samples in linkage section
Wikipedia article on OpenMP*
OpenMP.org
A Landmark in Image Processing: DMIP
Optimization Notice |
---|
The Intel® Integrated Performance Primitives (Intel® IPP) library contains functions that are more highly optimized for Intel microprocessors than for other microprocessors. While the functions in the Intel® IPP library offer optimizations for both Intel and Intel-compatible microprocessors, depending on your code and other factors, you will likely get extra performance on Intel microprocessors. While the paragraph above describes the basic optimization approach for the Intel® IPP library as a whole, the library may or may not be optimized to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Intel recommends that you evaluate other library products to determine which best meets your requirements. |