Nvidia Compute Unified Device Architecture (CUDA)

Get Complete Project Material File(s) Now! »

General-Purpose Processing on GPU : History and Context

The reign of the classical Central Processing Unit (CPU) is no longer hegemonic and the computing world is now heterogeneous. The Graphics Processing Units (GPUs) have been candidate as CPUs co-processors for more than a decade now. Other architectures were also developed like the Intel Larabee [Seiler et al. 2008], which never really reached the market as GPU and was released recently as a co-processor under the name Xeon Phi 1 by the end of 2012, and the IBM and Sony Cell [Hofstee 2005], which was used in the Sony PlayStation 3. However, although many researchers have tried to map eﬃcient algorithms on its complex architecture, it was discontinued. This failure resulted from its diﬃcult programming and memory models, especially facing the emergence of alternatives in the industry: the GPU manufacturers entered the general computing market.
Dedicated graphic hardware units oﬀer, generally via their drivers, access to a standard Application Programming Interface (API) such as OpenGL [Khronos OpenGL Working Group 1994, Khronos OpenGL Working Group 2012] and DirectX [Microsoft 1995, Mi-crosoft 2012]. These APIs are specific to graphic processing, the main application domain for this kind of hardware. Graphic processing makes use of many vector operations, and GPUs can multiply a vector by a scalar in one operation. This capability has been hijacked from graphic processing toward general-purpose computations.
This chapter first presents in Section 2.1 the history of the general-purpose comput-ing using GPUs, then Section 2.2 gives insights about the evolution of the programming model and the diﬀerent initiatives taken to pave the way to General-Purpose Processing on Graphics Processing Units (GPGPU). The OpenCL standard is introduced with more details in Section 2.3. The contemporary GPU architectures are presented in Section 2.4. Finally I list the many programming challenges these architectures oﬀer to programmers and compiler designers.

History

The use of graphic hardware for general-purpose computing has been a research domain for more than twenty years. Harris et al. proposed [Harris et al. 2002] a history starting with a machine like the Ikonas [England 1978], the Pixel Machine [Potmesil & Hoﬀert 1989], and Pixel-Planes 5 [Rhoades et al. 1992]. In 2000, Trendall and Stewart [Trendall & Stewart 2000] gave an overview of the past experiments with graphics hardware. Lengyel et al. [Lengyel et al. 1990] performed real-time robot motion planning using rasterizing capabilities of graphics hardware. Bohn [Bohn 1998] interprets a rectangle of pixels as a four-dimensional vector function, to do computation on a Kohonen feature map. Hoﬀ et al. [Hoﬀ et al. 1999] describe how to compute Voronoi diagrams using z-buﬀers. Kedem et al. [Kedem & Ishihara 1999] use the PixelFlow SIMD graphics computer [Eyles et al. 1997] to decrypt Unix passwords. Finally some raytracing was performed on GPU in [Carr et al. 2002] and [Purcell et al. 2002]. A survey of GPGPU computation can be found in [Owens et al. 2007].
Until 2007, the GPUs exposed a graphic pipeline through the OpenGL API. All the élégance of this research rested in the mapping of general mathematical computations on this pipeline [Trendall & Stewart 2000]. A key limitation was that, at that time, GPU hardware oﬀered only single-precision floating point units, although double precision floating point is often required for engineering and most scientific simulations.
GPUs have spread during the last decades, with an excellent cost/performance ratio that led to a trend in experimental research to use these specialized pieces of hardware. This trend was mirrored first with the evolution of the programming interface. Both OpenGL and DirectX introduced shaders (see Section 2.2.2) in 2001, and thus added programma-bility and flexibility to the graphic pipeline. However, using one of the graphic APIs was still mandatory and therefore General-Purpose Processing on Graphics Processing Units (GPGPU) was even more challenging than it is currently.
In 2003 Buck et al. [Buck et al. 2004] implemented a subset of the Brook streaming language to program GPUs. This new language, called BrookGPU, does not expose at all the graphic pipeline. The code is compiled toward DirectX and OpenGL. BrookGPU is used for instance in the Folding@home project [Pande lab Stanford University 2012]. More insight about Brook and BrookGPU is given in Section 2.2.3.
Ian Buck, who designed Brook and BrookGPU, has joined Nvidia to design the Compute Unified Device Architecture (CUDA) language, which shares similarities with BrookGPU. However, while BrookGPU is generic, CUDA API is specific to Nvidia and its then new scalar GPU architecture introduced with CUDA is presented in Section 2.4.5. CUDA is an API and a language to program GPUs more easily. The graphic pipeline does not exist anymore as such and the architecture is unified and exposed as multi-Single Instruction stream, Multiple Data streams (SIMD)-like processors. CUDA is introduced with more details in Section 2.2.4.
From 2004 to 2012, the evolution of GPUs’ floating point performance increased much faster than the CPUs’ performance, as shown in Figure 2.1. The programmability oﬀered by CUDA, combined with the GPU performance advantage, has made the GPGPU more and more popular for scientific computing during the past five years.
The increased interest in GPGPU attracted more attention and led to the standard-ization of a dedicated API and language to program accelerators: the Open Computing Language known as OpenCL (see Section 2.3).
Others programming models are emerging, such as directive-based languages. These let the programmers write portable, maintainable, and hopefully eﬃcient code. Pragma-like directives are added to a sequential code to tell the compiler which pieces of code should be executed on accelerator. This method is less intrusive but may provide limited performance currently. Several sets of directives are presented in Section 2.2.10.

Languages, Frameworks, and Programming Models

The programming language history includes many languages, frameworks, and pro-gramming models that have been designed to program accelerators. Some were designed for the initial purpose of the accelerator, i.e., graphic computing, and were later diverted to- ward general-purpose computation. Others were designed entirely from scratch to address GPGPU needs.
This section surveys the major contributions, approaches, and paradigms involved dur-ing the last decade to program hardware accelerators in general-purpose computations.

READ The optical measurements. Principle of flow birefringence measurement

Open Graphics Library (OpenGL)

Open Graphics Library (OpenGL) is a specification for a multiplatform API that was developed in 1992 by Silicon Graphics Inc. It is used to program software that make use of 3D or 2D graphic processing and provides an abstraction of the diﬀerent graphic units, hiding the complexities of interfacing with diﬀerent 3D accelerators. OpenGL manipulates objects such as points, lines and polygons, and converts them into pixels via a graphics pipeline, parametrized with the OpenGL state machine.
OpenGL is a procedural API containing low-level primitives that must be used by the programmer to render a scene. OpenGL was designed upon a state machine that mimics the graphic hardwares available at that time. The programmer must have a good knowledge of the graphics pipeline.
OpenGL commands mostly issue objects (points, lines and polygons) to the graph-ics pipeline, or configure the pipeline stages that process these objects. Basically, each stage of the pipeline performs a fixed function and is configurable only within tight limits. But since OpenGL 2.0 [Khronos OpenGL Working Group 2004] and the introduction of shaders and the OpenGL Shading Language (GLSL) language, several stages are now fully programmable.
In august 2012, the version 4.3 is announced with a new feature: the possibility of executing compute shaders such as the saxpy example shown in Figure 2.2 without using the full OpenGL state machine. The shader program is executed by every single threads in parallel. Then conducting the same operation over a vector, which usually exhibits a loop, involves here an implicit iteration space. Figure 2.2 illustrates this execution model with one thread per iteration. An classic CPU version of saxpy is shown in Figure 2.4a.

Table of contents :

Remerciements
Abstract
Résumé
1 Introduction
1.1 The Prophecy
1.2 Motivation
1.3 Outline
2 General-Purpose Processing on GPU : History and Context
2.1 History
2.2 Languages, Frameworks, and Programming Models
2.2.1 Open Graphics Library (OpenGL)
2.2.2 Shaders
2.2.3 Brook and BrookGPU
2.2.4 Nvidia Compute Unified Device Architecture (CUDA)
2.2.5 AMD Accelerated Parallel Processing, FireStream
2.2.6 Open Computing Language (OpenCL)
2.2.7 Microsoft DirectCompute
2.2.8 C++ Accelerated Massive Parallelism (AMP)
2.2.9 !C and the MPPA Accelerator
2.2.10 Directive-Based Language and Frameworks
2.2.11 Automatic Parallelization for GPGPU
2.3 Focus on OpenCL
2.3.1 Introduction
2.3.2 OpenCL Architecture
2.3.2.1 Platform Model
2.3.2.2 Execution Model
2.3.2.3 Memory Model
2.3.2.4 Programming Model
2.3.3 OpenCL Language
2.3.3.1 Conclusion
2.4 Target Architectures
2.4.1 From Specialized Hardware to a Massively Parallel Device
2.4.2 Building a GPU
2.4.3 Hardware Atomic Operations
2.4.4 AMD, from R300 to Graphics Core Next
2.4.5 Nvidia Computing Unified Device Architecture, from G80 to Kepler
2.4.6 Impact on Code Generation
2.4.7 Summary
2.5 Conclusion
3 DataMapping,Communications andConsistency
3.1 Case Study
3.2 Array Region Analysis
3.3 Basic Transformation Process
3.4 Region Refinement Scheme
3.4.1 Converting Convex Array Regions into Data Transfers
3.4.2 Managing Variable Substitutions
3.5 Limits
3.6 Communication Optimization Algorithm
3.6.1 A New Analysis: Kernel Data Mapping
3.6.2 Definitions
3.6.3 Intraprocedural Phase
3.6.4 Interprocedural Extension
3.6.5 Runtime Library
3.7 Sequential Promotion
3.7.1 Experimental Results
3.8 Related Work
3.8.1 Redundant Load-Store Elimination
3.8.1.1 Interprocedural Propagation
3.8.1.2 Combining Load and Store Elimination
3.9 Optimizing a Tiled Loop Nest
3.10 Conclusion
4 Transformations for GPGPU
4.1 Introduction
Table of Contents xiii
4.2 Loop Nest Mapping on GPU
4.3 Parallelism Detection
4.3.1 Allen and Kennedy
4.3.2 Coarse Grained Parallelization
4.3.3 Impact on Code Generation
4.4 Reduction Parallelization
4.4.1 Detection
4.4.2 Reduction Parallelization for GPU
4.4.3 Parallel Prefix Operations on GPUs
4.5 Induction Variable Substitution
4.6 Loop Fusion
4.6.1 Legality
4.6.2 Different Goals
4.6.3 Loop Fusion for GPGPU
4.6.4 Loop Fusion in PIPS
4.6.5 Loop Fusion Using Array Regions
4.6.6 Further Special Considerations
4.7 Scalarization
4.7.1 Scalarization inside Kernel
4.7.2 Scalarization after Loop Fusion
4.7.3 Perfect Nesting of Loops
4.7.4 Conclusion
4.8 Loop Unrolling
4.9 Array Linearization
4.10 Toward a Compilation Scheme
5 HeterogeneousCompilerDesignandAutomation
5.1 Par4All Project
5.2 Source-to-Source Transformation System
5.3 Programmable Pass Managers
5.3.1 PyPS
5.3.1.1 Benefiting from Python: on the shoulders of giants
5.3.1.2 Program Abstractions
5.3.1.3 Control Structures
5.3.1.4 Builder
5.3.1.5 Heterogeneous Compiler Developements
5.3.2 Related Work
5.3.3 Conclusion
5.4 Library Handling
5.4.1 Stubs Broker
5.4.2 Handling Multiple Implementations of an API: Dealing with External Libraries
5.5 Tool Combinations
5.6 Profitability Criteria
5.6.1 Static Approach
5.6.2 Runtime Approach
5.6.3 Conclusion
5.7 Version Selection at Runtime
5.8 Launch Configuration Heuristic
5.8.1 Tuning the Work-Group Size
5.8.2 Tuning the Block Dimensions
5.9 Conclusion
6 Management of Multi-GPUs
6.1 Introduction
6.2 Task Parallelism
6.2.1 The StarPU Runtime
6.2.2 Task Extraction in PIPS
6.2.3 Code Generation for StarPU
6.3 Data Parallelism Using Loop Nest Tiling
6.3.1 Performance
6.4 Related Work
6.5 Conclusion
7 Experiments
7.1 Hardware Platforms Used
7.2 Benchmarks, Kernels, and Applications Used for Experiments
7.3 Parallelization Algorithm
7.4 Launch Configuration
7.5 Scalarization
7.5.1 Scalarization inside Kernel
7.5.2 Full Array Contraction
7.5.3 Perfect Nesting of Loops
7.6 Loop Unrolling
7.7 Array Linearization
7.8 Communication Optimization
7.8.1 Metric
7.8.2 Results
7.8.3 Comparison with Respect to a Fully Dynamic Approach
7.8.4 Sequential Promotion
7.9 Overall Results
7.10 Multiple GPUs
7.10.1 Task Parallelism
7.10.2 Data Parallelism
7.11 Conclusions
8 Conclusion
Personal Bibliography
Bibliography
Acronyms
Résumé en français
1 Introduction
2 Calcul généraliste sur processeurs graphiques : histoire et contexte
3 Placement des données, communications, et cohérence
4 Transformations pour GPGPU
5 Conception de compilateurs hétérogènes et automatisation
6 Gestiondemultiplesaccélérateurs
7 Expériences
8 Conclusion