Suchen und Finden
Front Cover
1
Programming Massively Parallel Processors
4
Copyright Page
5
Contents
6
Preface
14
Target Audience
15
How to Use the Book
15
A Three-Phased Approach
16
Tying It All Together: The Final Project
16
Project Workshop
17
Design Document
17
Project Report
18
Online Supplements
18
Acknowledgements
20
Dedication
22
1 Introduction
24
1.1 Heterogeneous Parallel Computing
25
1.2 Architecture of a Modern GPU
31
1.3 Why More Speed or Parallelism?
33
1.4 Speeding Up Real Applications
35
1.5 Parallel Programming Languages and Models
37
1.6 Overarching Goals
39
1.7 Organization of the Book
40
References
44
2 History of GPU Computing
46
2.1 Evolution of Graphics Pipelines
46
The Era of Fixed-Function Graphics Pipelines
47
Evolution of Programmable Real-Time Graphics
51
Unified Graphics and Computing Processors
54
2.2 GPGPU: An Intermediate Step
56
2.3 GPU Computing
57
Scalable GPUs
58
Recent Developments
59
Future Trends
60
References and Further Reading
60
3 Introduction to Data Parallelism and CUDA C
64
3.1 Data Parallelism
65
3.2 CUDA Program Structure
66
3.3 A Vector Addition Kernel
68
3.4 Device Global Memory and Data Transfer
71
3.5 Kernel Functions and Threading
76
3.6 Summary
82
Function Declarations
82
Kernel Launch
82
Predefined Variables
82
Runtime API
83
3.7 Exercises
83
References
85
4 Data-Parallel Execution Model
86
4.1 Cuda Thread Organization
87
4.2 Mapping Threads to Multidimensional Data
91
4.3 Matrix-Matrix Multiplication—A More Complex Kernel
97
4.4 Synchronization and Transparent Scalability
104
4.5 Assigning Resources to Blocks
106
4.6 Querying Device Properties
108
4.7 Thread Scheduling and Latency Tolerance
110
4.8 Summary
114
4.9 Exercises
114
5 CUDA Memories
118
5.1 Importance of Memory Access Efficiency
119
5.2 CUDA Device Memory Types
120
5.3 A Strategy for Reducing Global Memory Traffic
128
5.4 A Tiled Matrix–Matrix Multiplication Kernel
132
5.5 Memory as a Limiting Factor to Parallelism
138
5.6 Summary
141
5.7 Exercises
142
6 Performance Considerations
146
6.1 Warps and Thread Execution
147
6.2 Global Memory Bandwidth
155
6.3 Dynamic Partitioning of Execution Resources
164
6.4 Instruction Mix and Thread Granularity
166
6.5 Summary
168
6.6 Exercises
168
References
172
7 Floating-Point Considerations
174
7.1 Floating-Point Format
175
Normalized Representation of M
175
Excess Encoding of E
176
7.2 Representable Numbers
178
7.3 Special Bit Patterns and Precision in Ieee Format
183
7.4 Arithmetic Accuracy and Rounding
184
7.5 Algorithm Considerations
185
7.6 Numerical Stability
187
7.7 Summary
192
7.8 Exercises
193
References
194
8 Parallel Patterns: Convolution
196
8.1 Background
197
8.2 1D Parallel Convolution—A Basic Algorithm
202
8.3 Constant Memory and Caching
204
8.4 Tiled 1D Convolution with Halo Elements
208
8.5 A Simpler Tiled 1D Convolution—General Caching
215
8.6 Summary
216
8.7 Exercises
217
9 Parallel Patterns: Prefix Sum
220
9.1 Background
221
9.2 A Simple Parallel Scan
223
9.3 Work Efficiency Considerations
227
9.4 A Work-Efficient Parallel Scan
228
9.5 Parallel Scan for Arbitrary-Length Inputs
233
9.6 Summary
237
9.7 Exercises
238
Reference
239
10 Parallel Patterns: Sparse Matrix–Vector Multiplication
240
10.1 Background
241
10.2 Parallel SpMV Using CSR
245
10.3 Padding and Transposition
247
10.4 Using Hybrid to Control Padding
249
10.5 Sorting and Partitioning for Regularization
253
10.6 Summary
255
10.7 Exercises
256
References
257
11 Application Case Study: Advanced MRI Reconstruction
258
11.1 Application Background
259
11.2 Iterative Reconstruction
262
11.3 Computing FHD
264
Step 1: Determine the Kernel Parallelism Structure
266
Step 2: Getting Around the Memory Bandwidth Limitation
272
Step 3: Using Hardware Trigonometry Functions
278
Step 4: Experimental Performance Tuning
282
11.4 Final Evaluation
283
11.5 Exercises
285
References
287
12 Application Case Study: Molecular Visualization and Analysis
288
12.1 Application Background
289
12.2 A Simple Kernel Implementation
291
12.3 Thread Granularity Adjustment
295
12.4 Memory Coalescing
297
12.5 Summary
300
12.6 Exercises
302
References
302
13 Parallel Programming and Computational Thinking
304
13.1 Goals of Parallel Computing
305
13.2 Problem Decomposition
306
13.3 Algorithm Selection
310
13.4 Computational Thinking
316
13.5 Summary
317
13.6 Exercises
317
References
318
14 An Introduction to OpenCL™
320
14.1 Background
320
14.2 Data Parallelism Model
322
14.3 Device Architecture
324
14.4 Kernel Functions
326
14.5 Device Management and Kernel Launch
327
14.6 Electrostatic Potential Map in Opencl
330
14.7 Summary
334
14.8 Exercises
335
References
336
15 Parallel Programming with OpenACC
338
15.1 OpenACC Versus CUDA C
338
15.2 Execution Model
341
15.3 Memory Model
342
15.4 Basic OpenACC Programs
343
Parallel Construct
343
Parallel Region, Gangs, and Workers
343
Loop Construct
345
Gang Loop
345
Worker Loop
346
OpenACC Versus CUDA
346
Vector Loop
349
Kernels Construct
350
Prescriptive Versus Descriptive
350
Ways to Help an OpenACC Compiler
352
Data Management
354
Data Clauses
354
Data Construct
355
Asynchronous Computation and Data Transfer
358
15.5 Future Directions of OpenACC
359
15.6 Exercises
360
16 Thrust: A Productivity-Oriented Library for CUDA
362
16.1 Background
362
16.2 Motivation
365
16.3 Basic Thrust Features
366
Iterators and Memory Space
367
Interoperability
368
16.4 Generic Programming
370
16.5 Benefits of Abstraction
372
16.6 Programmer Productivity
372
Robustness
373
Real-World Performance
373
16.7 Best Practices
375
Fusion
376
Structure of Arrays
377
Implicit Ranges
379
16.8 Exercises
380
References
381
17 CUDA FORTRAN
382
17.1 CUDA FORTRAN and CUDA C Differences
383
17.2 A First CUDA FORTRAN Program
384
17.3 Multidimensional Array in CUDA FORTRAN
386
17.4 Overloading Host/Device Routines With Generic Interfaces
387
17.5 Calling CUDA C Via Iso_C_Binding
390
17.6 Kernel Loop Directives and Reduction Operations
392
17.7 Dynamic Shared Memory
393
17.8 Asynchronous Data Transfers
394
17.9 Compilation and Profiling
400
17.10 Calling Thrust from CUDA FORTRAN
401
17.11 Exercises
405
18 An Introduction to C++ AMP
406
18.1 Core C++ Amp Features
407
18.2 Details of the C++ AMP Execution Model
414
Explicit and Implicit Data Copies
414
Asynchronous Operation
416
Section Summary
418
18.3 Managing Accelerators
418
18.4 Tiled Execution
421
18.5 C++ AMP Graphics Features
424
18.6 Summary
428
18.7 Exercises
428
19 Programming a Heterogeneous Computing Cluster
430
19.1 Background
431
19.2 A Running Example
431
19.3 MPI Basics
433
19.4 MPI Point-to-Point Communication Types
437
19.5 Overlapping Computation and Communication
444
19.6 MPI Collective Communication
454
19.7 Summary
454
19.8 Exercises
455
Reference
456
20 CUDA Dynamic Parallelism
458
20.1 Background
459
20.2 Dynamic Parallelism Overview
461
20.3 Important Details
462
Launch Environment Configuration
462
API Errors and Launch Failures
462
Events
462
Streams
463
Synchronization Scope
464
20.4 Memory Visibility
465
Global Memory
465
Zero-Copy Memory
465
Constant Memory
465
Local Memory
465
Shared Memory
466
Texture Memory
466
20.5 A Simple Example
467
20.6 Runtime Limitations
469
Memory Footprint
469
Nesting Depth
471
Memory Allocation and Lifetime
471
ECC Errors
472
Streams
472
Events
472
Launch Pool
472
20.7 A More Complex Example
472
Linear Bezier Curves
473
Quadratic Bezier Curves
473
Bezier Curve Calculation (Predynamic Parallelism)
473
Bezier Curve Calculation (with Dynamic Parallelism)
476
20.8 Summary
479
Reference
480
21 Conclusion and Future Outlook
482
21.1 Goals Revisited
482
21.2 Memory Model Evolution
484
21.3 Kernel Execution Control Evolution
487
21.4 Core Performance
490
21.5 Programming Environment
490
21.6 Future Outlook
491
References
492
Appendix A: Matrix Multiplication Host-Only Version Source Code
494
A.1 matrixmul.cu
494
A.2 matrixmul_gold.cpp
497
A.3 matrixmul.h
498
A.4 assist.h
499
A.5 Expected Output
503
Appendix B: GPU Compute Capabilities
504
B.1 GPU Compute Capability Tables
504
B.2 Memory Coalescing Variations
505
Index
510
Alle Preise verstehen sich inklusive der gesetzlichen MwSt.