Suchen und Finden

Service

Programming Massively Parallel Processors - A Hands-on Approach

von: David B. Kirk, Wen-mei W. Hwu

Elsevier Reference Monographs, 2012

ISBN: 9780123914187 , 519 Seiten

2. Auflage

Format: PDF, ePUB

Kopierschutz: DRM

Apple iPod touch, iPhone und Android Smartphones

Preis: 53,95 EUR

Mehr zum Inhalt

Programming Massively Parallel Processors - A Hands-on Approach

Front Cover: 1
Programming Massively Parallel Processors: 4
Copyright Page: 5
Contents: 6
Preface: 14
Target Audience: 15
How to Use the Book: 15
A Three-Phased Approach: 16
Tying It All Together: The Final Project: 16
Project Workshop: 17
Design Document: 17
Project Report: 18
Online Supplements: 18
Acknowledgements: 20
Dedication: 22
1 Introduction: 24
1.1 Heterogeneous Parallel Computing: 25
1.2 Architecture of a Modern GPU: 31
1.3 Why More Speed or Parallelism?: 33
1.4 Speeding Up Real Applications: 35
1.5 Parallel Programming Languages and Models: 37
1.6 Overarching Goals: 39
1.7 Organization of the Book: 40
References: 44
2 History of GPU Computing: 46
2.1 Evolution of Graphics Pipelines: 46
The Era of Fixed-Function Graphics Pipelines: 47
Evolution of Programmable Real-Time Graphics: 51
Unified Graphics and Computing Processors: 54
2.2 GPGPU: An Intermediate Step: 56
2.3 GPU Computing: 57
Scalable GPUs: 58
Recent Developments: 59
Future Trends: 60
References and Further Reading: 60
3 Introduction to Data Parallelism and CUDA C: 64
3.1 Data Parallelism: 65
3.2 CUDA Program Structure: 66
3.3 A Vector Addition Kernel: 68
3.4 Device Global Memory and Data Transfer: 71
3.5 Kernel Functions and Threading: 76
3.6 Summary: 82
Function Declarations: 82
Kernel Launch: 82
Predefined Variables: 82
Runtime API: 83
3.7 Exercises: 83
References: 85
4 Data-Parallel Execution Model: 86
4.1 Cuda Thread Organization: 87
4.2 Mapping Threads to Multidimensional Data: 91
4.3 Matrix-Matrix Multiplication—A More Complex Kernel: 97
4.4 Synchronization and Transparent Scalability: 104
4.5 Assigning Resources to Blocks: 106
4.6 Querying Device Properties: 108
4.7 Thread Scheduling and Latency Tolerance: 110
4.8 Summary: 114
4.9 Exercises: 114
5 CUDA Memories: 118
5.1 Importance of Memory Access Efficiency: 119
5.2 CUDA Device Memory Types: 120
5.3 A Strategy for Reducing Global Memory Traffic: 128
5.4 A Tiled Matrix–Matrix Multiplication Kernel: 132
5.5 Memory as a Limiting Factor to Parallelism: 138
5.6 Summary: 141
5.7 Exercises: 142
6 Performance Considerations: 146
6.1 Warps and Thread Execution: 147
6.2 Global Memory Bandwidth: 155
6.3 Dynamic Partitioning of Execution Resources: 164
6.4 Instruction Mix and Thread Granularity: 166
6.5 Summary: 168
6.6 Exercises: 168
References: 172
7 Floating-Point Considerations: 174
7.1 Floating-Point Format: 175
Normalized Representation of M: 175
Excess Encoding of E: 176
7.2 Representable Numbers: 178
7.3 Special Bit Patterns and Precision in Ieee Format: 183
7.4 Arithmetic Accuracy and Rounding: 184
7.5 Algorithm Considerations: 185
7.6 Numerical Stability: 187
7.7 Summary: 192
7.8 Exercises: 193
References: 194
8 Parallel Patterns: Convolution: 196
8.1 Background: 197
8.2 1D Parallel Convolution—A Basic Algorithm: 202
8.3 Constant Memory and Caching: 204
8.4 Tiled 1D Convolution with Halo Elements: 208
8.5 A Simpler Tiled 1D Convolution—General Caching: 215
8.6 Summary: 216
8.7 Exercises: 217
9 Parallel Patterns: Prefix Sum: 220
9.1 Background: 221
9.2 A Simple Parallel Scan: 223
9.3 Work Efficiency Considerations: 227
9.4 A Work-Efficient Parallel Scan: 228
9.5 Parallel Scan for Arbitrary-Length Inputs: 233
9.6 Summary: 237
9.7 Exercises: 238
Reference: 239
10 Parallel Patterns: Sparse Matrix–Vector Multiplication: 240
10.1 Background: 241
10.2 Parallel SpMV Using CSR: 245
10.3 Padding and Transposition: 247
10.4 Using Hybrid to Control Padding: 249
10.5 Sorting and Partitioning for Regularization: 253
10.6 Summary: 255
10.7 Exercises: 256
References: 257
11 Application Case Study: Advanced MRI Reconstruction: 258
11.1 Application Background: 259
11.2 Iterative Reconstruction: 262
11.3 Computing FHD: 264
Step 1: Determine the Kernel Parallelism Structure: 266
Step 2: Getting Around the Memory Bandwidth Limitation: 272
Step 3: Using Hardware Trigonometry Functions: 278
Step 4: Experimental Performance Tuning: 282
11.4 Final Evaluation: 283
11.5 Exercises: 285
References: 287
12 Application Case Study: Molecular Visualization and Analysis: 288
12.1 Application Background: 289
12.2 A Simple Kernel Implementation: 291
12.3 Thread Granularity Adjustment: 295
12.4 Memory Coalescing: 297
12.5 Summary: 300
12.6 Exercises: 302
References: 302
13 Parallel Programming and Computational Thinking: 304
13.1 Goals of Parallel Computing: 305
13.2 Problem Decomposition: 306
13.3 Algorithm Selection: 310
13.4 Computational Thinking: 316
13.5 Summary: 317
13.6 Exercises: 317
References: 318
14 An Introduction to OpenCL™: 320
14.1 Background: 320
14.2 Data Parallelism Model: 322
14.3 Device Architecture: 324
14.4 Kernel Functions: 326
14.5 Device Management and Kernel Launch: 327
14.6 Electrostatic Potential Map in Opencl: 330
14.7 Summary: 334
14.8 Exercises: 335
References: 336
15 Parallel Programming with OpenACC: 338
15.1 OpenACC Versus CUDA C: 338
15.2 Execution Model: 341
15.3 Memory Model: 342
15.4 Basic OpenACC Programs: 343
Parallel Construct: 343
Parallel Region, Gangs, and Workers: 343
Loop Construct: 345
Gang Loop: 345
Worker Loop: 346
OpenACC Versus CUDA: 346
Vector Loop: 349
Kernels Construct: 350
Prescriptive Versus Descriptive: 350
Ways to Help an OpenACC Compiler: 352
Data Management: 354
Data Clauses: 354
Data Construct: 355
Asynchronous Computation and Data Transfer: 358
15.5 Future Directions of OpenACC: 359
15.6 Exercises: 360
16 Thrust: A Productivity-Oriented Library for CUDA: 362
16.1 Background: 362
16.2 Motivation: 365
16.3 Basic Thrust Features: 366
Iterators and Memory Space: 367
Interoperability: 368
16.4 Generic Programming: 370
16.5 Benefits of Abstraction: 372
16.6 Programmer Productivity: 372
Robustness: 373
Real-World Performance: 373
16.7 Best Practices: 375
Fusion: 376
Structure of Arrays: 377
Implicit Ranges: 379
16.8 Exercises: 380
References: 381
17 CUDA FORTRAN: 382
17.1 CUDA FORTRAN and CUDA C Differences: 383
17.2 A First CUDA FORTRAN Program: 384
17.3 Multidimensional Array in CUDA FORTRAN: 386
17.4 Overloading Host/Device Routines With Generic Interfaces: 387
17.5 Calling CUDA C Via Iso_C_Binding: 390
17.6 Kernel Loop Directives and Reduction Operations: 392
17.7 Dynamic Shared Memory: 393
17.8 Asynchronous Data Transfers: 394
17.9 Compilation and Profiling: 400
17.10 Calling Thrust from CUDA FORTRAN: 401
17.11 Exercises: 405
18 An Introduction to C++ AMP: 406
18.1 Core C++ Amp Features: 407
18.2 Details of the C++ AMP Execution Model: 414
Explicit and Implicit Data Copies: 414
Asynchronous Operation: 416
Section Summary: 418
18.3 Managing Accelerators: 418
18.4 Tiled Execution: 421
18.5 C++ AMP Graphics Features: 424
18.6 Summary: 428
18.7 Exercises: 428
19 Programming a Heterogeneous Computing Cluster: 430
19.1 Background: 431
19.2 A Running Example: 431
19.3 MPI Basics: 433
19.4 MPI Point-to-Point Communication Types: 437
19.5 Overlapping Computation and Communication: 444
19.6 MPI Collective Communication: 454
19.7 Summary: 454
19.8 Exercises: 455
Reference: 456
20 CUDA Dynamic Parallelism: 458
20.1 Background: 459
20.2 Dynamic Parallelism Overview: 461
20.3 Important Details: 462
Launch Environment Configuration: 462
API Errors and Launch Failures: 462
Events: 462
Streams: 463
Synchronization Scope: 464
20.4 Memory Visibility: 465
Global Memory: 465
Zero-Copy Memory: 465
Constant Memory: 465
Local Memory: 465
Shared Memory: 466
Texture Memory: 466
20.5 A Simple Example: 467
20.6 Runtime Limitations: 469
Memory Footprint: 469
Nesting Depth: 471
Memory Allocation and Lifetime: 471
ECC Errors: 472
Streams: 472
Events: 472
Launch Pool: 472
20.7 A More Complex Example: 472
Linear Bezier Curves: 473
Quadratic Bezier Curves: 473
Bezier Curve Calculation (Predynamic Parallelism): 473
Bezier Curve Calculation (with Dynamic Parallelism): 476
20.8 Summary: 479
Reference: 480
21 Conclusion and Future Outlook: 482
21.1 Goals Revisited: 482
21.2 Memory Model Evolution: 484
21.3 Kernel Execution Control Evolution: 487
21.4 Core Performance: 490
21.5 Programming Environment: 490
21.6 Future Outlook: 491
References: 492
Appendix A: Matrix Multiplication Host-Only Version Source Code: 494
A.1 matrixmul.cu: 494
A.2 matrixmul_gold.cpp: 497
A.3 matrixmul.h: 498
A.4 assist.h: 499
A.5 Expected Output: 503
Appendix B: GPU Compute Capabilities: 504
B.1 GPU Compute Capability Tables: 504
B.2 Memory Coalescing Variations: 505
Index: 510

AGB
Datenschutz
Impressum
Kontakt
F.A.Q
Widerruf

Alle Preise verstehen sich inklusive der gesetzlichen MwSt.

Programming Massively Parallel Processors - A Hands-on Approach

von: David B. Kirk, Wen-mei W. Hwu

Programming Massively Parallel Processors - A Hands-on Approach

Front Cover

Programming Massively Parallel Processors

Copyright Page

Contents

Preface

Target Audience

How to Use the Book

A Three-Phased Approach

Tying It All Together: The Final Project

Project Workshop

Design Document

Project Report

Online Supplements

Acknowledgements

Dedication

1 Introduction

1.1 Heterogeneous Parallel Computing

1.2 Architecture of a Modern GPU

1.3 Why More Speed or Parallelism?

1.4 Speeding Up Real Applications

1.5 Parallel Programming Languages and Models

1.6 Overarching Goals

1.7 Organization of the Book

References

2 History of GPU Computing

2.1 Evolution of Graphics Pipelines

The Era of Fixed-Function Graphics Pipelines

Evolution of Programmable Real-Time Graphics

Unified Graphics and Computing Processors

2.2 GPGPU: An Intermediate Step

2.3 GPU Computing

Scalable GPUs

Recent Developments

Future Trends

References and Further Reading

3 Introduction to Data Parallelism and CUDA C

3.1 Data Parallelism

3.2 CUDA Program Structure

3.3 A Vector Addition Kernel

3.4 Device Global Memory and Data Transfer

3.5 Kernel Functions and Threading

3.6 Summary

Function Declarations

Kernel Launch

Predefined Variables

Runtime API

3.7 Exercises

References

4 Data-Parallel Execution Model

4.1 Cuda Thread Organization

4.2 Mapping Threads to Multidimensional Data

4.3 Matrix-Matrix Multiplication—A More Complex Kernel

4.4 Synchronization and Transparent Scalability

4.5 Assigning Resources to Blocks

4.6 Querying Device Properties

4.7 Thread Scheduling and Latency Tolerance

4.8 Summary

4.9 Exercises

5 CUDA Memories

5.1 Importance of Memory Access Efficiency

5.2 CUDA Device Memory Types

5.3 A Strategy for Reducing Global Memory Traffic

5.4 A Tiled Matrix–Matrix Multiplication Kernel

5.5 Memory as a Limiting Factor to Parallelism

5.6 Summary

5.7 Exercises

6 Performance Considerations

6.1 Warps and Thread Execution

6.2 Global Memory Bandwidth

6.3 Dynamic Partitioning of Execution Resources

6.4 Instruction Mix and Thread Granularity

6.5 Summary

6.6 Exercises

References

7 Floating-Point Considerations

7.1 Floating-Point Format

Normalized Representation of M