Programming Massively Parallel Processors - A Hands-on Approach

Programming Massively Parallel Processors - A Hands-on Approach

von: David B. Kirk, Wen-mei W. Hwu

Elsevier Reference Monographs, 2012

ISBN: 9780123914187 , 519 Seiten

2. Auflage

Format: PDF, ePUB

Kopierschutz: DRM

Windows PC,Mac OSX geeignet für alle DRM-fähigen eReader Apple iPad, Android Tablet PC's Apple iPod touch, iPhone und Android Smartphones

Preis: 53,95 EUR

Mehr zum Inhalt

Programming Massively Parallel Processors - A Hands-on Approach


 

Front Cover

1

Programming Massively Parallel Processors

4

Copyright Page

5

Contents

6

Preface

14

Target Audience

15

How to Use the Book

15

A Three-Phased Approach

16

Tying It All Together: The Final Project

16

Project Workshop

17

Design Document

17

Project Report

18

Online Supplements

18

Acknowledgements

20

Dedication

22

1 Introduction

24

1.1 Heterogeneous Parallel Computing

25

1.2 Architecture of a Modern GPU

31

1.3 Why More Speed or Parallelism?

33

1.4 Speeding Up Real Applications

35

1.5 Parallel Programming Languages and Models

37

1.6 Overarching Goals

39

1.7 Organization of the Book

40

References

44

2 History of GPU Computing

46

2.1 Evolution of Graphics Pipelines

46

The Era of Fixed-Function Graphics Pipelines

47

Evolution of Programmable Real-Time Graphics

51

Unified Graphics and Computing Processors

54

2.2 GPGPU: An Intermediate Step

56

2.3 GPU Computing

57

Scalable GPUs

58

Recent Developments

59

Future Trends

60

References and Further Reading

60

3 Introduction to Data Parallelism and CUDA C

64

3.1 Data Parallelism

65

3.2 CUDA Program Structure

66

3.3 A Vector Addition Kernel

68

3.4 Device Global Memory and Data Transfer

71

3.5 Kernel Functions and Threading

76

3.6 Summary

82

Function Declarations

82

Kernel Launch

82

Predefined Variables

82

Runtime API

83

3.7 Exercises

83

References

85

4 Data-Parallel Execution Model

86

4.1 Cuda Thread Organization

87

4.2 Mapping Threads to Multidimensional Data

91

4.3 Matrix-Matrix Multiplication—A More Complex Kernel

97

4.4 Synchronization and Transparent Scalability

104

4.5 Assigning Resources to Blocks

106

4.6 Querying Device Properties

108

4.7 Thread Scheduling and Latency Tolerance

110

4.8 Summary

114

4.9 Exercises

114

5 CUDA Memories

118

5.1 Importance of Memory Access Efficiency

119

5.2 CUDA Device Memory Types

120

5.3 A Strategy for Reducing Global Memory Traffic

128

5.4 A Tiled Matrix–Matrix Multiplication Kernel

132

5.5 Memory as a Limiting Factor to Parallelism

138

5.6 Summary

141

5.7 Exercises

142

6 Performance Considerations

146

6.1 Warps and Thread Execution

147

6.2 Global Memory Bandwidth

155

6.3 Dynamic Partitioning of Execution Resources

164

6.4 Instruction Mix and Thread Granularity

166

6.5 Summary

168

6.6 Exercises

168

References

172

7 Floating-Point Considerations

174

7.1 Floating-Point Format

175

Normalized Representation of M

175

Excess Encoding of E

176

7.2 Representable Numbers

178

7.3 Special Bit Patterns and Precision in Ieee Format

183

7.4 Arithmetic Accuracy and Rounding

184

7.5 Algorithm Considerations

185

7.6 Numerical Stability

187

7.7 Summary

192

7.8 Exercises

193

References

194

8 Parallel Patterns: Convolution

196

8.1 Background

197

8.2 1D Parallel Convolution—A Basic Algorithm

202

8.3 Constant Memory and Caching

204

8.4 Tiled 1D Convolution with Halo Elements

208

8.5 A Simpler Tiled 1D Convolution—General Caching

215

8.6 Summary

216

8.7 Exercises

217

9 Parallel Patterns: Prefix Sum

220

9.1 Background

221

9.2 A Simple Parallel Scan

223

9.3 Work Efficiency Considerations

227

9.4 A Work-Efficient Parallel Scan

228

9.5 Parallel Scan for Arbitrary-Length Inputs

233

9.6 Summary

237

9.7 Exercises

238

Reference

239

10 Parallel Patterns: Sparse Matrix–Vector Multiplication

240

10.1 Background

241

10.2 Parallel SpMV Using CSR

245

10.3 Padding and Transposition

247

10.4 Using Hybrid to Control Padding

249

10.5 Sorting and Partitioning for Regularization

253

10.6 Summary

255

10.7 Exercises

256

References

257

11 Application Case Study: Advanced MRI Reconstruction

258

11.1 Application Background

259

11.2 Iterative Reconstruction

262

11.3 Computing FHD

264

Step 1: Determine the Kernel Parallelism Structure

266

Step 2: Getting Around the Memory Bandwidth Limitation

272

Step 3: Using Hardware Trigonometry Functions

278

Step 4: Experimental Performance Tuning

282

11.4 Final Evaluation

283

11.5 Exercises

285

References

287

12 Application Case Study: Molecular Visualization and Analysis

288

12.1 Application Background

289

12.2 A Simple Kernel Implementation

291

12.3 Thread Granularity Adjustment

295

12.4 Memory Coalescing

297

12.5 Summary

300

12.6 Exercises

302

References

302

13 Parallel Programming and Computational Thinking

304

13.1 Goals of Parallel Computing

305

13.2 Problem Decomposition

306

13.3 Algorithm Selection

310

13.4 Computational Thinking

316

13.5 Summary

317

13.6 Exercises

317

References

318

14 An Introduction to OpenCL™

320

14.1 Background

320

14.2 Data Parallelism Model

322

14.3 Device Architecture

324

14.4 Kernel Functions

326

14.5 Device Management and Kernel Launch

327

14.6 Electrostatic Potential Map in Opencl

330

14.7 Summary

334

14.8 Exercises

335

References

336

15 Parallel Programming with OpenACC

338

15.1 OpenACC Versus CUDA C

338

15.2 Execution Model

341

15.3 Memory Model

342

15.4 Basic OpenACC Programs

343

Parallel Construct

343

Parallel Region, Gangs, and Workers

343

Loop Construct

345

Gang Loop

345

Worker Loop

346

OpenACC Versus CUDA

346

Vector Loop

349

Kernels Construct

350

Prescriptive Versus Descriptive

350

Ways to Help an OpenACC Compiler

352

Data Management

354

Data Clauses

354

Data Construct

355

Asynchronous Computation and Data Transfer

358

15.5 Future Directions of OpenACC

359

15.6 Exercises

360

16 Thrust: A Productivity-Oriented Library for CUDA

362

16.1 Background

362

16.2 Motivation

365

16.3 Basic Thrust Features

366

Iterators and Memory Space

367

Interoperability

368

16.4 Generic Programming

370

16.5 Benefits of Abstraction

372

16.6 Programmer Productivity

372

Robustness

373

Real-World Performance

373

16.7 Best Practices

375

Fusion

376

Structure of Arrays

377

Implicit Ranges

379

16.8 Exercises

380

References

381

17 CUDA FORTRAN

382

17.1 CUDA FORTRAN and CUDA C Differences

383

17.2 A First CUDA FORTRAN Program

384

17.3 Multidimensional Array in CUDA FORTRAN

386

17.4 Overloading Host/Device Routines With Generic Interfaces

387

17.5 Calling CUDA C Via Iso_C_Binding

390

17.6 Kernel Loop Directives and Reduction Operations

392

17.7 Dynamic Shared Memory

393

17.8 Asynchronous Data Transfers

394

17.9 Compilation and Profiling

400

17.10 Calling Thrust from CUDA FORTRAN

401

17.11 Exercises

405

18 An Introduction to C++ AMP

406

18.1 Core C++ Amp Features

407

18.2 Details of the C++ AMP Execution Model

414

Explicit and Implicit Data Copies

414

Asynchronous Operation

416

Section Summary

418

18.3 Managing Accelerators

418

18.4 Tiled Execution

421

18.5 C++ AMP Graphics Features

424

18.6 Summary

428

18.7 Exercises

428

19 Programming a Heterogeneous Computing Cluster

430

19.1 Background

431

19.2 A Running Example

431

19.3 MPI Basics

433

19.4 MPI Point-to-Point Communication Types

437

19.5 Overlapping Computation and Communication

444

19.6 MPI Collective Communication

454

19.7 Summary

454

19.8 Exercises

455

Reference

456

20 CUDA Dynamic Parallelism

458

20.1 Background

459

20.2 Dynamic Parallelism Overview

461

20.3 Important Details

462

Launch Environment Configuration

462

API Errors and Launch Failures

462

Events

462

Streams

463

Synchronization Scope

464

20.4 Memory Visibility

465

Global Memory

465

Zero-Copy Memory

465

Constant Memory

465

Local Memory

465

Shared Memory

466

Texture Memory

466

20.5 A Simple Example

467

20.6 Runtime Limitations

469

Memory Footprint

469

Nesting Depth

471

Memory Allocation and Lifetime

471

ECC Errors

472

Streams

472

Events

472

Launch Pool

472

20.7 A More Complex Example

472

Linear Bezier Curves

473

Quadratic Bezier Curves

473

Bezier Curve Calculation (Predynamic Parallelism)

473

Bezier Curve Calculation (with Dynamic Parallelism)

476

20.8 Summary

479

Reference

480

21 Conclusion and Future Outlook

482

21.1 Goals Revisited

482

21.2 Memory Model Evolution

484

21.3 Kernel Execution Control Evolution

487

21.4 Core Performance

490

21.5 Programming Environment

490

21.6 Future Outlook

491

References

492

Appendix A: Matrix Multiplication Host-Only Version Source Code

494

A.1 matrixmul.cu

494

A.2 matrixmul_gold.cpp

497

A.3 matrixmul.h

498

A.4 assist.h

499

A.5 Expected Output

503

Appendix B: GPU Compute Capabilities

504

B.1 GPU Compute Capability Tables

504

B.2 Memory Coalescing Variations

505

Index

510