Home > Store

CUDA Handbook: A Comprehensive Guide to GPU Programming, The

By Nicholas Wilt
Published Jun 12, 2013 by Addison-Wesley Professional.

Book

This product currently is not for sale.

Not for Sale

About

Description

Sample Content

Updates

More Information

About

Features

Detailed descriptions of the key abstractions in CUDA and how they interact
The only book that addresses the CUDA driver API in detail
Detailed descriptions of best practices, such as maximizing CPU/GPU overlap, software-pipelining CUDA streams,
avoiding Heisenberg effects when timing, etc.
Templates for common CUDA programming idioms, including parallel prefix sum
A printed reference that readers can use while programming in CUDA
Fully up to date for CUDA 5



Description

Copyright 2013
Dimensions: 7-3/8" x 9-1/8"
Pages: 528
Edition: 1st

Book
ISBN-10: 0-321-80946-7
ISBN-13: 978-0-321-80946-9

The CUDA Handbook begins where CUDA by Example (Addison-Wesley, 2011) leaves off, discussing CUDA hardware and software in greater detail and covering both CUDA 5.0 and Kepler. Every CUDA developer, from the casual to the most sophisticated, will find something here of interest and immediate usefulness. Newer CUDA developers will see how the hardware processes commands and how the driver checks progress; more experienced CUDA developers will appreciate the expert coverage of topics such as the driver API and context migration, as well as the guidance on how best to structure CPU/GPU data interchange and synchronization.

The accompanying open source code–more than 25,000 lines of it, freely available at www.cudahandbook.com–is specifically intended to be reused and repurposed by developers.

Designed to be both a comprehensive reference and a practical cookbook, the text is divided into the following three parts:

Part I, Overview, gives high-level descriptions of the hardware and software that make CUDA possible.

Part II, Details, provides thorough descriptions of every aspect of CUDA, including

Memory
Streams and events
Models of execution, including the dynamic parallelism feature, new with CUDA 5.0 and SM 3.5
The streaming multiprocessors, including descriptions of all features through SM 3.5
Programming multiple GPUs
Texturing

The source code accompanying Part II is presented as reusable microbenchmarks and microdemos, designed to expose specific hardware characteristics or highlight specific use cases.

Part III, Select Applications, details specific families of CUDA applications and key parallel algorithms, including

Streaming workloads
Reduction
Parallel prefix sum (Scan)
N-body
Image Processing

These algorithms cover the full range of potential CUDA applications.



Sample Content

Sample Pages

Download the sample pages (includes Chapter 8 and Index)

Preface xxi

Acknowledgments xxiii

About the Author xxv

Part I: 1

Chapter 1: Background 3

1.1 Our Approach 5

1.2 Code 6

1.3 Administrative Items 7

1.4 Road Map 8

Chapter 2: Hardware Architecture 11

2.1 CPU Configurations 11

2.2 Integrated GPUs 17

2.3 Multiple GPUs 19

2.4 Address Spaces in CUDA 22

2.5 CPU/GPU Interactions 32

2.6 GPU Architecture 41

2.7 Further Reading 50

Chapter 3: Software Architecture 51

3.1 Software Layers 51

3.2 Devices and Initialization 59

3.3 Contexts 67

3.4 Modules and Functions 71

3.5 Kernels (Functions) 73

3.6 Device Memory 75

3.7 Streams and Events 76

3.8 Host Memory 79

3.9 CUDA Arrays and Texturing 82

3.10 Graphics Interoperability 86

3.11 The CUDA Runtime and CUDA Driver API 87

Chapter 4: Software Environment 93

4.1 nvcc–CUDA Compiler Driver 93

4.2 ptxas–the PTX Assembler 100

4.3 cuobjdump 105

4.4 nvidia-smi 106

4.5 Amazon Web Services 109

Part II: 119

Chapter 5: Memory 121

5.1 Host Memory 122

5.2 Global Memory 130

5.3 Constant Memory 156

5.4 Local Memory 158

5.5 Texture Memory 162

5.6 Shared Memory 162

5.7 Memory Copy 164

Chapter 6: Streams and Events 173

6.1 CPU/GPU Concurrency: Covering Driver Overhead 174

6.2 Asynchronous Memcpy 178

6.3 CUDA Events: CPU/GPU Synchronization 183

6.4 CUDA Events: Timing 186

6.5 Concurrent Copying and Kernel Processing 187

6.6 Mapped Pinned Memory 197

6.7 Concurrent Kernel Processing 199

6.8 GPU/GPU Synchronization: cudaStreamWaitEvent() 202

6.9 Source Code Reference 202

Chapter 7: Kernel Execution 205

7.1 Overview 205

7.2 Syntax 206

7.3 Blocks, Threads, Warps, and Lanes 211

7.4 Occupancy 220

7.5 Dynamic Parallelism 222

Chapter 8: Streaming Multiprocessors 231

8.1 Memory 233

8.2 Integer Support 241

8.3 Floating-Point Support 244

8.4 Conditional Code 267

8.5 Textures and Surfaces 269

8.6 Miscellaneous Instructions 270

8.7 Instruction Sets 275

Chapter 9: Multiple GPUs 287

9.1 Overview 287

9.2 Peer-to-Peer 288

9.3 UVA: Inferring Device from Address 291

9.4 Inter-GPU Synchronization 292

9.5 Single-Threaded Multi-GPU 294

9.6 Multithreaded Multi-GPU 299

Chapter 10: Texturing 305

10.1 Overview 305

10.2 Texture Memory 306

10.3 1D Texturing 314

10.4 Texture as a Read Path 317

10.5 Texturing with Unnormalized Coordinates 323

10.6 Texturing with Normalized Coordinates 331

10.7 1D Surface Read/Write 333

10.8 2D Texturing 335

10.9 2D Texturing: Copy Avoidance 338

10.10 3D Texturing 340

10.11 Layered Textures 342

10.12 Optimal Block Sizing and Performance 343

10.13 Texturing Quick References 345

Part III: 351

Chapter 11: Streaming Workloads 353

11.1 Device Memory 355

11.2 Asynchronous Memcpy 358

11.3 Streams 359

11.4 Mapped Pinned Memory 361

11.5 Performance and Summary 362

Chapter 12: Reduction 365

12.1 Overview 365

12.2 Two-Pass Reduction 367

12.3 Single-Pass Reduction 373

12.4 Reduction with Atomics 376

12.5 Arbitrary Block Sizes 377

12.6 Reduction Using Arbitrary Data Types 378

12.7 Predicate Reduction 382

12.8 Warp Reduction with Shuffle 382

Chapter 13: Scan 385

13.1 Definition and Variations 385

13.2 Overview 387

13.3 Scan and Circuit Design 390

13.4 CUDA Implementations 394

13.5 Warp Scans 407

13.6 Stream Compaction 414

13.7 References (Parallel Scan Algorithms) 418

13.8 Further Reading (Parallel Prefix Sum Circuits) 419

Chapter 14: N-Body 421

14.1 Introduction 423

14.2 Naïve Implementation 428

14.3 Shared Memory 432

14.4 Constant Memory 434

14.5 Warp Shuffle 436

14.6 Multiple GPUs and Scalability 438

14.7 CPU Optimizations 439

14.8 Conclusion 444

14.9 References and Further Reading 446

Chapter 15: Image Processing: Normalized Correlation 449

15.1 Overview 449

15.2 Naïve Texture-Texture Implementation 452

15.3 Template in Constant Memory 456

15.4 Image in Shared Memory 459

15.5 Further Optimizations 463

15.6 Source Code 465

15.7 Performance and Further Reading 466

15.8 Further Reading 469

Appendix A: The CUDA Handbook Library 471

A.1 Timing 471

A.2 Threading 472

A.3 Driver API Facilities 474

A.4 Shmoos 475

A.5 Command Line Parsing 476

A.6 Error Handling 477

Glossary / TLA Decoder 481

Index 487



Updates

Submit Errata



More Information



InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address