Data Parallel C++ : Mastering DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL.

Yazar:Reinders, James
Katkıda bulunan(lar):Ashbaugh, Ben | Brodman, James | Kinsner, Michael | Pennycook, John | Tian, Xinmin
Materyal türü: KonuKonuYayıncı: Berkeley, CA : Apress L. P., 2020Telif hakkı tarihi: �2021Tanım: 1 online resource (565 pages)İçerik türü:text Ortam türü:computer Taşıyıcı türü: online resourceISBN: 9781484255742Tür/Form:Electronic books.Ek fiziksel biçimler:Print version:: Data Parallel C++LOC classification: QA76.76.C65Çevrimiçi kaynaklar: Click to View
İçindekiler:
Intro -- Table of Contents -- About the Authors -- Preface -- Acknowledgments -- Chapter 1: Introduction -- Read the Book, Not the Spec -- SYCL 1.2.1 vs. SYCL 2020, and DPC++ -- Getting a DPC++ Compiler -- Book GitHub -- Hello, World! and a SYCL Program Dissection -- Queues and Actions -- It Is All About Parallelism -- Throughput -- Latency -- Think Parallel -- Amdahl and Gustafson -- Scaling -- Heterogeneous Systems -- Data-Parallel Programming -- Key Attributes of DPC++ and SYCL -- Single-Source -- Host -- Devices -- Sharing Devices -- Kernel Code -- Kernel: Vector Addition (DAXPY) -- Asynchronous Task Graphs -- Race Conditions When We Make a Mistake -- C++ Lambda Functions -- Portability and Direct Programming -- Concurrency vs. Parallelism -- Summary -- Chapter 2: Where Code Executes -- Single-Source -- Host Code -- Device Code -- Choosing Devices -- Method#1: Run on a Device of Any Type -- Queues -- Binding a Queue to a Device, When Any Device Will Do -- Method#2: Using the Host Device for Development and Debugging -- Method#3: Using a GPU (or Other Accelerators) -- Device Types -- Accelerator Devices -- Device Selectors -- When Device Selection Fails -- Method#4: Using Multiple Devices -- Method#5: Custom (Very Specific) Device Selection -- device_selector Base Class -- Mechanisms to Score a Device -- Three Paths to Device Code Execution on CPU -- Creating Work on a Device -- Introducing the Task Graph -- Where Is the Device Code? -- Actions -- Fallback -- Summary -- Chapter 3: Data Management -- Introduction -- The Data Management Problem -- Device Local vs. Device Remote -- Managing Multiple Memories -- Explicit Data Movement -- Implicit Data Movement -- Selecting the Right Strategy -- USM, Buffers, and Images -- Unified Shared Memory -- Accessing Memory Through Pointers -- USM and Data Movement -- Explicit Data Movement in USM.
Implicit Data Movement in USM -- Buffers -- Creating Buffers -- Accessing Buffers -- Access Modes -- Ordering the Uses of Data -- In-order Queues -- Out-of-Order (OoO) Queues -- Explicit Dependences with Events -- Implicit Dependences with Accessors -- Choosing a Data Management Strategy -- Handler Class: Key Members -- Summary -- Chapter 4: Expressing Parallelism -- Parallelism Within Kernels -- Multidimensional Kernels -- Loops vs. Kernels -- Overview of Language Features -- Separating Kernels from Host Code -- Different Forms of Parallel Kernels -- Basic Data-Parallel Kernels -- Understanding Basic Data-Parallel Kernels -- Writing Basic Data-Parallel Kernels -- Details of Basic Data-Parallel Kernels -- The range Class -- The id Class -- The item Class -- Explicit ND-Range Kernels -- Understanding Explicit ND-Range Parallel Kernels -- Work-Items -- Work-Groups -- Sub-Groups -- Writing Explicit ND-Range Data-Parallel Kernels -- Details of Explicit ND-Range Data-Parallel Kernels -- The nd_range Class -- The nd_item Class -- The group Class -- The sub_group Class -- Hierarchical Parallel Kernels -- Understanding Hierarchical Data-Parallel Kernels -- Writing Hierarchical Data-Parallel Kernels -- Details of Hierarchical Data-Parallel Kernels -- The h_item Class -- The private_memory Class -- Mapping Computation to Work-Items -- One-to-One Mapping -- Many-to-One Mapping -- Choosing a Kernel Form -- Summary -- Chapter 5: Error Handling -- Safety First -- Types of Errors -- Let's Create Some Errors! -- Synchronous Error -- Asynchronous Error -- Application Error Handling Strategy -- Ignoring Error Handling -- Synchronous Error Handling -- Asynchronous Error Handling -- The Asynchronous Handler -- Invocation of the Handler -- Errors on a Device -- Summary -- Chapter 6: Unified Shared Memory -- Why Should We Use USM? -- Allocation Types -- Device Allocations.
Host Allocations -- Shared Allocations -- Allocating Memory -- What Do We Need to Know? -- Multiple Styles -- Allocations �a la C -- Allocations �a la C++ -- C++ Allocators -- Deallocating Memory -- Allocation Example -- Data Management -- Initialization -- Data Movement -- Explicit -- Implicit -- Migration -- Fine-Grained Control -- Queries -- Summary -- Chapter 7: Buffers -- Buffers -- Creation -- Buffer Properties -- use_host_ptr -- use_mutex -- context_bound -- What Can We Do with a Buffer? -- Accessors -- Accessor Creation -- What Can We Do with an Accessor? -- Summary -- Chapter 8: Scheduling Kernels and Data Movement -- What Is Graph Scheduling? -- How Graphs Work in DPC++ -- Command Group Actions -- How Command Groups Declare Dependences -- Examples -- When Are the Parts of a CG Executed? -- Data Movement -- Explicit -- Implicit -- Synchronizing with the Host -- Summary -- Chapter 9: Communication and Synchronization -- Work-Groups and Work-Items -- Building Blocks for Efficient Communication -- Synchronization via Barriers -- Work-Group Local Memory -- Using Work-Group Barriers and Local Memory -- Work-Group Barriers and Local Memory in ND-Range Kernels -- Local Accessors -- Synchronization Functions -- A Full ND-Range Kernel Example -- Work-Group Barriers and Local Memory in Hierarchical Kernels -- Scopes for Local Memory and Barriers -- A Full Hierarchical Kernel Example -- Sub-Groups -- Synchronization via Sub-Group Barriers -- Exchanging Data Within a Sub-Group -- A Full Sub-Group ND-Range Kernel Example -- Collective Functions -- Broadcast -- Votes -- Shuffles -- Loads and Stores -- Summary -- Chapter 10: Defining Kernels -- Why Three Ways to Represent a Kernel? -- Kernels As Lambda Expressions -- Elements of a Kernel Lambda Expression -- Naming Kernel Lambda Expressions -- Kernels As Named Function Objects.
Elements of a Kernel Named Function Object -- Interoperability with Other APIs -- Interoperability with API-Defined Source Languages -- Interoperability with API-Defined Kernel Objects -- Kernels in Program Objects -- Summary -- Chapter 11: Vectors -- How to Think About Vectors -- Vector Types -- Vector Interface -- Load and Store Member Functions -- Swizzle Operations -- Vector Execution Within a Parallel Kernel -- Vector Parallelism -- Summary -- Chapter 12: Device Information -- Refining Kernel Code to Be More Prescriptive -- How to Enumerate Devices and Capabilities -- Custom Device Selector -- Being Curious: get_info&lt -- &gt -- -- Being More Curious: Detailed Enumeration Code -- Inquisitive: get_info&lt -- &gt -- -- Device Information Descriptors -- Device-Specific Kernel Information Descriptors -- The Specifics: Those of "Correctness" -- Device Queries -- Kernel Queries -- The Specifics: Those of "Tuning/Optimization" -- Device Queries -- Kernel Queries -- Runtime vs. Compile-Time Properties -- Summary -- Chapter 13: Practical Tips -- Getting a DPC++ Compiler and Code Samples -- Online Forum and Documentation -- Platform Model -- Multiarchitecture Binaries -- Compilation Model -- Adding SYCL to Existing C++ Programs -- Debugging -- Debugging Kernel Code -- Debugging Runtime Failures -- Initializing Data and Accessing Kernel Outputs -- Multiple Translation Units -- Performance Implications of Multiple Translation Units -- When Anonymous Lambdas Need Names -- Migrating from CUDA to SYCL -- Summary -- Chapter 14: Common Parallel Patterns -- Understanding the Patterns -- Map -- Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Using Built-In Functions and Libraries -- The DPC++ Reduction Library -- The reduction Class -- The reducer Class -- User-Defined Reductions -- oneAPI DPC++ Library -- Group Functions.
Direct Programming -- Map -- Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Summary -- For More Information -- Chapter 15: Programming for GPUs -- Performance Caveats -- How GPUs Work -- GPU Building Blocks -- Simpler Processors (but More of Them) -- Expressing Parallelism -- Expressing More Parallelism -- Simplified Control Logic (SIMD Instructions) -- Predication and Masking -- SIMD Efficiency -- SIMD Efficiency and Groups of Items -- Switching Work to Hide Latency -- Offloading Kernels to GPUs -- SYCL Runtime Library -- GPU Software Drivers -- GPU Hardware -- Beware the Cost of Offloading! -- Transfers to and from Device Memory -- GPU Kernel Best Practices -- Accessing Global Memory -- Accessing Work-Group Local Memory -- Avoiding Local Memory Entirely with Sub-Groups -- Optimizing Computation Using Small Data Types -- Optimizing Math Functions -- Specialized Functions and Extensions -- Summary -- For More Information -- Chapter 16: Programming for CPUs -- Performance Caveats -- The Basics of a General-Purpose CPU -- The Basics of SIMD Hardware -- Exploiting Thread-Level Parallelism -- Thread Affinity Insight -- Be Mindful of First Touch to Memory -- SIMD Vectorization on CPU -- Ensure SIMD Execution Legality -- SIMD Masking and Cost -- Avoid Array-of-Struct for SIMD Efficiency -- Data Type Impact on SIMD Efficiency -- SIMD Execution Using single_task -- Summary -- Chapter 17: Programming for FPGAs -- Performance Caveats -- How to Think About FPGAs -- Pipeline Parallelism -- Kernels Consume Chip "Area" -- When to Use an FPGA -- Lots and Lots of Work -- Custom Operations or Operation Widths -- Scalar Data Flow -- Low Latency and Rich Connectivity -- Customized Memory Systems -- Running on an FPGA -- Compile Times -- The FPGA Emulator -- FPGA Hardware Compilation Occurs "Ahead-of-Time" -- Writing Kernels for FPGAs.
Exposing Parallelism.
Bu kütüphanenin etiketleri: Kütüphanedeki eser adı için etiket yok. Etiket eklemek için oturumu açın.
    Ortalama derecelendirme: 0.0 (0 oy)
Bu kayda ilişkin materyal yok

Intro -- Table of Contents -- About the Authors -- Preface -- Acknowledgments -- Chapter 1: Introduction -- Read the Book, Not the Spec -- SYCL 1.2.1 vs. SYCL 2020, and DPC++ -- Getting a DPC++ Compiler -- Book GitHub -- Hello, World! and a SYCL Program Dissection -- Queues and Actions -- It Is All About Parallelism -- Throughput -- Latency -- Think Parallel -- Amdahl and Gustafson -- Scaling -- Heterogeneous Systems -- Data-Parallel Programming -- Key Attributes of DPC++ and SYCL -- Single-Source -- Host -- Devices -- Sharing Devices -- Kernel Code -- Kernel: Vector Addition (DAXPY) -- Asynchronous Task Graphs -- Race Conditions When We Make a Mistake -- C++ Lambda Functions -- Portability and Direct Programming -- Concurrency vs. Parallelism -- Summary -- Chapter 2: Where Code Executes -- Single-Source -- Host Code -- Device Code -- Choosing Devices -- Method#1: Run on a Device of Any Type -- Queues -- Binding a Queue to a Device, When Any Device Will Do -- Method#2: Using the Host Device for Development and Debugging -- Method#3: Using a GPU (or Other Accelerators) -- Device Types -- Accelerator Devices -- Device Selectors -- When Device Selection Fails -- Method#4: Using Multiple Devices -- Method#5: Custom (Very Specific) Device Selection -- device_selector Base Class -- Mechanisms to Score a Device -- Three Paths to Device Code Execution on CPU -- Creating Work on a Device -- Introducing the Task Graph -- Where Is the Device Code? -- Actions -- Fallback -- Summary -- Chapter 3: Data Management -- Introduction -- The Data Management Problem -- Device Local vs. Device Remote -- Managing Multiple Memories -- Explicit Data Movement -- Implicit Data Movement -- Selecting the Right Strategy -- USM, Buffers, and Images -- Unified Shared Memory -- Accessing Memory Through Pointers -- USM and Data Movement -- Explicit Data Movement in USM.

Implicit Data Movement in USM -- Buffers -- Creating Buffers -- Accessing Buffers -- Access Modes -- Ordering the Uses of Data -- In-order Queues -- Out-of-Order (OoO) Queues -- Explicit Dependences with Events -- Implicit Dependences with Accessors -- Choosing a Data Management Strategy -- Handler Class: Key Members -- Summary -- Chapter 4: Expressing Parallelism -- Parallelism Within Kernels -- Multidimensional Kernels -- Loops vs. Kernels -- Overview of Language Features -- Separating Kernels from Host Code -- Different Forms of Parallel Kernels -- Basic Data-Parallel Kernels -- Understanding Basic Data-Parallel Kernels -- Writing Basic Data-Parallel Kernels -- Details of Basic Data-Parallel Kernels -- The range Class -- The id Class -- The item Class -- Explicit ND-Range Kernels -- Understanding Explicit ND-Range Parallel Kernels -- Work-Items -- Work-Groups -- Sub-Groups -- Writing Explicit ND-Range Data-Parallel Kernels -- Details of Explicit ND-Range Data-Parallel Kernels -- The nd_range Class -- The nd_item Class -- The group Class -- The sub_group Class -- Hierarchical Parallel Kernels -- Understanding Hierarchical Data-Parallel Kernels -- Writing Hierarchical Data-Parallel Kernels -- Details of Hierarchical Data-Parallel Kernels -- The h_item Class -- The private_memory Class -- Mapping Computation to Work-Items -- One-to-One Mapping -- Many-to-One Mapping -- Choosing a Kernel Form -- Summary -- Chapter 5: Error Handling -- Safety First -- Types of Errors -- Let's Create Some Errors! -- Synchronous Error -- Asynchronous Error -- Application Error Handling Strategy -- Ignoring Error Handling -- Synchronous Error Handling -- Asynchronous Error Handling -- The Asynchronous Handler -- Invocation of the Handler -- Errors on a Device -- Summary -- Chapter 6: Unified Shared Memory -- Why Should We Use USM? -- Allocation Types -- Device Allocations.

Host Allocations -- Shared Allocations -- Allocating Memory -- What Do We Need to Know? -- Multiple Styles -- Allocations �a la C -- Allocations �a la C++ -- C++ Allocators -- Deallocating Memory -- Allocation Example -- Data Management -- Initialization -- Data Movement -- Explicit -- Implicit -- Migration -- Fine-Grained Control -- Queries -- Summary -- Chapter 7: Buffers -- Buffers -- Creation -- Buffer Properties -- use_host_ptr -- use_mutex -- context_bound -- What Can We Do with a Buffer? -- Accessors -- Accessor Creation -- What Can We Do with an Accessor? -- Summary -- Chapter 8: Scheduling Kernels and Data Movement -- What Is Graph Scheduling? -- How Graphs Work in DPC++ -- Command Group Actions -- How Command Groups Declare Dependences -- Examples -- When Are the Parts of a CG Executed? -- Data Movement -- Explicit -- Implicit -- Synchronizing with the Host -- Summary -- Chapter 9: Communication and Synchronization -- Work-Groups and Work-Items -- Building Blocks for Efficient Communication -- Synchronization via Barriers -- Work-Group Local Memory -- Using Work-Group Barriers and Local Memory -- Work-Group Barriers and Local Memory in ND-Range Kernels -- Local Accessors -- Synchronization Functions -- A Full ND-Range Kernel Example -- Work-Group Barriers and Local Memory in Hierarchical Kernels -- Scopes for Local Memory and Barriers -- A Full Hierarchical Kernel Example -- Sub-Groups -- Synchronization via Sub-Group Barriers -- Exchanging Data Within a Sub-Group -- A Full Sub-Group ND-Range Kernel Example -- Collective Functions -- Broadcast -- Votes -- Shuffles -- Loads and Stores -- Summary -- Chapter 10: Defining Kernels -- Why Three Ways to Represent a Kernel? -- Kernels As Lambda Expressions -- Elements of a Kernel Lambda Expression -- Naming Kernel Lambda Expressions -- Kernels As Named Function Objects.

Elements of a Kernel Named Function Object -- Interoperability with Other APIs -- Interoperability with API-Defined Source Languages -- Interoperability with API-Defined Kernel Objects -- Kernels in Program Objects -- Summary -- Chapter 11: Vectors -- How to Think About Vectors -- Vector Types -- Vector Interface -- Load and Store Member Functions -- Swizzle Operations -- Vector Execution Within a Parallel Kernel -- Vector Parallelism -- Summary -- Chapter 12: Device Information -- Refining Kernel Code to Be More Prescriptive -- How to Enumerate Devices and Capabilities -- Custom Device Selector -- Being Curious: get_info&lt -- &gt -- -- Being More Curious: Detailed Enumeration Code -- Inquisitive: get_info&lt -- &gt -- -- Device Information Descriptors -- Device-Specific Kernel Information Descriptors -- The Specifics: Those of "Correctness" -- Device Queries -- Kernel Queries -- The Specifics: Those of "Tuning/Optimization" -- Device Queries -- Kernel Queries -- Runtime vs. Compile-Time Properties -- Summary -- Chapter 13: Practical Tips -- Getting a DPC++ Compiler and Code Samples -- Online Forum and Documentation -- Platform Model -- Multiarchitecture Binaries -- Compilation Model -- Adding SYCL to Existing C++ Programs -- Debugging -- Debugging Kernel Code -- Debugging Runtime Failures -- Initializing Data and Accessing Kernel Outputs -- Multiple Translation Units -- Performance Implications of Multiple Translation Units -- When Anonymous Lambdas Need Names -- Migrating from CUDA to SYCL -- Summary -- Chapter 14: Common Parallel Patterns -- Understanding the Patterns -- Map -- Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Using Built-In Functions and Libraries -- The DPC++ Reduction Library -- The reduction Class -- The reducer Class -- User-Defined Reductions -- oneAPI DPC++ Library -- Group Functions.

Direct Programming -- Map -- Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Summary -- For More Information -- Chapter 15: Programming for GPUs -- Performance Caveats -- How GPUs Work -- GPU Building Blocks -- Simpler Processors (but More of Them) -- Expressing Parallelism -- Expressing More Parallelism -- Simplified Control Logic (SIMD Instructions) -- Predication and Masking -- SIMD Efficiency -- SIMD Efficiency and Groups of Items -- Switching Work to Hide Latency -- Offloading Kernels to GPUs -- SYCL Runtime Library -- GPU Software Drivers -- GPU Hardware -- Beware the Cost of Offloading! -- Transfers to and from Device Memory -- GPU Kernel Best Practices -- Accessing Global Memory -- Accessing Work-Group Local Memory -- Avoiding Local Memory Entirely with Sub-Groups -- Optimizing Computation Using Small Data Types -- Optimizing Math Functions -- Specialized Functions and Extensions -- Summary -- For More Information -- Chapter 16: Programming for CPUs -- Performance Caveats -- The Basics of a General-Purpose CPU -- The Basics of SIMD Hardware -- Exploiting Thread-Level Parallelism -- Thread Affinity Insight -- Be Mindful of First Touch to Memory -- SIMD Vectorization on CPU -- Ensure SIMD Execution Legality -- SIMD Masking and Cost -- Avoid Array-of-Struct for SIMD Efficiency -- Data Type Impact on SIMD Efficiency -- SIMD Execution Using single_task -- Summary -- Chapter 17: Programming for FPGAs -- Performance Caveats -- How to Think About FPGAs -- Pipeline Parallelism -- Kernels Consume Chip "Area" -- When to Use an FPGA -- Lots and Lots of Work -- Custom Operations or Operation Widths -- Scalar Data Flow -- Low Latency and Rich Connectivity -- Customized Memory Systems -- Running on an FPGA -- Compile Times -- The FPGA Emulator -- FPGA Hardware Compilation Occurs "Ahead-of-Time" -- Writing Kernels for FPGAs.

Exposing Parallelism.

Description based on publisher supplied metadata and other sources.

Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2022. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.

There are no comments on this title.

yorum yazmak için.

Ziyaretçi Sayısı

Destekleyen Koha