University of Illinois at Urbana-Champaign, National Center for Supercomputing Applications, Urbana, IL

Harvard University, Cambridge, MA

Louisiana State University, Center for Computation & Technology, Baton Rouge, LA

Pittsburgh Supercomputing Center, Pittsburgh, PA

Princeton University, Princeton Institute for Computational Science and Engineering, Princeton, NJ

Rutgers University, Piscataway, NJ

University of California Los Angeles, Los Angeles, CA

University of Michigan, Ann Arbor, MI

University of Oklahoma, Norman, OK

University of South Carolina, Columbia, SC

University of Tennessee Knoxville, Knoxville, TN

University of Utah, Salt Lake City, UT

VSCSE - Virtual School of Computational Science and Engineering

Programming Heterogeneous Parallel Computing Systems

July 10–13, 2012

Studying many current GPU computing applications, we have learned that the limits of an application's scalability are often related to some combination of memory bandwidth saturation, memory contention, imbalanced data distribution, or data structure/algorithm interactions. Successful GPU application developers often adjust their data structures and problem formulation specifically for massive threading and executed their threads leveraging shared on-chip memory resources for bigger impact. We looked for patterns among those transformations, and here present the seven most common and crucial algorithm and data optimization techniques we discovered. Each can improve performance of applicable kernels by 2-10X in current processors while improving future scalability.

High-Performance Computing clusters are increasingly built with heterogeneous parallel computing nodes to achieve higher power efficiency and computation throughput. Petascale systems like Blue Waters and Titan will come online this year with both multicore CPUs and many-core GPUs. These systems will provide unprecedented capabilities to conduct computational experiments of historic significance. Upcoming Exascale systems are expected to embrace even more heterogeneity in order to overcome power limitations. While the computing community is racing to build tools and libraries to easy the use of these heterogeneous parallel computing systems, effective and confident use of these systems will always require knowledge about the low-level programming interfaces in these systems. This course is designed for researchers in computational science and engineering disciplines to be introduced to the essence of these programming interfaces (CUDA, OpenMP, and MPI) and how they should orchestrate the use of these interfaces to achieve their application goals. The course is unique in that it is application oriented and only introduces the necessary underlying computer science and engineering knowledge to solidify understanding. The one-week course will serve as a quick start for researchers who want to begin to use heterogeneous parallel computing systems ranging from laptops and Exascale clusters. It also provides a strong foundation for students to take full-semester courses in parallel programming interfaces and techniques.


  • Experience working in a Unix environment
  • Experience developing and running scientific codes written in C or C++
  • Basic knowledge of CUDA (A short online course, Introduction to CUDA, is available to registered on-site students who need assistance in meeting this prerequisite)


  • Wen-Mei Hwu, professor of electrical and computer engineering, chief scientist of the Parallel Computing Institute and principal investigator of the CUDA Center of Excellence, University of Illinois at Urbana-Champaign
  • David Kirk, NVIDIA fellow
  • William Gropp, professor of computer science and director of the Parallel Computing Institute, University of Illinois at Urbana-Champaign
  • Isaac Gelado, research staff member, Barcelona Supercomputer Center

Course outline:

  • Introduction and a Quick review of CUDA C
    • Heterogeneous computing architectures and programming models
    • common algorithmic strategies for high performance
    • Work partitioning, kernels, thread grids, keywords and run-time APIs
    • The von Neumann model and modern computer hardware
  • Kernel-Based Data parallelism model
    • Mapping of thread indices to data indices
    • Thread/Warp scheduling and transparent scalability
    • Mapping of the OpenCL kernel model to the CUDA kernel model
  • CUDA memory types
    • Registers, on-chip scratchpad memory, caches and DRAM
    • Shared memory vs. register tiling
  • CUDA task parallelism model
    • Streams, queues and contexts
    • Double buffering to overlap data transfers with kernel execution
    • CUDA and OpenCL host code APIs
  • GMAC API for simplified task parallelism and multi-GPU programming
    • Automated double buffering and queue/stream management
    • Multi-GPU data exchange support
  • OpenACC for simplified kernel creation and deployment
    • OpenACC directives and pragma
    • Using OpenACC in a heterogeneous parallel application
  • MPI in a Heterogeneous Parallel Computing Environment
    • Domain partitioning, MPI ranks, and MPI communication
    • An example of heterogeneous parallel application using MPI and CUDA/OpenCL
    • Overlapping MPI message latency with node-level computation
  • FORTRAN in a heterogeneous parallel computing environment
    • Memory layout considerations
    • PGI compiler pragmas
  • Important Patterns of Heterogeneous Parallel Applications
    • TBD, to be selected from convolution, reduction, prefix scan, stencil
  • Hands-on Lab

    NOTE: Students are required to provide their own laptops.