Parallel communication pattern

map: one-to-one
transpose: one-to-one
Gather many-to-one
scatter one-to-many
stencil several-to-one
reduce

summary of programming model
kernels – c / c++ functions

kernel foo(),
thread blocks: group of threads that cooperate to solve a (sub) problem

kernel bar()

streaming multiprocessors
SMs

CUDA makes few guarantees about when and where thread blocks will run.

Advantages
– hardware can run things efficiently
– no waiting on lowpokes
– scalability!
from cell phones to supercomputers
from current to future GPUs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#inculde <stdio.h>
 
#define NUM_BLOCKS 16
#define BLOCK_WIDTH 1
 
__global__ void hello()
{
    printf("Hello world! I'm a thread block %d\n", blockIdx.x);
}
 
int main(int argc, char **argv)
{
    // launch the kernel
    hello<<<NUM_BLOCKS, BLOCK_WIDTH>>>();
 
    cudaDeviceSynchronize();
    printf("That's all!\n");
 
    return 0;
}