Parallel communication pattern

map: one-to-one
transpose: one-to-one
Gather many-to-one
scatter one-to-many
stencil several-to-one
reduce

summary of programming model
kernels – c / c++ functions

kernel foo(),
thread blocks: group of threads that cooperate to solve a (sub) problem

kernel bar()

streaming multiprocessors
SMs

CUDA makes few guarantees about when and where thread blocks will run.

Advantages
– hardware can run things efficiently
– no waiting on lowpokes
– scalability!
from cell phones to supercomputers
from current to future GPUs

#inculde <stdio.h>

#define NUM_BLOCKS 16
#define BLOCK_WIDTH 1

__global__ void hello()
{
	printf("Hello world! I'm a thread block %d\n", blockIdx.x);
}

int main(int argc, char **argv)
{
	// launch the kernel
	hello<<<NUM_BLOCKS, BLOCK_WIDTH>>>();

	cudaDeviceSynchronize();
	printf("That's all!\n");

	return 0;
}