map: one-to-one
transpose: one-to-one
Gather many-to-one
scatter one-to-many
stencil several-to-one
reduce
summary of programming model
kernels – c / c++ functions
kernel foo(),
thread blocks: group of threads that cooperate to solve a (sub) problem
kernel bar()
streaming multiprocessors
SMs
CUDA makes few guarantees about when and where thread blocks will run.
Advantages
– hardware can run things efficiently
– no waiting on lowpokes
– scalability!
from cell phones to supercomputers
from current to future GPUs
#inculde <stdio.h> #define NUM_BLOCKS 16 #define BLOCK_WIDTH 1 __global__ void hello() { printf("Hello world! I'm a thread block %d\n", blockIdx.x); } int main(int argc, char **argv) { // launch the kernel hello<<<NUM_BLOCKS, BLOCK_WIDTH>>>(); cudaDeviceSynchronize(); printf("That's all!\n"); return 0; }