map: one-to-one
transpose: one-to-one
Gather many-to-one
scatter one-to-many
stencil several-to-one
reduce
summary of programming model
kernels – c / c++ functions
kernel foo(),
thread blocks: group of threads that cooperate to solve a (sub) problem
kernel bar()
streaming multiprocessors
SMs
CUDA makes few guarantees about when and where thread blocks will run.
Advantages
– hardware can run things efficiently
– no waiting on lowpokes
– scalability!
from cell phones to supercomputers
from current to future GPUs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | #inculde <stdio.h> #define NUM_BLOCKS 16 #define BLOCK_WIDTH 1 __global__ void hello() { printf ( "Hello world! I'm a thread block %d\n" , blockIdx.x); } int main( int argc, char **argv) { // launch the kernel hello<<<NUM_BLOCKS, BLOCK_WIDTH>>>(); cudaDeviceSynchronize(); printf ( "That's all!\n" ); return 0; } |