A CUDA program

cpu allocates storage on GPU
cpu copies input data from cpu
cpu launches kernel(s) on gpu to process
cpu copies results back to cpu from gpu

defining the gpu computation
BIG IDEA
kernels look like serial programs

for(i=0; i<64; i++){
	out[i] = in[i] * in[i];
}

64 times multifications
1* takes 2ns, executing 128 ns

a high-level view
CPU allocate memory, copy data to/from GPU, launch kernel
GPU express out = in*in
CPU launch 64 threads
64 times multifications
1* takes 10ns, executing 10ns