Squaring Numbers

#include 

__global__ void square(float * d_out, float * d_in){
	int idx = threadIdx.x;
	float f = d_in[idx];
	d_out[idx] = f * f;
}

int main(int argc, char ** argv){
	const int ARRAY_SIZE = 64;
	const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);

	float h_in[ARRAY_SIZE];
	for (int i = 0; i < ARRAY_SIZE; i++){
		h_in[i] = float(i);
	}
	float h_out[ARRAY_SIZE];

	float * d_in;
	float * d_out;

	cudaMalloc((void **) &d_in, ARRAY_BYTES);
	cudaMalloc((void **) &d_out, ARRAY_BYTES);

	cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpy???);

	square<<<1, ARRAY_SIZE>>>(d_out, d_in);

	cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpy???);

	for (int i = 0; i < ARRAY_SIZE; i++){
		printf("%f", h_out[i]);
		printf(((i % 4) != 3) ? "\t" : "\n");
	}

	cudaFree(d_in);
	cudaFree(d_out);

	return 0;
}

$ less square.cu
$ nvcc -o square square.cu

cuda mem cpu host to device
cuda mem cpu device to host

configuring the kernel launch
Kernel <<< Grid of Blocks, Block of threads >>> (...
-> 1, 2, or 3D -> 1, 2, or 3D
dim3(x, y, z)
dim3(w, i, i)==dim3(w)==w

square<<<1, 64>>> == square <<< dim3(1,1,1), dim3(64,1,1)>>>
square<<>>(...)

A CUDA program

cpu allocates storage on GPU
cpu copies input data from cpu
cpu launches kernel(s) on gpu to process
cpu copies results back to cpu from gpu

defining the gpu computation
BIG IDEA
kernels look like serial programs

for(i=0; i<64; i++){
	out[i] = in[i] * in[i];
}

64 times multifications
1* takes 2ns, executing 128 ns

a high-level view
CPU allocate memory, copy data to/from GPU, launch kernel
GPU express out = in*in
CPU launch 64 threads
64 times multifications
1* takes 10ns, executing 10ns

Power-efficient

CPU:latency (time/seconds)
GPU:throughput (stuff/time) (jobs/hour)

Latency vs Bandwidth

car:
latency 22.5 hours
throughput 0.089 people/hour
bus:
latency: 90 hours
throughput 0.45 people/hour

8 core ng (intel)
8-wide avx vector operations / core
2 thread / core(hyperthreading)
128-way parallelism

CUDA program
which is written in c with extensions for CPU(“host”) and GPU(“device”)

GPU computing

CUDA toolkit documentation v8.0
http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-microsoft-windows/index.html#axzz4aQaLXxZw

at limit for instruction-level parallelism per clock-cycle
-more processor to run computer faster

modern GPU: -thousands of ALUs
– hundreds of processors
– tens of thousands of concurrent threads

GPU
-smaller, faster, less power, more on chip

CPU
– complex control hardware
flexibility + performance
expensive in terms of power

GPU
– simpler control hardware
more hw for computation
potentially more power efficient
more restrictive programming model