The need for barriers

int idx = threadIdx.x;
--shared-- int array[128];
array[idx]=threadIdx.x;
if(idx < 127)
	array[idx] = array[idx+1]

thead, thread block

CUDA
a hierarchy of
-computation
-memory spaces
synchronization

Writing Efficient Program
High-level strategy
1.maximize arithmetic intensity math/memory