k-means clustering

k-means clustering
-pick k centers(at random)
-each center “claims” its closest points
-recompute the centers by overaging the clustered points
-repeat until convergence

P+(x): Partition / cluster of object x
c+ : set of all points in cluster i = {x s.t. P(x)=i}
center+i = ΣyeCi y / |Ci|

P+(x) = argmin||x-center+-1||2

K-means as optimization
configurations -center, P
scores – E(P,center) = Σx||centerp(x)-x||22
neighborhood-p,center= {(p’, center)}U {(P, center’)}

Properties of k-means clustering
-each iteration polynomial o(k n)
-finite(exponential) iterations o(kn)
-error decreases(if ties broken consistently)[with one exception]
-can get stuck!

Practical Matters

-mimic does well with structure
-representing PΘ Vo
-local optima
-probability theory
-time complexity

Clustering & EM
unsupervised learning
– supervised learning use labeled training data to generalize labels to new instances
– unsupervised learning make sense out of unlabeled data

Basic clustering problem
given: set of objects X
inter-object distances D(x,y) = D(y,x) x, yeX
output: partition pd(x) = pd(y)
if x &y in same cluster

Genetic Algorithms

Population of individuals
Mutation – local search
Cross over – population holds information
Generations – iterations of improvement

GA Skeleton
P0 = initial population of size K
repeat until converged
compute fitness of all exP+
select “most fit” individuals (top half, weighted prab)
pair up individuals, replacing “least fit” individuals
vra crossover/mutiation

Randomized Optimization: mimic
– only points, no structure
– unclear probability distribution

minic: estimating distinations
p(x) = p(x1|x2…xn)p(x2|x3…xn)…p(xn)
x=[x1, x2, x3…xn]

Optimize Me

x = {1,…, 100}
f(x) = (x mod 6)2 mod 7 – sin(x)
x = 11

X = R
f(x) = -x4 + 1000×3 – 20×2 + 4x -6

optimization approaches
– generate & test: small input space, complex function
– calculus: function has derivative
– Newtons method
what if assumptions don’t hold?

big input space, complex function, no derivative (or hard to find)
possibly many local oftimg

Randomize optimization

Hill climbing
Guess xe x
repeat
let n = argmax f(n)
Un N(x)
if f(x) > f(x):x = n
else: stop

Thinking of a 5-bit sequence
f(x) = #correct bit
x: 5-bit sequences

un-supervised learning

PZTTMNIIAOOI
(Randomized optimization)

optimization
input space x
objective function(fitness function) f:x -> r
goal: find xeX s.t. f(x*) = max f(x)
Find the best:

-factory, chemical, process control
-route finding
-root finding
-neural networks
x is weights
minimize error
-decision trees

Alice and Bob

from urllib import urlopen, urlencode
import json

base = "http://hgehoge.com"
Alice = base + "/alice"
Bob = base + "bob"

def check_output(output):
	data = output.read()
	if output.getcode() != 200:
		raise Exception(data)
	data = json.loads(data)
	return data

def get_pg():
	output = urlopen(base)
	data = check_output(output)
	return data

def initialize(person):
	data = {'type':'init'}
	output = urlopen(person, urlencode(data))
	data = check_output(output)
	return data

def send_key(person, token, public, name):
	data = {'type':'key',
		'token':token,
		'public': public,
		'name':name}
	output = urlopen(person, urlencode(data))
	data = check_output(output)
	return data

daf recieve_msg(person, token):
	data = {'type':'msg',
			'token':token}
	output = urlopen(person, urlencode(data))
	data = check_output(output)
	return data

def send_msg(person, token, cipher, iv):
	data = {'type':'msg',
			'token':token,
			'message':cipher,
			'iv':iv}
	output = urlopen(person, urlencode(data))
	data = check_output(output)
	return data