Dimensional reduction
local linear embedding
iso map
cluster by affinity
do em/kneans succeed
in finding the 2 closure
Affinity matrix
dimentionality for large environment
supervised vs unsupervised learnings
随机应变 ABCD: Always Be Coding and … : хороший
Dimensional reduction
local linear embedding
iso map
cluster by affinity
do em/kneans succeed
in finding the 2 closure
Affinity matrix
dimentionality for large environment
supervised vs unsupervised learnings
Maximum likelihood
3, 4, 5, 6, 7
m = 5
μ = 5
σ2 = 2
3, 9, 9, 3
μ = 6
σ2 = 9
Gaussians
– functional form
– fit from data
– multivariate gaussians
Expectation maximization
P(x) = Σi=i k P(c=i)p(x|C=i)
πi μiΣi
EM versus K-mean
minimize: -Σj log p(xjlσΣ1k)+ cosf k
guess
run EM
remove
clustering
– k-means, em
Unsupervised learning
-constructure
density estimation
-clustering
-dimensionality reduction
blind separation
K Means Clustering
– need to know k
– local minimum
– high dimentionality
– lack of mathematics
Gaussian Learning
pacamakes of a gaussian
f(x1u102)=1/√2πΘ exp(x-μ)2/2α2
μ=1/m mΣj=1 xg
Data x1…xm p(x1…xm|μ1Θ2)=πi f(xi|μ1Θ2)=(1/2πΘ2)m/2 exp – Σπ(xi-μ)2/2α2
m/2 log 1/2πα2 – 1/2α2 mΣi=i(xi-μ)2
Gradient
L = Σj(yj – w1x0 – w0)2 ->min
ΘL/w1 = -2Σj(yj-w1xj-w0)xj
ΘL/w0 = -2Σj(yj-w1xj-w0)
Perception algorithm
Linear seperator
w1x + w0 >= 0
0 if w1x + w0 < 0
Linear function
Linear Method
-regression vs classification
-exact solution vs iterative solution
-smoothing
-non-linear problems
Supervised Learning
-> parametic
KNN definition
learning: memorize all data
Problems of KNN
-very large data set
kdd trees
-very large feature spaces
Minimize quadratic loss
minΣ(yi-w1xi-w0)2 = L
ΘL/Θw0
Σxiyi – 1/m ΣyiΣxi – w/m(Σxi)2 = w1Σxi2
f(x)= w1X + w0
w0 = 3
w1 = -1
Sum(x_i y_i) – (1/M) Sum(y_i) Sum(x_i) + (w_1/M)( Sum(x_i) )^2 = w_1 Sum(x_i^2)
Regularization
loss = loss(data)+loss(parameters)
Σj(yi-wixi-w0)^2 + Σi|wi|p
advanced spam filters
-know spamming ip?
-have you emailed reason before?
-have other people received same message?
-email header consistent
-all caps
-do inline urls point to where they say?
-are you addressed by name?
Digit recognition
-input vector = pixel values
16 x 16
over fitting prevention
-Occam’s razor k?
cross validation
supervised learning
->classification yie{0,1}
->regression yie[0,1] eR
f(x) = w1X + w0
w0 = 3, w1= -1
Linear Regression
Data f(x)=w1x + w0, f(x)=wx+w0
y = f(x)
Loss = Σj(yj-x1xg-w0)2
Bayes Network
-offer is secret click sports
p(“secret”|spam) = 1/3
Dictionary has words
p(spam) ~ 1
p(wi|spam) ~ 11
p(wi|ham) ~ 11
message m=”sports”
p(spam|m) = 0.1667 or 3/18
= p(m|spam) p(spam) / P(m|spam)p(spam)+(m|ham)p(ham)
m = “secret is secret”
p(spam | m) = 25 /26
laplace smoothing
ml p(x) = count(x)/n
LS(k) p(x) = count(x) + k / (n + k|x|)
1 message 1 spam p(spam) = 2/3
10 message 6 spam p(spam) = 7/12
100 message 60 spam p(spam) = 61/102
k = 1, p(spam) = 2/5 p(ham) = 3/5 p(“today”|spam) = 1/21 p(“today”|ham) = 3/27
M = “today is secret” P(spam|m)= 0.4858
summary naive bayes
Supervised learning
x1, x2, x3 … xn -> y1
x21 x22 x23 … x2n -> y2
->Xm
f(xm) = ym
OCCAM’S RAZOR
everything eles begins equal
choose the less complex hypothesis
fit <-> low complexity
generalization error and over fitting error.
SPAM, HAM
offer is secret, play sports today
click secret link, went play sports
secret sports link, secret sport event, sport is today, sport costs money
P(spam) = 3/8
Maximum likehood
ssshhhhh p(s)=π
p(yi) = {π if yi = s, 1-π if yi = h
p(yi) = πyi ・(1-π)1-yi
p(date) = πi=1p(yi)= πcount(yi=1) (1-π)count(yi=0)=π3・(1-π)5
MLSolutions for
P(“secret” | spam) = 1/3
P(“secret” | ham) = 1/15
Machine learning
-> Bayes networks = reason with know models
-> Machine learning = learn models from data
Supervised Learning
Unsupervised Learning
Famous for using machine learning
-google web mining
-netflix DVD recommendations
-Amazon Product placement
Machine Learning
what?
->parameters, structure, hidden concepts
what form?
->supervised unsupervised reinforcement
what for?
-> production diagnostics sumarization…
How?
-> passive, active, online, offline
output?
-> classification, regression
Details
->generative, distinguish
Approximate Inference Sampling
P(B|+a)
Likelihood weighting
Inconsistent
GIBBS Sampling
Markov chain monte carlo mcmc
+c +s -r -w
+c -s -l -w
+c -s +r -w
P(A) = 0.5, P(B|A)=0.2, P(B|¬A)=0.8
P(¬A)=1-P(A)=0.5
P(A|B)=0.2
P(A|B)=P(B|A)P(A)/P(B)=0.2*0.5/P(B|A)P(A)+P(B|¬A)P(¬A)
=0.1/(0.1+0.8+0.5)=0.2
Simple Bayes Net
P(A)=0.5, Vi:P(Xi|A)=0.2, P(Xi|¬A)=0.6
P(A|X1 X2 ¬X3)=P(¬X3|A)P(A|X1X2)
P(¬A|x1X2¬X3)αP(¬X3hA)P(x2|¬A) P(x1|¬A)P(¬A)