-classification
-clustering
-dimensionality reduction
-graph analysis
Reinforcement-learning basics
Agent, Environment
a, r, s
ソフトウェアエンジニアの技術ブログ:Software engineer tech blog
随机应变 ABCD: Always Be Coding and … : хороший
-classification
-clustering
-dimensionality reduction
-graph analysis
Reinforcement-learning basics
Agent, Environment
a, r, s
Representing and reasoning with probabilities
Bayesian Networks
Joint Distribution
Strom lightning thonder
T, T .25 .20,.05
T, F .40 .04,0.36
F, T .05 .04,0.01
F, F .30 .03,0.27
Random day 2pm – look outside summer
Pr(¬storm) = 0.35
pr(lightning|storm) = .4615 (.25/.65)
X is conditionally independent of Y given Z fi the probability distribution governing X is independent of the value of y given the value of Z; that is, if
P(X=x|Y=y,Z=z)= P(X=x|Z=x)
more compactly write
P(X|Y,Z) = P(X|Z)
sampling
two things distribution are for -probability of value, generate value
simulation of a complex process
approximate inference
P(x)= Σy*P(x,y)
P(x,y)= P(x)*P(y|x)
P(y|x) = P(x|y)*P(y)/P(z)
Learn the best hypothesis given data
$ some domain knowledge
Learn the most probable H given data
$ domain knowledge
Pr(h|D)
Pr(h|D) = Pr(D|h)*Pr(h) / Pr(D) … Bayes’ rule
Pr(a,b) = Pr(a|b)P(b)
Pr(b,a) = Pr(b|a)P(a)
Bayesian Learning
For each h e H
calculate Pr(h|D) = P(D|h)P(h)/P(D)
Output:
h = argmax Pr(h|D)
h = argmax Pr(D|h)
Infinite Hypothesis Spaces
m>= 1/ε(ln|H|+ln1/γ)
spaces are infinite
– linear separators
– artificial neural networks
– decision trees (continuous input)
X:{1,2,3,4,5,6,7,8,9,10}
H: h(x) = x>=Θ
|H|=∞
Trade all hypotheses (only track non-negative integer), keep version space
X = R
H = {h(x) = xe[a,b]}
parameterized by a,b e R
VC = 2
Mondrain Composition
https://www.khanacademy.org/humanities/ap-art-history/later-europe-and-americas/modernity-ap/a/mondrian-composition
Colored Vornoi Diagram
support vector machines SVMs perceptron
Nearest neighbor 1-NN
decision trees
-defining learning problems
-showing specific algorithms work
-show these problems are fundamentally hard
Theory of computing analyzes how use resources: time, space, o(nlogn), o(n^2)
Inductive learning
1.probability of successful training
2.number of examples to train on
3.complexity of hypothesis class
4.accuracy to which target concept is approximated
5.manner in which training examples presented
6.manner in which training examples selected
computational complexing
– how much computational effort is needed for a learner to coverage?
sample complexing -batch
– how many training examples are needed for a learner to create a successful hypothesis
mistake bounds – online
– how many misclassfications can a learner make over an infinite run
true hypothesis: ceH Training set:s<=X candidate hypothesis: heH consistent learner: produce c(x)=h(x) far xeS version space: VS(s) = {h s.t. heH consistent wants} hypotheses consistent with examples errord(h) = Prx~d[c(x)=h(x)]
y = w^t*x + b
label, parameter of the plane
w^t*x + b = 1
w^t*x + b = 0
w^t*x + b = -1
y = {-1, +1}
w^t*x1 + b = 1
w^t*x2 + b = -1
w^t(x1-x2)/||w|| = 2/||w|| margin
max 2/||w|| while classifying everything correctly
yi(w^t*x + b) >= 1
min 1/2 ||w||^2 quadratic programming
w(α) = Σi αi – 1/2 Σio αi*αu*yi*yu*xi^t*xu
s.t. αi>=Θ, Σi αi*yi = Θ
SVMs: Linearly Married
– margins : generalization overfitting
– big is better
– optimization problem for finding max margins: QPs
– support vectors
spam email {+, -}
sample rules: body “manly” +, from spouse -, short +, just URLs +, just image +, “pΘrn” +, “make money easy” +
1. Learn over a subset of data -> rule: uniformly randomly, pick data and apply a learner
2. combine: complex rule -> mean
Boosting
“hardest” examples
weighted mean
Error: mismatches
PrD[h(x) = c(x)]
Before
Now
f(x) = look up(x)
+ remember, fast, simple
– generalization, overfit
distance stand for similarity
K-NN
Given: Training data D = {xi, yi}
distance metric d(q, x)
number or neighbors k
query point q
– NN = {i: d(q, xi) k smallest}
– Retrun
– classification: ploraity
– regression: mean
Preference Bias
+ locality -> near points are similar
+ smoothness -> averaging
+ an features matter equally
Curse of dimensionality
As the number of features or dimensions grows, the amount of data we need to generalize accurately grow exponentially
Neural Networks
cell body, neuron, axon, synapses, spike trains
computational unit
Artificial Neural networks
x1, x2, x3 -> Θ(unit: Perceptron) -> y
Σi=1 Xi・Wi (activation) >= Θ firing
output yes: y:1, no: y:0
e.g.
x1 = 1, X2 = 0, X3 = -1.5, w1 = 1/2, w2 = 3/5, w3 = 1
1*1.2 + 0*3.5-1.5*1 = -1
Θ = 0
out put should be y=0
How powerful is a perceptron unit
Regression
supervised learning: take examples of inputs and outputs. Now, given a new input, predict its output.
Mapping continuous inputs to outputs.
discrete, continuous
child Height, parent height
2/3 < 1, regression to mean
Reinforcement learning
Regression in machine learning
Finding the best constant function
f(x) = c
E(c) = Σi=1(yi-c)^2
LOSS, ERROR
Order of polynomial
k = 0:constant
k = 1:line
k = 2:parabola
f(x) = c0 + cix + c2x^2 + ... ckX^k
polynomial regression
c0 + c1x + c2x^2 + c3x^3 = y
Errors
Training data has errors not modeling f, but f + ε where do errors come from?
sensor error
Cross Validation
Fundamental assumption
use a model that is complex enough to fit the data without causing problems on the test set
-training error
-cross validation error
-> scalar input, continuous
-> vector input, continuous
include more input features (size, distance from zoo)
predict credit score
job? age? assets?
-> distance, vector or scalar