Bayesian Learning

Learn the best hypothesis given data
$ some domain knowledge

Learn the most probable H given data
$ domain knowledge

Pr(h|D)
Pr(h|D) = Pr(D|h)*Pr(h) / Pr(D) … Bayes’ rule
Pr(a,b) = Pr(a|b)P(b)
Pr(b,a) = Pr(b|a)P(a)

Bayesian Learning
For each h e H
calculate Pr(h|D) = P(D|h)P(h)/P(D)
Output:
h = argmax Pr(h|D)
h = argmax Pr(D|h)

VC Dimensions

Infinite Hypothesis Spaces
m>= 1/ε(ln|H|+ln1/γ)

spaces are infinite
– linear separators
– artificial neural networks
– decision trees (continuous input)

X:{1,2,3,4,5,6,7,8,9,10}
H: h(x) = x>=Θ
|H|=∞

Trade all hypotheses (only track non-negative integer), keep version space

X = R
H = {h(x) = xe[a,b]}
parameterized by a,b e R
VC = 2

Computational Learning Theory

Mondrain Composition
https://www.khanacademy.org/humanities/ap-art-history/later-europe-and-americas/modernity-ap/a/mondrian-composition
Colored Vornoi Diagram

support vector machines SVMs perceptron
Nearest neighbor 1-NN
decision trees

-defining learning problems
-showing specific algorithms work
-show these problems are fundamentally hard

Theory of computing analyzes how use resources: time, space, o(nlogn), o(n^2)

Inductive learning
1.probability of successful training
2.number of examples to train on
3.complexity of hypothesis class
4.accuracy to which target concept is approximated
5.manner in which training examples presented
6.manner in which training examples selected

computational complexing
– how much computational effort is needed for a learner to coverage?
sample complexing -batch
– how many training examples are needed for a learner to create a successful hypothesis
mistake bounds – online
– how many misclassfications can a learner make over an infinite run

true hypothesis: ceH Training set:s<=X candidate hypothesis: heH consistent learner: produce c(x)=h(x) far xeS version space: VS(s) = {h s.t. heH consistent wants} hypotheses consistent with examples errord(h) = Prx~d[c(x)=h(x)]

Support Vector Machines

y = w^t*x + b
label, parameter of the plane
w^t*x + b = 1
w^t*x + b = 0
w^t*x + b = -1
y = {-1, +1}

w^t*x1 + b = 1
w^t*x2 + b = -1
w^t(x1-x2)/||w|| = 2/||w|| margin

max 2/||w|| while classifying everything correctly
yi(w^t*x + b) >= 1
min 1/2 ||w||^2 quadratic programming
w(α) = Σi αi – 1/2 Σio αi*αu*yi*yu*xi^t*xu
s.t. αi>=Θ, Σi αi*yi = Θ

SVMs: Linearly Married
– margins : generalization overfitting
– big is better
– optimization problem for finding max margins: QPs
– support vectors

Ensemble learning boosting 

spam email {+, -}
sample rules: body “manly” +, from spouse -, short +, just URLs +, just image +, “pΘrn” +, “make money easy” +

1. Learn over a subset of data -> rule: uniformly randomly, pick data and apply a learner
2. combine: complex rule -> mean

Boosting
“hardest” examples
weighted mean

Error: mismatches
PrD[h(x) = c(x)]

Instance Based Learning

Before
1, 2, … n => f(x)

Now
1, 2, … n
f(x) = look up(x)

+ remember, fast, simple
– generalization, overfit

distance stand for similarity

K-NN
Given: Training data D = {xi, yi}
distance metric d(q, x)
number or neighbors k
query point q
– NN = {i: d(q, xi) k smallest}
– Retrun
– classification: ploraity
– regression: mean

Preference Bias
+ locality -> near points are similar
+ smoothness -> averaging
+ an features matter equally

Curse of dimensionality
As the number of features or dimensions grows, the amount of data we need to generalize accurately grow exponentially

Neural Networks

Neural Networks
cell body, neuron, axon, synapses, spike trains
computational unit
Artificial Neural networks

x1, x2, x3 -> Θ(unit: Perceptron) -> y
Σi=1 Xi・Wi (activation) >= Θ firing
output yes: y:1, no: y:0

e.g.
x1 = 1, X2 = 0, X3 = -1.5, w1 = 1/2, w2 = 3/5, w3 = 1
1*1.2 + 0*3.5-1.5*1 = -1
Θ = 0
out put should be y=0

How powerful is a perceptron unit

Regression & Classification

Regression
supervised learning: take examples of inputs and outputs. Now, given a new input, predict its output.

Mapping continuous inputs to outputs.
discrete, continuous

child Height, parent height
2/3 < 1, regression to mean Reinforcement learning Regression in machine learning Finding the best constant function f(x) = c E(c) = Σi=1(yi-c)^2 LOSS, ERROR Order of polynomial k = 0:constant k = 1:line k = 2:parabola f(x) = c0 + cix + c2x^2 + ... ckX^k polynomial regression c0 + c1x + c2x^2 + c3x^3 = y Errors Training data has errors not modeling f, but f + ε where do errors come from? sensor error Cross Validation Fundamental assumption use a model that is complex enough to fit the data without causing problems on the test set -training error -cross validation error -> scalar input, continuous
-> vector input, continuous
include more input features (size, distance from zoo)

predict credit score
job? age? assets?
-> distance, vector or scalar

ID3

LOOP:
-A <- best attribute -Assign A as decision attribute for Node -For each value of A create a deescalate of node -Sort training examples to create -If examples perform classified stop else iterate over leaves gain(s,a) = entropy(s) - Σv |Sv| / |S| entropy(Sv) -Σv P(v) logP(v) ID3: Bias INDUCTIVE BIAS Restriction bias: H Preference bias: -good spots at top -correct over incorrect -shorter trees Decision trees: other considerations - continuous attributes? e.g. age, weight, distance When do we stop? -> everything classified correctly!
-> no more attribute!
-> no overfitting
Regression
splitting? variance?
output average, local linear fit

Decision Trees

Supervised Learning
classification: true or false
regression

Credit history: lend money? -> classification: binary task

classification learning
– instances (input)
– concept function -> T,F
– target concept -> actual answer
– hypothesis -> class, all functions
– sample (training set)
– candidate: concept = target concept
– testing set

Decision Tree
entry: type italian, french, thai
atmosphere: fancy, hiw, casual
occupied
hot date?
cost, hungry, raining

node -> values -> attribute

representation vs algorithm

Decision Trees: Learning
1. Pick best attribute
Best ~ splits the data
2. Asked question
3. Follow the answer path
4. Go to 1
on til got an answer

Decision trees: Expressioness
Boolean
A and B, A or B, A xor B

n-or:any, n-xor:parity(odo)

XOR is hard, n attributes(boolean) o(n!), how many trees?, output is boolean

Truth table
a1, a2, a3, …△n, output
y, t, t … t
t, t, t … t