Machine Learning – Page 6 – ソフトウェアエンジニアの技術ブログ：Software engineer tech blog

BIG data Algorithm

-classification
-clustering
-dimensionality reduction
-graph analysis

Reinforcement-learning basics
Agent, Environment
a, r, s

Bayesian Interface

Representing and reasoning with probabilities
Bayesian Networks

Joint Distribution

Strom lightning thonder
T, T .25 .20,.05
T, F .40 .04,0.36
F, T .05 .04,0.01
F, F .30 .03,0.27
Random day 2pm – look outside summer
Pr(￢storm) = 0.35
pr(lightning|storm) = .4615 (.25/.65)

X is conditionally independent of Y given Z fi the probability distribution governing X is independent of the value of y given the value of Z; that is, if
P(X=x|Y=y,Z=z)= P(X=x|Z=x)
more compactly write
P(X|Y,Z) = P(X|Z)

sampling
two things distribution are for -probability of value, generate value
simulation of a complex process
approximate inference

P(x)= Σy*P(x,y)
P(x,y)= P(x)*P(y|x)
P(y|x) = P(x|y)*P(y)/P(z)

Bayesian Learning

Learn the best hypothesis given data
$ some domain knowledge

Learn the most probable H given data
$ domain knowledge

Bayesian Learning
For each h e H
calculate Pr(h|D) = P(D|h)P(h)/P(D)
Output:
h = argmax Pr(h|D)
h = argmax Pr(D|h)

VC Dimensions

Infinite Hypothesis Spaces
m>= 1/ε(ln|H|+ln1/γ)

spaces are infinite
– linear separators
– artificial neural networks
– decision trees (continuous input)

X:{1,2,3,4,5,6,7,8,9,10}
H: h(x) = x>=Θ
|H|=∞

Trade all hypotheses (only track non-negative integer), keep version space

X = R
H = {h(x) = xe[a,b]}
parameterized by a,b e R
VC = 2

Computational Learning Theory

Mondrain Composition
https://www.khanacademy.org/humanities/ap-art-history/later-europe-and-americas/modernity-ap/a/mondrian-composition
Colored Vornoi Diagram

support vector machines SVMs perceptron
Nearest neighbor 1-NN
decision trees

-defining learning problems
-showing specific algorithms work
-show these problems are fundamentally hard

Theory of computing analyzes how use resources: time, space, o(nlogn), o(n^2)

Inductive learning
1.probability of successful training
2.number of examples to train on
3.complexity of hypothesis class
4.accuracy to which target concept is approximated
5.manner in which training examples presented
6.manner in which training examples selected

computational complexing
– how much computational effort is needed for a learner to coverage?
sample complexing -batch
– how many training examples are needed for a learner to create a successful hypothesis
mistake bounds – online
– how many misclassfications can a learner make over an infinite run

true hypothesis: ceH Training set:s<=X candidate hypothesis: heH consistent learner: produce c(x)=h(x) far xeS version space: VS(s) = {h s.t. heH consistent wants} hypotheses consistent with examples errord(h) = Prx~d[c(x)=h(x)]

Support Vector Machines

y = w^t*x + b
label, parameter of the plane
w^t*x + b = 1
w^t*x + b = 0
w^t*x + b = -1
y = {-1, +1}

w^t*x1 + b = 1
w^t*x2 + b = -1
w^t(x1-x2)/||w|| = 2/||w|| margin

max 2/||w|| while classifying everything correctly
yi(w^t*x + b) >= 1
min 1/2 ||w||^2　quadratic programming
w(α) = Σi αi – 1/2 Σio αi*αu*yi*yu*xi^t*xu
s.t. αi>=Θ, Σi αi*yi = Θ

SVMs: Linearly Married
– margins : generalization overfitting
– big is better
– optimization problem for finding max margins: QPs
– support vectors

Ensemble learning boosting　

spam email {+, -}
sample rules: body “manly” +, from spouse -, short +, just URLs +, just image +, “pΘrn” +, “make money easy” +

1. Learn over a subset of data -> rule: uniformly randomly, pick data and apply a learner
2. combine: complex rule -> mean

Boosting
“hardest” examples
weighted mean

Error: mismatches
PrD[h(x) = c(x)]

Instance Based Learning

Before
1, 2, … n => f(x)

Now
1, 2, … n
f(x) = look up(x)

+ remember, fast, simple
– generalization, overfit

distance stand for similarity

K-NN
Given: Training data D = {xi, yi}
distance metric d(q, x)
number or neighbors k
query point q
– NN = {i: d(q, xi) k smallest}
– Retrun
– classification:　ploraity
– regression: mean

Preference Bias
+ locality -> near points are similar
+ smoothness -> averaging
+ an features matter equally

Curse of dimensionality
As the number of features or dimensions grows, the amount of data we need to generalize accurately grow exponentially

Neural Networks

Neural Networks
cell body, neuron, axon, synapses, spike trains
computational unit
Artificial Neural networks

x1, x2, x3 -> Θ(unit: Perceptron) -> y
Σi=1 Xi・Wi (activation) >= Θ firing
output yes: y:1, no: y:0

e.g.
x1 = 1, X2 = 0, X3 = -1.5, w1 = 1/2, w2 = 3/5, w3 = 1
1*1.2 + 0*3.5-1.5*1 = -1
Θ = 0
out put should be y=0

How powerful is a perceptron unit

Regression & Classification

Regression
supervised learning: take examples of inputs and outputs. Now, given a new input, predict its output.

Mapping continuous inputs to outputs.
discrete, continuous

child Height, parent height
2/3 < 1, regression to mean Reinforcement learning Regression in machine learning Finding the best constant function f(x) = c E(c) = Σi=1(yi-c)^2 LOSS, ERROR Order of polynomial k = 0:constant k = 1:line k = 2:parabola f(x) = c0 + cix + c2x^2 + ... ckX^k polynomial regression c0 + c1x + c2x^2 + c3x^3 = y Errors Training data has errors not modeling f, but f + ε where do errors come from? sensor error Cross Validation Fundamental assumption use a model that is complex enough to fit the data without causing problems on the test set -training error -cross validation error -> scalar input, continuous
-> vector input, continuous
include more input features (size, distance from zoo)

predict credit score
job? age? assets?
-> distance, vector or scalar