information gain = entropy(parent) – [weighted average] entropy(children)
decision tree algorithm: maximize information gain
>>> -2/3*math.log(2/3, 2) – 1/3*math.log(1/3, 2)
entropy(children) = 3/4(0.9184)+1/4(0)
0.3112
随机应变 ABCD: Always Be Coding and … : хороший
information gain = entropy(parent) – [weighted average] entropy(children)
decision tree algorithm: maximize information gain
>>> -2/3*math.log(2/3, 2) – 1/3*math.log(1/3, 2)
entropy(children) = 3/4(0.9184)+1/4(0)
0.3112
Entropy: controls how a DT decides where to split the data
definition: measure of impurity in a bunch of examples
entropy = Σi -Pi log2 (Pi)
Pi is fraction of examples in class i
all examples are same class -> entropy = 〇
examples are evenly split between classes -> entropy = 1.0
grade, bumpiness, speed limit, speed
ssff
Pi = 2 / 4 = 0.5
entropy
>>> import math >>> -0.5*math.log(0.5, 2) - 0.5*math.log(0.5, 2) 1.0
import sys from class_vis import prettyPicture from prep_terrain_data import makeTerrainData import matplotlib.pyplot as plt import numpy as np import pylab as pl features_train, labels_train, features_test, labels_test = makeTerrainData() def submitAccuracies(): return ["acc_min_samples_split_2":round(acc_min_samples_split_2,3), "acc_min_samples_split_50":round(acc_min_samples_split_50,3)]
Decision Tree:very popular, oldest, most useful
->trick, non-linear decision making
wind surf
linearly separable?
Decision Trees:
two outcomes Yes or No? to classify official data.
X < 3, Y < 2 sk learning: decision tree http://scikit-learn.org/stable/modules/tree.html classification
>>> from sklearn import tree >>> X = [[0, 0], [1, 1]] >>> Y = [0, 1] >>> clf = tree.DecisionTreeClassifier() >>> clf = clf.fit(X, Y) >>> clf.predict([[2., 2.]]) array([1]) >>> clf.predict_proba([[2., 2.]]) array([[ 0., 1.]]) >>> from sklearn.datasets import load_iris >>> from sklearn import tree >>> iris = load_iris() >>> clf = tree.DecisionTreeClassifier() >>> clf = clf.fit(iris.data, iris.target) >>> with open(“iris.dot”, ‘w’) as f: … f = tree.export_graphviz(clf, out_file=f) >>> import os >>> os.unlink(‘iris.dot’) >>> import pydotplus >>> dot_data = tree.export_graphviz(clf, out_file=None) >>> graph = pydotplus.graph_from_dot_data(dot_data) >>> graph.write_pdf(“iris.pdf”) >>> from IPython.display import Image >>> dot_data = tree.export_graphviz(clf, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True, special_characters=True) >>> graph = pydotplus.graph_from_dot_data(dot_data) >>> Image(graph.create_png())
DT decision boundary
import sys from class_vis import prettyPicture from prep_terrain_data import makeTerrainData import numpy as np import pylab as pl features_train, labels_train, features_test, labels_test = makeTerrainData() acc = def submitAccuracies(): return {"acc":round(acc,3)}
x, y -> svm -> label
z= x^2 + y^2
Kernel Trick
x, y -> x1, x2, x3, x4, x5
SVM γ(gamma) parameter
γ- define how far the influence of single training example reaches
low values – far
high values – close
Overfitting: stop overfitting
features_train = features_train[:len(features_train)/100]
labels_train = labels_train[:len(labels_train)/100]
SVM Support Vector Machine
Maximizes distance to nearest point
= margin
if you go to machine learning party, everybody talk machine learning
SVMs – Outliers
SVM in SKlearn
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
import numpy as np import pylab as pl features_train, labels_train, features_test, labels_test = makeTerrainData() from sklearn.svm import SVC clf = SVC(kernel="linear") clf.fit( features_train, labels_train ) pred = clf.predict( features_test ) from sklearn.metrics import accuracy_score acc = accuracy_score(pred, labels_test) def submitAccuracy(): return acc
Example P(C)=0.01
test:
90% it is positive if you have cancer (sensitivity)
90% it is negative if you don’t have cancer (specitnity)
Bayes Rule
prior probability * test evidence -> potential probability
prior: P(c) = 0.01 = 1%
P(Positive|Cancer) = 0.9 = 90%
P(Neg|¬C)=0.9, P(positive|¬cancer) = 0.1
posterior: P(cancer|Positive) = P(Cancer)*P(Positive|C) = 0.009
P(¬cancer|Positive) = P(¬cancer)*(Positive|¬cancer) = 0.099
normalize:P(Pos)=P(cancer|Positive)+P(¬cancer|Positive)=0.108
Text Learning – Naive Bayes
Chris: love 1, deal 8, life 1
Sara: love 3, deal 2, life 3
P(Chris) = 0.5
P(Sara) = 0.5
Sara use love and life frequency.
def NBAccuracy(features_train, labels_train, features_test, labels_test): from sklearn.naive_bayes import GaussianNB clf = GaussianNB() pref = clf.predict(features_test) accuracy = return accuracy from class_vis import prettyPicture from prep_terrain_data import makeTerrainData from classify import NBAccuracy import matplotlib.pyplot as plt import numpy as np import pylab as pl features_train, labels_train, features_test, labels_test = makeTerrainData() def submitAccuracy(): accuracy = NBAccuracy(features_train, labels_train, features_test, labels_test) return accuracy
#!/usr/bin/python from prep_terrain_data import makeTerrainData from class_vis import prettyPicture, output_image from ClassyfyNB import classify import numpy as np import pylab as pl features_train, labels_train, features_test, labels_test = makeTerrainData() grade_fast = [features_tarain[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]=0] bumpy_fast = [features_tarain[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]=0] grade_slow = [features_tarain[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]=1] bumpy_slow = [features_tarain[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]=1] clf = classify(features_train, labels_train) prettyPicture(clf, features_test, labels_test) output_image("test.png", "png", open("test.png", "rb").read())
#!/usr/bin/python #from ***plots import * import warnings warnings.filterwarnings("ignore") import matplotlib matplotlib.use('agg') import matplotlib.pyplot as plt import pylab as pl import numpy as np def prettyPicture(clf, X_test, y_test): x_min = 0.0; x_max = 1.0 y_min = 0.0; y_max = 1.0 h = .01 xx, yy = np.meshgrid(np.arrange(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.xlim(xx.min(), xx.max()) plt.ylim(yy.min(), yy.max()) plt.pcolormesh(xx, yy, Z, cmap=pl.cm.seismic) grade_fast = [features_tarain[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]=0] bumpy_fast = [features_tarain[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]=0] grade_slow = [features_tarain[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]=1] bumpy_slow = [features_tarain[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]=1] plt.scatter(grade_sig, bumpy_sig, color="b", label="fast") plt.scatter(grade_bkg, bumpy_bkg, color="r", label="slow") plt.legend() plt.xlabel("bumpiness") plt.ylabel("grade") plt.savefig("test.png") import base64 import json import subprocess def output_image(name, format, bytes): image_start = "BEGIN_IMAGE_f9825uweof8jw9fj4r8" image_end = "END_IMAGE_0238jfw08fjsiufhw8frs" data = {} data['name'] = name data['format'] = format data['bytes'] = base64.encodestring(bytes) print image_start+json.dumps(data)+image_end
#!/usr/bin/python import random def makeTerrainData(n_points=1000): random.seed(42) grade = [random.random() for ii in range(0,n_points)] bumpy = [random.random() for ii in range(0,n_points)] error = [random.random() for ii in range(0,n_points)] y = [round(grade[ii]*bumpy[ii]+0.3+0.1*error[ii]) for ii in range(0,n_points)] for ii in range(0, len(y)): if grade[ii]>0.8 or bumpy[ii]>0.8: y[ii] = 1.0 X = [[gg, ss] for gg, ss in zip(grade, bumpy)] split = int(0.75*n_points) X_train = X[0:split] X_test = X[split:] y_train = y[0:split] y_test = y[split:] grade_sig = [X_train[ii][0] for ii in range(0, len(X_train[i])) if y_train[ii]==0] bumpy_sig = [X_train[ii][1] for ii in range(0, len(X_train[i])) if y_train[ii]==0] grade_sig = [X_train[ii][0] for ii in range(0, len(X_train[i])) if y_train[ii]==1] bumpy_sig = [X_train[ii][1] for ii in range(0, len(X_train[i])) if y_train[ii]==1] grade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0] bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0] grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1] bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1] test_data = {"fast":{"grade":grade_sig, "bumpiness":bumpy_sig} , "slow":{"grade":grade_bkg, "bumpiness":bumpy_bkg}} return X_train, y_train, X_test, y_test
Bumpiness: smooth – bad
slope : flat – very steep
more like red x of blue circle, that’s most important in machine learning
Decision surface: Linear
Naive Bayes
Zooming ahead on supervised classification with python!
goal: draw decision boundary
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
come with page, just run with python interpreter
>>> import numpy as np >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) >>> Y = np.array([1, 1, 1, 2, 2, 2]) >>> from sklearn.naive_bayes import GaussianNB >>> clf = GaussianNB() >>> clf.fit(X, Y) GaussianNB(priors=None) >>> print(clf.predict([[-0.8, -1]])) [1] >>> clf_pf = GaussianNB() >>> clf_pf.partial_fit(X, Y, np.unique(Y)) GaussianNB(priors=None) >>> print(clf_pf.predict([[-0.8, -1]])) [1]
plt.xlim(x_min, x_max) plt.ylim(y_min, y_max) plt.xticks(()) plt.yticks(()) plt.scatter(grade_sig, bumpy_sig, color = "b", label = "fast!") plt.scatter(grade_bkg, bumpy_bkg, color = "r", label = "slow") plt.legend() plt.xlabel("bumpiness") plt.ylabel("grade") plt.show() from sklearn.naive_bayes import GaussianNB clt = GaussianNB() clf.fit(features_train, labels_train) pred = clf.predict(features_test)
#coding:utf-8 import math import sys from collections import defaultdict class NaiveBayes: """Multinomial Naive Bayes""" def __init__(self): self.categories = set() self.vocabularies = set() self.wordcount = {} self.catcount = {} self.denominator = {} def train(self, data): """ナイーブベイズ分類器の訓練""" # 文書集合からカテゴリを抽出して辞書を初期化 for d in data: cat = d[0] self.categories.add(cat) for cat in self.categories: self.wordcount[cat] = defaultdict(int) self.catcount[cat] = 0 # 文書集合からカテゴリと単語をカウント for d in data: cat, doc = d[0], d[1:] self.catcount[cat] += 1 for word in doc: self.vocabularies.add(word) self.wordcount[cat][word] += 1 # 単語の条件付き確率の分母の値をあらかじめ一括計算しておく (高速化のため) for cat in self.categories: self.denominator[cat] = sum(self.wordcount[cat].values()) + len(self.yocabularies) def classify(self, doc): """事後確率の対数 log(P(cat|doc))がもっとも大きなカテゴリを返す""" best = None max = -sys.maxint for cat in self.catcount.keys(): p = self.score(docs, cat) if p > max: max = p best = cat return best def wordProb(self, word, cat): """単語の条件付き確率 P(word|cat)を求める""" # ラプサムスムージングを適用 # wordcount[cat]はdefaultdict(int)なのでカテゴリに存在しなかった単語はデフォルトの0を返す return float(self.wordcount[cat][word] + 1)/ float(self.denominator[cat]) score(self, doc, cat): """文書が与えられたときのカテゴリの事後確率の対数 log(P(cat|doc))を求める""" total = sum(self.catcount.values()) #総文書数 score = math.log(float(self.catcount[cat]) / total) # log P(cat) for word in doc: # logをとると掛け算は足し算に score += math.log(self.wordProb(word, cat)) # log P(word|cat) return score __str__(self): total = sum(self.catcount.values()) #総文書数 return "documents: %d, vocabularies: %d, categories: %d" % (total, len(self.vocabularies), len(self.categories)) if __name__ == "__main__": # Introduction to Information Retrieval 13.2 data = [["yes", "chinese", "Beijin", "Chinese"], ["yes","chinese", "Chinse", "Shanghai"], ["yes","Chinese", "Macao"], ["no","Tokyo","Japan","Chinse"]] #ナイーブベイズ分類器を訓練 nb = NaiveBayes() nb.train(data) print nb print "P(Chinese|yes) =", nb.wordProb("Chinese", "yes") print "P(Tokyo|yes) =", nb.wordProb("Tokyo", "yes") print "P(Japan|yes) =", nb.wordProb("Japan", "yes") print "P(Chinese|no) =", nb.wordProb("Chinese", "no") print "P(Tokyo|no) =", nb.wordProb("Tokyo", "no") print "P(Japan|no) =", nb.wordProb("Japan", "no") test = ["Chinese","Chinese", "Chinese", "Tokyo", "Japan"] print "log P(yest|test) =", nb.score(test, "yes") print "log P(no|test) =", nb.score(test, "no") print nb.classify(test)