Regression

Continuous supervised learning
discrete: fast slow
continuous

sklearn regression
http://scikit-learn.org/stable/modules/linear_model.html

>>> from sklearn import linear_model
>>> reg = linear_model.LinearRegression()
>>> reg.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
>>> reg.coef_
array([ 0.5,  0.5])
#!/usr/bin/python

import numpy
import matplotlib
matplotlib.use('agg')

import matplotlib.pyplot as plt
from studentRegression import studentReg
from class_vis import prettyPicture, output_image

from ages_net_worths import ageNetWorthData

ages_train, ages_test, net_worths_train, net_worths_test = ageNetWorthData()

reg = studentReg(ages_train, net_worths_train)

plt.clf()
plt.scatter(ages_train, net_worths_train, color="b", label="train data")
plt.scatter(ages_test, net_worths_test, color="r", label="test data")
plt.plot(ages_test, reg.predict(ages_test), color="black")
plt.legend(loc=2)
plt.xlabel("ages")
plt.ylabel("net worths")
def studentRegression( ages_train, net_worths_train):
	from sklearn.linear_model import LinearRegression
	reg = LinearRegression()
	reg.fit( ages_train, net_worths_train )

	return reg

New algorithm

k nearest neighbors:classic, simple, easy to understand
random forest: “ensemble methods” meta classifiers built from (usually) decision trees
adaboost(boosted decision tree)
(previous algorithms:Naive Bayes, SVM, decision tree)

Process
1) do some research!
– get a general understanding
2) find sklearn documentation
3) deploy it!
4) use it to make predictions

What is a person of interest?
– indicted
– settled without admitting guilt
– testified in exchange for immunity

MORE DATA > fine-tuned algorithm

numerical – numerical values(numbers)
categorical – limited number of discrete values(category)
time series – temporal value(date, timestamp)
text – words

Information Gain

information gain = entropy(parent) – [weighted average] entropy(children)
decision tree algorithm: maximize information gain

>>> -2/3*math.log(2/3, 2) – 1/3*math.log(1/3, 2)

entropy(children) = 3/4(0.9184)+1/4(0)
0.3112

Entropy

Entropy: controls how a DT decides where to split the data
definition: measure of impurity in a bunch of examples

entropy = Σi -Pi log2 (Pi)
Pi is fraction of examples in class i

all examples are same class -> entropy = 〇
examples are evenly split between classes -> entropy = 1.0

grade, bumpiness, speed limit, speed
ssff
Pi = 2 / 4 = 0.5

entropy

>>> import math
>>> -0.5*math.log(0.5, 2) - 0.5*math.log(0.5, 2)
1.0

min_samples_split

import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData

import matplotlib.pyplot as plt
import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()

def submitAccuracies():
	return ["acc_min_samples_split_2":round(acc_min_samples_split_2,3),
		"acc_min_samples_split_50":round(acc_min_samples_split_50,3)]

Decision Tree

Decision Tree:very popular, oldest, most useful
->trick, non-linear decision making

wind surf
linearly separable?

Decision Trees:
two outcomes Yes or No? to classify official data.

X < 3, Y < 2 sk learning: decision tree http://scikit-learn.org/stable/modules/tree.html classification

>>> from sklearn import tree
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(X, Y)
>>> clf.predict([[2., 2.]])
array([1])
>>> clf.predict_proba([[2., 2.]])
array([[ 0.,  1.]])
>>> from sklearn.datasets import load_iris
>>> from sklearn import tree
>>> iris = load_iris()
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(iris.data, iris.target)
>>> with open(“iris.dot”, ‘w’) as f:
…     f = tree.export_graphviz(clf, out_file=f)
>>> import os
>>> os.unlink(‘iris.dot’)
>>> import pydotplus 
>>> dot_data = tree.export_graphviz(clf, out_file=None) 
>>> graph = pydotplus.graph_from_dot_data(dot_data) 
>>> graph.write_pdf(“iris.pdf”)
>>> from IPython.display import Image  
>>> dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names=iris.feature_names,  
                         class_names=iris.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
>>> graph = pydotplus.graph_from_dot_data(dot_data)  
>>> Image(graph.create_png())

DT decision boundary

import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData

import numpy as np 
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()

acc =

def submitAccuracies():
	return {"acc":round(acc,3)}

SVM features

x, y -> svm -> label
z= x^2 + y^2

Kernel Trick
x, y -> x1, x2, x3, x4, x5

SVM γ(gamma) parameter
γ- define how far the influence of single training example reaches
low values – far
high values – close

Overfitting: stop overfitting

features_train = features_train[:len(features_train)/100]
labels_train = labels_train[:len(labels_train)/100]

Support Vector Machine

SVM Support Vector Machine
Maximizes distance to nearest point
= margin

if you go to machine learning party, everybody talk machine learning

SVMs – Outliers

SVM in SKlearn
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()

from sklearn.svm import SVC
clf = SVC(kernel="linear")
clf.fit( features_train, labels_train )
pred = clf.predict( features_test )

from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)

def submitAccuracy():
	return acc

Cancer Test

Example P(C)=0.01
test:
90% it is positive if you have cancer (sensitivity)
90% it is negative if you don’t have cancer (specitnity)

Bayes Rule
prior probability * test evidence -> potential probability

prior: P(c) = 0.01 = 1%
P(Positive|Cancer) = 0.9 = 90%
P(Neg|¬C)=0.9, P(positive|¬cancer) = 0.1
posterior: P(cancer|Positive) = P(Cancer)*P(Positive|C) = 0.009
P(¬cancer|Positive) = P(¬cancer)*(Positive|¬cancer) = 0.099
normalize:P(Pos)=P(cancer|Positive)+P(¬cancer|Positive)=0.108

Text Learning – Naive Bayes
Chris: love 1, deal 8, life 1
Sara: love 3, deal 2, life 3
P(Chris) = 0.5
P(Sara) = 0.5
Sara use love and life frequency.

Calculating NBAccuracy

def NBAccuracy(features_train, labels_train, features_test, labels_test):

	from sklearn.naive_bayes import GaussianNB
	clf = GaussianNB()
	pref = clf.predict(features_test)
	accuracy =
	return accuracy

from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData
from classify import NBAccuracy

import matplotlib.pyplot as plt
import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()

def submitAccuracy():
	accuracy = NBAccuracy(features_train, labels_train, features_test, labels_test)
	return accuracy