-A <- best attribute -Assign A as decision attribute for Node -For each value of A create a deescalate of node -Sort training examples to create -If examples perform classified stop else iterate over leaves gain(s,a) = entropy(s) - Σv |Sv| / |S| entropy(Sv) -Σv P(v) logP(v) ID3: Bias INDUCTIVE BIAS Restriction bias: H Preference bias: -good spots at top -correct over incorrect -shorter trees Decision trees: other considerations - continuous attributes? e.g. age, weight, distance When do we stop? -> everything classified correctly!
-> no more attribute!
-> no overfitting
splitting? variance?
output average, local linear fit

Decision Trees

Supervised Learning
classification: true or false

Credit history: lend money? -> classification: binary task

classification learning
– instances (input)
– concept function -> T,F
– target concept -> actual answer
– hypothesis -> class, all functions
– sample (training set)
– candidate: concept = target concept
– testing set

Decision Tree
entry: type italian, french, thai
atmosphere: fancy, hiw, casual
hot date?
cost, hungry, raining

node -> values -> attribute

representation vs algorithm

Decision Trees: Learning
1. Pick best attribute
Best ~ splits the data
2. Asked question
3. Follow the answer path
4. Go to 1
on til got an answer

Decision trees: Expressioness
A and B, A or B, A xor B

n-or:any, n-xor:parity(odo)

XOR is hard, n attributes(boolean) o(n!), how many trees?, output is boolean

Truth table
a1, a2, a3, …△n, output
y, t, t … t
t, t, t … t

Philosophy of Machine Learning

Theoretical, Pratical
What is machine learning? × Proving theorems
computational statistics
broader notion of building computational artifacts that learn over time based on experience.

-supervised learning
-unsupervised learning
-reinforcement learning

1:1, 2:4, 3:9, 4:16, 5:25, 6:36
output <- input ^2 induction and deduction supervised learning = approximation unsupervised learning = description pixels -> Function approximator -> labels
Reinforcement learning

supervised learning: labels data well
reinforcement learning: behavior scores well
unsupervised learning: cluster wrests well

Evaluation Metrics

accuracy = no of items in a class labeled correctly / all items in that class

positive – negative
percision = true positive / true positive + false positive
recall = true positive / true positive + negative positive

predictions = [0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
true labels = [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0]

data set, features, algorithms, evaluation


parameters = {'kerne':('linear','rbf'),'c':[1,10]}
svr = svm.SVC()
clf = grid_search.GridSearchCV(svr, parameters),

parameters = {'kernel':('linear','rbf'),'C':[1,10]}

svr = svm.SVC()
clf = grid_search.GridSearchCV(svr,parameters),


import numpy as np
from sklearn import cross_validation
from sklearn imort datasets
from sklearn import svm

iris = datasets.load_iris(),, 4), (150,))

X_train, X_test, y_train, y_test = cross_validation.train_test_split(,, test_size=0.4, random_state=0)

X_train.shape, y_train.shape((90, 4), (90,))
X_test.shape, y_test.shape((60, 4), (60,))

clf = svm.SVC(kernel='linear',C=1).fit(X_train, y_train)
clf.score(X_test, y_test)0.96

Training, Transforms, Predicting
Train/test split -> pca -> svm

clf = GaussianNB()
t0 = time()
kf = KFold(len(authors), 2)
for train_indices, test_indicies in kf:
	features_train = [word_data[ii] for ii in train_indices]
	features_test = [word_data[ii] for ii in test_indices]
	authors_train = [authors[ii] for ii in train_indices]
	authors_test = [authors[ii] for ii in test_indices]

	vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
	features_train_transformed = vectorizer.fit_transform(features_train)
	features_test_transformed = vectorizer.transform(features_test)
	selector = SelectPercentile(f_classif, percentile=10), authors_train)
	features_train_transformed = selector.transform(features_train_transformed).toarray()
	features_test_transformed = selector.transform(features_test_transformed).toarray(), authors_train)
	print "training time:", round(time()-t0, 3), "s"
	t0 = time()
	pred = clf.predict( features_test_transformed )

When to use PCA

-> latent features driving the patterns in data
-> dimensional reduction
-> visualize high-dimensional data, reduce noise
-> make other algorithms(regression, classification) work better fewer inputs

PCA for facial recognition

X_train, X_test, y_train, y_test = train_split(X, y, test_size=0.25)

n_components = 150

print "Extracting the top %d eigenfaces from %d faces" % (n_components, X_train.shape[0])
t0 = time()
pca = RandomizePCA(n_components=n_components, whiten=True).fit(X_train)
print "done in %0.3fs" % (time() - t0)

eigenfaces = pca.components_.reshape((n_component, h, w))

print "Projecting the input data on the eigenfaces orthnormal basis"
t0 = time()
X_train_pca = pca.tranform(X_train)
X_test_pca = pca.transform(X_test)
print "done in %0.3fs" % (time() - t0)

print "Fitting the classifier to the training set"


Principal Component Analysis – PCA
Dimensional of data:2

x = 2
y = 3
Δx = 1
Δy = 2

square footage + No.Rooms -> Size

How to determine the principal component
variance – the willingness/flexibility of an algorithm to learn
technical term in statistics – roughly the “spread” of a data distribution(similar to standard duration)

– maximum variance and information loss

def doPCA():
	from sklearn.decomposition import PCA
	pca = PCA(n_components=2)
	return pca

pca = doPCA()
print pca.explained_variance_ratio_
first_pc = pca.component_[0]
second_pc = pca.components_[1]

transformed_data = pca.transform(data)
for ii, jj in zip(transofrmed_data, data):
	plt.scatter( first_pc[0]*ii[0],  first_pc[1]*ii[0], color="r")
	plt.scatter( second_pc[0]*ii[1], second_pc[1]*ii[1], color="c")
	plt.scatter( jj[0], jj[i], color="b")

plt.ylabel("long-term incentive")

Features != Information

There are two big univariate feature selection tools in sklearn: SelectPercentile and SelectKBest. The difference is pretty apparent by the names: SelectPercentile selects the X% of features that are most powerful (where X is a parameter) and SelectKBest selects the K features that are most powerful (where K is a parameter).

high bias
pays little attention to data over simplified, high error on training set
high variance
pays too much attention to data(does not generalize well) over fit

Regularization in Regression
method for automatically penalizing extra features
-Lasso Regression: minimize SSE + γ|β|

m1 – m4: coefficients of regression
x1-x4: features

import sklearn, linear_model Lasso
features, labes = GetMyData()
reguression = Lasso()
regression fit(features)
regression predict([2, 4])


import pickle
from get_data import getData

def computeFraction( poi_messages, all_messages ):

	fraction = 0.
	return faraction

data_dict = getData()

submit_dict = {}
for name in data_dict

	data_point = data_dict[name]

	from_poi_to_this_person = data_point["from_poi_to_this_person"]
	to_messages = data_point["to_messages"]
	fraction_from_poi = computeFraction( from_poi_to_this_person, to_messages )
	print fraction_from_poi
	data_point["fraction_from_poi"] = fraction_from_poi

	from_this_person_to_poi = data_point["from_this_person_to_poi"]
	from_messages = data_point["from_messages"]
	fraction_to_poi = computeFraction( from_this_person_to_poi, from_messages )
	print fraction_to_poi
	submit_dict[name] = {"from_poi_to_this_person":fraction_from_poi, 
	data_point["fraction_to_poi"] = fraction_to_poi

def submitDict():
	return submit_dict