LOOP:
-A <- best attribute
-Assign A as decision attribute for Node
-For each value of A create a deescalate of node
-Sort training examples to create
-If examples perform classified stop else iterate over leaves
gain(s,a) = entropy(s) - Σv |Sv| / |S| entropy(Sv)
-Σv P(v) logP(v)
ID3: Bias
INDUCTIVE BIAS
Restriction bias: H
Preference bias:
-good spots at top
-correct over incorrect
-shorter trees
Decision trees: other considerations
- continuous attributes?
e.g. age, weight, distance
When do we stop?
-> everything classified correctly!
-> no more attribute!
-> no overfitting
Regression
splitting? variance?
output average, local linear fit
Category: Machine Learning
Decision Trees
Supervised Learning
classification: true or false
regression
Credit history: lend money? -> classification: binary task
classification learning
– instances (input)
– concept function -> T,F
– target concept -> actual answer
– hypothesis -> class, all functions
– sample (training set)
– candidate: concept = target concept
– testing set
Decision Tree
entry: type italian, french, thai
atmosphere: fancy, hiw, casual
occupied
hot date?
cost, hungry, raining
node -> values -> attribute
representation vs algorithm
Decision Trees: Learning
1. Pick best attribute
Best ~ splits the data
2. Asked question
3. Follow the answer path
4. Go to 1
on til got an answer
Decision trees: Expressioness
Boolean
A and B, A or B, A xor B
n-or:any, n-xor:parity(odo)
XOR is hard, n attributes(boolean) o(n!), how many trees?, output is boolean
Truth table
a1, a2, a3, …△n, output
y, t, t … t
t, t, t … t
Philosophy of Machine Learning
Theoretical, Pratical
What is machine learning? × Proving theorems
computational statistics
broader notion of building computational artifacts that learn over time based on experience.
-supervised learning
-unsupervised learning
-reinforcement learning
1:1, 2:4, 3:9, 4:16, 5:25, 6:36
output <- input ^2
induction and deduction
supervised learning = approximation
unsupervised learning = description
pixels -> Function approximator -> labels
Reinforcement learning
Optimization
supervised learning: labels data well
reinforcement learning: behavior scores well
unsupervised learning: cluster wrests well
Evaluation Metrics
accuracy = no of items in a class labeled correctly / all items in that class
positive – negative
percision = true positive / true positive + false positive
recall = true positive / true positive + negative positive
predictions = [0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
true labels = [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0]
data set, features, algorithms, evaluation
GridSearchCV
parameters = {'kerne':('linear','rbf'),'c':[1,10]} svr = svm.SVC() clf = grid_search.GridSearchCV(svr, parameters) clf.fit(iris.data, iris.target) parameters = {'kernel':('linear','rbf'),'C':[1,10]} svr = svm.SVC() clf = grid_search.GridSearchCV(svr,parameters) clf.fit(iris.data, iris.target) clf.best_params_
Validation
import numpy as np from sklearn import cross_validation from sklearn imort datasets from sklearn import svm iris = datasets.load_iris() iris.data.shape, iris.target.shape((150, 4), (150,)) X_train, X_test, y_train, y_test = cross_validation.train_test_split( iris.data, iris.target, test_size=0.4, random_state=0) X_train.shape, y_train.shape((90, 4), (90,)) X_test.shape, y_test.shape((60, 4), (60,)) clf = svm.SVC(kernel='linear',C=1).fit(X_train, y_train) clf.score(X_test, y_test)0.96
Training, Transforms, Predicting
Train/test split -> pca -> svm
clf = GaussianNB() t0 = time() kf = KFold(len(authors), 2) for train_indices, test_indicies in kf: features_train = [word_data[ii] for ii in train_indices] features_test = [word_data[ii] for ii in test_indices] authors_train = [authors[ii] for ii in train_indices] authors_test = [authors[ii] for ii in test_indices] vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english') features_train_transformed = vectorizer.fit_transform(features_train) features_test_transformed = vectorizer.transform(features_test) selector = SelectPercentile(f_classif, percentile=10) selector.fit(features_train_transformed, authors_train) features_train_transformed = selector.transform(features_train_transformed).toarray() features_test_transformed = selector.transform(features_test_transformed).toarray() clf.fit(features_train_transformed, authors_train) print "training time:", round(time()-t0, 3), "s" t0 = time() pred = clf.predict( features_test_transformed )
When to use PCA
When to use PCA
-> latent features driving the patterns in data
-> dimensional reduction
-> visualize high-dimensional data, reduce noise
-> make other algorithms(regression, classification) work better fewer inputs
PCA for facial recognition
X_train, X_test, y_train, y_test = train_split(X, y, test_size=0.25) n_components = 150 print "Extracting the top %d eigenfaces from %d faces" % (n_components, X_train.shape[0]) t0 = time() pca = RandomizePCA(n_components=n_components, whiten=True).fit(X_train) print "done in %0.3fs" % (time() - t0) eigenfaces = pca.components_.reshape((n_component, h, w)) print "Projecting the input data on the eigenfaces orthnormal basis" t0 = time() X_train_pca = pca.tranform(X_train) X_test_pca = pca.transform(X_test) print "done in %0.3fs" % (time() - t0) print "Fitting the classifier to the training set"
http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html
PCA
Principal Component Analysis – PCA
Dimensional of data:2
x = 2
y = 3
Δx = 1
Δy = 2
square footage + No.Rooms -> Size
How to determine the principal component
variance – the willingness/flexibility of an algorithm to learn
technical term in statistics – roughly the “spread” of a data distribution(similar to standard duration)
– maximum variance and information loss
def doPCA(): from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(data) return pca pca = doPCA() print pca.explained_variance_ratio_ first_pc = pca.component_[0] second_pc = pca.components_[1] transformed_data = pca.transform(data) for ii, jj in zip(transofrmed_data, data): plt.scatter( first_pc[0]*ii[0], first_pc[1]*ii[0], color="r") plt.scatter( second_pc[0]*ii[1], second_pc[1]*ii[1], color="c") plt.scatter( jj[0], jj[i], color="b") plt.xlabel("bonus") plt.ylabel("long-term incentive") plt.show()
Features != Information
There are two big univariate feature selection tools in sklearn: SelectPercentile and SelectKBest. The difference is pretty apparent by the names: SelectPercentile selects the X% of features that are most powerful (where X is a parameter) and SelectKBest selects the K features that are most powerful (where K is a parameter).
high bias
pays little attention to data over simplified, high error on training set
high variance
pays too much attention to data(does not generalize well) over fit
Regularization in Regression
method for automatically penalizing extra features
-Lasso Regression: minimize SSE + γ|β|
m1 – m4: coefficients of regression
x1-x4: features
import sklearn, linear_model Lasso
features, labes = GetMyData()
reguression = Lasso()
regression fit(features)
regression predict([2, 4])
Visualizing
import pickle from get_data import getData def computeFraction( poi_messages, all_messages ): fraction = 0. return faraction data_dict = getData() submit_dict = {} for name in data_dict data_point = data_dict[name] print from_poi_to_this_person = data_point["from_poi_to_this_person"] to_messages = data_point["to_messages"] fraction_from_poi = computeFraction( from_poi_to_this_person, to_messages ) print fraction_from_poi data_point["fraction_from_poi"] = fraction_from_poi from_this_person_to_poi = data_point["from_this_person_to_poi"] from_messages = data_point["from_messages"] fraction_to_poi = computeFraction( from_this_person_to_poi, from_messages ) print fraction_to_poi submit_dict[name] = {"from_poi_to_this_person":fraction_from_poi, "from_this_person_to_poi":fraction_to_poi} data_point["fraction_to_poi"] = fraction_to_poi def submitDict(): return submit_dict