Feature scaling

Feature scaling
– try to determine Chris’s t-shirt size: 140 lbs, 6.1ft
– training set: Cameron, Sarah: 175 lbs, 5.9ft, 115 lbs, 5.2ft
measure height + weight
-> who is Chirs closer to in height + weight
Cameron(large shirt), Sarah(small shirt)
Feature Scaling

X’ = (X – Xmin)/(Xmax – Xmin)
[115, 140, 175]
25 / 60 = 0.417
0<= X' <= 1 [python] from sklearn.preprocessing import MinMaxScaler import numpy weights = numpy.array([[115],[140],[175]]) scaler = MinMaxScaler() rescaled_weight = scaler.fit_transform(weights) weights = numpy.array([[115.],[140.],[175.]]) rescaled_weight = scaler.fit_transform(weights) rescaled_weight [/python] Which algorithm would be affected by feature rescaling? - SVM with RBF - K-MEAN clustering

Clustering

Unsupervised Learning

K-MEANS
how many clusters?
-> 2

assign, optimize

Visualizing K-Means Clustering
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
-uniform point

K-MEANS
will output for any fixed training set always be the same
Local minimum

Outliers

what causes outliers?
-sensor malfunction: ignore
-data entry errors
-freak event: pay attention

Outlier Detection
-train
-remove
-train again

for point in data:
	salary = point[0]
	bonus = point[1]
	matplotlib.pyplot.scatter( salary, bonus )

matplotlib.pyplot.xlabel("salary")
matplotlib.pyplot.ylabel("bonus")
matplotlib.pyplot.show()

r^2 of a regression

r^2
how much of my change in the output(y) is explained by the change in my input(x)

0.0 < r^2 < 1.0 classification & regression property, supervised classification, regression output type, discrete(class labels), continuous(number) what are you trying to find?, decision boundary, best fit line evaluation, accuracy, "sum of squared error" r^2 Regression multi-variate age, IQ, education -> net worth

Multi-variate regression
y = 5×1 + 2.5×2 – 200

y = House Price
y = x1 – 10×2 + 500

import sys
import pickle
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeaturSplit
dictionary = pickle.load( open("../final_project/final_project_dataset_modified.pkl","r"))

features_list = ["bonus", "salary"]
data = featureFormat( dictionary, features_list, remove_any_zeroes=True)
target, features = targetFeatureSplit( data )

from sklearn.cross_validation import tarain_test_split
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42)
train_color = "b"
test_color = "b"

import matplotlib.pyplot as plt
for feature, target in zip(feature_test, target_test):
	plt.scatter( feature, target, color=test_color )
for feature, target in zip(feature_train, target_train):
	plt.scatter( feature, target, color=train_color )

plt.scatter(feature_test[0], target_test[0], color=test_color, label="test")
plt.scatter(feature_test[0], target_test[0], color=train_color, label="train")

try:
	plt.plot( feature_test, reg.predict(feature_test) )
except NameError:
	pass
plt.xlabel(features_list[1])
plt.ylabel(features_list[0])
plt.legend()
plt.show()

Linear Regression Errors

error = actual net worth – predicted net worth
(taken from training data, predicted by regression line)

predicted new worth = 218.75
actual net worth = 200
error(distance) = -18.75

Σ|error| on all data points
Σ|error^2| on all data points

minimizes Σall training point (actual – predicted)^2

several algorithms
-Ordinary least squares(OLS)
-> used in sklearn LinearRegression
-Gradient descent

SSE isn’t perfect! incline

Regression

Continuous supervised learning
discrete: fast slow
continuous

sklearn regression
http://scikit-learn.org/stable/modules/linear_model.html

>>> from sklearn import linear_model
>>> reg = linear_model.LinearRegression()
>>> reg.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
>>> reg.coef_
array([ 0.5,  0.5])
#!/usr/bin/python

import numpy
import matplotlib
matplotlib.use('agg')

import matplotlib.pyplot as plt
from studentRegression import studentReg
from class_vis import prettyPicture, output_image

from ages_net_worths import ageNetWorthData

ages_train, ages_test, net_worths_train, net_worths_test = ageNetWorthData()

reg = studentReg(ages_train, net_worths_train)

plt.clf()
plt.scatter(ages_train, net_worths_train, color="b", label="train data")
plt.scatter(ages_test, net_worths_test, color="r", label="test data")
plt.plot(ages_test, reg.predict(ages_test), color="black")
plt.legend(loc=2)
plt.xlabel("ages")
plt.ylabel("net worths")
def studentRegression( ages_train, net_worths_train):
	from sklearn.linear_model import LinearRegression
	reg = LinearRegression()
	reg.fit( ages_train, net_worths_train )

	return reg

New algorithm

k nearest neighbors:classic, simple, easy to understand
random forest: “ensemble methods” meta classifiers built from (usually) decision trees
adaboost(boosted decision tree)
(previous algorithms:Naive Bayes, SVM, decision tree)

Process
1) do some research!
– get a general understanding
2) find sklearn documentation
3) deploy it!
4) use it to make predictions

What is a person of interest?
– indicted
– settled without admitting guilt
– testified in exchange for immunity

MORE DATA > fine-tuned algorithm

numerical – numerical values(numbers)
categorical – limited number of discrete values(category)
time series – temporal value(date, timestamp)
text – words

Information Gain

information gain = entropy(parent) – [weighted average] entropy(children)
decision tree algorithm: maximize information gain

>>> -2/3*math.log(2/3, 2) – 1/3*math.log(1/3, 2)

entropy(children) = 3/4(0.9184)+1/4(0)
0.3112

Entropy

Entropy: controls how a DT decides where to split the data
definition: measure of impurity in a bunch of examples

entropy = Σi -Pi log2 (Pi)
Pi is fraction of examples in class i

all examples are same class -> entropy = 〇
examples are evenly split between classes -> entropy = 1.0

grade, bumpiness, speed limit, speed
ssff
Pi = 2 / 4 = 0.5

entropy

>>> import math
>>> -0.5*math.log(0.5, 2) - 0.5*math.log(0.5, 2)
1.0

min_samples_split

import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData

import matplotlib.pyplot as plt
import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()

def submitAccuracies():
	return ["acc_min_samples_split_2":round(acc_min_samples_split_2,3),
		"acc_min_samples_split_50":round(acc_min_samples_split_50,3)]