Principal Component Analysis – PCA
Dimensional of data:2
x = 2
y = 3
Δx = 1
Δy = 2
square footage + No.Rooms -> Size
How to determine the principal component
variance – the willingness/flexibility of an algorithm to learn
technical term in statistics – roughly the “spread” of a data distribution(similar to standard duration)
– maximum variance and information loss
def doPCA():
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(data)
return pca
pca = doPCA()
print pca.explained_variance_ratio_
first_pc = pca.component_[0]
second_pc = pca.components_[1]
transformed_data = pca.transform(data)
for ii, jj in zip(transofrmed_data, data):
plt.scatter( first_pc[0]*ii[0], first_pc[1]*ii[0], color="r")
plt.scatter( second_pc[0]*ii[1], second_pc[1]*ii[1], color="c")
plt.scatter( jj[0], jj[i], color="b")
plt.xlabel("bonus")
plt.ylabel("long-term incentive")
plt.show()