Gaussian Kernel and Bandwidth
Mean Integrated Standard Error
E[|Pn-P|^2] = Ef(Pn(x)-P(x))^2*dx
We can technically use MISE or AMISE to select the Optimal Bandwidth
def MahalanobisDist(x, y):
covariance_xy = np.cov(x,y, rowvar=0)
inv_covariance_xy = np.linalg.inv(covariance_xy)
xy_mean = np.mean(x),np.mean(y)
x_diff = np.array([x_i - xy_mean[0] for x_i in x])
y_diff = np.array([y_i - xy_mean[1] for y_i in y])
diff_xy = np.transpose([x_diff, y_diff])
md = []
for i in range(len(diff_xy)):
md.append(np.sqrt(np.dot(np.dot(np.transpose(diff_xy[i]), inv_covariance_xy),diff_xy[i])))
return md
md = MahalanobisDist(x,xbar)
problem formulation => choice of los/ risk
purpose of the model => problem formulation
Identification
natural sciences, economics, medicine, some engineering
Prediction/Generalize
statistic/ machine learning, comple phenomenon, general applications
M
logistic regression, support vector machine, random forest
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
def init_models(X_train, y_train):
models = [LogisticRegression(),
RandomForestClassifier(),
SVC(probability=True)]
for model in models:
model.fit(X_train, y_train)
return models
models = init_models(X_train, y_train)
Learning Curves
Plot of the model performance:
The Risk or Cost or Score vs.
Size of Training Set and Test Set
Classifiers: Score or l- Score