MISE

Gaussian Kernel and Bandwidth
Mean Integrated Standard Error
E[|Pn-P|^2] = Ef(Pn(x)-P(x))^2*dx

We can technically use MISE or AMISE to select the Optimal Bandwidth

1
2
3
4
5
6
7
8
9
10
11
12
13
def MahalanobisDist(x, y):
    covariance_xy = np.cov(x,y, rowvar=0)
    inv_covariance_xy = np.linalg.inv(covariance_xy)
    xy_mean = np.mean(x),np.mean(y)
    x_diff = np.array([x_i - xy_mean[0] for x_i in x])
    y_diff = np.array([y_i - xy_mean[1] for y_i in y])
    diff_xy = np.transpose([x_diff, y_diff])
    md = []
    for i in range(len(diff_xy)):
        md.append(np.sqrt(np.dot(np.dot(np.transpose(diff_xy[i]), inv_covariance_xy),diff_xy[i])))
    return md
 
md = MahalanobisDist(x,xbar)

problem formulation => choice of los/ risk
purpose of the model => problem formulation

Identification
natural sciences, economics, medicine, some engineering

Prediction/Generalize
statistic/ machine learning, comple phenomenon, general applications

M
logistic regression, support vector machine, random forest

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
 
def init_models(X_train, y_train):
    models = [LogisticRegression(),
            RandomForestClassifier(),
            SVC(probability=True)]
 
    for model in models:
        model.fit(X_train, y_train)
 
    return models
 
models = init_models(X_train, y_train)

Learning Curves
Plot of the model performance:
The Risk or Cost or Score vs.
Size of Training Set and Test Set
Classifiers: Score or l- Score