Gaussian Kernel and Bandwidth
Mean Integrated Standard Error
E[|Pn-P|^2] = Ef(Pn(x)-P(x))^2*dx
We can technically use MISE or AMISE to select the Optimal Bandwidth
def MahalanobisDist(x, y): covariance_xy = np.cov(x,y, rowvar=0) inv_covariance_xy = np.linalg.inv(covariance_xy) xy_mean = np.mean(x),np.mean(y) x_diff = np.array([x_i - xy_mean[0] for x_i in x]) y_diff = np.array([y_i - xy_mean[1] for y_i in y]) diff_xy = np.transpose([x_diff, y_diff]) md = [] for i in range(len(diff_xy)): md.append(np.sqrt(np.dot(np.dot(np.transpose(diff_xy[i]), inv_covariance_xy),diff_xy[i]))) return md md = MahalanobisDist(x,xbar)
problem formulation => choice of los/ risk
purpose of the model => problem formulation
Identification
natural sciences, economics, medicine, some engineering
Prediction/Generalize
statistic/ machine learning, comple phenomenon, general applications
M
logistic regression, support vector machine, random forest
from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier def init_models(X_train, y_train): models = [LogisticRegression(), RandomForestClassifier(), SVC(probability=True)] for model in models: model.fit(X_train, y_train) return models models = init_models(X_train, y_train)
Learning Curves
Plot of the model performance:
The Risk or Cost or Score vs.
Size of Training Set and Test Set
Classifiers: Score or l- Score