August 2017 – Page 4 – ソフトウェアエンジニアの技術ブログ：Software engineer tech blog

Getting the SHA1 key for android

Find android Directory on computer
Then run Keytool command

~$ cd ~/ .android
.android$ keytool -list -v -keystore ~/.android/debug.keystore -alias androiddebugkey -storepass android -keypass android

Setting Up Maps
-set up billing
-create a project
-enable maps API
-set up credential

Google Developer Console
https://console.cloud.google.com/

Billing Account Steps
-Name the account
-Specify country for the Account
-Account Type: Business or Individual
-Payer Detail
-Payment Type

Pick Google Maps Android API “Enable”

MISE

Gaussian Kernel and Bandwidth
Mean Integrated Standard Error
E[|Pn-P|^2] = Ef(Pn(x)-P(x))^2*dx

We can technically use MISE or AMISE to select the Optimal Bandwidth

def MahalanobisDist(x, y):
	covariance_xy = np.cov(x,y, rowvar=0)
	inv_covariance_xy = np.linalg.inv(covariance_xy)
	xy_mean = np.mean(x),np.mean(y)
	x_diff = np.array([x_i - xy_mean[0] for x_i in x])
	y_diff = np.array([y_i - xy_mean[1] for y_i in y])
	diff_xy = np.transpose([x_diff, y_diff])
	md = []
	for i in range(len(diff_xy)):
		md.append(np.sqrt(np.dot(np.dot(np.transpose(diff_xy[i]), inv_covariance_xy),diff_xy[i])))
	return md

md = MahalanobisDist(x,xbar)

problem formulation => choice of los/ risk
purpose of the model => problem formulation

Identification
natural sciences, economics, medicine, some engineering

Prediction/Generalize
statistic/ machine learning, comple phenomenon, general applications

M
logistic regression, support vector machine, random forest

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

def init_models(X_train, y_train):
	models = [LogisticRegression(),
			RandomForestClassifier(),
			SVC(probability=True)]

	for model in models:
		model.fit(X_train, y_train)

	return models

models = init_models(X_train, y_train)

Learning Curves
Plot of the model performance:
The Risk or Cost or Score vs.
Size of Training Set and Test Set
Classifiers: Score or l- Score

Kerndel Density Estimates

Non-Parametric Models: KDEs

Derived Feature: x = |f0 – f1|/f0
Definition: the ratio of the submitted charge to the difference between the submitted charge and payment amount by medicare.

x = abs(f0-f1)/f0
n0, bins0, patches0=plt.hist(x,100,normed=0,range=(0,1),histtype='stepfilled')
plt.setp(patches0, 'facecolor','g','alpha', 0.75)

from scipy import stats
from functools import partial
def my_kde_bandwidth(obj, fac=1./5):
	"""We use Scott's Rule, multiplied by a constant factor."""
	return np.power(obj.n, -1./(obj.d+4)) * fac

def getKDE(data, name="", bwfac = 0.2):
	x2 = data
	x_eval = np.linspace(x2.min() - 1, x2.max() + 1, 500)
	kde = stats.gaussian_kde(x2, bw_method=partial(my_kde_bandwidth, fac=bwfac))
	fig1 = plt.figure(figsize=(8.6))
	ax = fig1.add_subplot(111)
	plt.yscale=('log')
	plt.grid(True)
	x2h1, x2h2 = np.histogramix.bins=[0.,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0],normed
	ax.plot(x2, np.zeros(x2.shape), 'b+', ms=12)

Distribution of variables

Inspecting Distribution of Variables
f0: Average Submitted Charge
f1: Average Payment Amount
f2: Average Allowed Amount

g0 = f_CA.average_submitted_chrg_amt.values
g1 = f_CA.average_Medicare_payment_amt.values
g0 = f_CA.average_Medicare_allowed_amt.values

n0, bins0, patches0=plt.hist(g0,50,normed=0, range=(0,1000),histtype='stepfilled')
n2, bins2, patches2=plt.hist(g2,50,normed=0, range=(0,1000),histtype='stepfilled')
plt.setp(patches0, 'facecolor','g','alpha', 0.75)
plt.setp(patches2, 'facecolor','b','alpha', 0.75)

n0, bins0, patches0=plt.hist(f0,50,normed=1,log=range=(0,1000))

Scale variable we can
f0 – f0.min() / (f0.max() – f0.min())

n0, bins0, patches0=plt.hist((f0-f0.min())/(f0.max()-f0.min()),50,normed=1, log=1,range(-0.2,1.2),histtype='stepfilled')
n1, bins1, patches1=plt.hist((f2-f2.min())/(f2.max()-f2.min()),40,normed=1, log=1,range(-0.2,1,2),histtype='stepfilled')
plt_step(patches0,'facecolor','g','alpha',0.75)
plt_step(patches1,'facecolor','r','alpha',0.75)

The range of scaled variables: [0, 1]

Calculating Correlations
f1, f2: linearly correlated
Converiance: E[(x-μx)(y-μy)]
Pearson’s Correlation Co-efficient: Px,y = cov(x,y)/αxαy= E[(x-μx)(y-μy)]/αxαy

Parametric, Non Parametric, Mathematical
Kenerl K, Density D, Estimates E

The Medicare Code

CA_data = cwd +"/medicare_data/Medicare_Data_CA_xxx.csv"
f_CA = read_csv(CA_data)

f_CA.describe()
f_CA.head(10)
len(f_IL.columns)
for c in f_IL.columns : print c

IPython notebook

%pylab inline
from IPython.display import HTML
%matplotlib inline

import os
import sys
from StringIO import StringIO
import scipy
import seaborn as sns

from pandas import read_csv
import matplotlib.pyplot as pyplot

cwd = os.getcwd()

IL_data = cwd +"/medicare_data/Medicare_Data_IL_xxx.csv"
f_IL = read_csv(IL_data)

f_IL.describe()
f_IL.head(5)

len(f_IL.columns)
for c in f_IL.columns: print c

print len(f_IL.provider_type.unique())
print len(f_IL.nppes_provider_city.unique())
print len(f_IL.hcpcs_description.unique())

f0 = f_IL.average_submitted_chrg_amt.values
f1 = f_IL.average_Medicare_payment_amt.values
f2 = f_IL.average_Medicare_allowed_amt.values

Validate model

import sklearn
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(5)
knn_3 = KNeighborsRegressor(3)
len(knn_data_points)
.70*len(knn_data_points)
.30*len(knn_data_points)
import random
trainingPoints = [random.choice(knn_data_points) for i in xrange(966880)]

Modelling Techniques
Q, M, V
Mathmatical Models, Natural, Social Economic Sciences, Differential/Integral Equations, Algebra Other advance Math, Generalized knowledge of system
Statistical Machine Learning, Industry, Business General Applications,
Statistics Machine Learning Linear Algebra Information Theory
Combination, Science, business engineering optimization

The Medicare Challenge: The Data
Inpatient Charges 2012
Look for files with: “IL” and “CA” in the instructor’s notes

pylab inline
from IPython.display import HTML
matplotlib inline

import os
import sys
from StringIO import StringIO
import scipy
import seaborn as sns

from pandas import read_csv
import matplotlib.pyplot as pyplot

cwd = os.getcwd()

identify features with influencing relationship

X,Y
[Pr{x=x1},Pr{x=x2}]
[0.29, 0.71]
[Pr{x=x1|y=y1}, Pr{x=x2|y=y1}]
[0.16, 0.84]

X: x1,x2,…,xk
H(x)= -p1log(p1)- p2*log(p2)- … -pxlog(px)
-kΣi=1 pi*log(Pi)

H(x) = -Σi Pilog(Pi)
H(X) = 0 [0, 0, 0, … 1, 0, .., 0]
H(X) at max:[1/k, 1/k …]

H(X) > H(X|Y=Y1)
H(X) – H(X|Y)

X A,B,C,… arg min [H(X) > H(X|v)]

Covariance between intertweet and mention distance

mentionDists = [[v[0]] for v in nearestMentionToTimeDiff]
intertweetTimes = [v[l] for v in nearestMentionToTimeDiff]

import sklearn
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
clf = linear_model.LinearRegression()

clf.fit(mentionDists, intertweetTimes)
clf.coef_
mentionDists = [v[0] for v in nearestMentionToTimeDiff]
nyplt.scatter(mentionDists[0:6], intertweetTimes[0:6])

sources of prediction error

pl = pyplt.plot(xVals, yVals, label="std dev(sec)")
pyplt.legend(loc=2, prop={'size':18})
<matplotlib.legend.Legend at 0x145c51a50>

Inherent variance
f=a*x
(x,f),…
x1,a*x,
f=a*x + NOISE

(x1,v1),(x1,v2),…
V1,V2…

dataPoints_1 = []
x = np.arrange(0, 100, 10)
for j in xrange(100):
	points = [(i, i*2 + 3 + numpy.random.normal(scale=50.0)) for i in x]
	dataPoints_1.extend(points)

pointToVals = []
pointToBounds = []
for i in np.arrange(0, 100, 10):
	valsForDataPoint = [v for v dataPoints_l if v[0]==i]
	pointToVals.append(valsForDataPoint)
	upperBound = numpy.percentile(valsForDataPoint, 95)
	lowerBound = numpy.percentile(valsForDataPoint, 5)
	pointToBounds.append([i, (upperBound, lowerBound)])

pyplt.plot(x, [v[1][0] for v in pointToBounds])
pyplt.plot(x, [v[1][1] for v in pointToBounds])
pyplt.plot(x, [(i*2+3) for i in x], color="red", label="true")

Average Absolute Error

Refining exponential fit
f(t) = a*1/β e ^ t/β　+ C
β, a, c

Fit more generalized exponential

def fitFunc_gen(t, a, b, c):
	return a*(b)*numpy.exp(-b*t)+c

fitParams_gen, fitCov_gen = curve_fit(fitFunc_gen, division[0:len(division)])
fitParams_gen
fitCov_gen
(1/fitParams_gen[1])*fitParams_gen[0]+fitParams_gen[1]

Intertweet time:
t1, t2
<-Δt-><-p->

Training examples
(Δt, p)
elapsed time, time until next tweet

step_size = 10
data_points = []
for v in timeUntilNext:
	bin_left_edges = np.arange(0, v, step_size)

	for l_edge in bin_left_edges:
		tempNewPoint = [l_edge, v-1_edge]
		data_points.append(tempNewPoint)

data_points.sort()
delta 100 = [v[1] for v in data points if v[0]==100]

deltat_150 = [v[1] for v in data_points if v[0]==150]
deltat_10 = [v[1] for v in data_points if v[0]==10]

pandas.Series(deltat_10).hist(bins=30, alpha=0.5, color="blue")
d_150 = pandas.Series(deltat_150)
pandas.Series(deltat_150).hist(bins=30, alpha=0.3, color="red")

Pr{ x > t}

Pr {x > t} < E(x)/t X -> |x – μ|

Chebyshev’s inequality
Pr{|X-μ|>= t} <= ω^2/t^2 X1,...,Xn Confidence bounds and data f(x)= -1/β e ^ -x/β adding confidence to β Pr{|x-β*|ε< 2e^^2nε^2} β = 1451.5 seconds n = 3200

exp_diffs = []
for t in timeUntilNext:
	exp_diffs.append(t-1/fitParams[0])
pandas.Series(exp_diffs).hist(bins=50)
pandas.Series(exp_diffs).describe()

import math
exp_diffs = []
abs_diffs = []
for t in timeUnitNext:
	exp_diffs.append(t-1/fitParams[0])
	abs_diffs.append(math.fabs(t-1/fitParams[0]))

pandas.Series(abs_diffs).hist()
pandas.Series(abs_diffs).describe()