data science – Page 3 – ソフトウェアエンジニアの技術ブログ：Software engineer tech blog

Distribution of variables

Inspecting Distribution of Variables
f0: Average Submitted Charge
f1: Average Payment Amount
f2: Average Allowed Amount

g0 = f_CA.average_submitted_chrg_amt.values
g1 = f_CA.average_Medicare_payment_amt.values
g0 = f_CA.average_Medicare_allowed_amt.values

n0, bins0, patches0=plt.hist(g0,50,normed=0, range=(0,1000),histtype='stepfilled')
n2, bins2, patches2=plt.hist(g2,50,normed=0, range=(0,1000),histtype='stepfilled')
plt.setp(patches0, 'facecolor','g','alpha', 0.75)
plt.setp(patches2, 'facecolor','b','alpha', 0.75)

n0, bins0, patches0=plt.hist(f0,50,normed=1,log=range=(0,1000))

Scale variable we can
f0 – f0.min() / (f0.max() – f0.min())

n0, bins0, patches0=plt.hist((f0-f0.min())/(f0.max()-f0.min()),50,normed=1, log=1,range(-0.2,1.2),histtype='stepfilled')
n1, bins1, patches1=plt.hist((f2-f2.min())/(f2.max()-f2.min()),40,normed=1, log=1,range(-0.2,1,2),histtype='stepfilled')
plt_step(patches0,'facecolor','g','alpha',0.75)
plt_step(patches1,'facecolor','r','alpha',0.75)

The range of scaled variables: [0, 1]

Calculating Correlations
f1, f2: linearly correlated
Converiance: E[(x-μx)(y-μy)]
Pearson’s Correlation Co-efficient: Px,y = cov(x,y)/αxαy= E[(x-μx)(y-μy)]/αxαy

Parametric, Non Parametric, Mathematical
Kenerl K, Density D, Estimates E

The Medicare Code

CA_data = cwd +"/medicare_data/Medicare_Data_CA_xxx.csv"
f_CA = read_csv(CA_data)

f_CA.describe()
f_CA.head(10)
len(f_IL.columns)
for c in f_IL.columns : print c

IPython notebook

%pylab inline
from IPython.display import HTML
%matplotlib inline

import os
import sys
from StringIO import StringIO
import scipy
import seaborn as sns

from pandas import read_csv
import matplotlib.pyplot as pyplot

cwd = os.getcwd()

IL_data = cwd +"/medicare_data/Medicare_Data_IL_xxx.csv"
f_IL = read_csv(IL_data)

f_IL.describe()
f_IL.head(5)

len(f_IL.columns)
for c in f_IL.columns: print c

print len(f_IL.provider_type.unique())
print len(f_IL.nppes_provider_city.unique())
print len(f_IL.hcpcs_description.unique())

f0 = f_IL.average_submitted_chrg_amt.values
f1 = f_IL.average_Medicare_payment_amt.values
f2 = f_IL.average_Medicare_allowed_amt.values

Validate model

import sklearn
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(5)
knn_3 = KNeighborsRegressor(3)
len(knn_data_points)
.70*len(knn_data_points)
.30*len(knn_data_points)
import random
trainingPoints = [random.choice(knn_data_points) for i in xrange(966880)]

Modelling Techniques
Q, M, V
Mathmatical Models, Natural, Social Economic Sciences, Differential/Integral Equations, Algebra Other advance Math, Generalized knowledge of system
Statistical Machine Learning, Industry, Business General Applications,
Statistics Machine Learning Linear Algebra Information Theory
Combination, Science, business engineering optimization

The Medicare Challenge: The Data
Inpatient Charges 2012
Look for files with: “IL” and “CA” in the instructor’s notes

pylab inline
from IPython.display import HTML
matplotlib inline

import os
import sys
from StringIO import StringIO
import scipy
import seaborn as sns

from pandas import read_csv
import matplotlib.pyplot as pyplot

cwd = os.getcwd()

identify features with influencing relationship

X,Y
[Pr{x=x1},Pr{x=x2}]
[0.29, 0.71]
[Pr{x=x1|y=y1}, Pr{x=x2|y=y1}]
[0.16, 0.84]

X: x1,x2,…,xk
H(x)= -p1log(p1)- p2*log(p2)- … -pxlog(px)
-kΣi=1 pi*log(Pi)

H(x) = -Σi Pilog(Pi)
H(X) = 0 [0, 0, 0, … 1, 0, .., 0]
H(X) at max:[1/k, 1/k …]

H(X) > H(X|Y=Y1)
H(X) – H(X|Y)

X A,B,C,… arg min [H(X) > H(X|v)]

Covariance between intertweet and mention distance

mentionDists = [[v[0]] for v in nearestMentionToTimeDiff]
intertweetTimes = [v[l] for v in nearestMentionToTimeDiff]

import sklearn
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
clf = linear_model.LinearRegression()

clf.fit(mentionDists, intertweetTimes)
clf.coef_
mentionDists = [v[0] for v in nearestMentionToTimeDiff]
nyplt.scatter(mentionDists[0:6], intertweetTimes[0:6])

sources of prediction error

pl = pyplt.plot(xVals, yVals, label="std dev(sec)")
pyplt.legend(loc=2, prop={'size':18})
<matplotlib.legend.Legend at 0x145c51a50>

Inherent variance
f=a*x
(x,f),…
x1,a*x,
f=a*x + NOISE

(x1,v1),(x1,v2),…
V1,V2…

dataPoints_1 = []
x = np.arrange(0, 100, 10)
for j in xrange(100):
	points = [(i, i*2 + 3 + numpy.random.normal(scale=50.0)) for i in x]
	dataPoints_1.extend(points)

pointToVals = []
pointToBounds = []
for i in np.arrange(0, 100, 10):
	valsForDataPoint = [v for v dataPoints_l if v[0]==i]
	pointToVals.append(valsForDataPoint)
	upperBound = numpy.percentile(valsForDataPoint, 95)
	lowerBound = numpy.percentile(valsForDataPoint, 5)
	pointToBounds.append([i, (upperBound, lowerBound)])

pyplt.plot(x, [v[1][0] for v in pointToBounds])
pyplt.plot(x, [v[1][1] for v in pointToBounds])
pyplt.plot(x, [(i*2+3) for i in x], color="red", label="true")

Average Absolute Error

Refining exponential fit
f(t) = a*1/β e ^ t/β　+ C
β, a, c

Fit more generalized exponential

def fitFunc_gen(t, a, b, c):
	return a*(b)*numpy.exp(-b*t)+c

fitParams_gen, fitCov_gen = curve_fit(fitFunc_gen, division[0:len(division)])
fitParams_gen
fitCov_gen
(1/fitParams_gen[1])*fitParams_gen[0]+fitParams_gen[1]

Intertweet time:
t1, t2
<-Δt-><-p->

Training examples
(Δt, p)
elapsed time, time until next tweet

step_size = 10
data_points = []
for v in timeUntilNext:
	bin_left_edges = np.arange(0, v, step_size)

	for l_edge in bin_left_edges:
		tempNewPoint = [l_edge, v-1_edge]
		data_points.append(tempNewPoint)

data_points.sort()
delta 100 = [v[1] for v in data points if v[0]==100]

deltat_150 = [v[1] for v in data_points if v[0]==150]
deltat_10 = [v[1] for v in data_points if v[0]==10]

pandas.Series(deltat_10).hist(bins=30, alpha=0.5, color="blue")
d_150 = pandas.Series(deltat_150)
pandas.Series(deltat_150).hist(bins=30, alpha=0.3, color="red")

Pr{ x > t}

Pr {x > t} < E(x)/t X -> |x – μ|

Chebyshev’s inequality
Pr{|X-μ|>= t} <= ω^2/t^2 X1,...,Xn Confidence bounds and data f(x)= -1/β e ^ -x/β adding confidence to β Pr{|x-β*|ε< 2e^^2nε^2} β = 1451.5 seconds n = 3200

exp_diffs = []
for t in timeUntilNext:
	exp_diffs.append(t-1/fitParams[0])
pandas.Series(exp_diffs).hist(bins=50)
pandas.Series(exp_diffs).describe()

import math
exp_diffs = []
abs_diffs = []
for t in timeUnitNext:
	exp_diffs.append(t-1/fitParams[0])
	abs_diffs.append(math.fabs(t-1/fitParams[0]))

pandas.Series(abs_diffs).hist()
pandas.Series(abs_diffs).describe()

exponential fit

exponential distribution
y = 1/β e^ -y/β

Maximum likelihood estimation
d1, d2, …, dn
Pr{d1}*Pr{d2}…Pr{dw}

X = time until next tweet
f{x = t} = 1/β e * -t/β

from scipy.optimize import curve_fit
def fitFunc(t, b):
	return b*numpy.exp(-b*t)
count,division = np.histogram(timeUntilNext, bins=100, normed=Ture)
fitParams, fitCov = curve_fit(fitFunc, division[0:len(division)-1], count)

The Questioning Phase

Questioning
Modeling
Validating
=>
Answers!

e.g.
“Can we predict the next time a person will tweet?”
=> time of day

regression estimator, hypothesis test, classification

r(time since last tweet(Δt)) = time next tweet

Prepare data for histogram

tweetsDF = pandas.io.json.read_json("new_gruber_tweets.json")
createdDF = tweetsDF.ix[0:, ["created_at"]]
createdTextDF = tweetsDF.ix[0:, ["created_at", "text"]]
createdTextVals = createdTextDF.values

Collect "created_at" attributes for each tweetsDF

tweetTimes = []
for i, row in createdDF.iterrows():
	tweetTimes.append(row["created_at"])
tweetTimes.sort()

Create initial histogram

timeToNextSeries.hist(bins=30, normed=True)
<matplotlib.axes.AxesSubplot at 0x10c625390>

The QMV process of model building

The QMV iterative process of analysis
“when will my service tech arrive?”

Dispatch sys estimated arrival time
tech’s reported arrival time
time of tech’s prior finished job
=>
Classification e.g. before noon/afternoon
regression estimator

Classification route
Bayes Net, Perceptron, Logistic Regression

Candy bowl and consumer choice modeling
Consumer choice modeling: Understand how consumer make decisions
– Is it possible to find preference ordering for product brends
– Can we infer that there even exists a preference between brands?

Description of the Candy Bowl Data
consumer choice

-name
-gender
-candy
-candy color/flavor
-age
-ethnicity

Time between selections
“Interselection time” for candy C = # turns between selections of c

In []: plot_interselection_time(event_list, "orange", "airhead")
In []: plot_interselection_time(event_list, "red", "starburst")
		plot_interselection_time(event_list, "orange", "airhead")

Point estimation, Confidence sets, Classification, Hypothesis testing

r: interselection time for candy, c, at a given turn
x = (“airhead”, 1), (“role”, 5), (starburst, 7),…

c, choice #, interselection time of other candies in bowl, r(c, choice #)