Distribution of variables

Inspecting Distribution of Variables
f0: Average Submitted Charge
f1: Average Payment Amount
f2: Average Allowed Amount

1
2
3
4
5
6
7
8
9
10
g0 = f_CA.average_submitted_chrg_amt.values
g1 = f_CA.average_Medicare_payment_amt.values
g0 = f_CA.average_Medicare_allowed_amt.values
 
n0, bins0, patches0=plt.hist(g0,50,normed=0, range=(0,1000),histtype='stepfilled')
n2, bins2, patches2=plt.hist(g2,50,normed=0, range=(0,1000),histtype='stepfilled')
plt.setp(patches0, 'facecolor','g','alpha', 0.75)
plt.setp(patches2, 'facecolor','b','alpha', 0.75)
 
n0, bins0, patches0=plt.hist(f0,50,normed=1,log=range=(0,1000))

Scale variable we can
f0 – f0.min() / (f0.max() – f0.min())

1
2
3
4
n0, bins0, patches0=plt.hist((f0-f0.min())/(f0.max()-f0.min()),50,normed=1, log=1,range(-0.2,1.2),histtype='stepfilled')
n1, bins1, patches1=plt.hist((f2-f2.min())/(f2.max()-f2.min()),40,normed=1, log=1,range(-0.2,1,2),histtype='stepfilled')
plt_step(patches0,'facecolor','g','alpha',0.75)
plt_step(patches1,'facecolor','r','alpha',0.75)

The range of scaled variables: [0, 1]

Calculating Correlations
f1, f2: linearly correlated
Converiance: E[(x-μx)(y-μy)]
Pearson’s Correlation Co-efficient: Px,y = cov(x,y)/αxαy= E[(x-μx)(y-μy)]/αxαy

Parametric, Non Parametric, Mathematical
Kenerl K, Density D, Estimates E

The Medicare Code

1
2
3
4
5
6
7
CA_data = cwd +"/medicare_data/Medicare_Data_CA_xxx.csv"
f_CA = read_csv(CA_data)
 
f_CA.describe()
f_CA.head(10)
len(f_IL.columns)
for c in f_IL.columns : print c

IPython notebook

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
%pylab inline
from IPython.display import HTML
%matplotlib inline
 
import os
import sys
from StringIO import StringIO
import scipy
import seaborn as sns
 
from pandas import read_csv
import matplotlib.pyplot as pyplot
 
cwd = os.getcwd()
 
IL_data = cwd +"/medicare_data/Medicare_Data_IL_xxx.csv"
f_IL = read_csv(IL_data)
 
f_IL.describe()
f_IL.head(5)
1
2
3
4
5
6
7
8
9
10
len(f_IL.columns)
for c in f_IL.columns: print c
 
print len(f_IL.provider_type.unique())
print len(f_IL.nppes_provider_city.unique())
print len(f_IL.hcpcs_description.unique())
 
f0 = f_IL.average_submitted_chrg_amt.values
f1 = f_IL.average_Medicare_payment_amt.values
f2 = f_IL.average_Medicare_allowed_amt.values

Validate model

1
2
3
4
5
6
7
8
9
10
import sklearn
from sklearn.neighbors import KNeighborsRegressor
 
knn = KNeighborsRegressor(5)
knn_3 = KNeighborsRegressor(3)
len(knn_data_points)
.70*len(knn_data_points)
.30*len(knn_data_points)
import random
trainingPoints = [random.choice(knn_data_points) for i in xrange(966880)]

Modelling Techniques
Q, M, V
Mathmatical Models, Natural, Social Economic Sciences, Differential/Integral Equations, Algebra Other advance Math, Generalized knowledge of system
Statistical Machine Learning, Industry, Business General Applications,
Statistics Machine Learning Linear Algebra Information Theory
Combination, Science, business engineering optimization

The Medicare Challenge: The Data
Inpatient Charges 2012
Look for files with: “IL” and “CA” in the instructor’s notes

1
2
3
4
5
6
7
8
9
10
11
12
13
14
pylab inline
from IPython.display import HTML
matplotlib inline
 
import os
import sys
from StringIO import StringIO
import scipy
import seaborn as sns
 
from pandas import read_csv
import matplotlib.pyplot as pyplot
 
cwd = os.getcwd()

identify features with influencing relationship

X,Y
[Pr{x=x1},Pr{x=x2}]
[0.29, 0.71]
[Pr{x=x1|y=y1}, Pr{x=x2|y=y1}]
[0.16, 0.84]

X: x1,x2,…,xk
H(x)= -p1log(p1)- p2*log(p2)- … -pxlog(px)
-kΣi=1 pi*log(Pi)

H(x) = -Σi Pilog(Pi)
H(X) = 0 [0, 0, 0, … 1, 0, .., 0]
H(X) at max:[1/k, 1/k …]

H(X) > H(X|Y=Y1)
H(X) – H(X|Y)

X A,B,C,… arg min [H(X) > H(X|v)]

Covariance between intertweet and mention distance

1
2
3
4
5
6
7
8
9
10
11
12
mentionDists = [[v[0]] for v in nearestMentionToTimeDiff]
intertweetTimes = [v[l] for v in nearestMentionToTimeDiff]
 
import sklearn
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
clf = linear_model.LinearRegression()
 
clf.fit(mentionDists, intertweetTimes)
clf.coef_
mentionDists = [v[0] for v in nearestMentionToTimeDiff]
nyplt.scatter(mentionDists[0:6], intertweetTimes[0:6])

sources of prediction error

1
2
3
pl = pyplt.plot(xVals, yVals, label="std dev(sec)")
pyplt.legend(loc=2, prop={'size':18})
<matplotlib.legend.Legend at 0x145c51a50>

Inherent variance
f=a*x
(x,f),…
x1,a*x,
f=a*x + NOISE

(x1,v1),(x1,v2),…
V1,V2…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
dataPoints_1 = []
x = np.arrange(0, 100, 10)
for j in xrange(100):
    points = [(i, i*2 + 3 + numpy.random.normal(scale=50.0)) for i in x]
    dataPoints_1.extend(points)
 
pointToVals = []
pointToBounds = []
for i in np.arrange(0, 100, 10):
    valsForDataPoint = [v for v dataPoints_l if v[0]==i]
    pointToVals.append(valsForDataPoint)
    upperBound = numpy.percentile(valsForDataPoint, 95)
    lowerBound = numpy.percentile(valsForDataPoint, 5)
    pointToBounds.append([i, (upperBound, lowerBound)])
 
pyplt.plot(x, [v[1][0] for v in pointToBounds])
pyplt.plot(x, [v[1][1] for v in pointToBounds])
pyplt.plot(x, [(i*2+3) for i in x], color="red", label="true")

Average Absolute Error

Refining exponential fit
f(t) = a*1/β e ^ t/β + C
β, a, c

Fit more generalized exponential

1
2
3
4
5
6
7
def fitFunc_gen(t, a, b, c):
    return a*(b)*numpy.exp(-b*t)+c
 
fitParams_gen, fitCov_gen = curve_fit(fitFunc_gen, division[0:len(division)])
fitParams_gen
fitCov_gen
(1/fitParams_gen[1])*fitParams_gen[0]+fitParams_gen[1]

Intertweet time:
t1, t2
<-Δt-><-p->

Training examples
(Δt, p)
elapsed time, time until next tweet

1
2
3
4
5
6
7
8
9
10
11
step_size = 10
data_points = []
for v in timeUntilNext:
    bin_left_edges = np.arange(0, v, step_size)
 
    for l_edge in bin_left_edges:
        tempNewPoint = [l_edge, v-1_edge]
        data_points.append(tempNewPoint)
 
data_points.sort()
delta 100 = [v[1] for v in data points if v[0]==100]
1
2
3
4
5
6
deltat_150 = [v[1] for v in data_points if v[0]==150]
deltat_10 = [v[1] for v in data_points if v[0]==10]
 
pandas.Series(deltat_10).hist(bins=30, alpha=0.5, color="blue")
d_150 = pandas.Series(deltat_150)
pandas.Series(deltat_150).hist(bins=30, alpha=0.3, color="red")

Pr{ x > t}

Pr {x > t} < E(x)/t X -> |x – μ|

Chebyshev’s inequality
Pr{|X-μ|>= t} <= ω^2/t^2 X1,...,Xn Confidence bounds and data f(x)= -1/β e ^ -x/β adding confidence to β Pr{|x-β*|ε< 2e^^2nε^2} β = 1451.5 seconds n = 3200

exp_diffs = []
for t in timeUntilNext:
	exp_diffs.append(t-1/fitParams[0])
pandas.Series(exp_diffs).hist(bins=50)
pandas.Series(exp_diffs).describe()
1
2
3
4
5
6
7
8
9
import math
exp_diffs = []
abs_diffs = []
for t in timeUnitNext:
    exp_diffs.append(t-1/fitParams[0])
    abs_diffs.append(math.fabs(t-1/fitParams[0]))
 
pandas.Series(abs_diffs).hist()
pandas.Series(abs_diffs).describe()

exponential fit

exponential distribution
y = 1/β e^ -y/β

Maximum likelihood estimation
d1, d2, …, dn
Pr{d1}*Pr{d2}…Pr{dw}

X = time until next tweet
f{x = t} = 1/β e * -t/β

1
2
3
4
5
from scipy.optimize import curve_fit
def fitFunc(t, b):
    return b*numpy.exp(-b*t)
count,division = np.histogram(timeUntilNext, bins=100, normed=Ture)
fitParams, fitCov = curve_fit(fitFunc, division[0:len(division)-1], count)

The Questioning Phase

Questioning
Modeling
Validating
=>
Answers!

e.g.
“Can we predict the next time a person will tweet?”
=> time of day

regression estimator, hypothesis test, classification

r(time since last tweet(Δt)) = time next tweet

Prepare data for histogram

1
2
3
4
5
6
7
8
9
10
11
tweetsDF = pandas.io.json.read_json("new_gruber_tweets.json")
createdDF = tweetsDF.ix[0:, ["created_at"]]
createdTextDF = tweetsDF.ix[0:, ["created_at", "text"]]
createdTextVals = createdTextDF.values
 
Collect "created_at" attributes for each tweetsDF
 
tweetTimes = []
for i, row in createdDF.iterrows():
    tweetTimes.append(row["created_at"])
tweetTimes.sort()

Create initial histogram

1
2
timeToNextSeries.hist(bins=30, normed=True)
<matplotlib.axes.AxesSubplot at 0x10c625390>

The QMV process of model building

The QMV iterative process of analysis
“when will my service tech arrive?”

Dispatch sys estimated arrival time
tech’s reported arrival time
time of tech’s prior finished job
=>
Classification e.g. before noon/afternoon
regression estimator

Classification route
Bayes Net, Perceptron, Logistic Regression

Candy bowl and consumer choice modeling
Consumer choice modeling: Understand how consumer make decisions
– Is it possible to find preference ordering for product brends
– Can we infer that there even exists a preference between brands?

Description of the Candy Bowl Data
consumer choice

-name
-gender
-candy
-candy color/flavor
-age
-ethnicity

Time between selections
“Interselection time” for candy C = # turns between selections of c

1
2
3
In []: plot_interselection_time(event_list, "orange", "airhead")
In []: plot_interselection_time(event_list, "red", "starburst")
        plot_interselection_time(event_list, "orange", "airhead")

Point estimation, Confidence sets, Classification, Hypothesis testing

r: interselection time for candy, c, at a given turn
x = (“airhead”, 1), (“role”, 5), (starburst, 7),…

c, choice #, interselection time of other candies in bowl, r(c, choice #)