Validate model

import sklearn
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(5)
knn_3 = KNeighborsRegressor(3)
len(knn_data_points)
.70*len(knn_data_points)
.30*len(knn_data_points)
import random
trainingPoints = [random.choice(knn_data_points) for i in xrange(966880)]

Modelling Techniques
Q, M, V
Mathmatical Models, Natural, Social Economic Sciences, Differential/Integral Equations, Algebra Other advance Math, Generalized knowledge of system
Statistical Machine Learning, Industry, Business General Applications,
Statistics Machine Learning Linear Algebra Information Theory
Combination, Science, business engineering optimization

The Medicare Challenge: The Data
Inpatient Charges 2012
Look for files with: “IL” and “CA” in the instructor’s notes

pylab inline
from IPython.display import HTML
matplotlib inline

import os
import sys
from StringIO import StringIO
import scipy
import seaborn as sns

from pandas import read_csv
import matplotlib.pyplot as pyplot

cwd = os.getcwd()

identify features with influencing relationship

X,Y
[Pr{x=x1},Pr{x=x2}]
[0.29, 0.71]
[Pr{x=x1|y=y1}, Pr{x=x2|y=y1}]
[0.16, 0.84]

X: x1,x2,…,xk
H(x)= -p1log(p1)- p2*log(p2)- … -pxlog(px)
-kΣi=1 pi*log(Pi)

H(x) = -Σi Pilog(Pi)
H(X) = 0 [0, 0, 0, … 1, 0, .., 0]
H(X) at max:[1/k, 1/k …]

H(X) > H(X|Y=Y1)
H(X) – H(X|Y)

X A,B,C,… arg min [H(X) > H(X|v)]

Covariance between intertweet and mention distance

mentionDists = [[v[0]] for v in nearestMentionToTimeDiff]
intertweetTimes = [v[l] for v in nearestMentionToTimeDiff]

import sklearn
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
clf = linear_model.LinearRegression()

clf.fit(mentionDists, intertweetTimes)
clf.coef_
mentionDists = [v[0] for v in nearestMentionToTimeDiff]
nyplt.scatter(mentionDists[0:6], intertweetTimes[0:6])

sources of prediction error

pl = pyplt.plot(xVals, yVals, label="std dev(sec)")
pyplt.legend(loc=2, prop={'size':18})
<matplotlib.legend.Legend at 0x145c51a50>

Inherent variance
f=a*x
(x,f),…
x1,a*x,
f=a*x + NOISE

(x1,v1),(x1,v2),…
V1,V2…

dataPoints_1 = []
x = np.arrange(0, 100, 10)
for j in xrange(100):
	points = [(i, i*2 + 3 + numpy.random.normal(scale=50.0)) for i in x]
	dataPoints_1.extend(points)

pointToVals = []
pointToBounds = []
for i in np.arrange(0, 100, 10):
	valsForDataPoint = [v for v dataPoints_l if v[0]==i]
	pointToVals.append(valsForDataPoint)
	upperBound = numpy.percentile(valsForDataPoint, 95)
	lowerBound = numpy.percentile(valsForDataPoint, 5)
	pointToBounds.append([i, (upperBound, lowerBound)])

pyplt.plot(x, [v[1][0] for v in pointToBounds])
pyplt.plot(x, [v[1][1] for v in pointToBounds])
pyplt.plot(x, [(i*2+3) for i in x], color="red", label="true")

Average Absolute Error

Refining exponential fit
f(t) = a*1/β e ^ t/β + C
β, a, c

Fit more generalized exponential

def fitFunc_gen(t, a, b, c):
	return a*(b)*numpy.exp(-b*t)+c

fitParams_gen, fitCov_gen = curve_fit(fitFunc_gen, division[0:len(division)])
fitParams_gen
fitCov_gen
(1/fitParams_gen[1])*fitParams_gen[0]+fitParams_gen[1]

Intertweet time:
t1, t2
<-Δt-><-p->

Training examples
(Δt, p)
elapsed time, time until next tweet

step_size = 10
data_points = []
for v in timeUntilNext:
	bin_left_edges = np.arange(0, v, step_size)

	for l_edge in bin_left_edges:
		tempNewPoint = [l_edge, v-1_edge]
		data_points.append(tempNewPoint)

data_points.sort()
delta 100 = [v[1] for v in data points if v[0]==100]
deltat_150 = [v[1] for v in data_points if v[0]==150]
deltat_10 = [v[1] for v in data_points if v[0]==10]

pandas.Series(deltat_10).hist(bins=30, alpha=0.5, color="blue")
d_150 = pandas.Series(deltat_150)
pandas.Series(deltat_150).hist(bins=30, alpha=0.3, color="red")

Pr{ x > t}

Pr {x > t} < E(x)/t X -> |x – μ|

Chebyshev’s inequality
Pr{|X-μ|>= t} <= ω^2/t^2 X1,...,Xn Confidence bounds and data f(x)= -1/β e ^ -x/β adding confidence to β Pr{|x-β*|ε< 2e^^2nε^2} β = 1451.5 seconds n = 3200

exp_diffs = []
for t in timeUntilNext:
	exp_diffs.append(t-1/fitParams[0])
pandas.Series(exp_diffs).hist(bins=50)
pandas.Series(exp_diffs).describe()
import math
exp_diffs = []
abs_diffs = []
for t in timeUnitNext:
	exp_diffs.append(t-1/fitParams[0])
	abs_diffs.append(math.fabs(t-1/fitParams[0]))

pandas.Series(abs_diffs).hist()
pandas.Series(abs_diffs).describe()

exponential fit

exponential distribution
y = 1/β e^ -y/β

Maximum likelihood estimation
d1, d2, …, dn
Pr{d1}*Pr{d2}…Pr{dw}

X = time until next tweet
f{x = t} = 1/β e * -t/β

from scipy.optimize import curve_fit
def fitFunc(t, b):
	return b*numpy.exp(-b*t)
count,division = np.histogram(timeUntilNext, bins=100, normed=Ture)
fitParams, fitCov = curve_fit(fitFunc, division[0:len(division)-1], count)

The Questioning Phase

Questioning
Modeling
Validating
=>
Answers!

e.g.
“Can we predict the next time a person will tweet?”
=> time of day

regression estimator, hypothesis test, classification

r(time since last tweet(Δt)) = time next tweet

Prepare data for histogram

tweetsDF = pandas.io.json.read_json("new_gruber_tweets.json")
createdDF = tweetsDF.ix[0:, ["created_at"]]
createdTextDF = tweetsDF.ix[0:, ["created_at", "text"]]
createdTextVals = createdTextDF.values

Collect "created_at" attributes for each tweetsDF

tweetTimes = []
for i, row in createdDF.iterrows():
	tweetTimes.append(row["created_at"])
tweetTimes.sort()

Create initial histogram

timeToNextSeries.hist(bins=30, normed=True)
<matplotlib.axes.AxesSubplot at 0x10c625390>

The QMV process of model building

The QMV iterative process of analysis
“when will my service tech arrive?”

Dispatch sys estimated arrival time
tech’s reported arrival time
time of tech’s prior finished job
=>
Classification e.g. before noon/afternoon
regression estimator

Classification route
Bayes Net, Perceptron, Logistic Regression

Candy bowl and consumer choice modeling
Consumer choice modeling: Understand how consumer make decisions
– Is it possible to find preference ordering for product brends
– Can we infer that there even exists a preference between brands?

Description of the Candy Bowl Data
consumer choice

-name
-gender
-candy
-candy color/flavor
-age
-ethnicity

Time between selections
“Interselection time” for candy C = # turns between selections of c

In []: plot_interselection_time(event_list, "orange", "airhead")
In []: plot_interselection_time(event_list, "red", "starburst")
		plot_interselection_time(event_list, "orange", "airhead")

Point estimation, Confidence sets, Classification, Hypothesis testing

r: interselection time for candy, c, at a given turn
x = (“airhead”, 1), (“role”, 5), (starburst, 7),…

c, choice #, interselection time of other candies in bowl, r(c, choice #)

Cloud Deployment Models

Public
-third party customers/tenants
Private
-leverage technology internally
Hybrid(Public + Private)
-fail over, dealing with spikes testing
Community
-used by certain type of users

On-premises
Infrastructure
Platform(Paas)
Software(SaaS)

1. “fungible” resources
2. elastic, dynamic resource allocations
3. scale: management at scale, scalable resources
4. dealing with failures
5. multi-tenancy: performance & isolation
6. security

Cloud-enabling Technologies
-virtualization
-Resource provisioning (scheduling) mesos, yarn…

Storage
-distributed FS(“append only”)
-NoSQL, distributed in-memory caches…

Software – befined… networking, storage, datacenters…

“the cloud as a big data engine”
-data storage layer
-data processing layer
-caching layer
-language fron-ends

Datacenter Technologies

Internet service == any type of service provided via web interface

-presentation == static content
-business logic == dynamic content
-database tier == data store

-not necessarily separate processes on separate machines
-many available open source and proprietary technologies

…in multi process configurations ->
some form of IPC used, including RPC/RMO, shared memory …

For scale: multi-process, multi-node
=> “scale out” architecture

1. “Boss-worker”: front-end distributes requests to nodes
2. “All Equal”: all nodes execute any possible step in request processing, for any request

Functionally heterogeneous…
-different nodes, different tasks/requests
-data doesn’t have to be uniformly accessible everywhere

Traditional Approach:
– buy and configure resources
=> determine capacity based on expected demand(peak)
– When demand exceeds capacity
dropped request
lost opportunity

・on-demand elastic resources and services
・fine-grained pricing based on usage
・professionally managed and hosted
・API-based access

shared resources
– infrastructure and software/services
APIs for access & configuration
– web-based, libraries, command line…

Law of large numbers
– per customer there is large variation in resource needs
– average across many customers is roughly constant
Economies of Scale
– unit cost of providing resources or service drop at “bulk”