exponential fit

exponential distribution
y = 1/β e^ -y/β

Maximum likelihood estimation
d1, d2, …, dn
Pr{d1}*Pr{d2}…Pr{dw}

X = time until next tweet
f{x = t} = 1/β e * -t/β

from scipy.optimize import curve_fit
def fitFunc(t, b):
	return b*numpy.exp(-b*t)
count,division = np.histogram(timeUntilNext, bins=100, normed=Ture)
fitParams, fitCov = curve_fit(fitFunc, division[0:len(division)-1], count)

The Questioning Phase

Questioning
Modeling
Validating
=>
Answers!

e.g.
“Can we predict the next time a person will tweet?”
=> time of day

regression estimator, hypothesis test, classification

r(time since last tweet(Δt)) = time next tweet

Prepare data for histogram

tweetsDF = pandas.io.json.read_json("new_gruber_tweets.json")
createdDF = tweetsDF.ix[0:, ["created_at"]]
createdTextDF = tweetsDF.ix[0:, ["created_at", "text"]]
createdTextVals = createdTextDF.values

Collect "created_at" attributes for each tweetsDF

tweetTimes = []
for i, row in createdDF.iterrows():
	tweetTimes.append(row["created_at"])
tweetTimes.sort()

Create initial histogram

timeToNextSeries.hist(bins=30, normed=True)
<matplotlib.axes.AxesSubplot at 0x10c625390>

The QMV process of model building

The QMV iterative process of analysis
“when will my service tech arrive?”

Dispatch sys estimated arrival time
tech’s reported arrival time
time of tech’s prior finished job
=>
Classification e.g. before noon/afternoon
regression estimator

Classification route
Bayes Net, Perceptron, Logistic Regression

Candy bowl and consumer choice modeling
Consumer choice modeling: Understand how consumer make decisions
– Is it possible to find preference ordering for product brends
– Can we infer that there even exists a preference between brands?

Description of the Candy Bowl Data
consumer choice

-name
-gender
-candy
-candy color/flavor
-age
-ethnicity

Time between selections
“Interselection time” for candy C = # turns between selections of c

In []: plot_interselection_time(event_list, "orange", "airhead")
In []: plot_interselection_time(event_list, "red", "starburst")
		plot_interselection_time(event_list, "orange", "airhead")

Point estimation, Confidence sets, Classification, Hypothesis testing

r: interselection time for candy, c, at a given turn
x = (“airhead”, 1), (“role”, 5), (starburst, 7),…

c, choice #, interselection time of other candies in bowl, r(c, choice #)

Cloud Deployment Models

Public
-third party customers/tenants
Private
-leverage technology internally
Hybrid(Public + Private)
-fail over, dealing with spikes testing
Community
-used by certain type of users

On-premises
Infrastructure
Platform(Paas)
Software(SaaS)

1. “fungible” resources
2. elastic, dynamic resource allocations
3. scale: management at scale, scalable resources
4. dealing with failures
5. multi-tenancy: performance & isolation
6. security

Cloud-enabling Technologies
-virtualization
-Resource provisioning (scheduling) mesos, yarn…

Storage
-distributed FS(“append only”)
-NoSQL, distributed in-memory caches…

Software – befined… networking, storage, datacenters…

“the cloud as a big data engine”
-data storage layer
-data processing layer
-caching layer
-language fron-ends

Datacenter Technologies

Internet service == any type of service provided via web interface

-presentation == static content
-business logic == dynamic content
-database tier == data store

-not necessarily separate processes on separate machines
-many available open source and proprietary technologies

…in multi process configurations ->
some form of IPC used, including RPC/RMO, shared memory …

For scale: multi-process, multi-node
=> “scale out” architecture

1. “Boss-worker”: front-end distributes requests to nodes
2. “All Equal”: all nodes execute any possible step in request processing, for any request

Functionally heterogeneous…
-different nodes, different tasks/requests
-data doesn’t have to be uniformly accessible everywhere

Traditional Approach:
– buy and configure resources
=> determine capacity based on expected demand(peak)
– When demand exceeds capacity
dropped request
lost opportunity

・on-demand elastic resources and services
・fine-grained pricing based on usage
・professionally managed and hosted
・API-based access

shared resources
– infrastructure and software/services
APIs for access & configuration
– web-based, libraries, command line…

Law of large numbers
– per customer there is large variation in resource needs
– average across many customers is roughly constant
Economies of Scale
– unit cost of providing resources or service drop at “bulk”

Distributed Shared Memory

-must decide placement
place memory (pages) close to relevant processes
-must decide migration
when to copy memory(pages) from remote to local
-must decide sharing rules

Client
-send requests to file service

Caching
-improve performance (seen by client) and scalability

Servers
-own and manage state(files)
-provide service(file access)

Each node…
-“owns” state => memory
-provides service
memory reads/writes
from any node
consistency protocols

permits scaling beyond single machine memory limits
– more shared memory at lower cost
– slower overall memory access
– commodity interconnect technologies support this(RDMA)

Hardware vs Software DSM
hardware supported
– relies on interconnect
– os manages larger physical memory
– NICs translate remote MM accesses to messages
– NICs involved in all aspects of Mm management, support atomics
Software-supported
– everything done by software
– os or language runtime

Application access algorithm
-single reader/ single write(srsw)
-multiple readers / single writer(mrsw)
-multiple readers / multiple writers(mrmw)

Performance considerations
DSM performance metric == access latency
Achieving low latency through.. migration
-makes sense for srsw
-requires data movement
Replication(caching)
-more general
-requires consistency management

DSM Design: Consistency management
DSM ~shared memory in SMPs
in smp
– write -invalidate
– write -update

DSM Desgin: Cosistency management
Push invalidations when data is written to…
Pull modification info periodically…

if MRMW…
– need local caches for performance
– home node drives coherence
– all nodes responsible for part of distributed memory management

“Home” node
– keep state: pages accessed, modifications, caching enabled/disabled, locked…
– current “owner”

Consistency model == agreement between memory(state) and upper software layers

“mem behaves correctcy if and only if software follows specific rules”

Replication vs. Partitioning

Replication == each machine holds all files
load balancing, availability fault tolerance
writes become more complex
-> synchronously to all
-> or, write to one, then propagated to others
replicas must be reconsiled
Partitioning
== each machine has subset of files
availability vs. single server DFS
scalabililty w/file system size
single file write simpler
on failure, lose portion of data load balancing harder; if not balanced, then hot spots possible

NFSv3 == stateless, NFSv4 == stateful
caching
session-based(non-concurrent)
periodic updates
– default: 3sec for files; 30 sec for dir
NFSv4 => delegation to client for a period of time(avoids ‘update checks’)
locking
lease-based
NFSv4 => also “share reservation” – reader/writer lock

Access Pattern (workload) analysis
-33% of all file accesses are writes

Distribute File Systems

DFS design and implementation
Networked File System(NFS)
Caching in the Sprite Network File System by Nelson et al

-Accessed via well-defined interface
– access via VFS
-Focus on consistent state
-mixed distribution models possible

client application machine
(file-system interface, vfs interface, local file system)
file server machine

Remote File Service: Extremes
Upload/Download
– like FTP, SVN…
local reads/ writes at client
entire file download/upload even for small access
True Remote File Access
– every access to remote file, nothing done locally

file access centralized, easy to reason about consitency
every file operation pays network cost

A more Pratical Remote File Access (with Caching)
1. allow client to store parts of files locally
low latency on file operations server load reduced => is more scalable
2. Force clients to interact w/server (frequently)
server has insights into what clients are doing server has control into which accesses can be permitted => easier to maintain consistency

however, server more complex, requires different file sharing semantics

Stateless vs. stateful file server

Caching State in a DFS(optimization)
-locally clients maintain portion of state(file blocks)
-locally clients perform operations on chached state(open/read/write)

File Sharing
-in client memory
-on client storage device(HDD/SDD…)
-in buffer cache in memory on server(usefulness will depend on clients load, request interleaving…)

File sharing semantics in DFS
UNIX semantics => every write visible immediately
Session semantics(between open-close => session)
– write back on close(), update on open()
– easy to reason, but may be insufficient

Periodic Updates
– client writes-back periodically
– server invalidates periodically

Java RMI

Java Remote Method Invocations -RMI
– among address spaces in JVM(s)
– matches JAVA OO semantics
– IOL == Java (language-specific)

RMI Runtime
– Remote Reference Layer
unicast, broadcast, return-first response, returen-if-all-match
-Transport
TCP, UDP, shared memory

SunPRC Binding

CLIENT* clnt_create(char* host, unsigned long prog,
	unsigned long vers, char* proto);

// for square example
CLIENT* clnt_handle;
clnt_handle = clnt_create(rpc_host_name, SQUARE_PROG, SQURE_VERS, "tcp");

CLIENT type
– client handle
– status error, authentication…

XDR Data Types
Default Types
-char, byte, int, float…
Additional XDR types
-const(#define), hyper(64-bit integer), guadruple(128-bit float), opaque(~c byte)
– uninterpreted binary data

Fixed length array
– e.g., int data[80]
Variable length array
– e.g., int data<80> => translate into a data structure with “len” and “val” fields

except for strings
– string line<80> => c pointer to char
– stored in memory as a normal null-terminated string
– encoded for (transmission) as a pair of length and data

XDR Routines
marshalling/unmarshalling
-found in square_xdr.c
Clean-up
-xdr.free()
-user_define.freeresult procedure
-e.g., square_prog_1_freeresult
-called after result returned

RPC header
-service procedure ID, version number, request ID…
Actual data
– arguments or results
– encoded into a byte stream depending on data type

XDR IDL + the encoding
– i.e., the binary representation of data “on-the-wire”
XDR Encoding Rules
– all data types are encoded in multiples of 4 bytes
– big endian is the transmission standard