Why learn EDA?

So what’s getting unbiquitous and cheap?
Data.
And what is complementary to data?
Analysis.
-Hal Varian

Netflix Prize Competition
EDA:electronic design automation

Netflix Prize Dataset Visualization

Television Size Over the Years

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=10000, n_features=10, n_classes=2, n_informative=5)
Xtrain = X[:9000]
Xtest = X[9000:]
ytrain = y[:9000]
ytest = y[9000:]

clf = LogisticRegression()
clf.fit(Xtrain, ytrain)

500 TB of data

Facebook processes more than 500 TB of data daily
https://www.cnet.com/news/facebook-processes-more-than-500-tb-of-data-daily/

One of Facebook’s tools, Presto (mainly used for adhoc analysis), processes over 1 petabyte of data per day.

Google Trends
Chiken, Music, Movies
https://trends.google.com/trends/explore?date=all&q=chicken,music,movies

Connecting Count Bolt

class:Word Spout
componentId:”word-spout”

builder.setSpout("word-spout", new WordSpout(), 5);

class:CountBolt
componentOd:”count-bolt”

builder.setBolt("count-bolt",
	new CountBolt(), 15)
		.fieldsGrouping("word-spout",
			new Field("word"));

beautiful soup

easy_install beautifulsoup4

Storm & Hadoop

Storm & Hadoop are complimentary!
Hadoop => big batch processing
Storm => fast, reactive, real time processing

Storm data model
-Spouts
->sources of data for the topology (e.g) Postgres/MySQL/Kafka/Kestrel
-Bolts
->units of computation on data (e.g) filtering/aggregation/join/transformations

Live stream of Tweets
tweet spout, parse tweet bolt, word count bolt

Stream grouping
shuffle, fields, all, global

tuble: immutable ordered list of elements
topology: directed acyclic graph, vertices = computation and edges = streams of data

What is analytics?

Discovery: Ability to identify patterns in data
Communication: Provide insights in a meaningful way

Types of analytics, varieties
Cube Analytics: business intelligence
Predictive analytics: statistics and machine learning

Realtime: ability to analyze the data instantly
Batch: ability to provide insights after several hours/days when a query is posed

Realtime analytics
-streaming
-interactive

OLTP/OLAP
< 500 MS latency sensitive deterministic workflows

Visualization Spectrum

HTML5 Canvas:https://developer.mozilla.org/en-US/docs/Web/API/Canvas_API
WebGL:https://www.khronos.org/webgl/
SVG(Scalable Vector Graphics):https://developer.mozilla.org/en-US/docs/Web/SVG
D3.js:https://d3js.org/
NVD3:http://nvd3.org/
Dimple.js:http://dimplejs.org/
Rickshaw:http://code.shutterstock.com/rickshaw/
Chartio:https://chartio.com/
RAW:http://rawgraphs.io/

Predefined chart
python/ruby, c/c++, assembly

VizWiz

Data visualization
http://www.vizwiz.com/

Data Science Process
-Computer Science, Statistic and Data mining, Graphic Design, Infovis and HCI
-acquire, parse, filter, mine, represent, refine, interact

D3.js
https://d3js.org/

Rental Variables
size, color hue, orientation, shape, color saturation, texture

Display
position x, y, size, color

more accurate <-> less accurate

https://www.targetprocess.com/articles/visual-encoding/

– WebGL, Canvas, SVG
efficient, performant
flexible
low level
hard to develop with

Calculating a confidence interval

P^ = x/N
P^ = 100/1000 = 0.1

m = z*se
m = z * √p^(1-p^)/n
m = 0.019
z distribution μ=0, σ=1 -1.96, 1.96

N = 2000, x = 300
p^ = 300 / 2000
center of confident 0.15

Hypothesis Testing
P(results due to chance)
Pcont, Pexp
Pcont = Pexp
Pexp-Pcont = 0

Size vs. Power Trade-Off
How many page views
α = P(reject null | null true)