Alpha and Jitter

'''(r)
ggplot(aes(x = age, y = friends_initiated), data = pf)
 geom_point(alpha = 1/10, position = 'jitter')
'''
age_groups <- group_by(pf, age)
pf.fc_by_age <- summarise(age_groups,
	friend_count_mean = mean(friend_count),
	friend_count_median = median(friend_count),
	n = n())
pf.fc_by_age <- arrange(pf.fc_by_age, age)
head(pf.fc_by_age)

Explore Variables

Scatterplots

'''(r)
library(ggplot2)
pf <- read.csv('pseudo_facebook.tsv', sep = '\t')

qplot(x = age, y = friend_count, data = pf)
qplot(age, friend_count, data = pf)
'''
'''(r)
qplot(x = age, y = friend_count, data = pf)

ggplot(aes(x = age, y= friend_count), data = pf) + geom_point()

summary(pf$age)
'''
'''(r)
ggplot(aes(x = age, y = friend_count),data = pf)+
	geom_point(alpha = 1/20) + xlim(13, 90)
'''

Histogram of Users’ birth

'''(r)
install.packages('ggplot2')

names(pf)
qplot(x -dob_day, data - pf)
'''
'''(r)
qplot(x - friend_count, data - pf)
'''
'''(r)
qplot(x - friend_count, data - pf, xlim - c(0, 1000))

qplot(x - friend_count, data_pf) +
	scale_x_continuous(limits - c(0, 1000))
'''

R Markdown Documents

'''{r}
# the hash or pound symbol inside the block creates
# a comment. These three lines of are not code and cannot be
x <- [1:10]
mean(x)
'''
a <- c(1,2,5.3,6,-2,4) # numeric vector
b <- c("one","two","three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
reddit <- read.csv('reddit.csv')

table(reddit$employment)

str(reddit)
levels(reddit$age.range)

library(ggplot2)
qplot(data = reddit, x = age.range)

R

the leading tool
many packages
active community

install.packages("swirl")
library(swirl)
swirl()
> ?mean
> x <- c(0:10, 50)
> x
 [1]  0  1  2  3  4  5  6  7  8  9 10 50
> xm <- mean(x)
> xm
[1] 8.75
> c(xm, mean(x, trim = 0.10))
[1] 8.75 5.50
> subset(statesInfo, state.region == 1)
               X state.abb state.area state.region population income illiteracy life.exp murder
7    Connecticut        CT       5009            1       3100   5348        1.1    72.48    3.1
19         Maine        ME      33215            1       1058   3694        0.7    70.39    2.7
21 Massachusetts        MA       8257            1       5814   4755        1.1    71.83    3.3
29 New Hampshire        NH       9304            1        812   4281        0.7    71.23    3.3
30    New Jersey        NJ       7836            1       7333   5237        1.1    70.93    5.2
32      New York        NY      49576            1      18076   4903        1.4    70.55   10.9
38  Pennsylvania        PA      45333            1      11860   4449        1.0    70.43    6.1
39  Rhode Island        RI       1214            1        931   4558        1.3    71.90    2.4
45       Vermont        VT       9609            1        472   3907        0.6    71.64    5.5
   highSchoolGrad frost  area
7            56.0   139  4862
19           54.7   161 30920
21           58.5   103  7826
29           57.6   174  9027
30           52.5   115  7521
32           52.7    82 47831
38           50.2   126 44966
39           46.4   127  1049
45           57.1   168  9267
Title
========================================================
This is an R Markdown document or RMD. Markdown is a simple formatting syntax for authoring web pages (click the **Help** toolbar button for more details on using R Markdown).

When you click the **Knit HTML** button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Why learn EDA?

So what’s getting unbiquitous and cheap?
Data.
And what is complementary to data?
Analysis.
-Hal Varian

Netflix Prize Competition
EDA:electronic design automation

Netflix Prize Dataset Visualization

Television Size Over the Years

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=10000, n_features=10, n_classes=2, n_informative=5)
Xtrain = X[:9000]
Xtest = X[9000:]
ytrain = y[:9000]
ytest = y[9000:]

clf = LogisticRegression()
clf.fit(Xtrain, ytrain)

500 TB of data

Facebook processes more than 500 TB of data daily
https://www.cnet.com/news/facebook-processes-more-than-500-tb-of-data-daily/

One of Facebook’s tools, Presto (mainly used for adhoc analysis), processes over 1 petabyte of data per day.

Google Trends
Chiken, Music, Movies
https://trends.google.com/trends/explore?date=all&q=chicken,music,movies

Example: Google Privacy Policy

What information is collected about you?
– Personal information like name, email address, credit card, telephone number etc. that we provide to create an account.
– Service we visit a certain a website. Use it for advertising.
– Device information: hardware model, OS, network information(IP address) etc.
– Search queries
– Who we call? For long we talk?
– Cookies
– Location information
– Applications

How is collected information used?
improve user experience (personalization)
for serving you targeted advertisements – we can set ad preferences

Who do they share it with?
with opt-in, can share with companies, individuals and organizations outside of Google.
Domain administrators and re sellers who provide user support to your organization can get certain information about you that you give to Google.
Affiliates and other trusted businesses or persons with appropriate confidentiality and security measures.
For legal reasons.

Information security
-many services use encryption
-stronger authentication(two factor)
-Other safeguards

Changes to privacy policy
-Will not reduce user rights without your consent

Facebook Privacy Policies
Do companies adhere and operate according to the privacy policy you gave consent to?
Not really, Facebook had issues and actually the US Federal Trade Commission went after it for violation of user privacy.

Privacy

Do we need privacy only for individuals?
Universities, hospitals, charities require privacy and need to protect data of people they serve or have as employees.

Threads to Privacy
-Traffic analsis
-Surveillance
-Linking and making inferences

social media, tracking of web browsing, location aware applications, sometimes we are willing parties.

Privacy Threats to Online Tracking Info
-collection of information about you – with or without your consent?
-Usage – only used for specified purpose you agreed to?
-Information retention – how long can they keep it?
-Information disclosure and sharing -disclosed to only authorized or agreed to parties?
-Privacy policy change – can information collector/holder change to a more lax policy without your agreement?
-Information security – identity and access management, monitoring, secure against various threats we discussed.