Git intro

creating repositories with git init and git clone
reviewing repos with git status
using git log and git show to review past commits
being able to make commits with git add
commit them to the repo with git commit
need to know about branching, merging branches together, and resolving merge conflicts
being able to undo things in Git:
git commit –amend to undo the most recent commit or to change the wording of the commit message
git reset if you’re comfortable with all of these, then you’ll be good to go for this

It’s incredibly helpful to make all of your commits on descriptively named topic branches. Branches help isolate unrelated changes from each other.
So when you’re collaborating with other developers make sure to create a new branch that has a descriptive name that describes what changes it contains.


git remote
git push
git pull

Git is a distributed version control system which means there is not one main repository of information. Each developer has a copy of the repository. So you can have a copy of the repository (which includes the published commits and version history) and your friend can also have a copy of the same repository. Each repository has the exact same information that the other ones have, there’s no one repository that’s the main one.

The way we can interact and control a remote repository is through the Git remote command:

$ git remote

Alpha and Jitter

'''(r)
ggplot(aes(x = age, y = friends_initiated), data = pf)
 geom_point(alpha = 1/10, position = 'jitter')
'''
age_groups <- group_by(pf, age)
pf.fc_by_age <- summarise(age_groups,
	friend_count_mean = mean(friend_count),
	friend_count_median = median(friend_count),
	n = n())
pf.fc_by_age <- arrange(pf.fc_by_age, age)
head(pf.fc_by_age)

Explore Variables

Scatterplots

'''(r)
library(ggplot2)
pf <- read.csv('pseudo_facebook.tsv', sep = '\t')

qplot(x = age, y = friend_count, data = pf)
qplot(age, friend_count, data = pf)
'''
'''(r)
qplot(x = age, y = friend_count, data = pf)

ggplot(aes(x = age, y= friend_count), data = pf) + geom_point()

summary(pf$age)
'''
'''(r)
ggplot(aes(x = age, y = friend_count),data = pf)+
	geom_point(alpha = 1/20) + xlim(13, 90)
'''

Histogram of Users’ birth

'''(r)
install.packages('ggplot2')

names(pf)
qplot(x -dob_day, data - pf)
'''
'''(r)
qplot(x - friend_count, data - pf)
'''
'''(r)
qplot(x - friend_count, data - pf, xlim - c(0, 1000))

qplot(x - friend_count, data_pf) +
	scale_x_continuous(limits - c(0, 1000))
'''

R Markdown Documents

'''{r}
# the hash or pound symbol inside the block creates
# a comment. These three lines of are not code and cannot be
x <- [1:10]
mean(x)
'''
a <- c(1,2,5.3,6,-2,4) # numeric vector
b <- c("one","two","three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
reddit <- read.csv('reddit.csv')

table(reddit$employment)

str(reddit)
levels(reddit$age.range)

library(ggplot2)
qplot(data = reddit, x = age.range)

R

the leading tool
many packages
active community

install.packages("swirl")
library(swirl)
swirl()
> ?mean
> x <- c(0:10, 50)
> x
 [1]  0  1  2  3  4  5  6  7  8  9 10 50
> xm <- mean(x)
> xm
[1] 8.75
> c(xm, mean(x, trim = 0.10))
[1] 8.75 5.50
> subset(statesInfo, state.region == 1)
               X state.abb state.area state.region population income illiteracy life.exp murder
7    Connecticut        CT       5009            1       3100   5348        1.1    72.48    3.1
19         Maine        ME      33215            1       1058   3694        0.7    70.39    2.7
21 Massachusetts        MA       8257            1       5814   4755        1.1    71.83    3.3
29 New Hampshire        NH       9304            1        812   4281        0.7    71.23    3.3
30    New Jersey        NJ       7836            1       7333   5237        1.1    70.93    5.2
32      New York        NY      49576            1      18076   4903        1.4    70.55   10.9
38  Pennsylvania        PA      45333            1      11860   4449        1.0    70.43    6.1
39  Rhode Island        RI       1214            1        931   4558        1.3    71.90    2.4
45       Vermont        VT       9609            1        472   3907        0.6    71.64    5.5
   highSchoolGrad frost  area
7            56.0   139  4862
19           54.7   161 30920
21           58.5   103  7826
29           57.6   174  9027
30           52.5   115  7521
32           52.7    82 47831
38           50.2   126 44966
39           46.4   127  1049
45           57.1   168  9267
Title
========================================================
This is an R Markdown document or RMD. Markdown is a simple formatting syntax for authoring web pages (click the **Help** toolbar button for more details on using R Markdown).

When you click the **Knit HTML** button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Why learn EDA?

So what’s getting unbiquitous and cheap?
Data.
And what is complementary to data?
Analysis.
-Hal Varian

Netflix Prize Competition
EDA:electronic design automation

Netflix Prize Dataset Visualization

Television Size Over the Years

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=10000, n_features=10, n_classes=2, n_informative=5)
Xtrain = X[:9000]
Xtest = X[9000:]
ytrain = y[:9000]
ytest = y[9000:]

clf = LogisticRegression()
clf.fit(Xtrain, ytrain)

500 TB of data

Facebook processes more than 500 TB of data daily
https://www.cnet.com/news/facebook-processes-more-than-500-tb-of-data-daily/

One of Facebook’s tools, Presto (mainly used for adhoc analysis), processes over 1 petabyte of data per day.

Google Trends
Chiken, Music, Movies
https://trends.google.com/trends/explore?date=all&q=chicken,music,movies

Example: Google Privacy Policy

What information is collected about you?
– Personal information like name, email address, credit card, telephone number etc. that we provide to create an account.
– Service we visit a certain a website. Use it for advertising.
– Device information: hardware model, OS, network information(IP address) etc.
– Search queries
– Who we call? For long we talk?
– Cookies
– Location information
– Applications

How is collected information used?
improve user experience (personalization)
for serving you targeted advertisements – we can set ad preferences

Who do they share it with?
with opt-in, can share with companies, individuals and organizations outside of Google.
Domain administrators and re sellers who provide user support to your organization can get certain information about you that you give to Google.
Affiliates and other trusted businesses or persons with appropriate confidentiality and security measures.
For legal reasons.

Information security
-many services use encryption
-stronger authentication(two factor)
-Other safeguards

Changes to privacy policy
-Will not reduce user rights without your consent

Facebook Privacy Policies
Do companies adhere and operate according to the privacy policy you gave consent to?
Not really, Facebook had issues and actually the US Federal Trade Commission went after it for violation of user privacy.