R Markdown Documents

'''{r}
# the hash or pound symbol inside the block creates
# a comment. These three lines of are not code and cannot be
x <- [1:10]
mean(x)
'''
a <- c(1,2,5.3,6,-2,4) # numeric vector
b <- c("one","two","three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
reddit <- read.csv('reddit.csv')

table(reddit$employment)

str(reddit)
levels(reddit$age.range)

library(ggplot2)
qplot(data = reddit, x = age.range)

R

the leading tool
many packages
active community

install.packages("swirl")
library(swirl)
swirl()
> ?mean
> x <- c(0:10, 50)
> x
 [1]  0  1  2  3  4  5  6  7  8  9 10 50
> xm <- mean(x)
> xm
[1] 8.75
> c(xm, mean(x, trim = 0.10))
[1] 8.75 5.50
> subset(statesInfo, state.region == 1)
               X state.abb state.area state.region population income illiteracy life.exp murder
7    Connecticut        CT       5009            1       3100   5348        1.1    72.48    3.1
19         Maine        ME      33215            1       1058   3694        0.7    70.39    2.7
21 Massachusetts        MA       8257            1       5814   4755        1.1    71.83    3.3
29 New Hampshire        NH       9304            1        812   4281        0.7    71.23    3.3
30    New Jersey        NJ       7836            1       7333   5237        1.1    70.93    5.2
32      New York        NY      49576            1      18076   4903        1.4    70.55   10.9
38  Pennsylvania        PA      45333            1      11860   4449        1.0    70.43    6.1
39  Rhode Island        RI       1214            1        931   4558        1.3    71.90    2.4
45       Vermont        VT       9609            1        472   3907        0.6    71.64    5.5
   highSchoolGrad frost  area
7            56.0   139  4862
19           54.7   161 30920
21           58.5   103  7826
29           57.6   174  9027
30           52.5   115  7521
32           52.7    82 47831
38           50.2   126 44966
39           46.4   127  1049
45           57.1   168  9267
Title
========================================================
This is an R Markdown document or RMD. Markdown is a simple formatting syntax for authoring web pages (click the **Help** toolbar button for more details on using R Markdown).

When you click the **Knit HTML** button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Why learn EDA?

So what’s getting unbiquitous and cheap?
Data.
And what is complementary to data?
Analysis.
-Hal Varian

Netflix Prize Competition
EDA:electronic design automation

Netflix Prize Dataset Visualization

Television Size Over the Years

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=10000, n_features=10, n_classes=2, n_informative=5)
Xtrain = X[:9000]
Xtest = X[9000:]
ytrain = y[:9000]
ytest = y[9000:]

clf = LogisticRegression()
clf.fit(Xtrain, ytrain)

500 TB of data

Facebook processes more than 500 TB of data daily
https://www.cnet.com/news/facebook-processes-more-than-500-tb-of-data-daily/

One of Facebook’s tools, Presto (mainly used for adhoc analysis), processes over 1 petabyte of data per day.

Google Trends
Chiken, Music, Movies
https://trends.google.com/trends/explore?date=all&q=chicken,music,movies

Example: Google Privacy Policy

What information is collected about you?
– Personal information like name, email address, credit card, telephone number etc. that we provide to create an account.
– Service we visit a certain a website. Use it for advertising.
– Device information: hardware model, OS, network information(IP address) etc.
– Search queries
– Who we call? For long we talk?
– Cookies
– Location information
– Applications

How is collected information used?
improve user experience (personalization)
for serving you targeted advertisements – we can set ad preferences

Who do they share it with?
with opt-in, can share with companies, individuals and organizations outside of Google.
Domain administrators and re sellers who provide user support to your organization can get certain information about you that you give to Google.
Affiliates and other trusted businesses or persons with appropriate confidentiality and security measures.
For legal reasons.

Information security
-many services use encryption
-stronger authentication(two factor)
-Other safeguards

Changes to privacy policy
-Will not reduce user rights without your consent

Facebook Privacy Policies
Do companies adhere and operate according to the privacy policy you gave consent to?
Not really, Facebook had issues and actually the US Federal Trade Commission went after it for violation of user privacy.

Privacy

Do we need privacy only for individuals?
Universities, hospitals, charities require privacy and need to protect data of people they serve or have as employees.

Threads to Privacy
-Traffic analsis
-Surveillance
-Linking and making inferences

social media, tracking of web browsing, location aware applications, sometimes we are willing parties.

Privacy Threats to Online Tracking Info
-collection of information about you – with or without your consent?
-Usage – only used for specified purpose you agreed to?
-Information retention – how long can they keep it?
-Information disclosure and sharing -disclosed to only authorized or agreed to parties?
-Privacy policy change – can information collector/holder change to a more lax policy without your agreement?
-Information security – identity and access management, monitoring, secure against various threats we discussed.

Ethical Issues

Difference between law and ethics
– individual standard vs. societal
– No external arbiter and enforcement unlike law
– Examples – What do you do when you discover a vulnerability in a commercial product? Ethical disclosure?
– Code of ethical conduct(IEEE, ACM, university)

Privacy
Definition: A user’s ability to control how data pertaining to him/her can be collected, used and shared by someone else.

Privacy is not a new problem
– people have always worried about what others(friends, enemies, governments) might know about what they do.
– Scale and magnitude at which information about us and our activities can be collected, ways in which it can be used, and shared or sold.

Privacy
– financial statements, credit card statements, banking records etc.
– Health/medical conditions
– legal matters
– biometrics
– political benefits
– school and employer records
– web browsing habits? what do we search, what do we browse? websites we visit?
– Communication(emails and calls)
– Past history(right to be forgotten)

What is not private?
Where i live? my citizenship?
i am registered to vote?
My salary(state employee because Georgia Tech is a public university)

Law, Ethics, and Privacy

Cyber crime
– data thef, identity theft, extortion etc.
Copying and distribution of digital object(software, music)
– copyright, patents, trade secrets
– how are these applicable in the context of digital/computer objects?
Privacy
– Who can collect my information, how can I control it, how could it be used etc.?

US Computer Fraud and Abuse Act(CFAA)
– Defines criminal sanctions against various types of abuse
– Unauthorized access to computer containing:
– data protected for national defense
– banking or financial information
– Unauthorized access, use, modification, destruction, disclosure of computer or information on a system operated by or on behalf of US govt.

US Computer Fraud and Abuse Act
– Accessing without permission a protected computer(any computer connected to the internet)
– Transmitting code that cause damage to computers(malware)
– Trafficking in computer passwords

Cyber Risk Assessment

– Investments in cyber security are driven by risk and how certain controls may reduce it
– Some risk will always remain
– How can risk be assessed?

Risk exposure = Prob. [Adverse security event]* Impact[ adverse event ]
Risk Leverage = Risk exposure before/without a certain control – risk exposure after the control / cost of control

Risk leverage > 1 for the control to make sense

How do we assess and reduce cyber risk?
impact
– expected loss(reputational, recovery and response, legal, loss of business etc.)
Risk management
– accept, transfer(insurance) and reduce
– reduction via technology solutions, education and awareness training

Enterprise Cyber Security Posture
– Reactive
– regulation/compliance
– customer demands
– in response to a breach(Target or Home Depot)
– In response to events

Proactive:
– champion of an organization who has influence
– board level conversation about cyber security and risk

Economic value argument:
– return on investment(RoI)
– Estimating costs and benefits is tricky
– Perception vs. data-driven risk

Values at risk
– assets, reputation etc.
Threats and attack vectors
Plan, implement and manage
– Deploy appropriate controls
– Empower people and hold them responsible
– Plan for response and remediation (do not be surprised)
– User awareness
Understand and proactively address risk