Banner Ad

package com.example.adviewer;

import ...

public class BannerActivity extends Activity {
	private AdView mAdView;

	@Override
	protected void onCreate(Bundle savedInstanceState){
		super.onCreate(savedInstanceState);
		setContentView(R.layout.activity_banner);

		mAdView = (AdView) findViewById(R.id.adView);
		AdRequest adRequest = new AdRequest.Builder()
			.build();
		mAdView.loadAd(adRequest);
	}
}

AdListener
public void onAdLoaded()
public void onAdFailedToLoad(int code) {AdRequest Error}
public void onAdOpened()
public void onAdLeftApplication() (e.g. BROWSER)

AdMob

Easy to Launch, Long Term, User Value
Paid Donwloads: no, no, yes
Subscription: no, yes, yes
Displaying ADs: yes, yes, yes
In-app purchase: yes, yes, yes

Common Monetization Models
-ADs & In-app purchases, subscription & in-app purchases, paid download & in-app purchases

USERS -> APPS(publishers) -> AdMob -> ADVERTISERS
https://www.google.co.jp/admob/

Types of ADs
-Banner ADs: TEXT, Image
-Interstitional ADs: Text, image, video
-Native ADs

<?xml version="1.0" encoding="utf-8"?>

<RelativeLayout
	xmlns:android="http://schemas.android.com/apk/res/android"
	xmlns:ads="http://schemas.android.com/apk/res-auto"
	android:id="@+id/mainLayout"
	android:layout_width="match_parent"
	android:layout_height="match_parent">

	<com.google.android.gms.ads.AdView
		android:id="@+id/adView"
		android:layout_width="match_parent"
		android:layout_height="wrap_content"
		android:layout_alignParentBottom="true"
		android:layout_alignParentLeft="true"
		ads:adSize="BANNER"
		ads:adUnitId="cp-app-pub-xxxx/xxxx"/>
</RelativeLayout>

Mapper and Reducer

import sys
import string
import logging

from util import mapper_logfile
logging.basicConfig(filename=mapper_logfile, format='%(message)s',
	level=logging.INFO, filemode='w')

def mapper():
	for line in sys.stdin:

		data = line.strip().split(",")
		if len(data) != 12 or data[0] == 'Register':
			continue
		print "{0}\t{1}".format(data[3], data[8])

mapper()
import sys
import logging

from util import reducer_logfile
logging.basicConfig(filename=reducer_logfile, format='%(message)s',
	level=logging.INFO, filemode='w')

def reducer():

	aadhaar_generated = 0
	old_key = None

	for line in sys.stdin:
		data = line.strip().split("\t")

		if len(data) != 2:
			continue

		this_key, count = data
		if old_key and old_key != this_key:
			print "{0}\t{1}".format(old_key, aadhaar_generated)

			aadhaar_generated = 0

		old_key = this_key
		aadhaar_generated += float(count)

	if old_key != None:
		print "{0}\t{1}".format(old_key, aadhaar_generated)

reducer()

Mapreduce programming model -> HADOOP!
(1)Hive, (2)Pig
mahout, giraph, cassandra

Using Mapreduce with Subway data

Mapper

def mapper():

	for line in sys.stdin:

		data = line.strip.split("")

		for i in data:
			cleaned_data = i.translate(string.maketrans("",""), string.punctuation).lower()
			print "{0}\t{t}".format(cleaned_data,1)

			mapper()

Reduce stage -> reducer

import sys

def reducer():
	word_count = 0
	old_key = None

	for line in sys.stdin:
		data = line.strip().split("\t")

		if len(data) != 2:
			continue

		if old_key and old_key != this_key: 
			print"{0}\t{1}".format(old_key, word_count)
			word_count = 0

		old_key = this_key
		word_count += float(count)

	if old_key != None:
		print "{0}\t{1}".format(old_key, word_count)
#! /bin/bash

cat ../../data/aliceInWorderland.txt | python word_count_mapper.py | sort | python word_count_reducer.py

Scatter Plot

Line Chart
– Mitigate some shortcomings of scatterplot
– Emphasize trends
– Focus on year to year variability, not overall trends

LOESS Curve
– Emphasize long term trends
– LOESS weighted regression
– Easier to take a quick look at chart and understand big picture

Multivariate
– How to incorporate more variables
– Use an additional encoding
– Size
– Color / Saturation

Basics of mapreduce
Mapreduce is parallel programming model

python dictionary
{“alice”:1, “was”:1, “of”:2,…”do”:1}

Data Types

– Numeric data
A measurement (e.g. height, weight) or count(e.g. HR or hit)
Discrete and continuous
discrete: whole numbers(e.g., 10, 34, 25)
continuous: only number within range(e.g., 250, /357., .511)

Categorical Data
Represent characteristics(e.g., position, team, hometown, handedness)
can take on numerical values, but they don’t it have mathematical meaning
ordinal data
categories with ome orders or ranking
vary low average high

Time- series data
– data collected via repeated measurements over time
– example: average HR

from pandas import *
from ggplot import *

def lineplot_compare(hr_by_team_year_sf_la_csv):
	hr_year = pandas.read_csv('hr_by_team_year_sf_la.csv')
	print ggplot(hr_year, aes('yearID', 'HR', color='teamID')) + geom_point() + geom_line() + ggtitle('Total HRs by Year') + xlab('Year') + ylab('HR')

if __name__ == '__main__':
	lineplot_compare()

Effective Information Visualization

-effective communication of complex quantitative ideas
clarity, precision, efficiency

Visual Encoding
position x, y
Length A, B, C
Angle

Visual Encoding: Direction, Shape, Area/Volume
Color: Hue, Saturation
Combination: min, max
Limit Hues

Plotting in Python
– Many packages
– matplotlib <- very popular - ggplot <- use this, looks nicer, grammer of graphics ggplot(data, qes(xvar, yvar)) + geom_point() + geom_line() first step: create plot second step: represent data with geometric objects third step: add labels

from pandas import *
from ggplot import *

def lineplot(hr_year_csv):
	hr_year = pandas.read_csv(‘hr_year.csv’)
	print ggplot(hr_year, aes(‘yearID’, ‘HR’)) + geom_point(color=’red’) + geom_line(color=’red’) + ggtitle(‘Total HRs by Year’) + xlab(‘Year’) + ylab(‘HR’)

if __name__ == ‘__main__’:
	lineplot()

Coefficient of Determination

Coefficient of Determination
-data = yi … yn
-predictions = fi..fn
-average of data = y

R^2 = 1 – Σn(yi-fi)/Σn(yi-y)^2

Calculating R^2

import numpy as np

def compute_r_squared(data, predictions):
	SST = ((data-np.mean(data))**2).sum()
	SSReg = ((predictions-data)**2).sum()
	r_squared = 1 - SSReg / SST

	return r_squared

Additional Considerations
– other types of linear regression
– ordinary least squares regression
– parameter estimation
– under / overfitting
– multiple local minima

Types of Machine Learning

Different types of learning
Data -> Model -> Predictions

Supervised Learning
-trying to understand structure of data
-clustering

Linear Regression with gradient descent
mΣi=1*(Ypredicted – Yactual)^2

Gradient Descent – Cost Function: J(Θ)
Minimize J(Θ) … how?

import numpy
import pandas

def compute_cost(features, values, theta):
	m = len(values)
	sum_of_square_errors = numpy.square(numpy.dot(features, theta) - values).sum()
	cost = sum_of_square_errors / (2*m)

	return cost

def gradient_descent(features, values, theta, alpha, num_iterations):
	cost_history = []

	return theta, pandas.Series(cost_history)

Tip of the Imputation Iceberg

Imputation
– just the tip of the iceberg more sophisticated methods exist
– Fill in mean : simple and relatively effective
– Linear Regression

– Both have negative side effects and can obscure or simplify trend in data

Statistical Rigor
significance tests
– using data, can we disprove an assumption with a pre-defined level of confidence

Normal Distribution, Two parameters:
・μ(mean)
・σ(standard deviation)

f(x) = 1/√2πσ^2 * e * -(x-μ)^2 / 2σ^2
Mean = μ
Variance = σ^2

Statistical Significance Tests
t-test
Accept or reject a null hypothesis
NULL HYPOTHESIS: A statement we are trying to disprove by running test

import numpy
import scipy.stats
import pandas

def compare_averages(filename):
	baseball_data = pandas.read_csv('../data/baseball_data.csv')

	baseball_data_left = baseball_data[baseball_data['handedness'] == 'L']
	baseball_data_right = baseball_data[baseball_data['handedness'] == 'R']

	result = scipy.stats.ttest_ind(baseball_data_left['avg'], baseball_data_right('avg'), equal_var=False)

	if result[1] <= .05:
		return (False, result)
	else:
		return (True, result)

if __name__ == '__main__':
	result = compare_averages()
	print result

Machine Learning: A branch of artificial intelligence focused on constructing systems that learn from large amounts of data to make predictions.

statistics is focused on analyzing existing data and drawing valid conclusions.
machine learning is focused on making predictions.