data science – ソフトウェアエンジニアの技術ブログ：Software engineer tech blog

Time Series Forecasting

time series forecasting to predict values for following business situations.
– Monthly beach bike rentals
– A stock’s daily closing value
– Annual sheep population

Average Method
The best predictor of what will happen tomorow is the average of everything that has happened up until now.

Moving average method

Naive Method
If there is not enough data to create a predictive model, the Naive method can supplement forecasts for the near future.

Seasonal Naive Method
Assumes that the magnitude of the seasonal pattern will remain constant.

Exponential Smoothing Model
Past Observations, Weighted Average

Mapper and Reducer

import sys
import string
import logging

from util import mapper_logfile
logging.basicConfig(filename=mapper_logfile, format='%(message)s',
	level=logging.INFO, filemode='w')

def mapper():
	for line in sys.stdin:

		data = line.strip().split(",")
		if len(data) != 12 or data[0] == 'Register':
			continue
		print "{0}\t{1}".format(data[3], data[8])

mapper()

import sys
import logging

from util import reducer_logfile
logging.basicConfig(filename=reducer_logfile, format='%(message)s',
	level=logging.INFO, filemode='w')

def reducer():

	aadhaar_generated = 0
	old_key = None

	for line in sys.stdin:
		data = line.strip().split("\t")

		if len(data) != 2:
			continue

		this_key, count = data
		if old_key and old_key != this_key:
			print "{0}\t{1}".format(old_key, aadhaar_generated)

			aadhaar_generated = 0

		old_key = this_key
		aadhaar_generated += float(count)

	if old_key != None:
		print "{0}\t{1}".format(old_key, aadhaar_generated)

reducer()

Mapreduce programming model -> HADOOP!
(1)Hive, (2)Pig
mahout, giraph, cassandra

Using Mapreduce with Subway data

Scatter Plot

Line Chart
– Mitigate some shortcomings of scatterplot
– Emphasize trends
– Focus on year to year variability, not overall trends

LOESS Curve
– Emphasize long term trends
– LOESS weighted regression
– Easier to take a quick look at chart and understand big picture

Multivariate
– How to incorporate more variables
– Use an additional encoding
– Size
– Color / Saturation

Basics of mapreduce
Mapreduce is parallel programming model

python dictionary
{“alice”:1, “was”:1, “of”:2,…”do”:1}

– Numeric data
A measurement (e.g. height, weight) or count(e.g. HR or hit)
Discrete and continuous
discrete: whole numbers(e.g., 10, 34, 25)
continuous: only number within range(e.g., 250, /357., .511)

Categorical Data
Represent characteristics(e.g., position, team, hometown, handedness)
can take on numerical values, but they don’t it have mathematical meaning
ordinal data
categories with ome orders or ranking
vary low average high

Time- series data
– data collected via repeated measurements over time
– example: average HR

from pandas import *
from ggplot import *

def lineplot_compare(hr_by_team_year_sf_la_csv):
	hr_year = pandas.read_csv('hr_by_team_year_sf_la.csv')
	print ggplot(hr_year, aes('yearID', 'HR', color='teamID')) + geom_point() + geom_line() + ggtitle('Total HRs by Year') + xlab('Year') + ylab('HR')

if __name__ == '__main__':
	lineplot_compare()

Effective Information Visualization

-effective communication of complex quantitative ideas
clarity, precision, efficiency

Visual Encoding
position x, y
Length A, B, C
Angle

Visual Encoding: Direction, Shape, Area/Volume
Color: Hue, Saturation
Combination: min, max
Limit Hues

Plotting in Python
– Many packages
– matplotlib <- very popular - ggplot <- use this, looks nicer, grammer of graphics ggplot(data, qes(xvar, yvar)) + geom_point() + geom_line() first step: create plot second step: represent data with geometric objects third step: add labels

from pandas import *
from ggplot import *

def lineplot(hr_year_csv):
	hr_year = pandas.read_csv(‘hr_year.csv’)
	print ggplot(hr_year, aes(‘yearID’, ‘HR’)) + geom_point(color=’red’) + geom_line(color=’red’) + ggtitle(‘Total HRs by Year’) + xlab(‘Year’) + ylab(‘HR’)

if __name__ == ‘__main__’:
	lineplot()

Coefficient of Determination

Coefficient of Determination
-data = yi … yn
-predictions = fi..fn
-average of data = y

R^2 = 1 – Σn(yi-fi)/Σn(yi-y)^2

Calculating R^2

import numpy as np

def compute_r_squared(data, predictions):
	SST = ((data-np.mean(data))**2).sum()
	SSReg = ((predictions-data)**2).sum()
	r_squared = 1 - SSReg / SST

	return r_squared

Additional Considerations
– other types of linear regression
– ordinary least squares regression
– parameter estimation
– under / overfitting
– multiple local minima

Types of Machine Learning

Different types of learning
Data -> Model -> Predictions

Supervised Learning
-trying to understand structure of data
-clustering

Linear Regression with gradient descent
mΣi=1*(Ypredicted – Yactual)^2

Gradient Descent – Cost Function: J(Θ)
Minimize J(Θ) … how?

import numpy
import pandas

def compute_cost(features, values, theta):
	m = len(values)
	sum_of_square_errors = numpy.square(numpy.dot(features, theta) - values).sum()
	cost = sum_of_square_errors / (2*m)

	return cost

def gradient_descent(features, values, theta, alpha, num_iterations):
	cost_history = []

	return theta, pandas.Series(cost_history)

Tip of the Imputation Iceberg

Imputation
– just the tip of the iceberg more sophisticated methods exist
– Fill in mean : simple and relatively effective
– Linear Regression

– Both have negative side effects and can obscure or simplify trend in data

Statistical Rigor
significance tests
– using data, can we disprove an assumption with a pre-defined level of confidence

Normal Distribution, Two parameters:
・μ(mean)
・σ(standard deviation)

f(x) = 1/√2πσ^2 * e * -(x-μ)^2 / 2σ^2
Mean = μ
Variance = σ^2

Statistical Significance Tests
t-test
Accept or reject a null hypothesis
NULL HYPOTHESIS: A statement we are trying to disprove by running test

import numpy
import scipy.stats
import pandas

def compare_averages(filename):
	baseball_data = pandas.read_csv('../data/baseball_data.csv')

	baseball_data_left = baseball_data[baseball_data['handedness'] == 'L']
	baseball_data_right = baseball_data[baseball_data['handedness'] == 'R']

	result = scipy.stats.ttest_ind(baseball_data_left['avg'], baseball_data_right('avg'), equal_var=False)

	if result[1] <= .05:
		return (False, result)
	else:
		return (True, result)

if __name__ == '__main__':
	result = compare_averages()
	print result

Machine Learning: A branch of artificial intelligence focused on constructing systems that learn from large amounts of data to make predictions.

statistics is focused on analyzing existing data and drawing valid conclusions.
machine learning is focused on making predictions.

Complex Query

import pandas
import pandasql

def aggregate_query(filename):
	aadhaar_data = pandas.read_csv(filename)
	aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)

	q = """
	SELECT
	gender, district, sum(aadhaar_generated)
	FROM
	aadhaar_data
	WHERE
	age > 50
	GROUP BY
	gender, district;
	"""

	aadhaar_solution = pandasql.sqldf(q.lower(), locals())
	return aadhaar_solution

Files, Databases, APIs
Application Programming Interface
https://www.last.fm/api

import json
import requests

def imputation(filename):

	baseball = pandas.read_csv('../data/Master.csv')

	baseball['weight'] = baseball['weight'].fillna(numpy.mean(baseball['weight']))

	print numpy.sum(baseball['weight']), numpy.mean(baseball['weight'])

Matrix Multiplication

>>> a = [1,2,3,4,5]
>>> b = [2,3,4,5,6]
>>> numpy.dot(a,b)
70

Data Wrangling Manipulation
files, databases, web APIs

Dealing with Messy Data
Acquiring Data
– Acquiring data often isn’t funcy
– Find stuff on the internet!
– A lot of data stored in text files and on gov’t website

Common Data Formats
– csv, xml, json

import pandas

def add_full_name(path_to_csv, path_to_new_csv):
	dataframe = pandas.read_csv(path_to_csv)
	dataframe['nameFull'] = dataframe['nameFirst'] + ' ' + dataframe['nameLast']
	dataframe.to_csv(path_to_new_csv)

if __name__ == "__main__":
	path_to_csv = ""
	path_to_new_csv = ""
	add_full_name(path_to_csv, path_to_new_csv)

Relational Database
Why useful? ->
it is straight forward to extract aggregated with complex filters
a database scale well
it ensures all data is consistently formatted

Schemas = Blueprints
SELECT * FROM aadhar_data;

import pandas
import pandasql

def select_first_50(filename):
	aadhaar_data = pandas.read_csv('../data/aadhaar_data.csv')
	aadhaar_data.rename(columns = lambda x: x.replace('','_').lower(), inplace=True)

	q = """
	SELECT
	register, enrolment_agency
	FROM
	aadhaar_data
	LIMIT 50;
	"""
	aadhaar_solution = pandasql.sqldf(q.lower(), locals())
	return aadhaar_solution

Category: data science