Introduction to Predictive modeling

Class 2

January 29th, 2020

Introduction

Materials

Ch. 3 of Provost-Fawcett’s Data Science for Business

Motivations

introduce the concepts described in Ch. 3
familiriarize with the concepts, work with Entropy
detail the treatment of binary classification

Models

Definition

A predictive model is a formula for estimating the unknown value of interest, often called the target attribute.

Regression: numerical target

Classification: class membership

E.g., Class-probability estimation

A descriptive model is a formula for estimating the underlying phenomena and causal connections between values.

Descriptive modeling often is used to work towards a causal understanding of the data-generating process. (why do users watch Sci-Fi sagas?)

Supervised segmentation

divide the dataset into segments (sets of row) by the value of their output variable .

If the segmentation is done using values of variables that will be known when the target is not then these segments can be used to predict the value of the target value.

Questions

What are the variables that contain important information about the target variable?

Can they be selected automatically?

Selecting informative attributes

Attributes
- head-shape: {square, circular}
- body-shape: {rectangular, oval}
- body-color: {gray, white}
Target
- write-off: {yes, no}

Measure:

purity of segments: homogenity wrt. the target variable.

Complications

attributes seldom split a dataset perfectly.
body-color=gray splits off one single data point, hence pure. Is it desirable?
Non-binary attributes in binary classification
non-discrete attributes

Information gain.

Definition

It measures how much much an attribute improves (decreases) entropy over the whole segmentation it creates.

How much purer are the children wrt. the parent segment?

Information gain

IG(parent, children) = H(parent) - $[prop(c_a)\times H(c_a) + prop(c_b)\times H(c_b) + \dots ]$

where $prop(c_x)$ is the proportion of el. assigned to $c_i:$ $\frac{|c_i|}{n}$

$IG(P, C)$ is connected with Kullback-Leibler divergence.

H(parent) = 0.99

H(left) = 0.39

H(right) = 0.79

$IG(p, C) = 0.99 -[13/30\cdot 0.39 + 17/30 \cdot 0.79] = 0.37$

Notice the discretization of the numerical var. we split on

H(parent) = 0.99

H(left) = 0.54

H(center) = 0.97

H(right) = 0.98

IG(p, C) = 0.13

Numeric targets

Discretization may reduce numerical dimensions to discrete ones

In regression, variance is the analogous of information entropy!

Example: attribute selection with IG

Notice the graphical method deployed to visualize Information gain:

The shaded area represents Entropy.

the white area ‘reclaimed’ from the shade is the Information gain.

total: 8124 instances

edible: 4208 (51.8%)

poisonous: 3916 (48.2%)

H = $-[0.518\log 0.518 + 0.482\log 0.482]$ = $-[.518\cdot -.949 + .482\cdot -1.053]$ = .999

Decision trees

A type of iterated supervised segmentation

node $\rightarrow$ leaves represents a segmentation that reduces information Entropy.

Iterate until the set of all root $\rightarrow$ leaf trajectories gives a complete classification.

Measure: total entropy of the set of leaf segments.

Decision tree:

a set of if-then rules over attribute (or discretized) values

Probability estimates

From freq. to probability

Introduction to Predictive modeling Class 2 January 29th, 2020