Class 2
January 29th, 2020
Ch. 3 of Provost-Fawcett’s Data Science for Business
introduce the concepts described in Ch. 3
familiriarize with the concepts, work with Entropy
detail the treatment of binary classification
A predictive model is a formula for estimating the unknown value of interest, often called the target attribute.
Regression: numerical target
Classification: class membership
E.g., Class-probability estimation
A descriptive model is a formula for estimating the underlying phenomena and causal connections between values.
Descriptive modeling often is used to work towards a causal understanding of the data-generating process. (why do users watch Sci-Fi sagas?)
divide the dataset into segments (sets of row) by the value of their output variable .
If the segmentation is done using values of variables that will be known when the target is not then these segments can be used to predict the value of the target value.
What are the variables that contain important information about the target variable?
Can they be selected automatically?
Measure:
purity of segments: homogenity wrt. the target variable.
attributes seldom split a dataset perfectly.
body-color=gray splits off one single data point, hence pure. Is it desirable?
Non-binary attributes in binary classification
non-discrete attributes
It measures how much much an attribute improves (decreases) entropy over the whole segmentation it creates.
How much purer are the children wrt. the parent segment?
IG(parent, children) = H(parent) -
where
H(parent) = 0.99
H(left) = 0.39
H(right) = 0.79
Notice the discretization of the numerical var. we split on
H(parent) = 0.99
H(left) = 0.54
H(center) = 0.97
H(right) = 0.98
IG(p, C) = 0.13
Discretization may reduce numerical dimensions to discrete ones
In regression, variance is the analogous of information entropy!
Notice the graphical method deployed to visualize Information gain:
The shaded area represents Entropy.
the white area ‘reclaimed’ from the shade is the Information gain.
total: 8124 instances
edible: 4208 (51.8%)
poisonous: 3916 (48.2%)
H =
A type of iterated supervised segmentation
node
Iterate until the set of all root
Measure: total entropy of the set of leaf segments.
Decision tree:
a set of if-then rules over attribute (or discretized) values