+ - 0:00:00
Notes for current slide
Notes for next slide

Data Science Techniques and Applications (DSTA) 2019-20

Evaluating Classification performance

Slides and codes courtesy of Andreas C. Müller, NYU

https://github.com/amueller/

1 / 30

Supervised Binary Classification

Istance:

  • a collection X

  • its classification y (assume a classification sys. {0,1} or {-1, +1})

Solution: a classifier function (here called model)

Measure: misclassification wrt. y

2 / 30

Confusion matrix

3 / 30

Diagonal divided by everything.

Example: evaluate the LogisticRegression classifer

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
data = load_breast_cancer()
4 / 30

Example: evaluate the LogisticRegression classifer

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, stratify=data.target, random_state=0)
lr = LogisticRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)
5 / 30

Example: evaluate the LogisticRegression classifer

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, stratify=data.target, random_state=0)
lr = LogisticRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(lr.score(X_test, y_test))

6 / 30

Problems with Accuracy

Data with 90% positives:

from sklearn.metrics import accuracy_score
for y_pred in [y_pred_1, y_pred_2, y_pred_3]:
print(accuracy_score(y_true, y_pred))
0.9
0.9
0.9

7 / 30
  • Imbalanced classes lead to hard-to-interpret accuracy.

Precision, Recall, f-score

Precision=TPTP+FP

Recall=TPTP+FN

F_1=2precisionrecallprecision+recall


Positive Predicted Value (PPV)


Sensitivity, coverage, true positive rate.


Harmonic mean of precision and recall

8 / 30

All depend on definition of positive and negative.

F-score

F1=2precisionrecallprecision+recall=21precision+1recall

Prize on higher recall: F2

Prize on higher precision: F0.5

9 / 30






11 / 30

The imporante of experiment design

  • What do I want? What do I care about?
  • Can I assign costs to the confusion matrix?
  • What guarantees do we want to give?
12 / 30

(precision, recall, or something else) (i.e. a false positive costs me 10 dollars; a false negative, 100 dollars)

Setting Thresholds

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, stratify=data.target, random_state=0)
lr = LogisticRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.91 0.92 0.92 53
1 0.96 0.94 0.95 90
avg/total 0.94 0.94 0.94 143
y_pred = lr.predict_proba(X_test)[:, 1] > .85 # predict 1 only when Pr>.85
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.84 1.00 0.91 53
1 1.00 0.89 0.94 90
avg/total 0.94 0.93 0.93 143
13 / 30

The Precision-Recall curve

SVC: A SVM classifier

# [...]
X_train, X_test, y_train, y_test = train_test_split(X, y)
svc = SVC.fit(X_train, y_train)
precision, recall, thresholds = precision_recall_curve(y_test, svc.decision_function(X_test))

(some details were omitted)

14 / 30

The Precision-Recall curve of an SVC classifier

15 / 30

Comparing RF and SVC

16 / 30

Alternative: Average Precision

17 / 30

Related to area under the precision-recall curve (with step interpolation)

F1 vs. average Precision

from sklearn.metrics import f1_score
print("f1_score of random forest: {:.3f}".format(f1_score(y_test, rf.predict(X_test))))
print("f1_score of svc: {:.3f}".format(f1_score(y_test, svc.predict(X_test))))
f1_score of random forest: 0.709
f1_score of svc: 0.715
from sklearn.metrics import average_precision_score
ap_rf = average_precision_score(y_test, rf.predict_proba(X_test)[:, 1])
ap_svc = average_precision_score(y_test, svc.decision_function(X_test))
print("Average precision of random forest: {:.3f}".format(ap_rf))
print("Average precision of svc: {:.3f}".format(ap_svc))
Average precision of random forest: 0.682
Average precision of svc: 0.693
18 / 30

AP only considers ranking!

The ROC Curve

FPR=FPFP+TN
TPR=TPTP+FN=recall
19 / 30

20 / 30

The ROC AUC

  • Area under ROC Curve

  • Always .5 for random/constant prediction

from sklearn.metrics import roc_auc_score
rf_auc = roc_auc_score(y_test, rf.predict_proba(X_test)[:,1])
svc_auc = roc_auc_score(y_test, svc.decision_function(X_test))
print("AUC for random forest: {:, .3f}".format(rf_auc))
print("AUC for SVC: {:, .3f}".format(svc_auc))
AUC for random forest: 0.937
AUC for SVC: 0.916

Backgorund reading: The Relationship Between Precision-Recall and ROC Curves


https://www.biostat.wisc.edu/~page/rocpr.pdf

21 / 30

Summary of metrics for binary classification

Threshold-based:

  • precision, recall,
  • accuracy
  • Fα

Ranking-based:

  • average precision

  • ROC AUC

22 / 30

add log loss?

Non-binary (Multi-class) classification

23 / 30

Confusion Matrix

from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
digits = load_digits()
# data is between 0 and 16
X_train, X_test, y_train, y_test =
train_test_split(digits.data / 16., digits.target, random_state=0)
lr = LogisticRegression().fit(X_train, y_train)
pred = lr.predict(X_test)
print("Accuracy: {:.3f}".format(accuracy_score(y_test, pred)))
print("Confusion matrix:")
print(confusion_matrix(y_test, pred))
24 / 30
[...]
print("Accuracy: {:.3f}".format(accuracy_score(y_test, pred)))
print("Confusion matrix:")
print(confusion_matrix(y_test, pred))
Accuracy: 0.964
Confusion matrix:
[[37 0 0 0 0 0 0 0 0 0]
[ 0 41 0 0 0 0 1 0 1 0]
[ 0 0 44 0 0 0 0 0 0 0]
[ 0 0 0 43 0 0 0 0 1 1]
[ 0 0 0 0 37 0 0 1 0 0]
[ 0 0 0 0 0 47 0 0 0 1]
[ 0 1 0 0 0 0 51 0 0 0]
[ 0 0 0 0 1 0 0 47 0 0]
[ 0 3 1 0 0 1 0 0 43 0]
[ 0 0 0 0 0 2 0 0 1 44]]
]
25 / 30
print(classification_report(y_test, pred))
precision recall f1-score support
0 1.00 1.00 1.00 37
1 0.91 0.95 0.93 43
2 0.98 1.00 0.99 44
3 1.00 0.96 0.98 45
4 0.97 0.97 0.97 38
5 0.94 0.98 0.96 48
6 0.98 0.98 0.98 52
7 0.98 0.98 0.98 48
8 0.93 0.90 0.91 48
9 0.96 0.94 0.95 47
avg / total 0.96 0.96 0.96 450
26 / 30

Averaging strategies

  • "macro", "weighted", "micro" (multi-label), "samples" (multi-label)

macro: 1|L|lLR(yl,y^l)

weighted: 1nlLnlR(yl,y^l)

print("Micro average: ", recall_score(y_test, pred, average="weighted"))
print("Macro average: ", recall_score(y_test, pred, average="macro"))
Micro average: 0.964
Macro average: 0.964
27 / 30

micro vs macro same for other metrics.

Multi-class ROC AUC

  • Hand & Till, 2001, one vs one
    1c(c1)j=1ckjcAUC(j,k)

develop a classifier for each pair of possible labels; let them vote on each unseen input

  • Provost & Domingo, 2000, one vs rest
    1cj=1cp(j)AUC(j,restj)

Develop a classifier for each possible label

Apply all of them to an unseen input and that give the label of the corresponding classifier that reports the highest confidence score

28 / 30

observations:

  • Accuracy rarely what you want

  • Emphasis on recall or precision?

  • Problems are rarely balanced

29 / 30

Available Scikit-learn scoring functions

from sklearn.metrics.scorer import SCORERS
print("\n".join(sorted(SCORERS.keys())))
accuracy log_loss precision_micro
adjusted_mutual_info_score mean_absolute_error precision_samples
adjusted_rand_score mean_squared_error precision_weighted
average_precision median_absolute_error r2
completeness_score mutual_info_score recall
explained_variance neg_log_loss recall_macro
f1 neg_mean_absolute_error recall_micro
f1_macro neg_mean_squared_error recall_samples
f1_micro neg_mean_squared_log_error recall_weighted
f1_samples neg_median_absolute_error roc_auc
f1_weighted normalized_mutual_info_score v_measure_score
fowlkes_mallows_score precision
homogeneity_score precision_macro
30 / 30

Supervised Binary Classification

Istance:

  • a collection X

  • its classification y (assume a classification sys. {0,1} or {-1, +1})

Solution: a classifier function (here called model)

Measure: misclassification wrt. y

2 / 30
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow