ROC is very useful to estimate the quality of a binary classifier. But for long time, I do confuse how it can be generated. Although I know the concept of TPR(true positive ratio), FPR(false positive ratio), I still can’t know why it displays as a curve. Because in my mind, each TPR/FPR only shows one point in a model. How can a model give lots of TPR/FPR? Thanks to the book of Introduction to Data Mining by Pang-Ning Tan etc., I finally get the point from one example in the book.

As we’re dealing with the binary classifier, it’ll only generate the probability from $0$ to $1$. For example, the Bayesian classifier can generate the posterior probability matches the above characteristics. Thus, we can assume our dataset as follows: there’re 5 positive and 5 negative data points in our testing dataset. And our binary classifier gives the dataset’s prediction probability with their class tag as follows: $(+, 0.25), (-, 0.43), (+, 0.53), (-, 0.76), (-, 0.85), (-, 0.85), (+, 0.85), (-, 0.87), (+, 0.93), (+, 0.95)$.

With my previous understanding, the whole work has finished. But in fact, it’s NOT! That’s why I got confused. The tricky part here is: although we have lots of prediction probability, what’s the predicted result? How should we give each data point a class label? Is data point 1 should be negative? Or the data point 3 should be positive? Given those results can’t show us the final class label!

So, what’s lack of here? The cut-off line! In fact, at this stage, we didn’t give the benchmark or baseline to decide how to determine the class label. And also, different cut-off line will give us different TPR/FPR pair. For example, if we treat the lowest probability above as the baseline, i.e. the $0.25$ as the positive cut-off line. Then, all the data points here will be treated as positive. Because the things that larger than and equal to $0.25$ should be treated as positive data, which is the definition of cut-off baseline.

Then, when we treated the $2^{nd}$ lowest probability $0.43$ as baseline, it’ll only treat the data point $(+, 0.25)$ as negative point. Because it’s the only data has the probability that lower than the baseline $0.43$. But here, we just made a FN point. Because the actual class label of $(+, 0.25)$ is $+$, but we treated it as negative.

The process can continue until the last case, that we treat all the dataset as negative points.

As we’ve gotten those $10$ TPR/FPR pairs, we can draw the ROC lines corresponding. So in summary, the reason that confused me is I didn’t notice that a classifier only gives the prediction probability is not enough. We also have to give the corresponding cut-off line. And each cut-off line will generate one TPR/FPR pair. That’s how we get the final ROC curve.