Machine Learning Classification
- 3.1 Classification Tasks
- 3.2 A Simple Classification Dataset
- 3.3 Training and Testing: Don't Teach to the Test
- 3.4 Evaluation: Grading the Exam
- 3.5 Simple Classifier #1: Nearest Neighbors, Long Distance Relationships, and Assumptions
- 3.6 Simple Classifier #2: Naive Bayes, Probability, and Broken Promises
- 3.7 Simplistic Evaluation of Classifiers
- 3.8 EOC
Walk through the first steps in building, training, and evaluating learning systems that classify examples (classifiers).
In [1]:
# setup from mlwpy import * %matplotlib inline
3.1 Classification Tasks
Now that we’ve laid a bit of groundwork, let’s turn our attention to the main attraction: building and evaluating learning systems. We’ll start with classification and we need some data to play with. If that weren’t enough, we need to establish some evaluation criteria for success. All of these are just ahead.
Let me squeeze in a few quick notes on terminology. If there are only two target classes for output, we can call a learning task binary classification. You can think about {Yes, No}, {Red, Black}, or {True, False} targets. Very often, binary problems are described mathematically using {-1, +1} or {0, 1}. Computer scientists love to encode {False, True} into the numbers {0, 1} as the output values. In reality, {-1, +1} or {0, 1} are both used for mathematical convenience, and it won’t make much of a difference to us. (The two encodings often cause head-scratching if you lose focus reading two different mathematical presentations. You might see one in a blog post and the other in an article and you can’t reconcile them. I’ll be sure to point out any differences in this book.) With more than two target classes, we have a multiclass problem.
Some classifiers try to make a decision about the output in a direct fashion. The direct approach gives us great flexibility in the relationships we find, but that very flexibility means that we aren’t tied down to assumptions that might lead us to better decisions. These assumptions are similar to limiting the suspects in a crime to people that were near where the crime occurred. Sure, we could start with no assumptions at all and equally consider suspects from London, Tokyo, and New York for a crime that occurred in Nashville. But, adding an assumption that the suspect is in Tennessee should lead to a better pool of suspects.
Other classifiers break the decision into a two-step process: (1) build a model of how likely the outcomes are and (2) pick the most likely outcome. Sometimes we prefer the second approach because we care about the grades of the prediction. For example, we might want to know how likely it is that someone is sick. That is, we want to know that there is a 90% chance someone is sick, versus a more generic estimate “yes, we think they are sick.” That becomes important when the real-world cost of our predictions is high. When cost matters, we can combine the probabilities of events with the costs of those events and come up with a decision model to choose a real-world action that balances these, possibly competing, demands. We will consider one example of each type of classifier: Nearest Neighbors goes directly to an output class, while Naive Bayes makes an intermediate stop at an estimated probability.