where Nc is the number of training instances that have label c, and N is the total
number of training instances. Therefore,P(c) is simply the proportion of training
instances that have label c.
Estimating P(d|c) is a little more complicated because the same “counting”
estimate that we were able to use for estimating P(c) would not work. (Why? See
exercise 9.3.) In order to make the estimation feasible, we must impose the simplifying
assumption that d can be represented as d = w1, . . . ,wn and that wi
is independent of wj for every i ̸= j. Simply stated, this says that document d
1 Throughout most of this chapter, we assume that the items being classified are textual
documents. However, it is important to note that the techniques described here can
be used in a more general setting and applied to non-textual items such as images and
videos.