At its core is the assumption that a document is generated by a small number of “topics.” An LDA “topic” is a probability distribution, assigning each possible word a probability. Topics are considered hypothetical and unobservable, which is to say that they don’t actually exist in documents. Instead, we must infer the topic characteristics from a collection of documents. Each document is assumed to be generated by a few of these topics. So, every word in every document is assumed to be attributable to one of the document’s topics.