A language, whether natural (such as English) or artificial (such as Java), is a set
of strings of characters from some alphabet. The strings of a language are called
sentences or statements. The syntax rules of a language specify which strings
of characters from the language’s alphabet are in the language. English, for
example, has a large and complex collection of rules for specifying the syntax of
its sentences. By comparison, even the largest and most complex programming
languages are syntactically very simple.
Formal descriptions of the syntax of programming languages, for simplicity’s
sake, often do not include descriptions of the lowest-level syntactic
units. These small units are called lexemes. The description of lexemes can
be given by a lexical specification, which is usually separate from the syntactic
description of the language. The lexemes of a programming language include
its numeric literals, operators, and special words, among others. One can think
of programs as strings of lexemes rather than of characters.
Lexemes are partitioned into groups—for example, the names of variables,
methods, classes, and so forth in a programming language form a group called
identifiers. Each lexeme group is represented by a name, or token. So, a token
of a language is a category of its lexemes. For example, an identifier is a token
that can have lexemes, or instances, such as sum and total. In some cases, a
token has only a single possible lexeme. For example, the token for the arithmetic
operator symbol + has just one possible lexeme. Consider the following
Java statement: