MDL is an information-theoretic approach to machine learning, or statistical model selection, which basically says you should pick the model which gives you the most compact description of the data, including the description of the model itself. More precisely, given a probabilistic model, Shannon's coding theorems tell you the minimal number of bits needed to encode your data, i.e., the maximum extent to which it can be compressed. Really, however, to complete the description, you need to specify the model as well, from among some set of alternatives, and this will also require a certain number of bits. Hence you really want to minimize the combined length of the description of the model, plus the description of the data under that model. This works out to being a kind of penalized maximum likelihood — the data-given-model bit is the negative log likelihood, and the model-description term is the penalty.