Today’s modeling and analysis of high-dimensional data is either based on human
expertise to hand-craft a set of task-specific data, which suffers significantly
from the ever-increasing complexity and the unknown patterns of the new data; or
is based on simple data-driven approaches which tend to lose the fundamentally
physical insights of real world datasets. Therefore, it is very difficult with today’s
modeling practice to efficiently, effectively, and unsupervisedly detect reliable patterns
and information in high-dimensional data. In this dissertation, we developed a
scalable data modeling framework that utilizes modern theoretical physics for unsupervised
high-dimensional data analysis and mining. Not only does it have a solid
theoretical background, but it is capable of handling different tasks with different
capability (clustering, anomaly detection and feature selections, etc.). This framework
also has probabilistic interpretation that avoids the sensitivity from scaling
parameter tuning or noise appearance in real world applications. Furthermore, we
presented a fast approximated approach to make such a framework applicable on
large-scale datasets with high efficiency and effectiveness.