In this post, we will go through the concept of manifold hypothesis. We will how the dimension of the underlining manifold can be estimated and how the geometry of the underlining manifold can affect the difficulty of the learning task.
Manifolds in the machine learning context are usually embedded Riemannian manifolds to a mathematician. That means they are subsets of that are locally diffeomorphic to for some .
The manifold hypothesis posits that most of the high-dimensional real-world data sets lie on or near a low-dimensional manifold. For instance, while we encode image datas using pixel values (the dimension of the data equal to the number of pixels), for computer vision tasks, the information we need is usually independent of brightness, saturation and constrast of the images themselves, which cuts down the actually dimension of data.
Intrinsic dimension can be a useful indicator of the geometric complexity of the data set. There have been many algorithms developed for estimating intrinsic dimension. One of the techniques that rises to popularity lately is one that utilizes diffusion models (see the previous post for the detailed discussion).
The idea behind it is as follows: Given a data point on the data manifold , we can perturb it slightly to get a noised data point . Then from , the score function , which the diffusion model aimed to learn, approximately points in the direction of the orthogonal projection of on . This implies that when is small, is approximately in the normal bundle . Note that if is a manifold with ambient dimension , we have
We can estimate by repeatedly sampling and applying PCA to all the resulting vectors.