Exploring New Forms of Random Projections for Prediction and Dimensionality Reduction in Big-Data Regimes
Abstract
The story of this work is dimensionality reduction. Dimensionality reduction
is a method that takes as input a point-set P of n points in R^d where d is
typically large and attempts to find a lower-dimensional representation of
that dataset, in order to ease the burden of processing for down-stream
algorithms. In today’s landscape of machine learning, researchers and
practitioners work with datasets that either have a very large number of
samples, and or include high-dimensional samples. Therefore, dimensionality
reduction is applied as a pre-processing technique primarily to overcome the
curse of dimensionality.
Generally, dimensionality reduction improves time and storage space required
for processing the point-set, removes multi-collinearity and redundancies in
the dataset where different features may depend on one another, and may
enable simple visualizations of the dataset in 2-D and 3-D making the
relationships in the data easy for humans to comprehend. Dimensionality
reduction methods come in many shapes and sizes. Methods such as Principal
Component Analysis (PCA), Multi-dimensional Scaling, IsoMaps, and Locally
Linear Embeddings are amongst the most commonly used method of this family of
algorithms. However, the choice of dimensionality reduction method proves
critical in many applications as there is no one-size-fits-all solution, and
special care must be considered for different datasets and tasks.
Furthermore, the aforementioned popular methods are data-dependent, and
commonly rely on computing either the Kernel / Gram matrix or the covariance
matrix of the dataset. These matrices scale with increasing number of samples
and increasing number of data dimensions, respectively, and are consequently
poor choices in today’s landscape of big-data applications.
Therefore, it is pertinent to develop new dimensionality reduction methods
that can be efficiently applied to large and high-dimensional datasets, by
either reducing the dependency on the data, or side-stepping it altogether.
Furthermore, such new dimensionality reduction methods should be able to
perform on par with, or better than, traditional methods such as PCA. To
achieve this goal, we turn to a simple and powerful method called random
projections.
Random projections are a simple, efficient, and data-independent method for
stably embedding a point-set P of n points in R^d to R^k where d is typically
large and k is on the order of log n. Random projections have a long history
of use in dimensionality reduction literature with great success. In this
work, we are inspired to build on the ideas of random projection theory, and
extend the framework and build a powerful new setup of random projections for
large high-dimensional datasets, with comparable performance to
state-of-the-art data-dependent and nonlinear methods. Furthermore, we study
the use of random projections in domains other than dimensionality reduction,
including prediction, and show the competitive performance of such methods
for processing small dataset regimes.
Collections
Cite this version of the work
Amir-Hossein Karimi
(2018).
Exploring New Forms of Random Projections for Prediction and Dimensionality Reduction in Big-Data Regimes. UWSpace.
http://hdl.handle.net/10012/13220
Other formats
Related items
Showing items related by title, author, creator and subject.
-
k-Connectedness and k-Factors in the Semi-Random Graph Process
Koerts, Hidde (University of Waterloo, 2022-12-20)The semi-random graph process is a single-player graph game where the player is initially presented an edgeless graph with n vertices. In each round, the player is offered a vertex u uniformly at random and subsequently ... -
Modeling and managing noise in quantum error correction
Beale, Stefanie (University of Waterloo, 2023-10-19)Simulating a quantum system to full accuracy is very costly and often impossible as we do not know the exact dynamics of a given system. In particular, the dynamics of measurement noise are not well understood. For this ... -
Randomly-connected Non-Local Conditional Random Fields
Shafiee, Mohammad Javad (University of Waterloo, 2017-02-21)Structural data modeling is an important field of research. Structural data are the combination of latent variables being related to each other. The incorporation of these relations in modeling and taking advantage of those ...