Skip to content

Latest commit

 

History

History
82 lines (56 loc) · 2.33 KB

02a-pca-introduction.qmd

File metadata and controls

82 lines (56 loc) · 2.33 KB
title subtitle author pdf-engine format
PCA: Overview
Introduction to Statistical Modelling
Prof. Joris Vankerschaver
lualatex
beamer
theme colortheme fonttheme header-includes
Pittsburgh
default
default
\setbeamertemplate{frametitle}[default][left] \setbeamertemplate{footline}[frame number] \usepackage{emoji} \usepackage{luatexko}
# What is dimensionality reduction? ## What and why - Reduce the number of variables ("dimensionality") in a dataset **in a principled way**. - Useful for - Visualization - Data preprocessing - Computational efficiency - Many different approaches - Principal component analysis (this course) - Multidimensional scaling - t-SNE, UMAP, ... ## Visualization ![](./images/02-pca/gene-expression-profiles.png){height=75% fig-align="center"} From: Lever et al., *Principal component analysis*, Nature Methods, Vol. 14, p. 641–642, 2017. ## Visualization Genotype data 197,146 loci in 1387 Europeans, summarized in two principal components (left) and compared to geographical origin (right). ![](./images/02-pca/genetic-variation.png){height=50% fig-align="center"} From: Novembre et al., *Genes mirror geography within Europe*, Nature, Vol. 456, 6 November 2008. ## Data preprocessing Bodyfat dataset: - Suffered from high multicollinearity. - Conclusions from regression model are doubtful. ![](./images/02-pca/multicollinearity.png){height=50% fig-align="center"} ## Computational efficiency - A 250 x 250 image consists of 250$^2$ = 62,500 pixels. - Not all pixels are equally informative. - Extract signal that is maximally informative, discard rest. ![](./images/02-pca/imagenet-cats.jpeg){fig-align="center"} ## Principal component analysis - Covered in this course. - Works by finding directions in which **variance is maximized**. - Good first choice, not so good if patterns are highly nonlinear. ![](./images/02-pca/max-variance-projection.png){height=50% fig-align="center"} ## Other dimensionality reduction methods t-SNE, UMAP: - Useful for highly nonlinear relations between features. - "Deforms" data so that local structure is maintained. - Frequently used in single-cell RNA sequencing analysis. ![](./images/02-pca/tsne-cells.png){height=80% fig-align="center"} From: \url{https://www.cancer.gov/ccg/blog/2020/interview-t-sne}