- Understand the concept of multivariate statistics.
- Analyze relationships between multiple variables.
- Learn about multivariate probability distributions.
- Compute covariance and correlation matrices.
- Understand the multivariate normal distribution.
- Explore Principal Component Analysis (PCA).
Definition of Multivariate Statistics
Multivariate statistics involves analyzing multiple variables simultaneously to study their relationships.
- Univariate analysis: Single variable (e.g., height distribution).
- Bivariate analysis: Two variables (e.g., height vs. weight).
- Multivariate analysis: Three or more variables (e.g., height, weight, and age).
Covariance and Correlation Matrices
The covariance matrix describes the relationships between multiple variables:
\[ \Sigma = \begin{bmatrix} \text{Var}(X_1) & \text{Cov}(X_1, X_2) & \dots & \text{Cov}(X_1, X_n) \\ \text{Cov}(X_2, X_1) & \text{Var}(X_2) & \dots & \text{Cov}(X_2, X_n) \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}(X_n, X_1) & \text{Cov}(X_n, X_2) & \dots & \text{Var}(X_n) \end{bmatrix} \]The correlation matrix standardizes the relationships:
\[ R = \begin{bmatrix} 1 & \rho_{12} & \dots & \rho_{1n} \\ \rho_{21} & 1 & \dots & \rho_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ \rho_{n1} & \rho_{n2} & \dots & 1 \end{bmatrix} \]Multivariate Probability Distributions
Multivariate probability distributions describe joint behavior of multiple variables.
- Joint Probability Function: \( P(X_1, X_2, \dots, X_n) \).
- Marginal Distributions: Individual distributions obtained by summing/integrating over other variables.
- Conditional Distributions: Probability of one variable given values of others.
Multivariate Normal Distribution
The multivariate normal distribution is an extension of the normal distribution to multiple variables:
\[ f(\mathbf{x}) = \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right) \]where:
- \( \boldsymbol{\mu} \) = mean vector
- \( \Sigma \) = covariance matrix
- \( \mathbf{x} \) = multivariate variable
Derivation
The multivariate normal distribution generalizes the univariate normal:
\[ P(X) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(X - \mu)^2}{2\sigma^2}} \]Extending this to \( n \) dimensions leads to the matrix form:
\[ f(\mathbf{x}) = \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right) \]Principal Component Analysis (PCA)
PCA is a technique for reducing dimensionality while preserving variance.
- Step 1: Compute the covariance matrix.
- Step 2: Find eigenvalues and eigenvectors.
- Step 3: Select top \( k \) principal components.
- Step 4: Transform data to new coordinate system.
Examples
Example 1: Compute the covariance matrix for the dataset:
X | Y | Z |
---|---|---|
2 | 4 | 3 |
3 | 5 | 4 |
4 | 6 | 5 |
Exercises
- Question 1: Compute the correlation matrix for the dataset:
- \( (2,3), (3,4), (5,6) \).
- Question 2: Find the first principal component of:
- \( X_1 = (1,2,3), X_2 = (4,5,6) \).
- Question 3: Compute the eigenvalues of the covariance matrix:
- Answer 1: The correlation matrix is:
- \( \begin{bmatrix}1 & 0.9 \\ 0.9 & 1\end{bmatrix} \).
- Answer 2: First principal component: \( (0.71, 0.71) \).
- Answer 3: Eigenvalues: \( \lambda_1 = 2.5, \lambda_2 = 0.5 \).