Exam Preparation Flashcards
The Sum Rule states that the marginal probability of \(X\) is obtained by summing the joint probability over all states of \(Y\): $$p(X) = \sum_Y p(X, Y)$$ The Product Rule states that the joint probability can be decomposed into a conditional and a marginal: $$p(X, Y) = p(Y|X)p(X)$$
Bayes' Theorem is given by: $$p(Y|X) = \frac{p(X|Y)p(Y)}{p(X)}$$ where \(p(Y|X)\) is the posterior, \(p(X|Y)\) is the likelihood, and \(p(Y)\) is the prior. The denominator \(p(X)\) acts as a normalization constant to ensure the posterior probabilities sum to 1. It can be computed using the sum rule: $$p(X) = \sum_Y p(X|Y)p(Y)$$
For a continuous random variable \(x\), the PDF \(p(x)\) is a function such that the probability of \(x\) falling in an infinitesimal interval \((x, x + \delta x)\) is given by \(p(x)\delta x\). It must satisfy \(p(x) \ge 0\) and \(\int_{-\infty}^{\infty} p(x)dx = 1\). The probability that \(x\) lies in an interval \((a, b)\) is \(\int_{a}^{b} p(x)dx\).
The CDF, denoted \(P(z)\) or \(F(z)\), is the probability that the random variable \(X\) takes a value less than or equal to \(z\): $$P(z) = p(X \le z) = \int_{-\infty}^{z} p(x)dx$$ The PDF is the derivative of the CDF: \(p(x) = \frac{d}{dx}P(x)\).
The expectation (average value) of a function \(f(x)\) under a probability distribution \(p(x)\) is denoted \(\mathbb{E}[f]\).
For discrete variables:
$$\mathbb{E}[f] = \sum_x p(x)f(x)$$
For continuous variables:
$$\mathbb{E}[f] = \int p(x)f(x)dx$$
Variance measures the variability of \(f(x)\) around its mean: $$\text{var}[f] = \mathbb{E}[(f(x) - \mathbb{E}[f(x)])^2] = \mathbb{E}[f(x)^2] - \mathbb{E}[f(x)]^2$$ Covariance measures the extent to which two variables \(x\) and \(y\) vary together: $$\text{cov}[x, y] = \mathbb{E}_{x,y}[\{x - \mathbb{E}[x]\}\{y - \mathbb{E}[y]\}] = \mathbb{E}_{x,y}[xy] - \mathbb{E}[x]\mathbb{E}[y]$$
Two random variables \(X\) and \(Y\) are independent, denoted \(X \perp Y\), if their joint distribution factorizes into the product of their marginals: $$p(X, Y) = p(X)p(Y)$$ This implies that \(p(Y|X) = p(Y)\) and \(p(X|Y) = p(X)\). Covariance between independent variables is zero.
\(X\) and \(Y\) are conditionally independent given \(Z\) (denoted \(X \perp Y | Z\)) if the conditional joint distribution factorizes: $$p(X, Y | Z) = p(X|Z)p(Y|Z)$$ This means that if we know \(Z\), knowing \(Y\) gives no extra information about \(X\).
A distribution for a single binary random variable \(x \in \{0, 1\}\). It is governed by a parameter \(\mu\) (probability of \(x=1\)):
$$\text{Bern}(x|\mu) = \mu^x (1-\mu)^{1-x}$$
Mean: \(\mathbb{E}[x] = \mu\)
Variance: \(\text{var}[x] = \mu(1-\mu)\)
Models the number of observations \(m\) of \(x=1\) (heads) in a set of \(N\) independent Bernoulli trials.
$$\text{Bin}(m|N, \mu) = \binom{N}{m} \mu^m (1-\mu)^{N-m}$$
Mean: \(\mathbb{E}[m] = N\mu\)
Variance: \(\text{var}[m] = N\mu(1-\mu)\)
Generalization of the Binomial to \(K\) mutually exclusive states. For \(N\) trials, with \(x_k\) being the count of state \(k\): $$\text{Mult}(m_1, ..., m_K|\boldsymbol{\mu}, N) = \frac{N!}{m_1! \dots m_K!} \prod_{k=1}^K \mu_k^{m_k}$$ where \(\sum \mu_k = 1\) and \(\sum m_k = N\).
Models counts of rare events over a fixed interval. Defined for \(x \in \{0, 1, 2, ...\}\) with rate parameter \(\lambda > 0\): $$\text{Poi}(x|\lambda) = e^{-\lambda} \frac{\lambda^x}{x!}$$ Mean: \(\lambda\), Variance: \(\lambda\).
A continuous distribution defined by mean \(\mu\) and variance \(\sigma^2\): $$\mathcal{N}(x|\mu, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{1/2}} \exp\left\{ -\frac{1}{2\sigma^2}(x-\mu)^2 \right\}$$ It has maximum entropy for a given variance. Precision is defined as \(\beta = 1/\sigma^2\).
For a \(D\)-dimensional vector \(\mathbf{x}\): $$\mathcal{N}(\mathbf{x}|\boldsymbol{\mu}, \mathbf{\Sigma}) = \frac{1}{(2\pi)^{D/2}|\mathbf{\Sigma}|^{1/2}} \exp\left\{ -\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu}) \right\}$$ where \(\boldsymbol{\mu}\) is the mean vector and \(\mathbf{\Sigma}\) is the \(D \times D\) covariance matrix. The term in the exponent is the squared Mahalanobis distance.
\(\mathbf{\Sigma}\) must be symmetric and positive semi-definite (positive definite for the density to be well-defined). Its eigendecomposition \(\mathbf{\Sigma} = \sum \lambda_i \mathbf{u}_i \mathbf{u}_i^T\) defines the orientation (eigenvectors \(\mathbf{u}_i\)) and elongation (eigenvalues \(\lambda_i\)) of the elliptical constant-density contours.
The theorem states that the sum (or average) of a large number of independent and identically distributed (i.i.d.) random variables (with finite mean and variance) tends to a Gaussian distribution, regardless of the original distribution of the variables.
A continuous distribution over \(\mu \in [0, 1]\), often used as a prior for Bernoulli parameters. $$\text{Beta}(\mu|a, b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \mu^{a-1} (1-\mu)^{b-1}$$ Mean: \(a/(a+b)\). It generalizes the uniform distribution (when \(a=b=1\)).
A multivariate generalization of the Beta distribution over the \(K\)-dimensional probability simplex. $$\text{Dir}(\boldsymbol{\mu}|\boldsymbol{\alpha}) = \frac{\Gamma(\alpha_0)}{\Gamma(\alpha_1)\dots\Gamma(\alpha_K)} \prod_{k=1}^K \mu_k^{\alpha_k - 1}$$ where \(\alpha_0 = \sum \alpha_k\). It is the conjugate prior for the Multinomial distribution.
A flexible distribution for positive real variables \(\lambda > 0\). $$\text{Gam}(\lambda|a, b) = \frac{1}{\Gamma(a)} b^a \lambda^{a-1} \exp(-b\lambda)$$ Often used as a conjugate prior for the precision (inverse variance) of a Gaussian.
It is a heavy-tailed distribution derived by integrating out the precision of a Gaussian with a Gamma prior. $$\text{St}(x|\mu, \lambda, \nu) \propto \left[ 1 + \frac{\lambda(x-\mu)^2}{\nu} \right]^{-(\nu+1)/2}$$ It provides robustness to outliers compared to the Gaussian because probability decays polynomially rather than exponentially.
A frequentist method for parameter estimation. It chooses the parameters \(\boldsymbol{\theta}\) that maximize the likelihood function \(p(\mathcal{D}|\boldsymbol{\theta})\), i.e., making the observed data \(\mathcal{D}\) most probable. $$\hat{\boldsymbol{\theta}}_{MLE} = \arg\max_{\boldsymbol{\theta}} \log p(\mathcal{D}|\boldsymbol{\theta})$$
Given \(N\) data points \(x_n\): $$\mu_{ML} = \frac{1}{N} \sum_{n=1}^N x_n \quad \text{(Sample Mean)}$$ $$\sigma^2_{ML} = \frac{1}{N} \sum_{n=1}^N (x_n - \mu_{ML})^2 \quad \text{(Sample Variance)}$$ Note: \(\sigma^2_{ML}\) is a biased estimator of the true variance.
Bias is the difference between the expected value of the estimator and the true parameter value: \(\text{bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta\). The MLE for Gaussian variance is biased because \(\mathbb{E}[\sigma^2_{ML}] = \frac{N-1}{N}\sigma^2\). An unbiased estimator uses \(N-1\) in the denominator.
A method that estimates parameters by maximizing the posterior distribution \(p(\boldsymbol{\theta}|\mathcal{D})\). It incorporates a prior \(p(\boldsymbol{\theta})\): $$\hat{\boldsymbol{\theta}}_{MAP} = \arg\max_{\boldsymbol{\theta}} p(\boldsymbol{\theta}|\mathcal{D}) = \arg\max_{\boldsymbol{\theta}} [ \log p(\mathcal{D}|\boldsymbol{\theta}) + \log p(\boldsymbol{\theta}) ]$$ The log prior acts as a regularizer.
A prior is conjugate to a likelihood function if the posterior distribution has the same functional form as the prior. This allows for closed-form Bayesian updates.
Example: Beta prior + Bernoulli likelihood \(\rightarrow\) Beta posterior.
1. Beta distribution
2. Dirichlet distribution
3. Gaussian distribution
4. Gamma (or Inverse-Gamma) distribution
The principle that Bayesian model selection automatically penalizes complex models. Complex models spread their probability mass over a larger space of datasets. Thus, for a specific dataset, a simpler model (that puts more mass on that dataset) often has a higher marginal likelihood (evidence) \(p(\mathcal{D}|M)\).
The distribution of a new data point \(x_{new}\) given the observed data \(\mathcal{D}\), obtained by marginalizing out the parameters \(\boldsymbol{\theta}\): $$p(x_{new}|\mathcal{D}) = \int p(x_{new}|\boldsymbol{\theta}) p(\boldsymbol{\theta}|\mathcal{D}) d\boldsymbol{\theta}$$ This averages predictions over all possible parameter settings weighted by their posterior probabilities.
Entropy \(H[x]\) measures the average amount of information (or uncertainty) in a random variable \(x\): $$H[x] = -\sum_x p(x) \ln p(x)$$ It is maximized by a uniform distribution.
A measure of dissimilarity between two probability distributions \(p(x)\) and \(q(x)\): $$\text{KL}(p||q) = -\int p(x) \ln \frac{q(x)}{p(x)} dx$$ Properties: \(\text{KL}(p||q) \ge 0\) (Gibbs' inequality), and \(\text{KL}(p||q) = 0\) iff \(p=q\). It is asymmetric.
A measure of the dependence between two variables \(x\) and \(y\). It is the KL divergence between the joint distribution and the product of marginals: $$I[x, y] = \text{KL}(p(x,y) || p(x)p(y)) = H[x] - H[x|y]$$ It represents the reduction in uncertainty about \(x\) given \(y\).
If \(p(x_a, x_b) = \mathcal{N}\left( \begin{bmatrix} \mu_a \\ \mu_b \end{bmatrix}, \begin{bmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix} \right)\), then the marginal is simply: $$p(x_a) = \mathcal{N}(x_a | \mu_a, \Sigma_{aa})$$ Marginalization in Gaussians is trivial: just drop the other variables.
For the joint Gaussian defined in the previous card, the conditional \(p(x_a|x_b)\) is also Gaussian with:
Mean: \(\mu_{a|b} = \mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(x_b - \mu_b)\)
Covariance: \(\Sigma_{a|b} = \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}\)
Note the mean depends linearly on \(x_b\), while the covariance is independent of \(x_b\).
The product of two Gaussian PDFs is proportional to another Gaussian PDF. \(\mathcal{N}(x|\mu_1, \Sigma_1)\mathcal{N}(x|\mu_2, \Sigma_2) \propto \mathcal{N}(x|\mu_{new}, \Sigma_{new})\). This property is key for Bayesian updates where prior and likelihood are both Gaussian.
A system where: $$p(x) = \mathcal{N}(x|\mu_x, \Sigma_x)$$ $$p(y|x) = \mathcal{N}(y|Ax + b, \Sigma_y)$$ The marginal \(p(y)\) and posterior \(p(x|y)\) are also Gaussian. This forms the basis for Factor Analysis, PCA, and Kalman Filters.
It is the conjugate prior for a multivariate Gaussian with unknown mean \(\mu\) and unknown covariance \(\Sigma\). $$p(\mu, \Sigma) = \text{NIW}(\mu, \Sigma | m_0, \kappa_0, \nu_0, S_0) = \mathcal{N}(\mu | m_0, \frac{1}{\kappa_0}\Sigma) \times \text{IW}(\Sigma | S_0, \nu_0)$$ It combines a Gaussian prior on the mean (conditioned on \(\Sigma\)) and an Inverse-Wishart prior on the covariance.
A distribution over symmetric positive-definite matrices (like covariance matrices). It is the conjugate prior for the precision matrix (inverse covariance) of a Gaussian. If \(\mathbf{S} \sim \text{Wi}(\mathbf{\Sigma}, \nu)\), then \(\mathbf{S}\) is the distribution of the scatter matrix \(\sum_{i=1}^\nu \mathbf{x}_i \mathbf{x}_i^T\) where \(\mathbf{x}_i \sim \mathcal{N}(0, \mathbf{\Sigma})\).
In high-dimensional spaces, volume grows exponentially with dimension. Data becomes sparse, and distance measures become less meaningful (all points are roughly equidistant). For a Gaussian, probability mass is concentrated in a thin shell at a large radius from the mean.
Decision theory combines probabilities with a Loss Function \(L(y, a)\) (cost of taking action \(a\) when truth is \(y\)). The goal is to minimize Risk (expected loss): $$\mathbb{E}[L] = \sum_y \int L(y, a(x)) p(x, y) dx$$ For 0-1 loss, this leads to picking the class with the highest posterior probability.
A generative classifier that assumes features \(x_1, \dots, x_D\) are conditionally independent given the class label \(y\): $$p(\mathbf{x}|y) = \prod_{j=1}^D p(x_j|y)$$ The classifier predicts \(y\) maximizing \(p(y)\prod p(x_j|y)\). "Naive" because features are rarely independent.
A numerical technique to compute \(\log \sum_i \exp(x_i)\) without underflow or overflow. Ideally: $$\log \sum \exp(x_i) = m + \log \sum \exp(x_i - m)$$ where \(m = \max_i x_i\). Used in Naive Bayes and HMMs.
A class of distributions whose PDF can be written as: $$p(\mathbf{x}|\boldsymbol{\eta}) = h(\mathbf{x}) g(\boldsymbol{\eta}) \exp\{ \boldsymbol{\eta}^T \mathbf{u}(\mathbf{x}) \}$$ where \(\boldsymbol{\eta}\) are natural parameters and \(\mathbf{u}(\mathbf{x})\) are sufficient statistics. Includes Gaussian, Exponential, Gamma, Beta, Bernoulli, etc.
A statistic \(\phi(D)\) is sufficient for a parameter \(\theta\) if \(p(D|\theta)\) depends on the data \(D\) only through \(\phi(D)\). $$p(D|\theta) = g(\phi(D), \theta)h(D)$$ For exponential family, \(\sum \mathbf{u}(x_n)\) is sufficient.
A probabilistic model that assumes data is generated from a mixture of \(K\) Gaussian distributions. $$p(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x|\mu_k, \Sigma_k)$$ where \(\pi_k\) are mixing coefficients (\(\sum \pi_k = 1\)). It is a latent variable model.
An iterative algorithm to find MLE or MAP estimates in models with latent variables (like GMMs).
E-step: Compute posterior of latent variables given current parameters.
M-step: Update parameters to maximize expected log likelihood under distributions found in E-step.
Choosing a model \(M_i\) based on the posterior \(p(M_i|\mathcal{D}) \propto p(\mathcal{D}|M_i)p(M_i)\). The term \(p(\mathcal{D}|M_i)\) is the Model Evidence (or Marginal Likelihood), computed by integrating out parameters: $$p(\mathcal{D}|M_i) = \int p(\mathcal{D}|\theta, M_i)p(\theta|M_i)d\theta$$ It automatically penalizes model complexity (Bayesian Occam's Razor).
Criteria for model selection that approximate the model evidence.
BIC (Bayesian Information Criterion): \(\ln p(\mathcal{D}) \approx \ln p(\mathcal{D}|\hat{\theta}) - \frac{M}{2}\ln N\)
AIC (Akaike Information Criterion): \(\ln p(\mathcal{D}|\hat{\theta}) - M\)
where \(M\) is # of parameters, \(N\) is # of data points. BIC penalizes complexity more strongly for large \(N\).
A prior intended to have minimal influence on the posterior, letting data dominate. Examples include the uniform distribution or Jeffreys Prior (which is invariant to re-parameterization). They are often "improper" (do not integrate to 1).
A distance measure that accounts for correlations in data: $$\Delta^2 = (\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})$$ It appears in the exponent of the Multivariate Gaussian. It reduces to Euclidean distance if \(\mathbf{\Sigma} = \mathbf{I}\).
A distribution over positive definite matrices, which acts as the conjugate prior for the covariance matrix of a Gaussian (when the mean is known). It generalizes the Inverse-Gamma distribution.
The Bernoulli distribution models a single binary trial (0 or 1). The Binomial distribution models the sum of \(N\) independent Bernoulli trials. If \(N=1\), Binomial reduces to Bernoulli.
\(\sigma(x) = \frac{1}{1 + e^{-x}}\).
It maps \((-\infty, \infty)\) to \((0, 1)\).
Derivative: \(\sigma'(x) = \sigma(x)(1 - \sigma(x))\).
Used in Logistic Regression and Neural Networks.
A generalization of the sigmoid to \(K\) classes. It converts a vector of real numbers \(\mathbf{z}\) into a probability distribution: $$\text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}$$ Used in Multiclass Logistic Regression and Neural Networks.
An estimator \(\hat{\theta}_N\) is consistent if it converges in probability to the true parameter \(\theta\) as the sample size \(N \to \infty\). MLE is a consistent estimator.
For a function \(\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m\), the Jacobian \(\mathbf{J}\) is an \(m \times n\) matrix of all first-order partial derivatives: \(J_{ij} = \frac{\partial f_i}{\partial x_j}\). It is used in the Change of Variables formula for probability densities.
If \(\mathbf{y} = f(\mathbf{x})\) is a bijective transformation, the PDF of \(\mathbf{y}\) is: $$p_y(\mathbf{y}) = p_x(\mathbf{x}) \left| \det \frac{\partial \mathbf{x}}{\partial \mathbf{y}} \right|$$ where the term in absolute values is the Jacobian determinant of the inverse transformation.
The expected squared error of an estimator can be decomposed into: $$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}$$ High bias implies underfitting (model too simple). High variance implies overfitting (model too complex).
The set of all possible outcomes of a random experiment. For example, in a coin toss \(\Omega = \{H, T\}\); for a die roll \(\Omega = \{1, 2, 3, 4, 5, 6\}
A subset of the sample space \(\Omega\). An event \(A\) occurs if the outcome of the experiment is an element of \(A\). For example, "rolling an even number" is the event \(A = \{2, 4, 6\}
1. Non-negativity: \(P(A) \ge 0\) for any event \(A\).
2. Normalization: \(P(\Omega) = 1\).
3. Additivity: If \(A_1, A_2, \dots\) are disjoint (mutually exclusive) events, then \(P(\cup_i A_i) = \sum_i P(A_i)\).
For any two events \(A\) and \(B\): $$P(A \cup B) = P(A) + P(B) - P(A \cap B)$$ If \(A\) and \(B\) are mutually exclusive (\(A \cap B = \emptyset\)), then \(P(A \cup B) = P(A) + P(B)\).
The probability of event \(A\) occurring given that event \(B\) has occurred, denoted \(P(A|B)\). $$P(A|B) = \frac{P(A \cap B)}{P(B)}$$ Defined only when \(P(B) > 0\).
The probability of a single event occurring, independent of other events. For discrete variables \(X, Y\): $$P(X=x) = \sum_y P(X=x, Y=y)$$ It is obtained by "summing out" the other variables from the joint distribution.
For any random variables \(X\) and \(Y\) (dependent or independent) and constants \(a, b\): $$\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$$
If \(X\) and \(Y\) are independent random variables: $$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$$ If they are dependent: \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)\).
The square root of the variance: $$\sigma_X = \sqrt{\text{Var}(X)}$$ It measures the spread of the distribution in the same units as the random variable itself.
A normalized measure of linear dependence between two variables \(X\) and \(Y\), denoted \(\rho_{XY}\): $$\rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$$ It ranges from -1 to 1. \(\rho=0\) means uncorrelated.
Two variables are uncorrelated if \(\text{Cov}(X, Y) = 0\). They are independent if \(P(X, Y) = P(X)P(Y)\). Independence implies uncorrelatedness, but uncorrelatedness does not imply independence (unless variables are jointly Gaussian).
Independent and Identically Distributed. A sequence of random variables is i.i.d. if each variable has the same probability distribution and is mutually independent of the others. This is a common assumption for data points in machine learning.
The probability of the observed data \(\mathcal{D}\) viewed as a function of the parameters \(\theta\): $$L(\theta) = p(\mathcal{D}|\theta)$$ It is not a probability distribution over \(\theta\) (does not integrate to 1).
The probability of the parameters \(\theta\) after observing data \(\mathcal{D}\): $$p(\theta|\mathcal{D}) = \frac{p(\mathcal{D}|\theta)p(\theta)}{p(\mathcal{D})}$$ It combines the prior belief with the likelihood of the observed data.
The probability distribution \(p(\theta)\) over parameters \(\theta\) before any data is observed. It expresses initial beliefs or assumptions about the parameters.
The value of the random variable that has the highest probability (discrete) or probability density (continuous). For a unimodal distribution like Gaussian, it coincides with the mean.
The value \(m\) such that the probability of being less than or equal to \(m\) is 0.5. For a symmetric distribution like Gaussian, median = mean = mode.
A discrete distribution describing the result of a single trial with \(K\) possible outcomes (e.g., rolling a K-sided die once). Often represented using a 1-of-K encoding vector \(\mathbf{x}\) where \(p(\mathbf{x}|\boldsymbol{\mu}) = \prod \mu_k^{x_k}\). It is a special case of the Multinomial with \(N=1\).
Continuous: \(p(x) = \frac{1}{b-a}\) for \(a \le x \le b\), 0 otherwise.
Discrete: \(p(x=k) = \frac{1}{K}\) for \(k=1 \dots K\).
It represents maximum ignorance (maximum entropy) over a bounded interval or finite set.
Allows expressing a joint distribution as a product of conditional distributions: $$p(X_1, X_2, \dots, X_N) = p(X_1) p(X_2|X_1) p(X_3|X_1, X_2) \dots p(X_N|X_1, \dots, X_{N-1})$$ Valid for any ordering of variables.