Machine Learning Probability Theory

Exam Preparation Flashcards

Question
Loading...
Answer
Loading...
Card 1 of 100
What are the Sum Rule and Product Rule of probability?

The Sum Rule states that the marginal probability of \(X\) is obtained by summing the joint probability over all states of \(Y\): $$p(X) = \sum_Y p(X, Y)$$ The Product Rule states that the joint probability can be decomposed into a conditional and a marginal: $$p(X, Y) = p(Y|X)p(X)$$

Source: Bishop, Pattern Recognition and Machine Learning, Ch 1.2 "Probability Theory", Eq 1.10, 1.11; Murphy, Machine Learning: A Probabilistic Perspective, Ch 2.2.2.
State Bayes' Theorem and explain the role of the denominator.

Bayes' Theorem is given by: $$p(Y|X) = \frac{p(X|Y)p(Y)}{p(X)}$$ where \(p(Y|X)\) is the posterior, \(p(X|Y)\) is the likelihood, and \(p(Y)\) is the prior. The denominator \(p(X)\) acts as a normalization constant to ensure the posterior probabilities sum to 1. It can be computed using the sum rule: $$p(X) = \sum_Y p(X|Y)p(Y)$$

Source: Bishop, Ch 1.2.3 "Bayesian probabilities", Eq 1.12, 1.13; Murphy, Ch 2.2.3 "Bayes rule".
Define Probability Density Function (PDF).

For a continuous random variable \(x\), the PDF \(p(x)\) is a function such that the probability of \(x\) falling in an infinitesimal interval \((x, x + \delta x)\) is given by \(p(x)\delta x\). It must satisfy \(p(x) \ge 0\) and \(\int_{-\infty}^{\infty} p(x)dx = 1\). The probability that \(x\) lies in an interval \((a, b)\) is \(\int_{a}^{b} p(x)dx\).

Source: Bishop, Ch 1.2.1 "Probability densities", Eq 1.24-1.26; Murphy, Ch 2.2.5 "Continuous random variables".
What is the Cumulative Distribution Function (CDF)?

The CDF, denoted \(P(z)\) or \(F(z)\), is the probability that the random variable \(X\) takes a value less than or equal to \(z\): $$P(z) = p(X \le z) = \int_{-\infty}^{z} p(x)dx$$ The PDF is the derivative of the CDF: \(p(x) = \frac{d}{dx}P(x)\).

Source: Bishop, Ch 1.2.1, Eq 1.28; Murphy, Ch 2.2.5.
Define Expectation for discrete and continuous variables.

The expectation (average value) of a function \(f(x)\) under a probability distribution \(p(x)\) is denoted \(\mathbb{E}[f]\).
For discrete variables: $$\mathbb{E}[f] = \sum_x p(x)f(x)$$ For continuous variables: $$\mathbb{E}[f] = \int p(x)f(x)dx$$

Source: Bishop, Ch 1.2.2, Eq 1.33-1.34; Murphy, Ch 2.2.7.
Define Variance and Covariance.

Variance measures the variability of \(f(x)\) around its mean: $$\text{var}[f] = \mathbb{E}[(f(x) - \mathbb{E}[f(x)])^2] = \mathbb{E}[f(x)^2] - \mathbb{E}[f(x)]^2$$ Covariance measures the extent to which two variables \(x\) and \(y\) vary together: $$\text{cov}[x, y] = \mathbb{E}_{x,y}[\{x - \mathbb{E}[x]\}\{y - \mathbb{E}[y]\}] = \mathbb{E}_{x,y}[xy] - \mathbb{E}[x]\mathbb{E}[y]$$

Source: Bishop, Ch 1.2.2, Eq 1.38-1.41; Murphy, Ch 2.5.1.
What does it mean for two random variables to be Independent?

Two random variables \(X\) and \(Y\) are independent, denoted \(X \perp Y\), if their joint distribution factorizes into the product of their marginals: $$p(X, Y) = p(X)p(Y)$$ This implies that \(p(Y|X) = p(Y)\) and \(p(X|Y) = p(X)\). Covariance between independent variables is zero.

Source: Murphy, Ch 2.2.4; Bishop, Ch 1.2.
What is Conditional Independence?

\(X\) and \(Y\) are conditionally independent given \(Z\) (denoted \(X \perp Y | Z\)) if the conditional joint distribution factorizes: $$p(X, Y | Z) = p(X|Z)p(Y|Z)$$ This means that if we know \(Z\), knowing \(Y\) gives no extra information about \(X\).

Source: Murphy, Ch 2.2.4, Eq 2.15; Bishop Ch 8.2.
Describe the Bernoulli Distribution.

A distribution for a single binary random variable \(x \in \{0, 1\}\). It is governed by a parameter \(\mu\) (probability of \(x=1\)): $$\text{Bern}(x|\mu) = \mu^x (1-\mu)^{1-x}$$ Mean: \(\mathbb{E}[x] = \mu\)
Variance: \(\text{var}[x] = \mu(1-\mu)\)

Source: Bishop, Ch 2.1, Eq 2.2; Murphy, Ch 2.3.1.
Describe the Binomial Distribution.

Models the number of observations \(m\) of \(x=1\) (heads) in a set of \(N\) independent Bernoulli trials. $$\text{Bin}(m|N, \mu) = \binom{N}{m} \mu^m (1-\mu)^{N-m}$$ Mean: \(\mathbb{E}[m] = N\mu\)
Variance: \(\text{var}[m] = N\mu(1-\mu)\)

Source: Bishop, Ch 2.1, Eq 2.9; Murphy, Ch 2.3.1.
Describe the Multinomial Distribution.

Generalization of the Binomial to \(K\) mutually exclusive states. For \(N\) trials, with \(x_k\) being the count of state \(k\): $$\text{Mult}(m_1, ..., m_K|\boldsymbol{\mu}, N) = \frac{N!}{m_1! \dots m_K!} \prod_{k=1}^K \mu_k^{m_k}$$ where \(\sum \mu_k = 1\) and \(\sum m_k = N\).

Source: Bishop, Ch 2.2, Eq 2.34; Murphy, Ch 2.3.2.
Describe the Poisson Distribution.

Models counts of rare events over a fixed interval. Defined for \(x \in \{0, 1, 2, ...\}\) with rate parameter \(\lambda > 0\): $$\text{Poi}(x|\lambda) = e^{-\lambda} \frac{\lambda^x}{x!}$$ Mean: \(\lambda\), Variance: \(\lambda\).

Source: Murphy, Ch 2.3.3, Eq 2.39.
What is the Gaussian (Normal) Distribution (Univariate)?

A continuous distribution defined by mean \(\mu\) and variance \(\sigma^2\): $$\mathcal{N}(x|\mu, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{1/2}} \exp\left\{ -\frac{1}{2\sigma^2}(x-\mu)^2 \right\}$$ It has maximum entropy for a given variance. Precision is defined as \(\beta = 1/\sigma^2\).

Source: Bishop, Ch 1.2.4, Eq 1.46; Murphy, Ch 2.4.1.
What is the Multivariate Gaussian Distribution?

For a \(D\)-dimensional vector \(\mathbf{x}\): $$\mathcal{N}(\mathbf{x}|\boldsymbol{\mu}, \mathbf{\Sigma}) = \frac{1}{(2\pi)^{D/2}|\mathbf{\Sigma}|^{1/2}} \exp\left\{ -\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu}) \right\}$$ where \(\boldsymbol{\mu}\) is the mean vector and \(\mathbf{\Sigma}\) is the \(D \times D\) covariance matrix. The term in the exponent is the squared Mahalanobis distance.

Source: Bishop, Ch 2.3, Eq 2.43; Murphy, Ch 2.5.2.
What are the properties of the Covariance Matrix \(\mathbf{\Sigma}\) in a Multivariate Gaussian?

\(\mathbf{\Sigma}\) must be symmetric and positive semi-definite (positive definite for the density to be well-defined). Its eigendecomposition \(\mathbf{\Sigma} = \sum \lambda_i \mathbf{u}_i \mathbf{u}_i^T\) defines the orientation (eigenvectors \(\mathbf{u}_i\)) and elongation (eigenvalues \(\lambda_i\)) of the elliptical constant-density contours.

Source: Bishop, Ch 2.3, Eq 2.45-2.48; Murphy, Ch 2.5.2.
What is the Central Limit Theorem?

The theorem states that the sum (or average) of a large number of independent and identically distributed (i.i.d.) random variables (with finite mean and variance) tends to a Gaussian distribution, regardless of the original distribution of the variables.

Source: Bishop, Ch 2.3; Murphy, Ch 2.6.3.
Describe the Beta Distribution.

A continuous distribution over \(\mu \in [0, 1]\), often used as a prior for Bernoulli parameters. $$\text{Beta}(\mu|a, b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \mu^{a-1} (1-\mu)^{b-1}$$ Mean: \(a/(a+b)\). It generalizes the uniform distribution (when \(a=b=1\)).

Source: Bishop, Ch 2.1.1, Eq 2.13; Murphy, Ch 2.4.5.
Describe the Dirichlet Distribution.

A multivariate generalization of the Beta distribution over the \(K\)-dimensional probability simplex. $$\text{Dir}(\boldsymbol{\mu}|\boldsymbol{\alpha}) = \frac{\Gamma(\alpha_0)}{\Gamma(\alpha_1)\dots\Gamma(\alpha_K)} \prod_{k=1}^K \mu_k^{\alpha_k - 1}$$ where \(\alpha_0 = \sum \alpha_k\). It is the conjugate prior for the Multinomial distribution.

Source: Bishop, Ch 2.2.1, Eq 2.38; Murphy, Ch 2.5.4.
Describe the Gamma Distribution.

A flexible distribution for positive real variables \(\lambda > 0\). $$\text{Gam}(\lambda|a, b) = \frac{1}{\Gamma(a)} b^a \lambda^{a-1} \exp(-b\lambda)$$ Often used as a conjugate prior for the precision (inverse variance) of a Gaussian.

Source: Bishop, Ch 2.3.6, Eq 2.146; Murphy, Ch 2.4.4.
What is Student's t-distribution and why is it useful?

It is a heavy-tailed distribution derived by integrating out the precision of a Gaussian with a Gamma prior. $$\text{St}(x|\mu, \lambda, \nu) \propto \left[ 1 + \frac{\lambda(x-\mu)^2}{\nu} \right]^{-(\nu+1)/2}$$ It provides robustness to outliers compared to the Gaussian because probability decays polynomially rather than exponentially.

Source: Bishop, Ch 2.3.7, Eq 2.159; Murphy, Ch 2.5.3.
What is Maximum Likelihood Estimation (MLE)?

A frequentist method for parameter estimation. It chooses the parameters \(\boldsymbol{\theta}\) that maximize the likelihood function \(p(\mathcal{D}|\boldsymbol{\theta})\), i.e., making the observed data \(\mathcal{D}\) most probable. $$\hat{\boldsymbol{\theta}}_{MLE} = \arg\max_{\boldsymbol{\theta}} \log p(\mathcal{D}|\boldsymbol{\theta})$$

Source: Bishop, Ch 1.2.5; Murphy, Ch 6.2.2.
What is the MLE for the mean \(\mu\) and variance \(\sigma^2\) of a Gaussian?

Given \(N\) data points \(x_n\): $$\mu_{ML} = \frac{1}{N} \sum_{n=1}^N x_n \quad \text{(Sample Mean)}$$ $$\sigma^2_{ML} = \frac{1}{N} \sum_{n=1}^N (x_n - \mu_{ML})^2 \quad \text{(Sample Variance)}$$ Note: \(\sigma^2_{ML}\) is a biased estimator of the true variance.

Source: Bishop, Ch 1.2.4, Eq 1.55-1.56; Murphy, Ch 4.1.3.
What is bias in an estimator? Give an example.

Bias is the difference between the expected value of the estimator and the true parameter value: \(\text{bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta\). The MLE for Gaussian variance is biased because \(\mathbb{E}[\sigma^2_{ML}] = \frac{N-1}{N}\sigma^2\). An unbiased estimator uses \(N-1\) in the denominator.

Source: Bishop, Ch 1.2.4; Murphy, Ch 6.4.2.
What is Maximum A Posteriori (MAP) estimation?

A method that estimates parameters by maximizing the posterior distribution \(p(\boldsymbol{\theta}|\mathcal{D})\). It incorporates a prior \(p(\boldsymbol{\theta})\): $$\hat{\boldsymbol{\theta}}_{MAP} = \arg\max_{\boldsymbol{\theta}} p(\boldsymbol{\theta}|\mathcal{D}) = \arg\max_{\boldsymbol{\theta}} [ \log p(\mathcal{D}|\boldsymbol{\theta}) + \log p(\boldsymbol{\theta}) ]$$ The log prior acts as a regularizer.

Source: Bishop, Ch 1.2.5; Murphy, Ch 5.2.1.
What is a Conjugate Prior?

A prior is conjugate to a likelihood function if the posterior distribution has the same functional form as the prior. This allows for closed-form Bayesian updates.
Example: Beta prior + Bernoulli likelihood \(\rightarrow\) Beta posterior.

Source: Bishop, Ch 2.4.2; Murphy, Ch 3.3.2.
Give the conjugate priors for the parameters of:
1. Bernoulli
2. Multinomial
3. Gaussian Mean (known variance)
4. Gaussian Variance (known mean)

1. Beta distribution
2. Dirichlet distribution
3. Gaussian distribution
4. Gamma (or Inverse-Gamma) distribution

Source: Bishop, Ch 2; Murphy, Ch 3 & 4.
What is the Bayesian Occam's Razor?

The principle that Bayesian model selection automatically penalizes complex models. Complex models spread their probability mass over a larger space of datasets. Thus, for a specific dataset, a simpler model (that puts more mass on that dataset) often has a higher marginal likelihood (evidence) \(p(\mathcal{D}|M)\).

Source: Murphy, Ch 5.3.1; Bishop Ch 3.4.
What is the Posterior Predictive Distribution?

The distribution of a new data point \(x_{new}\) given the observed data \(\mathcal{D}\), obtained by marginalizing out the parameters \(\boldsymbol{\theta}\): $$p(x_{new}|\mathcal{D}) = \int p(x_{new}|\boldsymbol{\theta}) p(\boldsymbol{\theta}|\mathcal{D}) d\boldsymbol{\theta}$$ This averages predictions over all possible parameter settings weighted by their posterior probabilities.

Source: Bishop, Ch 1.2.6; Murphy, Ch 3.2.4.
Define Entropy for a discrete random variable.

Entropy \(H[x]\) measures the average amount of information (or uncertainty) in a random variable \(x\): $$H[x] = -\sum_x p(x) \ln p(x)$$ It is maximized by a uniform distribution.

Source: Bishop, Ch 1.6, Eq 1.98; Murphy, Ch 2.8.1.
What is KL Divergence (Relative Entropy)?

A measure of dissimilarity between two probability distributions \(p(x)\) and \(q(x)\): $$\text{KL}(p||q) = -\int p(x) \ln \frac{q(x)}{p(x)} dx$$ Properties: \(\text{KL}(p||q) \ge 0\) (Gibbs' inequality), and \(\text{KL}(p||q) = 0\) iff \(p=q\). It is asymmetric.

Source: Bishop, Ch 1.6.1, Eq 1.113; Murphy, Ch 2.8.2.
Define Mutual Information.

A measure of the dependence between two variables \(x\) and \(y\). It is the KL divergence between the joint distribution and the product of marginals: $$I[x, y] = \text{KL}(p(x,y) || p(x)p(y)) = H[x] - H[x|y]$$ It represents the reduction in uncertainty about \(x\) given \(y\).

Source: Bishop, Ch 1.6.1, Eq 1.120; Murphy, Ch 2.8.3.
How do you compute the marginal distribution \(p(x_a)\) from a partitioned joint Gaussian \(p(x_a, x_b)\)?

If \(p(x_a, x_b) = \mathcal{N}\left( \begin{bmatrix} \mu_a \\ \mu_b \end{bmatrix}, \begin{bmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix} \right)\), then the marginal is simply: $$p(x_a) = \mathcal{N}(x_a | \mu_a, \Sigma_{aa})$$ Marginalization in Gaussians is trivial: just drop the other variables.

Source: Bishop, Ch 2.3.2; Murphy, Ch 4.3.
How do you compute the conditional distribution \(p(x_a|x_b)\) from a joint Gaussian?

For the joint Gaussian defined in the previous card, the conditional \(p(x_a|x_b)\) is also Gaussian with: Mean: \(\mu_{a|b} = \mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(x_b - \mu_b)\)
Covariance: \(\Sigma_{a|b} = \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}\) Note the mean depends linearly on \(x_b\), while the covariance is independent of \(x_b\).

Source: Bishop, Ch 2.3.1, Eq 2.81-2.82; Murphy, Ch 4.3.
What is the Product of two Gaussians?

The product of two Gaussian PDFs is proportional to another Gaussian PDF. \(\mathcal{N}(x|\mu_1, \Sigma_1)\mathcal{N}(x|\mu_2, \Sigma_2) \propto \mathcal{N}(x|\mu_{new}, \Sigma_{new})\). This property is key for Bayesian updates where prior and likelihood are both Gaussian.

Source: Murphy, Ch 4.3.3.
What is the Linear Gaussian Model?

A system where: $$p(x) = \mathcal{N}(x|\mu_x, \Sigma_x)$$ $$p(y|x) = \mathcal{N}(y|Ax + b, \Sigma_y)$$ The marginal \(p(y)\) and posterior \(p(x|y)\) are also Gaussian. This forms the basis for Factor Analysis, PCA, and Kalman Filters.

Source: Bishop, Ch 2.3.3; Murphy, Ch 4.4.
What is the Normal-Inverse-Wishart (NIW) distribution?

It is the conjugate prior for a multivariate Gaussian with unknown mean \(\mu\) and unknown covariance \(\Sigma\). $$p(\mu, \Sigma) = \text{NIW}(\mu, \Sigma | m_0, \kappa_0, \nu_0, S_0) = \mathcal{N}(\mu | m_0, \frac{1}{\kappa_0}\Sigma) \times \text{IW}(\Sigma | S_0, \nu_0)$$ It combines a Gaussian prior on the mean (conditioned on \(\Sigma\)) and an Inverse-Wishart prior on the covariance.

Source: Murphy, Ch 4.6.3.2, Eq 4.200.
What is the Wishart Distribution?

A distribution over symmetric positive-definite matrices (like covariance matrices). It is the conjugate prior for the precision matrix (inverse covariance) of a Gaussian. If \(\mathbf{S} \sim \text{Wi}(\mathbf{\Sigma}, \nu)\), then \(\mathbf{S}\) is the distribution of the scatter matrix \(\sum_{i=1}^\nu \mathbf{x}_i \mathbf{x}_i^T\) where \(\mathbf{x}_i \sim \mathcal{N}(0, \mathbf{\Sigma})\).

Source: Bishop, Ch 2.3.6; Murphy, Ch 4.5.
What is the "Curse of Dimensionality"?

In high-dimensional spaces, volume grows exponentially with dimension. Data becomes sparse, and distance measures become less meaningful (all points are roughly equidistant). For a Gaussian, probability mass is concentrated in a thin shell at a large radius from the mean.

Source: Bishop, Ch 1.4; Murphy, Ch 1.4.3.
Explain Decision Theory and the concept of Risk.

Decision theory combines probabilities with a Loss Function \(L(y, a)\) (cost of taking action \(a\) when truth is \(y\)). The goal is to minimize Risk (expected loss): $$\mathbb{E}[L] = \sum_y \int L(y, a(x)) p(x, y) dx$$ For 0-1 loss, this leads to picking the class with the highest posterior probability.

Source: Bishop, Ch 1.5; Murphy, Ch 5.7.
What is the Naive Bayes Classifier?

A generative classifier that assumes features \(x_1, \dots, x_D\) are conditionally independent given the class label \(y\): $$p(\mathbf{x}|y) = \prod_{j=1}^D p(x_j|y)$$ The classifier predicts \(y\) maximizing \(p(y)\prod p(x_j|y)\). "Naive" because features are rarely independent.

Source: Murphy, Ch 3.5.
What is the Log-Sum-Exp trick?

A numerical technique to compute \(\log \sum_i \exp(x_i)\) without underflow or overflow. Ideally: $$\log \sum \exp(x_i) = m + \log \sum \exp(x_i - m)$$ where \(m = \max_i x_i\). Used in Naive Bayes and HMMs.

Source: Murphy, Ch 3.5.3.
What is the Exponential Family of distributions?

A class of distributions whose PDF can be written as: $$p(\mathbf{x}|\boldsymbol{\eta}) = h(\mathbf{x}) g(\boldsymbol{\eta}) \exp\{ \boldsymbol{\eta}^T \mathbf{u}(\mathbf{x}) \}$$ where \(\boldsymbol{\eta}\) are natural parameters and \(\mathbf{u}(\mathbf{x})\) are sufficient statistics. Includes Gaussian, Exponential, Gamma, Beta, Bernoulli, etc.

Source: Bishop, Ch 2.4, Eq 2.194; Murphy, Ch 9.2.
What are Sufficient Statistics?

A statistic \(\phi(D)\) is sufficient for a parameter \(\theta\) if \(p(D|\theta)\) depends on the data \(D\) only through \(\phi(D)\). $$p(D|\theta) = g(\phi(D), \theta)h(D)$$ For exponential family, \(\sum \mathbf{u}(x_n)\) is sufficient.

Source: Bishop, Ch 2.4.1; Murphy, Ch 3.3.1.
What is a Gaussian Mixture Model (GMM)?

A probabilistic model that assumes data is generated from a mixture of \(K\) Gaussian distributions. $$p(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x|\mu_k, \Sigma_k)$$ where \(\pi_k\) are mixing coefficients (\(\sum \pi_k = 1\)). It is a latent variable model.

Source: Bishop, Ch 2.3.9; Murphy, Ch 11.2.1.
What is the EM (Expectation-Maximization) Algorithm?

An iterative algorithm to find MLE or MAP estimates in models with latent variables (like GMMs).
E-step: Compute posterior of latent variables given current parameters.
M-step: Update parameters to maximize expected log likelihood under distributions found in E-step.

Source: Bishop, Ch 9; Murphy, Ch 11.4.
What is Bayesian Model Selection and the Evidence?

Choosing a model \(M_i\) based on the posterior \(p(M_i|\mathcal{D}) \propto p(\mathcal{D}|M_i)p(M_i)\). The term \(p(\mathcal{D}|M_i)\) is the Model Evidence (or Marginal Likelihood), computed by integrating out parameters: $$p(\mathcal{D}|M_i) = \int p(\mathcal{D}|\theta, M_i)p(\theta|M_i)d\theta$$ It automatically penalizes model complexity (Bayesian Occam's Razor).

Source: Bishop, Ch 3.4; Murphy, Ch 5.3.
What are BIC and AIC?

Criteria for model selection that approximate the model evidence.
BIC (Bayesian Information Criterion): \(\ln p(\mathcal{D}) \approx \ln p(\mathcal{D}|\hat{\theta}) - \frac{M}{2}\ln N\)
AIC (Akaike Information Criterion): \(\ln p(\mathcal{D}|\hat{\theta}) - M\) where \(M\) is # of parameters, \(N\) is # of data points. BIC penalizes complexity more strongly for large \(N\).

Source: Bishop, Ch 1.3, 4.4.1; Murphy, Ch 5.3.2.4.
What is a Non-informative Prior?

A prior intended to have minimal influence on the posterior, letting data dominate. Examples include the uniform distribution or Jeffreys Prior (which is invariant to re-parameterization). They are often "improper" (do not integrate to 1).

Source: Bishop, Ch 2.4.3; Murphy, Ch 5.4.1.
What is the Mahalanobis Distance?

A distance measure that accounts for correlations in data: $$\Delta^2 = (\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})$$ It appears in the exponent of the Multivariate Gaussian. It reduces to Euclidean distance if \(\mathbf{\Sigma} = \mathbf{I}\).

Source: Bishop, Ch 2.3, Eq 2.44; Murphy, Ch 4.1.2.
What is the Inverse-Wishart Distribution?

A distribution over positive definite matrices, which acts as the conjugate prior for the covariance matrix of a Gaussian (when the mean is known). It generalizes the Inverse-Gamma distribution.

Source: Murphy, Ch 4.5.1.
What is the relationship between Bernoulli and Binomial distributions?

The Bernoulli distribution models a single binary trial (0 or 1). The Binomial distribution models the sum of \(N\) independent Bernoulli trials. If \(N=1\), Binomial reduces to Bernoulli.

Source: Bishop, Ch 2.1; Murphy, Ch 2.3.1.
What is the Logistic Sigmoid Function and its derivative?

\(\sigma(x) = \frac{1}{1 + e^{-x}}\).
It maps \((-\infty, \infty)\) to \((0, 1)\).
Derivative: \(\sigma'(x) = \sigma(x)(1 - \sigma(x))\).
Used in Logistic Regression and Neural Networks.

Source: Bishop, Ch 4.3.2; Murphy, Ch 1.4.6.
What is the Softmax Function?

A generalization of the sigmoid to \(K\) classes. It converts a vector of real numbers \(\mathbf{z}\) into a probability distribution: $$\text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}$$ Used in Multiclass Logistic Regression and Neural Networks.

Source: Bishop, Ch 4.3.4; Murphy, Ch 4.2.2.
What is Consistent Estimator?

An estimator \(\hat{\theta}_N\) is consistent if it converges in probability to the true parameter \(\theta\) as the sample size \(N \to \infty\). MLE is a consistent estimator.

Source: Murphy, Ch 6.4.1.
What is the Jacobian Matrix?

For a function \(\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m\), the Jacobian \(\mathbf{J}\) is an \(m \times n\) matrix of all first-order partial derivatives: \(J_{ij} = \frac{\partial f_i}{\partial x_j}\). It is used in the Change of Variables formula for probability densities.

Source: Bishop, Ch 1.2.1; Murphy, Ch 2.6.2.1.
What is the Change of Variables Formula for PDFs?

If \(\mathbf{y} = f(\mathbf{x})\) is a bijective transformation, the PDF of \(\mathbf{y}\) is: $$p_y(\mathbf{y}) = p_x(\mathbf{x}) \left| \det \frac{\partial \mathbf{x}}{\partial \mathbf{y}} \right|$$ where the term in absolute values is the Jacobian determinant of the inverse transformation.

Source: Bishop, Ch 1.2.1; Murphy, Ch 2.6.2.
What is the Bias-Variance Decomposition?

The expected squared error of an estimator can be decomposed into: $$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}$$ High bias implies underfitting (model too simple). High variance implies overfitting (model too complex).

Source: Bishop, Ch 3.2; Murphy, Ch 6.4.4.
What is a Sample Space (\(\Omega\))?

The set of all possible outcomes of a random experiment. For example, in a coin toss \(\Omega = \{H, T\}\); for a die roll \(\Omega = \{1, 2, 3, 4, 5, 6\}

Source: Murphy, Ch 2.2.1.
What is an Event?

A subset of the sample space \(\Omega\). An event \(A\) occurs if the outcome of the experiment is an element of \(A\). For example, "rolling an even number" is the event \(A = \{2, 4, 6\}

Source: Murphy, Ch 2.2.1.
What are the Axioms of Probability?

1. Non-negativity: \(P(A) \ge 0\) for any event \(A\).
2. Normalization: \(P(\Omega) = 1\).
3. Additivity: If \(A_1, A_2, \dots\) are disjoint (mutually exclusive) events, then \(P(\cup_i A_i) = \sum_i P(A_i)\).

Source: Standard Probability Theory.
How do you calculate the probability of the Union of two events?

For any two events \(A\) and \(B\): $$P(A \cup B) = P(A) + P(B) - P(A \cap B)$$ If \(A\) and \(B\) are mutually exclusive (\(A \cap B = \emptyset\)), then \(P(A \cup B) = P(A) + P(B)\).

Source: Murphy, Ch 2.2.2.1.
What is Conditional Probability?

The probability of event \(A\) occurring given that event \(B\) has occurred, denoted \(P(A|B)\). $$P(A|B) = \frac{P(A \cap B)}{P(B)}$$ Defined only when \(P(B) > 0\).

Source: Murphy, Ch 2.2.2.3.
What is Marginal Probability?

The probability of a single event occurring, independent of other events. For discrete variables \(X, Y\): $$P(X=x) = \sum_y P(X=x, Y=y)$$ It is obtained by "summing out" the other variables from the joint distribution.

Source: Bishop, Ch 1.2.
What is Linearity of Expectation?

For any random variables \(X\) and \(Y\) (dependent or independent) and constants \(a, b\): $$\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$$

Source: Murphy, Ch 2.6.1.
What is the Variance Sum Law for independent variables?

If \(X\) and \(Y\) are independent random variables: $$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$$ If they are dependent: \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)\).

Source: Murphy, Exercise 2.3.
What is the Standard Deviation?

The square root of the variance: $$\sigma_X = \sqrt{\text{Var}(X)}$$ It measures the spread of the distribution in the same units as the random variable itself.

Source: Murphy, Ch 2.2.7.
What is the Correlation Coefficient?

A normalized measure of linear dependence between two variables \(X\) and \(Y\), denoted \(\rho_{XY}\): $$\rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$$ It ranges from -1 to 1. \(\rho=0\) means uncorrelated.

Source: Murphy, Ch 2.5.1.
What does Uncorrelated mean vs. Independent?

Two variables are uncorrelated if \(\text{Cov}(X, Y) = 0\). They are independent if \(P(X, Y) = P(X)P(Y)\). Independence implies uncorrelatedness, but uncorrelatedness does not imply independence (unless variables are jointly Gaussian).

Source: Murphy, Ch 2.5.1.
What does I.I.D. stand for?

Independent and Identically Distributed. A sequence of random variables is i.i.d. if each variable has the same probability distribution and is mutually independent of the others. This is a common assumption for data points in machine learning.

Source: Murphy, Ch 2.6.3; Bishop, Ch 1.2.4.
What is the Likelihood Function?

The probability of the observed data \(\mathcal{D}\) viewed as a function of the parameters \(\theta\): $$L(\theta) = p(\mathcal{D}|\theta)$$ It is not a probability distribution over \(\theta\) (does not integrate to 1).

Source: Bishop, Ch 1.2.3; Murphy, Ch 2.2.3.1.
What is the Posterior Probability?

The probability of the parameters \(\theta\) after observing data \(\mathcal{D}\): $$p(\theta|\mathcal{D}) = \frac{p(\mathcal{D}|\theta)p(\theta)}{p(\mathcal{D})}$$ It combines the prior belief with the likelihood of the observed data.

Source: Bishop, Ch 1.2.3; Murphy, Ch 2.2.3.1.
What is the Prior Probability?

The probability distribution \(p(\theta)\) over parameters \(\theta\) before any data is observed. It expresses initial beliefs or assumptions about the parameters.

Source: Bishop, Ch 1.2.3; Murphy, Ch 2.2.3.1.
What is the Mode of a distribution?

The value of the random variable that has the highest probability (discrete) or probability density (continuous). For a unimodal distribution like Gaussian, it coincides with the mean.

Source: Murphy, Ch 2.2.7.
What is the Median of a distribution?

The value \(m\) such that the probability of being less than or equal to \(m\) is 0.5. For a symmetric distribution like Gaussian, median = mean = mode.

Source: Murphy, Ch 2.2.6.
What is the Categorical Distribution?

A discrete distribution describing the result of a single trial with \(K\) possible outcomes (e.g., rolling a K-sided die once). Often represented using a 1-of-K encoding vector \(\mathbf{x}\) where \(p(\mathbf{x}|\boldsymbol{\mu}) = \prod \mu_k^{x_k}\). It is a special case of the Multinomial with \(N=1\).

Source: Murphy, Ch 2.3.2.
What is the Uniform Distribution?

Continuous: \(p(x) = \frac{1}{b-a}\) for \(a \le x \le b\), 0 otherwise.
Discrete: \(p(x=k) = \frac{1}{K}\) for \(k=1 \dots K\). It represents maximum ignorance (maximum entropy) over a bounded interval or finite set.

Source: Murphy, Ch 2.2.5.
What is the Chain Rule of Probability?

Allows expressing a joint distribution as a product of conditional distributions: $$p(X_1, X_2, \dots, X_N) = p(X_1) p(X_2|X_1) p(X_3|X_1, X_2) \dots p(X_N|X_1, \dots, X_{N-1})$$ Valid for any ordering of variables.

Source: Murphy, Ch 2.2.2.2.