Machine Learning Probability Theory Flashcards

What are the Sum Rule and Product Rule of probability?

The Sum Rule states that the marginal probability of $X$ is obtained by summing the joint probability over all states of $Y$: $$p(X) = \sum_Y p(X, Y)$$ The Product Rule states that the joint probability can be decomposed into a conditional and a marginal: $$p(X, Y) = p(Y|X)p(X)$$

Source: Bishop, Pattern Recognition and Machine Learning, Ch 1.2 "Probability Theory", Eq 1.10, 1.11; Murphy, Machine Learning: A Probabilistic Perspective, Ch 2.2.2.

State Bayes' Theorem and explain the role of the denominator.

Bayes' Theorem is given by: $$p(Y|X) = \frac{p(X|Y)p(Y)}{p(X)}$$ where $p(Y|X)$ is the posterior, $p(X|Y)$ is the likelihood, and $p(Y)$ is the prior. The denominator $p(X)$ acts as a normalization constant to ensure the posterior probabilities sum to 1. It can be computed using the sum rule: $$p(X) = \sum_Y p(X|Y)p(Y)$$

Source: Bishop, Ch 1.2.3 "Bayesian probabilities", Eq 1.12, 1.13; Murphy, Ch 2.2.3 "Bayes rule".

Define Probability Density Function (PDF).

For a continuous random variable $x$, the PDF $p(x)$ is a function such that the probability of $x$ falling in an infinitesimal interval $(x, x + \delta x)$ is given by $p(x)\delta x$. It must satisfy $p(x) \ge 0$ and $\int_{-\infty}^{\infty} p(x)dx = 1$. The probability that $x$ lies in an interval $(a, b)$ is $\int_{a}^{b} p(x)dx$.

Source: Bishop, Ch 1.2.1 "Probability densities", Eq 1.24-1.26; Murphy, Ch 2.2.5 "Continuous random variables".

What is the Cumulative Distribution Function (CDF)?

The CDF, denoted $P(z)$ or $F(z)$, is the probability that the random variable $X$ takes a value less than or equal to $z$: $$P(z) = p(X \le z) = \int_{-\infty}^{z} p(x)dx$$ The PDF is the derivative of the CDF: $p(x) = \frac{d}{dx}P(x)$.

Source: Bishop, Ch 1.2.1, Eq 1.28; Murphy, Ch 2.2.5.

Define Expectation for discrete and continuous variables.

The expectation (average value) of a function $f(x)$ under a probability distribution $p(x)$ is denoted $\mathbb{E}[f]$.
For discrete variables: $$\mathbb{E}[f] = \sum_x p(x)f(x)$$ For continuous variables: $$\mathbb{E}[f] = \int p(x)f(x)dx$$

Source: Bishop, Ch 1.2.2, Eq 1.33-1.34; Murphy, Ch 2.2.7.

Define Variance and Covariance.

Variance measures the variability of $f(x)$ around its mean: $$\text{var}[f] = \mathbb{E}[(f(x) - \mathbb{E}[f(x)])^2] = \mathbb{E}[f(x)^2] - \mathbb{E}[f(x)]^2$$ Covariance measures the extent to which two variables $x$ and $y$ vary together: $$\text{cov}[x, y] = \mathbb{E}_{x,y}[\{x - \mathbb{E}[x]\}\{y - \mathbb{E}[y]\}] = \mathbb{E}_{x,y}[xy] - \mathbb{E}[x]\mathbb{E}[y]$$

Source: Bishop, Ch 1.2.2, Eq 1.38-1.41; Murphy, Ch 2.5.1.

What does it mean for two random variables to be Independent?

Two random variables $X$ and $Y$ are independent, denoted $X \perp Y$, if their joint distribution factorizes into the product of their marginals: $$p(X, Y) = p(X)p(Y)$$ This implies that $p(Y|X) = p(Y)$ and $p(X|Y) = p(X)$. Covariance between independent variables is zero.

Source: Murphy, Ch 2.2.4; Bishop, Ch 1.2.

What is Conditional Independence?

$X$ and $Y$ are conditionally independent given $Z$ (denoted $X \perp Y | Z$) if the conditional joint distribution factorizes: $$p(X, Y | Z) = p(X|Z)p(Y|Z)$$ This means that if we know $Z$, knowing $Y$ gives no extra information about $X$.

Source: Murphy, Ch 2.2.4, Eq 2.15; Bishop Ch 8.2.

Describe the Bernoulli Distribution.

A distribution for a single binary random variable $x \in \{0, 1\}$. It is governed by a parameter $\mu$ (probability of $x=1$): $$\text{Bern}(x|\mu) = \mu^x (1-\mu)^{1-x}$$ Mean: $\mathbb{E}[x] = \mu$
Variance: $\text{var}[x] = \mu(1-\mu)$

Source: Bishop, Ch 2.1, Eq 2.2; Murphy, Ch 2.3.1.

Describe the Binomial Distribution.

Models the number of observations $m$ of $x=1$ (heads) in a set of $N$ independent Bernoulli trials. $$\text{Bin}(m|N, \mu) = \binom{N}{m} \mu^m (1-\mu)^{N-m}$$ Mean: $\mathbb{E}[m] = N\mu$
Variance: $\text{var}[m] = N\mu(1-\mu)$

Source: Bishop, Ch 2.1, Eq 2.9; Murphy, Ch 2.3.1.

Describe the Multinomial Distribution.

Generalization of the Binomial to $K$ mutually exclusive states. For $N$ trials, with $x_k$ being the count of state $k$: $$\text{Mult}(m_1, ..., m_K|\boldsymbol{\mu}, N) = \frac{N!}{m_1! \dots m_K!} \prod_{k=1}^K \mu_k^{m_k}$$ where $\sum \mu_k = 1$ and $\sum m_k = N$.

Source: Bishop, Ch 2.2, Eq 2.34; Murphy, Ch 2.3.2.

Describe the Poisson Distribution.

Models counts of rare events over a fixed interval. Defined for $x \in \{0, 1, 2, ...\}$ with rate parameter $\lambda > 0$: $$\text{Poi}(x|\lambda) = e^{-\lambda} \frac{\lambda^x}{x!}$$ Mean: $\lambda$, Variance: $\lambda$.

Source: Murphy, Ch 2.3.3, Eq 2.39.

What is the Gaussian (Normal) Distribution (Univariate)?

A continuous distribution defined by mean $\mu$ and variance $\sigma^2$: $$\mathcal{N}(x|\mu, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{1/2}} \exp\left\{ -\frac{1}{2\sigma^2}(x-\mu)^2 \right\}$$ It has maximum entropy for a given variance. Precision is defined as $\beta = 1/\sigma^2$.

Source: Bishop, Ch 1.2.4, Eq 1.46; Murphy, Ch 2.4.1.

What is the Multivariate Gaussian Distribution?

For a $D$-dimensional vector $\mathbf{x}$: $$\mathcal{N}(\mathbf{x}|\boldsymbol{\mu}, \mathbf{\Sigma}) = \frac{1}{(2\pi)^{D/2}|\mathbf{\Sigma}|^{1/2}} \exp\left\{ -\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu}) \right\}$$ where $\boldsymbol{\mu}$ is the mean vector and $\mathbf{\Sigma}$ is the $D \times D$ covariance matrix. The term in the exponent is the squared Mahalanobis distance.

Source: Bishop, Ch 2.3, Eq 2.43; Murphy, Ch 2.5.2.

What are the properties of the Covariance Matrix $\mathbf{\Sigma}$ in a Multivariate Gaussian?

$\mathbf{\Sigma}$ must be symmetric and positive semi-definite (positive definite for the density to be well-defined). Its eigendecomposition $\mathbf{\Sigma} = \sum \lambda_i \mathbf{u}_i \mathbf{u}_i^T$ defines the orientation (eigenvectors $\mathbf{u}_i$) and elongation (eigenvalues $\lambda_i$) of the elliptical constant-density contours.

Source: Bishop, Ch 2.3, Eq 2.45-2.48; Murphy, Ch 2.5.2.

What is the Central Limit Theorem?

The theorem states that the sum (or average) of a large number of independent and identically distributed (i.i.d.) random variables (with finite mean and variance) tends to a Gaussian distribution, regardless of the original distribution of the variables.

Source: Bishop, Ch 2.3; Murphy, Ch 2.6.3.

Describe the Beta Distribution.

A continuous distribution over $\mu \in [0, 1]$, often used as a prior for Bernoulli parameters. $$\text{Beta}(\mu|a, b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \mu^{a-1} (1-\mu)^{b-1}$$ Mean: $a/(a+b)$. It generalizes the uniform distribution (when $a=b=1$).

Source: Bishop, Ch 2.1.1, Eq 2.13; Murphy, Ch 2.4.5.

Describe the Dirichlet Distribution.

A multivariate generalization of the Beta distribution over the $K$-dimensional probability simplex. $$\text{Dir}(\boldsymbol{\mu}|\boldsymbol{\alpha}) = \frac{\Gamma(\alpha_0)}{\Gamma(\alpha_1)\dots\Gamma(\alpha_K)} \prod_{k=1}^K \mu_k^{\alpha_k - 1}$$ where $\alpha_0 = \sum \alpha_k$. It is the conjugate prior for the Multinomial distribution.

Source: Bishop, Ch 2.2.1, Eq 2.38; Murphy, Ch 2.5.4.

Describe the Gamma Distribution.

A flexible distribution for positive real variables $\lambda > 0$. $$\text{Gam}(\lambda|a, b) = \frac{1}{\Gamma(a)} b^a \lambda^{a-1} \exp(-b\lambda)$$ Often used as a conjugate prior for the precision (inverse variance) of a Gaussian.

Source: Bishop, Ch 2.3.6, Eq 2.146; Murphy, Ch 2.4.4.

What is Student's t-distribution and why is it useful?

It is a heavy-tailed distribution derived by integrating out the precision of a Gaussian with a Gamma prior. $$\text{St}(x|\mu, \lambda, \nu) \propto \left[ 1 + \frac{\lambda(x-\mu)^2}{\nu} \right]^{-(\nu+1)/2}$$ It provides robustness to outliers compared to the Gaussian because probability decays polynomially rather than exponentially.

Source: Bishop, Ch 2.3.7, Eq 2.159; Murphy, Ch 2.5.3.

What is Maximum Likelihood Estimation (MLE)?

A frequentist method for parameter estimation. It chooses the parameters $\boldsymbol{\theta}$ that maximize the likelihood function $p(\mathcal{D}|\boldsymbol{\theta})$, i.e., making the observed data $\mathcal{D}$ most probable. $$\hat{\boldsymbol{\theta}}_{MLE} = \arg\max_{\boldsymbol{\theta}} \log p(\mathcal{D}|\boldsymbol{\theta})$$

Source: Bishop, Ch 1.2.5; Murphy, Ch 6.2.2.

What is the MLE for the mean $\mu$ and variance $\sigma^2$ of a Gaussian?

Given $N$ data points $x_n$: $$\mu_{ML} = \frac{1}{N} \sum_{n=1}^N x_n \quad \text{(Sample Mean)}$$ $$\sigma^2_{ML} = \frac{1}{N} \sum_{n=1}^N (x_n - \mu_{ML})^2 \quad \text{(Sample Variance)}$$ Note: $\sigma^2_{ML}$ is a biased estimator of the true variance.

Source: Bishop, Ch 1.2.4, Eq 1.55-1.56; Murphy, Ch 4.1.3.

What is bias in an estimator? Give an example.

Bias is the difference between the expected value of the estimator and the true parameter value: $\text{bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta$. The MLE for Gaussian variance is biased because $\mathbb{E}[\sigma^2_{ML}] = \frac{N-1}{N}\sigma^2$. An unbiased estimator uses $N-1$ in the denominator.

Source: Bishop, Ch 1.2.4; Murphy, Ch 6.4.2.

What is Maximum A Posteriori (MAP) estimation?

A method that estimates parameters by maximizing the posterior distribution $p(\boldsymbol{\theta}|\mathcal{D})$. It incorporates a prior $p(\boldsymbol{\theta})$: $$\hat{\boldsymbol{\theta}}_{MAP} = \arg\max_{\boldsymbol{\theta}} p(\boldsymbol{\theta}|\mathcal{D}) = \arg\max_{\boldsymbol{\theta}} [ \log p(\mathcal{D}|\boldsymbol{\theta}) + \log p(\boldsymbol{\theta}) ]$$ The log prior acts as a regularizer.

Source: Bishop, Ch 1.2.5; Murphy, Ch 5.2.1.

What is a Conjugate Prior?

A prior is conjugate to a likelihood function if the posterior distribution has the same functional form as the prior. This allows for closed-form Bayesian updates.
Example: Beta prior + Bernoulli likelihood $\rightarrow$ Beta posterior.

Source: Bishop, Ch 2.4.2; Murphy, Ch 3.3.2.

Give the conjugate priors for the parameters of:
1. Bernoulli
2. Multinomial
3. Gaussian Mean (known variance)
4. Gaussian Variance (known mean)

1. Beta distribution
2. Dirichlet distribution
3. Gaussian distribution
4. Gamma (or Inverse-Gamma) distribution

Source: Bishop, Ch 2; Murphy, Ch 3 & 4.

What is the Bayesian Occam's Razor?

The principle that Bayesian model selection automatically penalizes complex models. Complex models spread their probability mass over a larger space of datasets. Thus, for a specific dataset, a simpler model (that puts more mass on that dataset) often has a higher marginal likelihood (evidence) $p(\mathcal{D}|M)$.

Source: Murphy, Ch 5.3.1; Bishop Ch 3.4.

What is the Posterior Predictive Distribution?

The distribution of a new data point $x_{new}$ given the observed data $\mathcal{D}$, obtained by marginalizing out the parameters $\boldsymbol{\theta}$: $$p(x_{new}|\mathcal{D}) = \int p(x_{new}|\boldsymbol{\theta}) p(\boldsymbol{\theta}|\mathcal{D}) d\boldsymbol{\theta}$$ This averages predictions over all possible parameter settings weighted by their posterior probabilities.

Source: Bishop, Ch 1.2.6; Murphy, Ch 3.2.4.

Define Entropy for a discrete random variable.

Entropy $H[x]$ measures the average amount of information (or uncertainty) in a random variable $x$: $$H[x] = -\sum_x p(x) \ln p(x)$$ It is maximized by a uniform distribution.

Source: Bishop, Ch 1.6, Eq 1.98; Murphy, Ch 2.8.1.

What is KL Divergence (Relative Entropy)?

A measure of dissimilarity between two probability distributions $p(x)$ and $q(x)$: $$\text{KL}(p||q) = -\int p(x) \ln \frac{q(x)}{p(x)} dx$$ Properties: $\text{KL}(p||q) \ge 0$ (Gibbs' inequality), and $\text{KL}(p||q) = 0$ iff $p=q$. It is asymmetric.

Source: Bishop, Ch 1.6.1, Eq 1.113; Murphy, Ch 2.8.2.

Define Mutual Information.

A measure of the dependence between two variables $x$ and $y$. It is the KL divergence between the joint distribution and the product of marginals: $$I[x, y] = \text{KL}(p(x,y) || p(x)p(y)) = H[x] - H[x|y]$$ It represents the reduction in uncertainty about $x$ given $y$.

Source: Bishop, Ch 1.6.1, Eq 1.120; Murphy, Ch 2.8.3.

How do you compute the marginal distribution $p(x_a)$ from a partitioned joint Gaussian $p(x_a, x_b)$?

If $p(x_a, x_b) = \mathcal{N}\left( \begin{bmatrix} \mu_a \\ \mu_b \end{bmatrix}, \begin{bmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix} \right)$, then the marginal is simply: $$p(x_a) = \mathcal{N}(x_a | \mu_a, \Sigma_{aa})$$ Marginalization in Gaussians is trivial: just drop the other variables.

Source: Bishop, Ch 2.3.2; Murphy, Ch 4.3.

How do you compute the conditional distribution $p(x_a|x_b)$ from a joint Gaussian?

For the joint Gaussian defined in the previous card, the conditional $p(x_a|x_b)$ is also Gaussian with: Mean: $\mu_{a|b} = \mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(x_b - \mu_b)$
Covariance: $\Sigma_{a|b} = \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}$ Note the mean depends linearly on $x_b$, while the covariance is independent of $x_b$.

Source: Bishop, Ch 2.3.1, Eq 2.81-2.82; Murphy, Ch 4.3.

What is the Product of two Gaussians?

The product of two Gaussian PDFs is proportional to another Gaussian PDF. $\mathcal{N}(x|\mu_1, \Sigma_1)\mathcal{N}(x|\mu_2, \Sigma_2) \propto \mathcal{N}(x|\mu_{new}, \Sigma_{new})$. This property is key for Bayesian updates where prior and likelihood are both Gaussian.

Source: Murphy, Ch 4.3.3.

What is the Linear Gaussian Model?

A system where: $$p(x) = \mathcal{N}(x|\mu_x, \Sigma_x)$$ $$p(y|x) = \mathcal{N}(y|Ax + b, \Sigma_y)$$ The marginal $p(y)$ and posterior $p(x|y)$ are also Gaussian. This forms the basis for Factor Analysis, PCA, and Kalman Filters.

Source: Bishop, Ch 2.3.3; Murphy, Ch 4.4.

What is the Normal-Inverse-Wishart (NIW) distribution?

It is the conjugate prior for a multivariate Gaussian with unknown mean $\mu$ and unknown covariance $\Sigma$. $$p(\mu, \Sigma) = \text{NIW}(\mu, \Sigma | m_0, \kappa_0, \nu_0, S_0) = \mathcal{N}(\mu | m_0, \frac{1}{\kappa_0}\Sigma) \times \text{IW}(\Sigma | S_0, \nu_0)$$ It combines a Gaussian prior on the mean (conditioned on $\Sigma$) and an Inverse-Wishart prior on the covariance.

Source: Murphy, Ch 4.6.3.2, Eq 4.200.

What is the Wishart Distribution?

A distribution over symmetric positive-definite matrices (like covariance matrices). It is the conjugate prior for the precision matrix (inverse covariance) of a Gaussian. If $\mathbf{S} \sim \text{Wi}(\mathbf{\Sigma}, \nu)$, then $\mathbf{S}$ is the distribution of the scatter matrix $\sum_{i=1}^\nu \mathbf{x}_i \mathbf{x}_i^T$ where $\mathbf{x}_i \sim \mathcal{N}(0, \mathbf{\Sigma})$.

Source: Bishop, Ch 2.3.6; Murphy, Ch 4.5.

What is the "Curse of Dimensionality"?

In high-dimensional spaces, volume grows exponentially with dimension. Data becomes sparse, and distance measures become less meaningful (all points are roughly equidistant). For a Gaussian, probability mass is concentrated in a thin shell at a large radius from the mean.

Source: Bishop, Ch 1.4; Murphy, Ch 1.4.3.

Explain Decision Theory and the concept of Risk.

Decision theory combines probabilities with a Loss Function $L(y, a)$ (cost of taking action $a$ when truth is $y$). The goal is to minimize Risk (expected loss): $$\mathbb{E}[L] = \sum_y \int L(y, a(x)) p(x, y) dx$$ For 0-1 loss, this leads to picking the class with the highest posterior probability.

Source: Bishop, Ch 1.5; Murphy, Ch 5.7.

What is the Naive Bayes Classifier?

A generative classifier that assumes features $x_1, \dots, x_D$ are conditionally independent given the class label $y$: $$p(\mathbf{x}|y) = \prod_{j=1}^D p(x_j|y)$$ The classifier predicts $y$ maximizing $p(y)\prod p(x_j|y)$. "Naive" because features are rarely independent.

Source: Murphy, Ch 3.5.

What is the Log-Sum-Exp trick?

A numerical technique to compute $\log \sum_i \exp(x_i)$ without underflow or overflow. Ideally: $$\log \sum \exp(x_i) = m + \log \sum \exp(x_i - m)$$ where $m = \max_i x_i$. Used in Naive Bayes and HMMs.

Source: Murphy, Ch 3.5.3.

What is the Exponential Family of distributions?

A class of distributions whose PDF can be written as: $$p(\mathbf{x}|\boldsymbol{\eta}) = h(\mathbf{x}) g(\boldsymbol{\eta}) \exp\{ \boldsymbol{\eta}^T \mathbf{u}(\mathbf{x}) \}$$ where $\boldsymbol{\eta}$ are natural parameters and $\mathbf{u}(\mathbf{x})$ are sufficient statistics. Includes Gaussian, Exponential, Gamma, Beta, Bernoulli, etc.

Source: Bishop, Ch 2.4, Eq 2.194; Murphy, Ch 9.2.

What are Sufficient Statistics?

A statistic $\phi(D)$ is sufficient for a parameter $\theta$ if $p(D|\theta)$ depends on the data $D$ only through $\phi(D)$. $$p(D|\theta) = g(\phi(D), \theta)h(D)$$ For exponential family, $\sum \mathbf{u}(x_n)$ is sufficient.

Source: Bishop, Ch 2.4.1; Murphy, Ch 3.3.1.

What is a Gaussian Mixture Model (GMM)?

A probabilistic model that assumes data is generated from a mixture of $K$ Gaussian distributions. $$p(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x|\mu_k, \Sigma_k)$$ where $\pi_k$ are mixing coefficients ($\sum \pi_k = 1$). It is a latent variable model.

Source: Bishop, Ch 2.3.9; Murphy, Ch 11.2.1.

What is the EM (Expectation-Maximization) Algorithm?

An iterative algorithm to find MLE or MAP estimates in models with latent variables (like GMMs).
E-step: Compute posterior of latent variables given current parameters.
M-step: Update parameters to maximize expected log likelihood under distributions found in E-step.

Source: Bishop, Ch 9; Murphy, Ch 11.4.

What is Bayesian Model Selection and the Evidence?

Choosing a model $M_i$ based on the posterior $p(M_i|\mathcal{D}) \propto p(\mathcal{D}|M_i)p(M_i)$. The term $p(\mathcal{D}|M_i)$ is the Model Evidence (or Marginal Likelihood), computed by integrating out parameters: $$p(\mathcal{D}|M_i) = \int p(\mathcal{D}|\theta, M_i)p(\theta|M_i)d\theta$$ It automatically penalizes model complexity (Bayesian Occam's Razor).

Source: Bishop, Ch 3.4; Murphy, Ch 5.3.

What are BIC and AIC?

Criteria for model selection that approximate the model evidence.
BIC (Bayesian Information Criterion): $\ln p(\mathcal{D}) \approx \ln p(\mathcal{D}|\hat{\theta}) - \frac{M}{2}\ln N$
AIC (Akaike Information Criterion): $\ln p(\mathcal{D}|\hat{\theta}) - M$ where $M$ is # of parameters, $N$ is # of data points. BIC penalizes complexity more strongly for large $N$.

Source: Bishop, Ch 1.3, 4.4.1; Murphy, Ch 5.3.2.4.

What is a Non-informative Prior?

A prior intended to have minimal influence on the posterior, letting data dominate. Examples include the uniform distribution or Jeffreys Prior (which is invariant to re-parameterization). They are often "improper" (do not integrate to 1).

Source: Bishop, Ch 2.4.3; Murphy, Ch 5.4.1.

What is the Mahalanobis Distance?

A distance measure that accounts for correlations in data: $$\Delta^2 = (\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})$$ It appears in the exponent of the Multivariate Gaussian. It reduces to Euclidean distance if $\mathbf{\Sigma} = \mathbf{I}$.

Source: Bishop, Ch 2.3, Eq 2.44; Murphy, Ch 4.1.2.

What is the Inverse-Wishart Distribution?

A distribution over positive definite matrices, which acts as the conjugate prior for the covariance matrix of a Gaussian (when the mean is known). It generalizes the Inverse-Gamma distribution.

Source: Murphy, Ch 4.5.1.

What is the relationship between Bernoulli and Binomial distributions?

The Bernoulli distribution models a single binary trial (0 or 1). The Binomial distribution models the sum of $N$ independent Bernoulli trials. If $N=1$, Binomial reduces to Bernoulli.

Source: Bishop, Ch 2.1; Murphy, Ch 2.3.1.

What is the Logistic Sigmoid Function and its derivative?

$\sigma(x) = \frac{1}{1 + e^{-x}}$.
It maps $(-\infty, \infty)$ to $(0, 1)$.
Derivative: $\sigma'(x) = \sigma(x)(1 - \sigma(x))$.
Used in Logistic Regression and Neural Networks.

Source: Bishop, Ch 4.3.2; Murphy, Ch 1.4.6.

What is the Softmax Function?

A generalization of the sigmoid to $K$ classes. It converts a vector of real numbers $\mathbf{z}$ into a probability distribution: $$\text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}$$ Used in Multiclass Logistic Regression and Neural Networks.

Source: Bishop, Ch 4.3.4; Murphy, Ch 4.2.2.

What is Consistent Estimator?

An estimator $\hat{\theta}_N$ is consistent if it converges in probability to the true parameter $\theta$ as the sample size $N \to \infty$. MLE is a consistent estimator.

Source: Murphy, Ch 6.4.1.

What is the Jacobian Matrix?

For a function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian $\mathbf{J}$ is an $m \times n$ matrix of all first-order partial derivatives: $J_{ij} = \frac{\partial f_i}{\partial x_j}$. It is used in the Change of Variables formula for probability densities.

Source: Bishop, Ch 1.2.1; Murphy, Ch 2.6.2.1.

What is the Change of Variables Formula for PDFs?

If $\mathbf{y} = f(\mathbf{x})$ is a bijective transformation, the PDF of $\mathbf{y}$ is: $$p_y(\mathbf{y}) = p_x(\mathbf{x}) \left| \det \frac{\partial \mathbf{x}}{\partial \mathbf{y}} \right|$$ where the term in absolute values is the Jacobian determinant of the inverse transformation.

Source: Bishop, Ch 1.2.1; Murphy, Ch 2.6.2.

What is the Bias-Variance Decomposition?

The expected squared error of an estimator can be decomposed into: $$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}$$ High bias implies underfitting (model too simple). High variance implies overfitting (model too complex).

Source: Bishop, Ch 3.2; Murphy, Ch 6.4.4.

What is a Sample Space ($\Omega$)?

The set of all possible outcomes of a random experiment. For example, in a coin toss $\Omega = \{H, T\}$; for a die roll \(\Omega = \{1, 2, 3, 4, 5, 6\}

Source: Murphy, Ch 2.2.1.

What is an Event?

A subset of the sample space $\Omega$. An event $A$ occurs if the outcome of the experiment is an element of $A$. For example, "rolling an even number" is the event \(A = \{2, 4, 6\}

Source: Murphy, Ch 2.2.1.

What are the Axioms of Probability?

1. Non-negativity: $P(A) \ge 0$ for any event $A$.
2. Normalization: $P(\Omega) = 1$.
3. Additivity: If $A_1, A_2, \dots$ are disjoint (mutually exclusive) events, then $P(\cup_i A_i) = \sum_i P(A_i)$.

Source: Standard Probability Theory.

How do you calculate the probability of the Union of two events?

For any two events $A$ and $B$: $$P(A \cup B) = P(A) + P(B) - P(A \cap B)$$ If $A$ and $B$ are mutually exclusive ($A \cap B = \emptyset$), then $P(A \cup B) = P(A) + P(B)$.

Source: Murphy, Ch 2.2.2.1.

What is Conditional Probability?

The probability of event $A$ occurring given that event $B$ has occurred, denoted $P(A|B)$. $$P(A|B) = \frac{P(A \cap B)}{P(B)}$$ Defined only when $P(B) > 0$.

Source: Murphy, Ch 2.2.2.3.

What is Marginal Probability?

The probability of a single event occurring, independent of other events. For discrete variables $X, Y$: $$P(X=x) = \sum_y P(X=x, Y=y)$$ It is obtained by "summing out" the other variables from the joint distribution.

Source: Bishop, Ch 1.2.

What is Linearity of Expectation?

For any random variables $X$ and $Y$ (dependent or independent) and constants $a, b$: $$\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$$

Source: Murphy, Ch 2.6.1.

What is the Variance Sum Law for independent variables?

If $X$ and $Y$ are independent random variables: $$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$$ If they are dependent: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$.

Source: Murphy, Exercise 2.3.

What is the Standard Deviation?

The square root of the variance: $$\sigma_X = \sqrt{\text{Var}(X)}$$ It measures the spread of the distribution in the same units as the random variable itself.

Source: Murphy, Ch 2.2.7.

What is the Correlation Coefficient?

A normalized measure of linear dependence between two variables $X$ and $Y$, denoted $\rho_{XY}$: $$\rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$$ It ranges from -1 to 1. $\rho=0$ means uncorrelated.

Source: Murphy, Ch 2.5.1.

What does Uncorrelated mean vs. Independent?

Two variables are uncorrelated if $\text{Cov}(X, Y) = 0$. They are independent if $P(X, Y) = P(X)P(Y)$. Independence implies uncorrelatedness, but uncorrelatedness does not imply independence (unless variables are jointly Gaussian).

Source: Murphy, Ch 2.5.1.

What does I.I.D. stand for?

Independent and Identically Distributed. A sequence of random variables is i.i.d. if each variable has the same probability distribution and is mutually independent of the others. This is a common assumption for data points in machine learning.

Source: Murphy, Ch 2.6.3; Bishop, Ch 1.2.4.

What is the Likelihood Function?

The probability of the observed data $\mathcal{D}$ viewed as a function of the parameters $\theta$: $$L(\theta) = p(\mathcal{D}|\theta)$$ It is not a probability distribution over $\theta$ (does not integrate to 1).

Source: Bishop, Ch 1.2.3; Murphy, Ch 2.2.3.1.

What is the Posterior Probability?

The probability of the parameters $\theta$ after observing data $\mathcal{D}$: $$p(\theta|\mathcal{D}) = \frac{p(\mathcal{D}|\theta)p(\theta)}{p(\mathcal{D})}$$ It combines the prior belief with the likelihood of the observed data.

Source: Bishop, Ch 1.2.3; Murphy, Ch 2.2.3.1.

What is the Prior Probability?

The probability distribution $p(\theta)$ over parameters $\theta$ before any data is observed. It expresses initial beliefs or assumptions about the parameters.

Source: Bishop, Ch 1.2.3; Murphy, Ch 2.2.3.1.

What is the Mode of a distribution?

The value of the random variable that has the highest probability (discrete) or probability density (continuous). For a unimodal distribution like Gaussian, it coincides with the mean.

Source: Murphy, Ch 2.2.7.

What is the Median of a distribution?

The value $m$ such that the probability of being less than or equal to $m$ is 0.5. For a symmetric distribution like Gaussian, median = mean = mode.

Source: Murphy, Ch 2.2.6.

What is the Categorical Distribution?

A discrete distribution describing the result of a single trial with $K$ possible outcomes (e.g., rolling a K-sided die once). Often represented using a 1-of-K encoding vector $\mathbf{x}$ where $p(\mathbf{x}|\boldsymbol{\mu}) = \prod \mu_k^{x_k}$. It is a special case of the Multinomial with $N=1$.

Source: Murphy, Ch 2.3.2.

What is the Uniform Distribution?

Continuous: $p(x) = \frac{1}{b-a}$ for $a \le x \le b$, 0 otherwise.
Discrete: $p(x=k) = \frac{1}{K}$ for $k=1 \dots K$. It represents maximum ignorance (maximum entropy) over a bounded interval or finite set.

Source: Murphy, Ch 2.2.5.

What is the Chain Rule of Probability?

Allows expressing a joint distribution as a product of conditional distributions: $$p(X_1, X_2, \dots, X_N) = p(X_1) p(X_2|X_1) p(X_3|X_1, X_2) \dots p(X_N|X_1, \dots, X_{N-1})$$ Valid for any ordering of variables.

Source: Murphy, Ch 2.2.2.2.