Q: What is Bayes' theorem and what are its components in the context of statistical inference?
A: Bayes' theorem is given by \( p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)} \).
- \( p(\theta|D) \) is the posterior probability of the parameters \(\theta\) given the data \(D\).
- \( p(D|\theta) \) is the likelihood of the data given the parameters.
- \( p(\theta) \) is the prior probability of the parameters.
- \( p(D) \) is the marginal likelihood or evidence of the data.
Source: Bishop, Chapter 1; Murphy, Chapter 2
Q: What is a conjugate prior?
A: A prior distribution is conjugate to a likelihood function if the resulting posterior distribution is in the same probability distribution family as the prior. This provides an analytical solution for the posterior, simplifying Bayesian analysis.
Source: Bishop, Section 2.1.1; Murphy, Section 3.3.2
Q: Describe the Beta-Binomial conjugate pair.
A: If the likelihood of observing \(m\) successes in \(N\) trials is Binomial, \(p(D|\mu) \propto \mu^m (1-\mu)^{N-m}\), and the prior on the probability of success \(\mu\) is a Beta distribution, \(\text{Beta}(\mu|a,b) \propto \mu^{a-1}(1-\mu)^{b-1}\), then the posterior is also a Beta distribution: \(\text{Beta}(\mu|m+a, N-m+b)\).
Source: Bishop, Section 2.1.1; Murphy, Section 3.3
Q: What is the posterior predictive distribution and why is it important?
A: The posterior predictive distribution gives the probability of a new data point, \(x_{new}\), given the observed data \(D\). It is calculated by integrating the likelihood of the new data point over the posterior distribution of the parameters: \( p(x_{new}|D) = \int p(x_{new}|\theta) p(\theta|D) d\theta \). It is important because it allows for prediction while accounting for parameter uncertainty.
Source: Bishop, Section 3.3.2; Murphy, Section 3.2.4
Q: What is the Maximum a Posteriori (MAP) estimate?
A: The MAP estimate is the mode of the posterior distribution, i.e., the value of the parameters \(\theta\) that maximizes the posterior probability \(p(\theta|D)\). It is often used as a point estimate of the parameters.
Source: Bishop, Section 1.2.6; Murphy, Section 5.2.1
Q: How does the MAP estimate relate to the Maximum Likelihood Estimate (MLE)?
A: The MAP estimate maximizes \(p(D|\theta)p(\theta)\), while the MLE maximizes \(p(D|\theta)\). If the prior \(p(\theta)\) is uniform, the MAP estimate is equivalent to the MLE. The log of the prior in the MAP estimation acts as a regularization term.
Source: Bishop, Section 1.2.6; Murphy, Section 5.2.1
Q: What is the marginal likelihood (or model evidence) and what is its role in Bayesian inference?
A: The marginal likelihood is the probability of the data given a model, \(p(D|m) = \int p(D|\theta, m) p(\theta|m) d\theta\). It is used for Bayesian model selection, typically by comparing the marginal likelihoods of different models (e.g., via Bayes factors).
Source: Bishop, Section 3.4; Murphy, Section 5.3.2
Q: Explain the concept of the "Bayesian Occam's Razor".
A: Bayesian model selection via the marginal likelihood naturally penalizes overly complex models. A complex model must spread its predictive probability over a wider range of possible datasets. If a simpler model can explain the observed data well, it will have a higher marginal likelihood and be favored.
Source: Bishop, Section 3.4; Murphy, Section 5.3.1
Q: What is the Laplace approximation?
A: The Laplace approximation is a method to approximate a posterior distribution with a Gaussian distribution. The Gaussian is centered at the mode of the posterior (the MAP estimate), and its covariance is the inverse of the negative Hessian matrix of the log posterior at the mode.
Source: Bishop, Section 4.4; Murphy, Section 4.4.1
Q: What is the relationship between Bayesian linear regression with a Gaussian prior and Ridge regression?
A: The MAP estimate of the coefficients in a Bayesian linear regression with a zero-mean Gaussian prior is equivalent to the Ridge regression estimate. The precision of the prior corresponds to the regularization parameter \(\lambda\).
Source: Bishop, Section 3.3; Rogers and Girolami, Chapter 3
Q: What is the difference between a credible interval and a confidence interval?
A: A 95% credible interval is a range for which there is a 95% probability that the true parameter value lies within it. A 95% confidence interval is a range that, if the experiment were repeated many times, would contain the true parameter in 95% of the repetitions. The Bayesian credible interval is a statement about the parameter, while the frequentist confidence interval is a statement about the interval.
Source: Murphy, Section 5.2.2
Q: What is a Highest Posterior Density (HPD) region?
A: An HPD region is a type of credible interval that contains the required mass of the posterior probability, and where every point inside the region has a higher probability density than any point outside it. For a unimodal symmetric posterior, it's the same as the central interval.
Source: Murphy, Section 5.2.2.1
Q: What is the "curse of dimensionality" in the context of density estimation?
A: In high-dimensional spaces, the volume of the space grows so fast that the data becomes very sparse. This makes it very difficult to estimate probability densities accurately without an exponentially large amount of data.
Source: Bishop, Section 1.4
Q: What is the difference between a parametric and a non-parametric model?
A: A parametric model has a fixed number of parameters, regardless of the amount of training data. A non-parametric model's complexity (and number of parameters) can grow with the amount of training data.
Source: Murphy, Section 1.4.1
Q: What is the bias-variance tradeoff?
A: A fundamental concept in machine learning that describes the tradeoff between a model's ability to fit the training data well (low bias) and its ability to generalize to new data (low variance). Simple models have high bias and low variance, while complex models have low bias and high variance.
Source: Bishop, Section 3.2; Murphy, Section 6.4.4
Q: What is the exponential family of distributions?
A: A broad class of probability distributions that can be written in the form \( p(x|\eta) = h(x)g(\eta) \exp(\eta^T u(x)) \). Many common distributions, including the Gaussian, Bernoulli, and Poisson, are members of this family. They have useful properties, such as the existence of conjugate priors.
Source: Bishop, Section 2.4
Q: What is the "size principle" in Bayesian concept learning?
A: The size principle states that the model favors the simplest (smallest) hypothesis consistent with the data. This is a form of Occam's Razor, where the likelihood \(p(D|h)\) is inversely proportional to the size of the hypothesis space \(|h|\).
Source: Murphy, Section 3.2.1
Q: What is the Dirichlet-multinomial model?
A: A model for sequences of categorical data. It is derived by placing a Dirichlet prior on the parameters of a multinomial likelihood. It is often used in natural language processing for tasks like topic modeling.
Source: Murphy, Section 3.4
Q: What is the "black swan paradox" and how does Bayesian inference address it?
A: It's the problem of making predictions about unseen events (e.g., predicting the probability of a "black swan" is zero if you've only ever seen white swans). Bayesian inference, through the use of priors (like Laplace's rule of succession), assigns a non-zero probability to unseen events, avoiding this issue.
Source: Murphy, Section 3.3.4.1
Q: What is the Jeffreys-Lindley paradox?
A: A paradox in Bayesian hypothesis testing where using a vague (improper or very diffuse) prior for a parameter under an alternative hypothesis (M1) can lead to the Bayes factor always favoring the simpler null hypothesis (M0), regardless of the data.
Source: Murphy, Section 5.3.4
Q: What is a "spike and slab" prior?
A: A type of prior used for Bayesian variable selection. It is a mixture of a "spike" (a distribution sharply peaked at zero, like a Dirac delta or a narrow Gaussian) and a "slab" (a diffuse, flat distribution).
Source: Murphy, Section 13.2.1
Q: What is Bayesian Model Averaging (BMA)?
A: An ensemble method that accounts for model uncertainty by averaging the predictions of multiple models, weighted by their posterior model probabilities.
Source: Bishop, Section 14.1; Murphy, Section 3.2.4
Q: What is the "log-sum-exp" trick?
A: A numerical stabilization technique used to compute the logarithm of a sum of exponentials, which is common in calculating log-likelihoods or log-posteriors, while avoiding numerical underflow or overflow.
Source: Murphy, Section 3.5.3
Q: What is the Normal-Inverse-Wishart (NIW) distribution?
A: It is the conjugate prior for a multivariate Normal distribution N(μ, Σ) when both the mean μ and the covariance matrix Σ are unknown.
Source: Murphy, Section 4.6.3
Q: What is shrinkage in the context of Bayesian estimation?
A: It is the phenomenon where the posterior estimate is pulled away from the maximum likelihood estimate and towards a prior belief. It is a form of regularization that reduces variance.
Source: Murphy, Section 5.6.2.1
Q: What is the posterior mean in a Normal-Normal model (inferring a mean with a Normal prior)?
A: The posterior mean is a precision-weighted average of the prior mean and the sample mean (the MLE). It represents a "shrinkage" of the sample mean towards the prior mean.
Source: Murphy, Section 4.4.2.1
Q: What is the Jeffreys prior for a location parameter, like a Gaussian mean?
A: A uniform (improper) prior, p(μ) ∝ 1. This is a translation-invariant prior.
Source: Murphy, Section 5.4.2.2
Q: What is the Jeffreys prior for a scale parameter, like a Gaussian standard deviation?
A: An improper prior proportional to 1/σ, i.e., p(σ) ∝ 1/σ. This is a scale-invariant prior.
Source: Murphy, Section 5.4.2.2
Q: What is the "evidence procedure"?
A: Another name for Empirical Bayes or type-II maximum likelihood, where hyperparameters are estimated by maximizing the marginal likelihood.
Source: Murphy, Section 5.6
Q: What is the posterior mean of the precision in a Gamma-Poisson model?
A: If the prior is \(\text{Ga}(\lambda|a,b)\) and we observe data \(D\), the posterior is \(\text{Ga}(\lambda|a+\sum y_i, b+n)\). The posterior mean is \((a+\sum y_i)/(b+n)\).
Source: fciml-pages-1.pdf, Chapter 2
Q: What is the posterior predictive distribution in a Beta-Binomial model?
A: The posterior predictive distribution is the Beta-Binomial distribution. For a single new trial, the probability of success is the posterior mean of \(\theta\).
Source: Murphy, Section 3.3.4
Q: What is the posterior predictive distribution in a Dirichlet-Multinomial model?
A: The posterior predictive distribution is the Dirichlet-Multinomial distribution. For a single new trial, the probability of outcome \(j\) is the posterior mean of \(\theta_j\).
Source: Murphy, Section 3.4.4
Q: What is the posterior predictive distribution in a Gaussian-Gaussian model (known variance)?
A: The posterior predictive distribution for a new observation is a Gaussian distribution with mean equal to the posterior mean of \(\mu\) and variance equal to the sum of the posterior variance of \(\mu\) and the observation variance.
Source: Murphy, Section 4.4.2.1
Q: What is the posterior predictive distribution in a Bayesian linear regression model?
A: The posterior predictive distribution for a new observation is a Gaussian distribution with mean \(x_{new}^T E[w|D]\) and variance \(\sigma^2 + x_{new}^T \text{cov}[w|D] x_{new}\).
Source: Rogers and Girolami, Chapter 3
Q: What is the "plug-in" approximation for the posterior predictive distribution?
A: It involves finding a point estimate of the parameters (like MAP or MLE), and then "plugging" this estimate into the predictive distribution, \(p(y_{new}|\theta_{est})\), ignoring parameter uncertainty.
Source: Murphy, Section 3.2.4
Q: What is the main advantage of Bayesian Model Averaging (BMA) over using a single model?
A: BMA accounts for model uncertainty by averaging predictions across multiple models, weighted by their posterior probabilities. This typically leads to better predictive performance and more robust conclusions than relying on a single "best" model.
Source: Bishop, Section 14.1
Q: What is the relationship between the marginal likelihood and the BIC (Bayesian Information Criterion)?
A: The BIC is an approximation to the log marginal likelihood. It is given by \(\log p(D|\hat{\theta}) - \frac{\text{dof}(\theta)}{2} \log N\), where the second term penalizes model complexity.
Source: Murphy, Section 5.3.2.4
Q: What is the difference between the BIC and AIC (Akaike Information Criterion)?
A: Both are penalized likelihood measures for model selection. The BIC has a stronger penalty for model complexity (proportional to \(\log N\)) than the AIC (proportional to 2), and thus tends to favor simpler models.
Source: Murphy, Section 5.3.2.5
Q: What is the "effective sample size" of a prior?
A: The effective sample size (e.g., \(\alpha + \beta\) in a Beta prior) represents the strength of the prior belief in terms of the number of "pseudo-observations" it contributes. A larger effective sample size means a stronger prior that will be less influenced by the data.
Source: Murphy, Section 3.3.3
Q: What is the connection between a Bayesian neural network and dropout?
A: Dropout, a regularization technique where neurons are randomly dropped during training, can be interpreted as an approximation to Bayesian inference in a deep Gaussian process model. It approximates the effect of averaging over many different network architectures.
Source: Bishop, Section 5.7.2
Q: What is the Polya Urn scheme and how does it relate to the Dirichlet-multinomial model?
A: The Polya Urn scheme is a process where a ball of a certain color is drawn from an urn, and then returned along with another ball of the same color. This "rich get richer" process is mathematically described by the Dirichlet-multinomial distribution, which is used in Bayesian models to capture burstiness in data (e.g., word counts).
Source: Murphy, Section 3.5
Q: What is the main idea behind Bayesian concept learning?
A: Bayesian concept learning models how humans can infer general rules or concepts from a small number of positive examples. It uses Bayes' rule to update beliefs over a hypothesis space of possible concepts, favoring simpler hypotheses that explain the data concisely (the size principle).
Source: Murphy, Section 3.2
Q: What is the purpose of a loss function in Bayesian decision theory?
A: A loss function \(L(y, a)\) quantifies the "cost" or "error" of taking an action 'a' when the true state of nature is 'y'. It is used to define the optimal action as the one that minimizes the expected loss over the posterior distribution.
Source: Murphy, Section 5.7
Q: What is the relationship between the posterior mean and squared error loss?
A: The posterior mean is the Bayes estimator that minimizes the posterior expected loss when using a squared error (L2) loss function. It is also known as the Minimum Mean Squared Error (MMSE) estimate.
Source: Murphy, Section 5.7.1.3
Q: What is the relationship between the posterior median and absolute error loss?
A: The posterior median is the Bayes estimator that minimizes the posterior expected loss when using an absolute error (L1) loss function. It is more robust to outliers than the posterior mean.
Source: Murphy, Section 5.7.1.4
Q: What is the main advantage of Bayesian Model Averaging (BMA)?
A: BMA accounts for model uncertainty by averaging predictions across multiple models, weighted by their posterior probabilities. This typically leads to better predictive performance and more robust conclusions than relying on a single "best" model.
Source: Bishop, Section 14.1
Q: What is the role of the "effective sample size" of a prior?
A: The effective sample size (e.g., \(\alpha + \beta\) in a Beta prior) represents the strength of the prior belief in terms of the number of "pseudo-observations" it contributes. A larger effective sample size means a stronger prior that will be less influenced by the data.
Source: Murphy, Section 3.3.3
Q: What is the connection between a Bayesian neural network and dropout?
A: Dropout, a regularization technique where neurons are randomly dropped during training, can be interpreted as an approximation to Bayesian inference in a deep Gaussian process model. It approximates the effect of averaging over many different network architectures.
Source: Bishop, Section 5.7.2
Q: What is the Polya Urn scheme and how does it relate to the Dirichlet-multinomial model?
A: The Polya Urn scheme is a process where a ball of a certain color is drawn from an urn, and then returned along with another ball of the same color. This "rich get richer" process is mathematically described by the Dirichlet-multinomial distribution, which is used in Bayesian models to capture burstiness in data (e.g., word counts).
Source: Murphy, Section 3.5
Q: What is the main idea behind Bayesian concept learning?
A: Bayesian concept learning models how humans can infer general rules or concepts from a small number of positive examples. It uses Bayes' rule to update beliefs over a hypothesis space of possible concepts, favoring simpler hypotheses that explain the data concisely (the size principle).
Source: Murphy, Section 3.2
Q: What is the sum rule of probability?
A: The sum rule states that the marginal probability of a random variable can be obtained by summing (or integrating) the joint probability over all possible values of the other variables. \( p(X) = \sum_Y p(X, Y) \).
Source: Bishop, Section 1.2
Q: What is the product rule of probability?
A: The product rule states that the joint probability of two random variables can be expressed as the product of the conditional probability of one variable given the other, and the marginal probability of the other. \( p(X, Y) = p(Y|X)p(X) \).
Source: Bishop, Section 1.2
Q: What is the relationship between the posterior, likelihood, prior, and evidence?
A: posterior \(\propto\) likelihood \(\times\) prior. The posterior is proportional to the product of the likelihood and the prior. The evidence is the normalizing constant.
Source: Bishop, Section 1.2.3
Q: What is the likelihood function?
A: The likelihood function \(p(D|w)\) is the probability of the observed data \(D\) as a function of the parameters \(w\). It is not a probability distribution over \(w\).
Source: Bishop, Section 1.2.3
Q: What is the difference between the Bayesian and frequentist views of parameters?
A: In the Bayesian view, parameters are random variables about which we can have uncertainty. In the frequentist view, parameters are fixed, unknown constants.
Source: Bishop, Section 1.2.3
Q: What is the expectation of a function \(f(x)\) under a probability distribution \(p(x)\)?
A: The expectation is the weighted average of the function's values, where the weights are given by the probability of each value. For a discrete variable, \(E[f] = \sum_x p(x)f(x)\). For a continuous variable, \(E[f] = \int p(x)f(x)dx\).
Source: Bishop, Section 1.2.2
Q: What is the variance of a random variable \(x\)?
A: The variance measures the spread of a distribution. It is defined as \(\text{var}[x] = E[(x - E[x])^2] = E[x^2] - (E[x])^2\).
Source: Bishop, Section 1.2.2
Q: What is the covariance of two random variables \(x\) and \(y\)?
A: The covariance measures the extent to which two variables vary together. It is defined as \(\text{cov}[x, y] = E[(x - E[x])(y - E[y])] = E[xy] - E[x]E[y]\).
Source: Bishop, Section 1.2.2
Q: What is the Gaussian (or normal) distribution?
A: A continuous probability distribution defined by its mean \(\mu\) and variance \(\sigma^2\). Its pdf is given by \( N(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \).
Source: Bishop, Section 1.2.4
Q: What is the multivariate Gaussian distribution?
A: A generalization of the Gaussian distribution to multiple dimensions. It is defined by a mean vector \(\mu\) and a covariance matrix \(\Sigma\).
Source: Bishop, Section 2.3
Q: What is the central limit theorem?
A: The central limit theorem states that the sum of a large number of independent and identically distributed random variables will be approximately normally distributed, regardless of the underlying distribution.
Source: Murphy, Section 2.6.3
Q: What is the principle of maximum entropy?
A: The principle of maximum entropy states that, subject to any known constraints, the distribution that best represents the current state of knowledge is the one with the largest entropy.
Source: Bishop, Section 1.6.1
Q: What is the Kullback-Leibler (KL) divergence?
A: The KL divergence is a measure of the dissimilarity between two probability distributions \(p\) and \(q\). It is defined as \(KL(p||q) = \sum_x p(x) \log \frac{p(x)}{q(x)}\). It is not symmetric.
Source: Bishop, Section 1.6.1
Q: What is mutual information?
A: Mutual information measures the mutual dependence between two random variables. It is the KL divergence between the joint distribution and the product of the marginal distributions: \(I(X;Y) = KL(p(X,Y)||p(X)p(Y))\).
Source: Bishop, Section 1.6.1
Q: What is the relationship between mutual information and entropy?
A: Mutual information can be expressed in terms of entropy as \(I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)\). It represents the reduction in uncertainty about one variable given knowledge of the other.
Source: Bishop, Section 1.6.1
Q: What is the "No Free Lunch" theorem?
A: The No Free Lunch theorem states that there is no universally best model or learning algorithm. A set of assumptions that works well in one domain may work poorly in another.
Source: Murphy, Section 1.4.9
Q: What is the difference between overfitting and underfitting?
A: Overfitting occurs when a model is too complex and captures noise in the training data, leading to poor generalization. Underfitting occurs when a model is too simple and fails to capture the underlying structure of the data.
Source: Bishop, Section 1.1
Q: What is regularization?
A: Regularization is a technique used to prevent overfitting by adding a penalty term to the error function. This penalty discourages complex models, such as those with large coefficient values.
Source: Bishop, Section 1.1
Q: What is cross-validation?
A: Cross-validation is a technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.
Source: Bishop, Section 1.3
Q: What is the "reject option" in classification?
A: The reject option is a choice to abstain from making a classification decision when the model is uncertain. This is useful in risk-averse applications where the cost of a misclassification is high.
Source: Bishop, Section 1.5.3
Q: What is the difference between a generative and a discriminative classifier?
A: A generative classifier models the joint distribution \(p(x,y)\), often by modeling the class-conditional densities \(p(x|y)\) and the class priors \(p(y)\). A discriminative classifier models the posterior \(p(y|x)\) directly.
Source: Bishop, Section 1.5.4
Q: What is a loss function in decision theory?
A: A loss function \(L(t, y(x))\) quantifies the cost of making a prediction \(y(x)\) when the true value is \(t\). The goal of decision theory is to choose a prediction that minimizes the expected loss.
Source: Bishop, Section 1.5.5
Q: What is the optimal prediction for a regression problem with a squared loss function?
A: The optimal prediction is the conditional mean of the target variable, \(E[t|x]\).
Source: Bishop, Section 1.5.5
Q: What is the optimal prediction for a regression problem with an absolute loss function?
A: The optimal prediction is the conditional median of the target variable.
Source: Bishop, Section 1.5.5
Q: What is the "curse of dimensionality"?
A: In high-dimensional spaces, the volume of the space grows so fast that the data becomes very sparse. This makes it very difficult to estimate probability densities accurately without an exponentially large amount of data.
Source: Bishop, Section 1.4
Q: What is the difference between a parametric and a non-parametric model?
A: A parametric model has a fixed number of parameters, regardless of the amount of training data. A non-parametric model's complexity (and number of parameters) can grow with the amount of training data.
Source: Murphy, Section 1.4.1
Q: What is the bias-variance tradeoff?
A: A fundamental concept in machine learning that describes the tradeoff between a model's ability to fit the training data well (low bias) and its ability to generalize to new data (low variance). Simple models have high bias and low variance, while complex models have low bias and high variance.
Source: Bishop, Section 3.2; Murphy, Section 6.4.4
Q: What is the exponential family of distributions?
A: A broad class of probability distributions that can be written in the form \( p(x|\eta) = h(x)g(\eta) \exp(\eta^T u(x)) \). Many common distributions, including the Gaussian, Bernoulli, and Poisson, are members of this family. They have useful properties, such as the existence of conjugate priors.
Source: Bishop, Section 2.4
Q: What is the "size principle" in Bayesian concept learning?
A: The size principle states that the model favors the simplest (smallest) hypothesis consistent with the data. This is a form of Occam's Razor, where the likelihood \(p(D|h)\) is inversely proportional to the size of the hypothesis space \(|h|\).
Source: Murphy, Section 3.2.1
Q: What is the Dirichlet-multinomial model?
A: A model for sequences of categorical data. It is derived by placing a Dirichlet prior on the parameters of a multinomial likelihood. It is often used in natural language processing for tasks like topic modeling.
Source: Murphy, Section 3.4
Q: What is the "black swan paradox" and how does Bayesian inference address it?
A: It's the problem of making predictions about unseen events (e.g., predicting the probability of a "black swan" is zero if you've only ever seen white swans). Bayesian inference, through the use of priors (like Laplace's rule of succession), assigns a non-zero probability to unseen events, avoiding this issue.
Source: Murphy, Section 3.3.4.1
Q: What is the Jeffreys-Lindley paradox?
A: A paradox in Bayesian hypothesis testing where using a vague (improper or very diffuse) prior for a parameter under an alternative hypothesis (M1) can lead to the Bayes factor always favoring the simpler null hypothesis (M0), regardless of the data.
Source: Murphy, Section 5.3.4
Q: What is a "spike and slab" prior?
A: A type of prior used for Bayesian variable selection. It is a mixture of a "spike" (a distribution sharply peaked at zero, like a Dirac delta or a narrow Gaussian) and a "slab" (a diffuse, flat distribution).
Source: Murphy, Section 13.2.1
Q: What is Bayesian Model Averaging (BMA)?
A: An ensemble method that accounts for model uncertainty by averaging the predictions of multiple models, weighted by their posterior model probabilities.
Source: Bishop, Section 14.1; Murphy, Section 3.2.4
Q: What is the "log-sum-exp" trick?
A: A numerical stabilization technique used to compute the logarithm of a sum of exponentials, which is common in calculating log-likelihoods or log-posteriors, while avoiding numerical underflow or overflow.
Source: Murphy, Section 3.5.3
Q: What is the Normal-Inverse-Wishart (NIW) distribution?
A: It is the conjugate prior for a multivariate Normal distribution N(μ, Σ) when both the mean μ and the covariance matrix Σ are unknown.
Source: Murphy, Section 4.6.3
Q: What is shrinkage in the context of Bayesian estimation?
A: It is the phenomenon where the posterior estimate is pulled away from the maximum likelihood estimate and towards a prior belief. It is a form of regularization that reduces variance.
Source: Murphy, Section 5.6.2.1
Q: What is the posterior mean in a Normal-Normal model (inferring a mean with a Normal prior)?
A: The posterior mean is a precision-weighted average of the prior mean and the sample mean (the MLE). It represents a "shrinkage" of the sample mean towards the prior mean.
Source: Murphy, Section 4.4.2.1
Q: What is the Jeffreys prior for a location parameter, like a Gaussian mean?
A: A uniform (improper) prior, p(μ) ∝ 1. This is a translation-invariant prior.
Source: Murphy, Section 5.4.2.2
Q: What is the Jeffreys prior for a scale parameter, like a Gaussian standard deviation?
A: An improper prior proportional to 1/σ, i.e., p(σ) ∝ 1/σ. This is a scale-invariant prior.
Source: Murphy, Section 5.4.2.2
Q: What is the "evidence procedure"?
A: Another name for Empirical Bayes or type-II maximum likelihood, where hyperparameters are estimated by maximizing the marginal likelihood.
Source: Murphy, Section 5.6
Q: What is the posterior mean of the precision in a Gamma-Poisson model?
A: If the prior is \(\text{Ga}(\lambda|a,b)\) and we observe data \(D\), the posterior is \(\text{Ga}(\lambda|a+\sum y_i, b+n)\). The posterior mean is \((a+\sum y_i)/(b+n)\).
Source: fciml-pages-1.pdf, Chapter 2
Q: What is the posterior predictive distribution in a Beta-Binomial model?
A: The posterior predictive distribution is the Beta-Binomial distribution. For a single new trial, the probability of success is the posterior mean of \(\theta\).
Source: Murphy, Section 3.3.4
Q: What is the posterior predictive distribution in a Dirichlet-Multinomial model?
A: The posterior predictive distribution is the Dirichlet-Multinomial distribution. For a single new trial, the probability of outcome \(j\) is the posterior mean of \(\theta_j\).
Source: Murphy, Section 3.4.4
Q: What is the posterior predictive distribution in a Gaussian-Gaussian model (known variance)?
A: The posterior predictive distribution for a new observation is a Gaussian distribution with mean equal to the posterior mean of \(\mu\) and variance equal to the sum of the posterior variance of \(\mu\) and the observation variance.
Source: Murphy, Section 4.4.2.1
Q: What is the posterior predictive distribution in a Bayesian linear regression model?
A: The posterior predictive distribution for a new observation is a Gaussian distribution with mean \(x_{new}^T E[w|D]\) and variance \(\sigma^2 + x_{new}^T \text{cov}[w|D] x_{new}\).
Source: Rogers and Girolami, Chapter 3
Q: What is the "plug-in" approximation for the posterior predictive distribution?
A: It involves finding a point estimate of the parameters (like MAP or MLE), and then "plugging" this estimate into the predictive distribution, \(p(y_{new}|\theta_{est})\), ignoring parameter uncertainty.
Source: Murphy, Section 3.2.4
Q: What is the main advantage of Bayesian Model Averaging (BMA) over using a single model?
A: BMA accounts for model uncertainty by averaging predictions across multiple models, weighted by their posterior probabilities. This typically leads to better predictive performance and more robust conclusions than relying on a single "best" model.
Source: Bishop, Section 14.1
Q: What is the relationship between the marginal likelihood and the BIC (Bayesian Information Criterion)?
A: The BIC is an approximation to the log marginal likelihood. It is given by \(\log p(D|\hat{\theta}) - \frac{\text{dof}(\theta)}{2} \log N\), where the second term penalizes model complexity.
Source: Murphy, Section 5.3.2.4
Q: What is the difference between the BIC and AIC (Akaike Information Criterion)?
A: Both are penalized likelihood measures for model selection. The BIC has a stronger penalty for model complexity (proportional to \(\log N\)) than the AIC (proportional to 2), and thus tends to favor simpler models.
Source: Murphy, Section 5.3.2.5