Bayesian Inference and Learning Quiz

Select an answer for each question. Your progress will be saved.

1. In Bayesian inference, what does the prior distribution, \(\pi(\theta)\), represent?

The prior distribution \(\pi(\theta)\) represents our uncertainty on the unknown parameters \(\theta\) before observing the data \(y\) (a-priori). It allows for the incorporation of prior knowledge or beliefs into the model.

Source: Bayesian Inference Essentials Part 1(1).html, Section: Bayes theorem for statistical models.

2. What is a conjugate prior?

A prior is conjugate to the likelihood if the resulting posterior distribution is in the same family of distributions as the prior. This simplifies Bayesian analysis considerably, as the posterior can be derived analytically. For example, the Beta distribution is a conjugate prior for the Binomial likelihood.

Source: A First Course in Machine Learning, Comment 3.1.

3. In the context of Bayesian linear regression, if the likelihood is Gaussian and the prior on the regression weights \(w\) is also Gaussian, what is the form of the posterior distribution for \(w\)?

Due to the choice of a conjugate Gaussian prior distribution for the Gaussian likelihood function, the posterior distribution will also be Gaussian. This is a key property of conjugate priors.

Source: Machine Learning: A Probabilistic Perspective, Section 3.3, Page 153.

4. What is the primary purpose of the marginal likelihood (or model evidence), \(f(y)\)?

The term \(f(y) = \int f(y|\theta)\pi(\theta)d\theta\) is the marginal likelihood. It acts as the normalizing constant in Bayes' theorem to ensure the posterior is a proper probability distribution. It is also central to Bayesian model comparison, where models with higher marginal likelihood are preferred.

Source: Bayesian Inference Essentials Part 1(1).html, Section: Bayes theorem for statistical models.

5. The bias-variance decomposition expresses the expected squared loss as:

The expected squared loss can be decomposed into three terms: \((\text{bias})^2\), variance, and a constant noise term. The bias represents the difference between the average prediction and the true function, while variance measures the sensitivity of the model to the specific training dataset.

Source: Machine Learning: A Probabilistic Perspective, Section 3.2, Page 149, Eq. 3.41.

6. In Bayesian linear regression, the posterior mean of the weights is a weighted average of the prior mean and the Maximum Likelihood Estimate (MLE). What happens to the posterior mean as the number of data points \(n\) approaches infinity?

As the number of data points \(n\) grows, the likelihood term dominates the prior. The weight given to the MLE approaches 1, and the weight given to the prior mean approaches 0. Therefore, the posterior mean converges to the MLE.

Source: Bayesian Linear Regression.html, Section: Bayesian linear regression.

7. Which of the following is a key characteristic of a non-informative prior?

A non-informative prior is designed to "let the data speak for themselves" by having minimal influence on the final posterior distribution. Such priors can sometimes be improper (i.e., not integrate to 1), like the uniform prior over an unbounded domain.

Source: Machine Learning: A Probabilistic Perspective, Section 2.4.3, Page 118.

8. The posterior predictive distribution \(f(y_n|y)\) is calculated by:

The posterior predictive distribution is found by marginalizing (integrating) out the model parameters \(\theta\) from the joint distribution of the new data point and the parameters, conditioned on the observed data. The formula is \(f(y_n|y) = \int f(y_n|\theta)\pi(\theta|y)d\theta\). This accounts for all uncertainty in the parameters.

Source: Bayesian Inference Essentials Part 1(1).html, Section: Prediction.

9. In the context of the Beta-Binomial model, if the prior is Beta(\(\alpha, \beta\)) and we observe \(y\) successes in \(n\) trials, the posterior distribution is:

Due to conjugacy, the posterior is also a Beta distribution. The new parameters are found by adding the observed counts to the prior's hyperparameters (which can be interpreted as pseudo-counts). The new \(\alpha'\) is \(\alpha + y\) (prior heads + observed heads) and the new \(\beta'\) is \(\beta + n - y\) (prior tails + observed tails).

Source: Bayesian Inference Essentials Part 1(1).html, Section: Examples: conjugate models.

10. What is the "curse of dimensionality" in the context of k-Nearest Neighbors (k-NN)?

In a high-dimensional space, the volume of the space grows exponentially. To capture a small fraction of the data for a local estimate, the neighborhood must have a large volume. Consequently, the points within this neighborhood are not very close to the target point, making them less useful as predictors. This is a key limitation of methods like k-NN in high dimensions.

Source: Machine Learning: A Probabilistic Perspective, Section 1.4.3, Page 18.

11. The Maximum a Posteriori (MAP) estimate for a parameter \(\theta\) is equivalent to:

The MAP estimate is the value of \(\theta\) that maximizes the posterior probability density, \(\pi(\theta|y)\). This corresponds to the mode of the posterior distribution.

Source: Bayesian Inference Essentials Part 1(1).html, Section: Bayes estimators.

12. In Bayesian linear regression, adding a Gaussian prior \(N(w|0, \alpha^{-1}I)\) to the parameters is equivalent to what in a frequentist setting?

Maximizing the posterior distribution, which is proportional to the likelihood times the prior, is equivalent to minimizing the negative log posterior. For a Gaussian likelihood and Gaussian prior, this negative log posterior is precisely a sum-of-squares error term plus a quadratic (L2) penalty on the weights, which is the objective function for Ridge regression.

Source: Machine Learning: A Probabilistic Perspective, Section 3.3, Page 153, Eq. 3.55.

13. What does the Bayesian Occam's razor effect imply?

The Bayesian Occam's razor automatically penalizes models that are too complex. A very complex model can explain many different datasets, so it must spread its probability mass thinly. A simpler model concentrates its mass on a smaller set of possible datasets. If the observed data falls in that set, the simpler model will have higher marginal likelihood.

Source: Machine Learning: A Probabilistic Perspective, Section 5.3.1, Page 156.

14. For a Gaussian distribution, what is a sufficient statistic?

For a Gaussian distribution, the sufficient statistic is \(u(x) = (x, x^2)\). This means we only need to store the sum of the data points and the sum of the squares of the data points to have all the information needed from the data to estimate the parameters.

Source: Machine Learning: A Probabilistic Perspective, Section 2.4, Page 117.

15. In the evidence approximation framework (Type II Maximum Likelihood), how are hyperparameters like \(\alpha\) and \(\beta\) in Bayesian linear regression typically determined?

The evidence approximation is an approach where hyperparameters are set to specific values by maximizing the marginal likelihood, which is obtained by integrating out the model parameters \(w\). This avoids the need for full Bayesian integration over hyperparameters, which is often intractable, and avoids the computational cost of cross-validation.

Source: Machine Learning: A Probabilistic Perspective, Section 3.5, Page 165.

16. What is a key difference between a Bayesian credible interval and a frequentist confidence interval?

A 95% credible interval for a parameter \(\theta\) is an interval that contains \(\theta\) with 95% probability, based on the posterior distribution. A 95% confidence interval is an interval that, if the experiment were repeated many times, would contain the true (fixed) value of \(\theta\) in 95% of the repetitions. The Bayesian interpretation is often more intuitive to non-statisticians.

Source: Bayesian Inference Essentials Part 1(1).html, Section: (Bayesian) credible intervals.

17. The Jeffreys prior for a parameter \(\theta\) is proportional to:

The Jeffreys prior is defined as \(\pi(\theta) \propto \det(I(\theta))^{1/2}\), where \(I(\theta)\) is the Fisher information matrix. This prior has the desirable property of being invariant to reparameterization.

Source: Bayesian Inference Essentials Part 2(1).html, Section: Prior specification.

18. In Bayesian linear regression, the predictive variance \(\sigma_N^2(x)\) is composed of two terms. What do they represent?

The predictive variance is given by \(\sigma_N^2(x) = \frac{1}{\beta} + \phi(x)^T S_N \phi(x)\). The first term, \(1/\beta\), represents the intrinsic noise in the data. The second term, \(\phi(x)^T S_N \phi(x)\), reflects the uncertainty in the parameter estimate \(w\), where \(S_N\) is the posterior covariance of \(w\).

Source: Machine Learning: A Probabilistic Perspective, Section 3.3.2, Page 156, Eq. 3.59.

19. What is the "equivalent kernel" \(k(x, x')\) in the context of Bayesian linear regression?

The predictive mean can be written as \(y(x, m_N) = \sum_{n=1}^N k(x, x_n) t_n\), where \(k(x, x_n) = \beta \phi(x)^T S_N \phi(x_n)\) is the equivalent kernel. It shows that the model's prediction at a new point \(x\) is a weighted sum of the observed target values \(t_n\), where the weights depend on the "similarity" between \(x\) and \(x_n\) as defined by the kernel.

Source: Machine Learning: A Probabilistic Perspective, Section 3.3.3, Page 159, Eq. 3.61.

20. A key problem with MAP (Maximum a Posteriori) estimation is that it is not invariant to reparameterization. Why does this happen?

When changing variables from \(x\) to \(y=f(x)\), the new probability density is \(p_y(y) = p_x(x) |dx/dy|\). The Jacobian term \(|dx/dy|\) is not, in general, constant, which means it can change the shape of the density and thus shift the location of the mode (the MAP estimate).

Source: Machine Learning: A Probabilistic Perspective, Section 5.2.1.4, Page 152.

21. In the bias-variance trade-off, a very flexible model (e.g., a high-degree polynomial) is likely to have:

A very flexible model can fit the training data very well, so on average its prediction is close to the true function, leading to low bias. However, it is very sensitive to the specific noise in the training data, causing its predictions to change drastically with different datasets, leading to high variance.

Source: Machine Learning: A Probabilistic Perspective, Section 3.2, Page 149.

22. The posterior mean is the Bayes estimator that minimizes which loss function?

The optimal estimator that minimizes the posterior expected squared error loss is the posterior mean. This is also known as the Minimum Mean Squared Error (MMSE) estimate.

Source: Machine Learning: A Probabilistic Perspective, Section 5.7.1.3, Page 179.

23. What is the relationship between Ridge Regression and the MAP estimate in Bayesian linear regression?

The MAP estimate maximizes \(p(w|D) \propto p(D|w)p(w)\). For a Gaussian likelihood and a zero-mean Gaussian prior on the weights, maximizing the posterior is equivalent to minimizing the negative log-posterior, which results in the sum-of-squares error plus an L2 penalty, the same objective as Ridge Regression.

Source: Bayesian Linear Regression.html, Section: Bayesian linear regression.

24. The posterior predictive distribution for a future observation is generally _______ than the plug-in predictive distribution using the MAP estimate.

The full posterior predictive distribution averages over all possible parameter values weighted by their posterior probability. This accounts for parameter uncertainty. The plug-in approximation uses only a single point (the MAP estimate), ignoring parameter uncertainty. This leads to overconfident (narrower) predictions.

Source: Machine Learning: A Probabilistic Perspective, Section 3.3.4.1, Page 77.

25. For a Normal-Normal model where \(y_i \sim N(\theta, \sigma^2)\) (known \(\sigma^2\)) and the prior is \(\theta \sim N(\mu_0, \tau^2)\), the posterior mean is a weighted average of the prior mean \(\mu_0\) and the sample mean \(\bar{y}\). The weight on the sample mean increases as:

The posterior mean is \(\frac{\frac{\sigma^2}{n}\mu_0+\tau^2 \bar{y}}{\tau^2+\frac{\sigma^2}{n}}\). As the number of samples \(n\) increases, the term \(\sigma^2/n\) decreases, giving more weight to the sample mean \(\bar{y}\) and less to the prior mean \(\mu_0\). This reflects that with more data, we trust the data more and the prior less.

Source: Bayesian Inference Essentials Part 1(1).html, Section: Examples: conjugate models.

26. What is the "size principle" in Bayesian concept learning?

The likelihood of a hypothesis \(h\) is \(p(D|h) = (1/|h|)^N\), where \(|h|\) is the size of the hypothesis (number of items it contains) and \(N\) is the number of data points. This means smaller, more specific hypotheses that are consistent with the data receive a higher likelihood. This is a form of Occam's razor.

Source: Machine Learning: A Probabilistic Perspective, Section 3.2.1, Page 67.

27. The Lasso estimator in frequentist regression is equivalent to a MAP estimate under which prior on the regression weights?

The Lasso estimator uses an L1 penalty, \(\lambda \sum |\beta_j|\). This penalty corresponds to the negative log of a Laplace prior distribution, \(p(\beta_j) \propto \exp(-\gamma |\beta_j|)\). This is why Lasso tends to produce sparse solutions (many weights are exactly zero).

Source: Bayesian Linear Regression.html, Section: Bayesian linear regression.

28. What is a major drawback of using the posterior mode (MAP) as a point estimate for a skewed distribution?

For skewed distributions, such as a Gamma or a skewed Beta, the mode can be at the edge of the distribution (e.g., at 0), while the mean and the bulk of the probability mass are located elsewhere. In such cases, the mode is a poor summary of the distribution.

Source: Machine Learning: A Probabilistic Perspective, Section 5.2.1.3, Page 151.

29. The Jeffreys prior for the rate parameter \(\lambda\) of a Poisson distribution is:

The Fisher information for a Poisson(\(\lambda\)) is \(I(\lambda) = n/\lambda\). The Jeffreys prior is proportional to the square root of the Fisher information, so \(\pi(\lambda) \propto (n/\lambda)^{1/2} \propto 1/\sqrt{\lambda}\).

Source: Bayesian Inference Essentials Part 2(1).html, Section: Prior specification.

30. In Bayesian model comparison, the Bayes Factor \(B_{10}\) is the ratio of:

The Bayes factor is defined as \(B_{10}(y) = \frac{P(y|H_1)}{P(y|H_0)}\), which is the ratio of the marginal likelihoods. It measures the strength of evidence provided by the data for one model over the other.

Source: Bayesian Inference Essentials Part 1(1).html, Section: Bayesian hypothesis testing and model choice.

31. What is the "sifting property" of the Dirac delta function \(\delta(x)\)?

The sifting property states that integrating the product of a function \(f(x)\) and a Dirac delta function centered at \(\mu\) "sifts" out the value of the function at \(\mu\). This is a fundamental property used in signal processing and physics.

Source: Machine Learning: A Probabilistic Perspective, Section 2.4.2, Page 39.

32. A Highest Posterior Density (HPD) region is always:

For a given probability \(1-\alpha\), the HPD region is the set of values \(C = \{\theta : p(\theta|D) \ge p^*\}\) that contains \(1-\alpha\) of the probability mass. For a unimodal distribution, this results in the shortest possible interval, as it includes the most probable values and excludes the least probable ones.

Source: Machine Learning: A Probabilistic Perspective, Section 5.2.2.1, Page 154.

33. In the Normal-Normal conjugate model, the posterior precision is the sum of:

In a simple Normal-Normal model where we observe \(N\) data points, the posterior precision is \(\lambda_N = \lambda_0 + N\lambda_y\), where \(\lambda_0\) is the prior precision and \(\lambda_y\) is the precision of each measurement. This shows that information, as measured by precision, adds up.

Source: Machine Learning: A Probabilistic Perspective, Section 4.4.2.1, Page 121.

34. What is Laplace's rule of succession used for?

Laplace's rule of succession, \(p(X=1|D) = \frac{N_1+1}{N+2}\), is the posterior predictive probability for a Bernoulli trial under a uniform Beta(1,1) prior. It is a form of add-one smoothing that prevents predicting zero probability for an event that has not yet been observed.

Source: Machine Learning: A Probabilistic Perspective, Section 3.3.4.1, Page 77.

35. The "effective number of parameters" \(\gamma\) in the evidence framework for Bayesian linear regression measures:

The quantity \(\gamma = \sum_i \frac{\lambda_i}{\alpha + \lambda_i}\) measures the number of parameters whose values are tightly constrained by the data (where the eigenvalue of the Hessian \(\lambda_i\) is large compared to the prior precision \(\alpha\)) versus those that are determined by the prior.

Source: Machine Learning: A Probabilistic Perspective, Section 3.5.3, Page 170.

36. Which of these is NOT a valid basis function for a linear regression model?

Linear basis function models are of the form \(y(x, w) = \sum_j w_j \phi_j(x)\). The basis functions \(\phi_j(x)\) must be fixed, nonlinear functions of the input variables \(x\) only. They cannot depend on the adjustable parameters \(w_j\), otherwise the model is no longer linear in its parameters.

Source: Machine Learning: A Probabilistic Perspective, Section 3.1, Page 138.

37. In a frequentist view, what does the bias of an estimator measure?

The bias of an estimator \(\hat{\theta}\) is defined as \(\text{bias}(\hat{\theta}) = E[\hat{\theta}(D)] - \theta^*\), where the expectation is taken over the sampling distribution of the data \(D\) and \(\theta^*\) is the true parameter value. It measures the systematic error of the estimator.

Source: A First Course in Machine Learning, Section 2.8.

38. What is the primary motivation for using the log-sum-exp trick in generative classifiers?

When computing the posterior \(p(y=c|x) \propto p(x|y=c)p(y=c)\), the class-conditional density \(p(x|y=c)\) can be a very small number for high-dimensional \(x\). Working in the log domain prevents these numbers from underflowing to zero. The log-sum-exp trick is a stable way to compute the normalization constant \(\log(\sum_c \exp(b_c))\) in the log domain.

Source: Machine Learning: A Probabilistic Perspective, Section 3.5.3, Page 86.

39. The posterior predictive distribution for a Beta-Binomial model is known as the:

When you have a binomial likelihood for the data and a beta distribution as the prior for the success probability, the resulting posterior predictive distribution for the number of successes in a new set of trials is the Beta-binomial distribution. It is also known as a compound distribution.

Source: Machine Learning: A Probabilistic Perspective, Section 3.3.4.2, Page 78.

40. In Quadratic Discriminant Analysis (QDA), what assumption is made about the class-conditional densities?

QDA models the class-conditional density for each class \(c\) as a multivariate Gaussian \(N(x|\mu_c, \Sigma_c)\) with its own mean \(\mu_c\) and covariance matrix \(\Sigma_c\). This leads to quadratic decision boundaries between classes.

Source: Machine Learning: A Probabilistic Perspective, Section 4.2.1, Page 102.

41. What is the key assumption of the Naive Bayes classifier?

The "naive" assumption is that all features are conditionally independent of each other, given the class label. This allows the class-conditional density to be factored into a product of one-dimensional densities: \(p(x|y=c, \theta) = \prod_{j=1}^D p(x_j|y=c, \theta_{jc})\).

Source: Machine Learning: A Probabilistic Perspective, Section 3.5, Page 82.

42. The posterior mode (MAP) is the Bayes estimator that minimizes which loss function?

The 0-1 loss function (\(L(y, a) = I(y \neq a)\)) incurs a loss of 1 for any incorrect prediction and 0 for a correct one. Minimizing the posterior expected loss under this function requires choosing the action (class label) that maximizes the posterior probability, which is the posterior mode or MAP estimate.

Source: Machine Learning: A Probabilistic Perspective, Section 5.7.1.1, Page 177.

43. What is the primary advantage of using a mixture of conjugate priors?

A mixture of conjugate priors is also a conjugate prior. This allows for the creation of very flexible prior distributions (e.g., multi-modal) that can approximate nearly any shape, while still resulting in a posterior (which will also be a mixture) that can be computed analytically.

Source: Machine Learning: A Probabilistic Perspective, Section 5.4.4, Page 169.

44. In the context of Bayesian inference, what is shrinkage?

Shrinkage is the phenomenon where the posterior estimate becomes a compromise between the data (represented by the MLE) and the prior beliefs. The posterior mean is "shrunk" towards the prior mean, with the amount of shrinkage depending on the relative strength (precision) of the prior and the likelihood.

Source: Machine Learning: A Probabilistic Perspective, Section 4.4.2.2, Page 122.

45. What is the primary issue with using an improper prior in Bayesian model selection?

An improper prior does not have a defined normalization constant. Since the marginal likelihood calculation depends on this constant, it becomes ill-defined. This makes comparing models using Bayes factors problematic, as the result can be arbitrary.

Source: Machine Learning: A Probabilistic Perspective, Section 5.3.4, Page 165.

46. The Central Limit Theorem is important for Bayesian inference because it suggests that:

The Bayesian Central Limit Theorem states that, under regularity conditions, the posterior distribution converges to a Gaussian centered at the MAP estimate as the amount of data increases. This justifies the use of Gaussian approximations (like the Laplace approximation) for the posterior.

Source: Bayesian Inference Essentials Part 2(1).html, Section: Bayesian Central Limit Theorem.

47. In a linear regression model \(y = w^T x + \epsilon\) with \(\epsilon \sim N(0, \sigma^2)\), maximizing the likelihood is equivalent to:

The log-likelihood function for this model is \(\ln p(t|X, w, \beta) = -\frac{\beta}{2} \sum_n (t_n - w^T \phi(x_n))^2 + \text{const}\). Maximizing this expression with respect to \(w\) is equivalent to minimizing the sum-of-squares error term \(\sum_n (t_n - w^T \phi(x_n))^2\).

Source: Machine Learning: A Probabilistic Perspective, Section 3.1.1, Page 141.

48. What is the primary purpose of using basis functions in a linear model?

By using a fixed set of nonlinear functions of the input variables, \(\phi_j(x)\), the model \(y(x, w) = \sum_j w_j \phi_j(x)\) can model complex, non-linear responses. The key is that the model remains a linear function of the parameters \(w\), which preserves many of its simple analytical properties.

Source: Machine Learning: A Probabilistic Perspective, Section 3.1, Page 138.

49. The "black swan paradox" in the context of Bayesian inference illustrates the problem of:

If you have only ever seen white swans, the MLE for the probability of seeing a black swan is zero. This leads to the absurd prediction that black swans are impossible. Bayesian methods, by using a prior and computing a posterior predictive distribution, can assign non-zero probability to unseen events, thus avoiding this paradox.

Source: Machine Learning: A Probabilistic Perspective, Section 3.3.4.1, Page 77.

50. In the context of Bayesian linear regression, what is the effect of a very strong prior (e.g., a Gaussian with very small variance)?

A strong prior (high precision, low variance) represents a strong belief. The posterior is a compromise between the prior and the likelihood. If the prior is much stronger than the likelihood (which happens with small datasets or very strong prior beliefs), the posterior will be dominated by the prior.

Source: A First Course in Machine Learning, Section 3.3.2.