Bayesian Inference and Learning Quiz (Set 2)

Select an answer for each question. Your progress will be saved.

1. In the context of a linear model with Gaussian noise, the maximum likelihood estimate (MLE) for the noise variance \(\hat{\sigma}^2\) is biased. The expected value of the MLE, \(E[\hat{\sigma}^2]\), is:

The MLE for the variance is biased because the model parameters \(w\) are chosen to minimize the squared error on the training data, effectively fitting some of the noise. This leads to an underestimation of the true noise variance. The expected value is \(E[\hat{\sigma}^2] = \sigma^2(1 - D/N)\), which is always lower than \(\sigma^2\) for \(D>0\).

Source: A First Course in Machine Learning, Section 2.10.2, Page 88.

2. What is the primary characteristic of the Student's t-distribution that makes it more robust to outliers than the Gaussian distribution?

The Student's t-distribution has heavier tails than a Gaussian. This means it assigns more probability to points far from the mean. As a result, outliers (which are far from the mean) have less influence on the parameter estimates, making the distribution more robust.

Source: Machine Learning: A Probabilistic Perspective, Section 2.4.2, Page 40.

3. In Bayesian linear regression, if we use an infinitely broad prior for the weights (e.g., a Gaussian with infinite variance), the posterior mean of the weights will be equal to:

An infinitely broad prior (e.g., \(S_0 = \alpha^{-1}I\) with \(\alpha \to 0\)) represents a state of minimal prior information. In this case, the posterior mean \(m_N\) reduces to the maximum likelihood value \(w_{ML}\), as the data completely dominates the prior.

Source: Machine Learning: A Probabilistic Perspective, Section 3.3.1, Page 153.

4. The posterior median is the Bayes estimator that minimizes which loss function?

The posterior median is the optimal estimator for the absolute loss function, \(L(y, a) = |y - a|\). This loss function is less sensitive to outliers than the squared error loss, making the median a more robust estimator than the mean.

Source: Machine Learning: A Probabilistic Perspective, Section 5.7.1.4, Page 179.

5. What is the key property of the Moore-Penrose pseudo-inverse \(\Phi^\dagger = (\Phi^T\Phi)^{-1}\Phi^T\)?

The Moore-Penrose pseudo-inverse provides a way to find a "best fit" solution to a system of linear equations that lacks a unique solution. It generalizes the concept of a matrix inverse to non-square matrices, which is essential for solving the normal equations in least squares regression where the design matrix \(\Phi\) is typically not square.

Source: Machine Learning: A Probabilistic Perspective, Section 3.1.1, Page 142.

6. In the context of the Poisson-Gamma conjugate model, if the prior is Gamma(\(\alpha, \beta\)) and we observe \(n\) data points with sum \(\sum y_i\), the posterior for the rate \(\lambda\) is:

The posterior is found by updating the prior's hyperparameters with the sufficient statistics from the data. For a Poisson likelihood, the sufficient statistic is the sum of the counts, \(\sum y_i\). The posterior is therefore \(\text{Gamma}(\alpha_{new}, \beta_{new}) = \text{Gamma}(\alpha+\sum y_i, \beta+n)\).

Source: Bayesian Inference Essentials Part 1(1).html, Section: Examples: conjugate models.

7. What is a key limitation of fixed basis function models in high-dimensional spaces?

The main difficulty with fixed basis functions is the curse of dimensionality. To cover a high-dimensional space adequately with a grid of basis functions (like Gaussians), the number of functions required grows exponentially, making the model computationally intractable and prone to overfitting.

Source: Machine Learning: A Probabilistic Perspective, Section 3.6, Page 173.

8. The Jeffreys prior for a location parameter \(\mu\) (like the mean of a Gaussian) is \(\pi(\mu) \propto 1\). This is an example of a:

A prior is translation invariant if it assigns the same probability mass to any interval of the same width, regardless of its location. The uniform prior \(\pi(\mu) \propto 1\) has this property. It means that our prior belief is not changed if we shift the origin of our coordinate system.

Source: Machine Learning: A Probabilistic Perspective, Section 5.4.2.2, Page 168.

9. In the context of Bayesian linear regression, what does the term \(weight decay\) refer to?

The quadratic regularization term \(\frac{\lambda}{2} w^T w\) is known as weight decay in machine learning literature. In sequential learning algorithms, this term encourages the weights to shrink towards zero unless supported by the data, thus controlling overfitting.

Source: Machine Learning: A Probabilistic Perspective, Section 3.1.4, Page 144.

10. What is the primary difference between Quadratic Discriminant Analysis (QDA) and Linear Discriminant Analysis (LDA)?

The key difference is the assumption about the covariance matrices of the class-conditional Gaussians. QDA uses a separate \(\Sigma_c\) for each class, leading to quadratic boundaries. LDA assumes \(\Sigma_c = \Sigma\) for all classes, which simplifies the model and results in linear decision boundaries.

Source: Machine Learning: A Probabilistic Perspective, Section 4.2.2, Page 103.

11. The posterior predictive mean in a Bayesian linear regression model, \(E[y|x, D]\), is given by:

The optimal estimate under squared error loss is the posterior mean. For a linear model, this means we should use the posterior mean of the parameters, \(E[w|D]\), to make predictions. This is \(E[y|x, D] = E[x^T w|D] = x^T E[w|D]\).

Source: Machine Learning: A Probabilistic Perspective, Section 5.7.1.3, Page 179.

12. What is a key property of the exponential family of distributions?

A very important property of the exponential family is that for any likelihood function from this family, a conjugate prior can be constructed. This makes Bayesian inference analytically tractable for a wide range of models, including the Gaussian, Bernoulli, Poisson, and Gamma distributions.

Source: Machine Learning: A Probabilistic Perspective, Section 2.4.2, Page 117.

13. In the context of Bayesian model selection, what does it mean if the Bayes Factor \(B_{10}\) is less than 1?

The Bayes Factor \(B_{10}\) is the ratio of the marginal likelihood of Model 1 to Model 0. If \(B_{10} < 1\), it means the denominator (marginal likelihood of Model 0) is larger, indicating that the data provides more evidence for Model 0.

Source: Machine Learning: A Probabilistic Perspective, Section 5.3.3, Page 163.

14. What is a Parzen window?

In kernel density estimation, the density is estimated by placing a kernel function at each data point. This kernel function is also known as a Parzen window. Common choices include the Gaussian kernel or a uniform (boxcar) kernel.

Source: Machine Learning: A Probabilistic Perspective, Section 2.5.1, Page 123.

15. In a hierarchical Bayesian model, what is a hyper-parameter?

In a hierarchical model, we place priors on the parameters, and these priors themselves have parameters. These are called hyper-parameters. For example, in \(\theta \sim \text{Beta}(\alpha, \beta)\), \(\alpha\) and \(\beta\) are hyper-parameters. We can place priors on them, which are called hyper-priors.

Source: A First Course in Machine Learning, Section 3.5, Page 120.

16. What is the main difference between kernel density estimation and the k-nearest-neighbor density estimation technique?

Both methods are based on the density estimate \(p(x) \approx K/(NV)\). The kernel approach fixes the volume \(V\) (the bandwidth \(h\) of the kernel) and counts the number of points \(K\) that fall within it. The k-NN approach fixes the number of points \(K\) and determines the volume \(V\) required to enclose them.

Source: Machine Learning: A Probabilistic Perspective, Section 2.5.1, Page 123.

17. The trace trick, \(x^T A x = \text{tr}(A x x^T)\), is useful for:

The trace trick allows one to reorder matrix products inside a trace operator. This is particularly useful when taking derivatives of matrix expressions with respect to a matrix. It is used to simplify the log-likelihood of a multivariate Gaussian, which facilitates finding the MLE for the covariance matrix.

Source: Machine Learning: A Probabilistic Perspective, Section 4.1.3.1, Page 100.

18. In the context of Bayesian linear regression, what is the effect of the prior precision parameter \(\alpha\) on the model complexity?

The prior is \(p(w|\alpha) = N(w|0, \alpha^{-1}I)\). A large \(\alpha\) means a small variance, forcing the weights \(w\) to be close to zero. This is a form of regularization that penalizes complex models, leading to smoother functions and preventing overfitting.

Source: Machine Learning: A Probabilistic Perspective, Section 3.3.1, Page 153.

19. What is a key property of the Dirichlet distribution?

The Dirichlet distribution is defined over the probability simplex, which is the set of vectors \(\theta\) where \(\theta_k \ge 0\) and \(\sum_k \theta_k = 1\). This makes it the natural choice as a conjugate prior for the parameters of a multinomial or categorical distribution.

Source: Machine Learning: A Probabilistic Perspective, Section 2.5.4, Page 47.

20. The sequential update property of Bayesian inference means that:

Bayesian updating is naturally sequential. The posterior distribution after seeing some data can be treated as the prior for the next piece of data. This makes Bayesian methods well-suited for online learning, where data arrives one point at a time.

Source: Machine Learning: A Probabilistic Perspective, Section 3.3.1, Page 153.

21. What is the primary reason for using Monte Carlo methods in Bayesian inference?

Many integrals required in Bayesian inference, such as the marginal likelihood or posterior expectations of complex functions, do not have a closed-form solution. Monte Carlo methods provide a way to approximate these integrals by drawing samples from the relevant distribution and computing an empirical average.

Source: Bayesian Inference Essentials Part 2(1).html, Section: Monte Carlo Integration.

22. In a hierarchical model for cancer rates across different cities, why is it beneficial to treat the parameters of the prior (hyperparameters) as unknown random variables?

By inferring the hyperparameters from the data, information is shared across all groups (cities). The cancer rate estimate for a city with a small population (data-poor) will be shrunk towards the overall population mean, which is heavily influenced by cities with large populations (data-rich). This leads to more stable and reliable estimates for all cities.

Source: Machine Learning: A Probabilistic Perspective, Section 5.5.1, Page 171.

23. The entropy of a discrete random variable is maximized when:

Entropy measures uncertainty. A uniform distribution, where all outcomes are equally likely, represents the state of maximum uncertainty. A delta function, where the outcome is certain, has zero entropy.

Source: Machine Learning: A Probabilistic Perspective, Section 2.8.1, Page 56.

24. What is the "bag of words" model in the context of document classification?

The bag of words model simplifies a document by treating it as an unordered collection (a "bag") of words. It assumes that each word is drawn independently from a class-conditional distribution, which makes it a suitable application for the Naive Bayes classifier.

Source: Machine Learning: A Probabilistic Perspective, Section 3.4.4.1, Page 81.

25. The value of the Kullback-Leibler (KL) divergence \(KL(p||q)\) is zero if and only if:

The KL divergence is a measure of dissimilarity between two probability distributions. It is always non-negative and is only zero when the two distributions are identical. It is not a true distance metric because it is not symmetric (i.e., \(KL(p||q) \neq KL(q||p)\)).

Source: Machine Learning: A Probabilistic Perspective, Section 2.8.2, Page 58.

26. In the context of linear models, if the basis functions are global (like polynomials), what is a major consequence?

Global basis functions, like powers of \(x\), have support across the entire input space. This means that adjusting the model to fit a data point in one area will change the function's value everywhere, which can be a limitation. Local basis functions, like Gaussians or splines, resolve this by only having influence in a specific region.

Source: Machine Learning: A Probabilistic Perspective, Section 3.1, Page 139.

27. What is the effect of the temperature parameter \(T\) in a softmax function \(S(\eta/T)_c\)?

The temperature parameter controls the "peakedness" of the softmax distribution. A low temperature (\(T \to 0\)) makes the output approximate a one-hot vector corresponding to the maximum value (like an argmax function). A high temperature (\(T \to \infty\)) makes the output approach a uniform distribution.

Source: Machine Learning: A Probabilistic Perspective, Section 4.2.2, Page 104.

28. The "base rate fallacy" occurs when one:

The base rate fallacy is the error of confusing \(p(A|B)\) with \(p(B|A)\) and ignoring the base rate, or prior probability, \(p(B)\). For example, in medical diagnosis, confusing the probability of a positive test given the disease with the probability of having the disease given a positive test, while ignoring the low prevalence (prior probability) of the disease.

Source: Machine Learning: A Probabilistic Perspective, Section 2.2.3.1, Page 30.

29. In the context of the Dirichlet-multinomial model, what is the purpose of the Dirichlet prior?

The Dirichlet distribution is the conjugate prior for the multinomial likelihood. Using it allows for an analytical posterior distribution. The posterior mean provides a smoothed estimate of the multinomial probabilities, which is particularly useful for handling the zero-count problem in language models.

Source: Machine Learning: A Probabilistic Perspective, Section 3.4, Page 79.

30. What is a key advantage of Bayesian model averaging over using a single MAP estimate for prediction?

Bayesian model averaging (BMA) computes the posterior predictive distribution by averaging the predictions of all hypotheses, weighted by their posterior probabilities. This incorporates uncertainty about which hypothesis is correct. A single MAP estimate ignores this uncertainty, which can lead to overconfident predictions, especially when the data is ambiguous.

Source: Machine Learning: A Probabilistic Perspective, Section 3.2.4, Page 71.

31. The Wishart distribution is a multivariate generalization of which distribution?

The Wishart distribution is a distribution over positive definite matrices. It is a multivariate generalization of the Gamma distribution and serves as a conjugate prior for the precision matrix of a multivariate Gaussian.

Source: Machine Learning: A Probabilistic Perspective, Section 4.5, Page 125.

32. What is the effect of a small bandwidth \(h\) in kernel density estimation?

A small bandwidth \(h\) means the kernels are very narrow. This results in a density estimate that is very spiky, with a peak at each data point. This corresponds to high variance and low bias, and can be seen as overfitting the data.

Source: Machine Learning: A Probabilistic Perspective, Section 2.5.1, Page 124.

33. In the context of linear models, what is sequential learning (or online learning)?

Sequential or online learning algorithms process data points one by one, updating the model parameters after each presentation. This is useful for large datasets where batch processing is computationally expensive, and for real-time applications where data arrives in a continuous stream. Stochastic gradient descent is a common technique for sequential learning.

Source: Machine Learning: A Probabilistic Perspective, Section 3.1.3, Page 143.

34. The posterior predictive distribution for a Normal-Normal model with unknown mean and variance (using a Normal-gamma prior) is a:

When both the mean and variance of a Gaussian are unknown, the conjugate prior is a Normal-gamma distribution. Integrating out both unknown parameters results in a posterior predictive distribution that is a Student's t-distribution. This accounts for the uncertainty in both the mean and the variance, resulting in heavier tails than a Gaussian.

Source: Machine Learning: A Probabilistic Perspective, Section 3.3.2, Page 158.

35. What is the main purpose of the "add-one smoothing" (Laplace smoothing) technique in language models?

Add-one smoothing is a Bayesian technique that corresponds to using a uniform Dirichlet prior (or Beta prior in the binary case). By adding a pseudo-count of 1 to every word, it ensures that even words not seen in the training data are assigned a small non-zero probability, thus avoiding the zero-count problem where the model would otherwise predict unseen words are impossible.

Source: Machine Learning: A Probabilistic Perspective, Section 3.3.4.1, Page 77.

36. The decision boundaries of a multi-class linear discriminant analysis (LDA) are always:

In LDA, the decision boundary between any two classes \(c\) and \(c'\) is the set of points where \(p(y=c|x) = p(y=c'|x)\). Because of the shared covariance matrix assumption, the quadratic terms in \(x\) cancel out, leaving a linear equation in \(x\). This results in linear decision boundaries (hyperplanes).

Source: Machine Learning: A Probabilistic Perspective, Section 4.2.2, Page 104.

37. What is the effect of a large regularization coefficient \(\lambda\) in ridge regression?

A large \(\lambda\) places a heavy penalty on the magnitude of the weights (\(\lambda w^T w\)). This forces the weights to be small, resulting in a simpler, smoother function. This increases the model's bias (it may not fit the training data as well) but reduces its variance (it is less sensitive to noise in the data), which can improve generalization.

Source: Machine Learning: A Probabilistic Perspective, Section 3.1.4, Page 144.

38. The posterior mean of the Beta(\(\alpha, \beta\)) distribution is \(\frac{\alpha}{\alpha+\beta}\). The posterior mode is:

The mode of the Beta(\(\alpha, \beta\)) distribution is given by \(\frac{\alpha-1}{\alpha+\beta-2}\), provided \(\alpha, \beta > 1\). This is different from the mean, which is important as it shows that the MAP estimate (mode) and posterior mean estimate are not always the same.

Source: A First Course in Machine Learning, Section 2.6, Page 128, Eq. 2.269.

39. In the k-NN classification algorithm, what is a major drawback of choosing a very large value for \(K\) (e.g., \(K=N\))?

If \(K=N\) (the size of the training set), the neighborhood for any test point is the entire dataset. The classifier will therefore always predict the majority class of the entire training set, regardless of the test point's location. This is a very simple, high-bias model that underfits the data.

Source: Machine Learning: A Probabilistic Perspective, Section 1.4.7, Page 22.

40. The process of finding the hyperparameters that maximize the marginal likelihood is also known as:

This approach, where the marginal likelihood \(p(D|\eta)\) is maximized to find the hyperparameters \(\eta\), is known by several names, including Empirical Bayes, Type II Maximum Likelihood, and the evidence procedure. It's a practical way to set hyperparameters without needing to specify hyper-priors or use cross-validation.

Source: Machine Learning: A Probabilistic Perspective, Section 3.5, Page 165.

41. What is the primary difference between a generative and a discriminative classifier?

A generative model learns a model of the joint probability \(p(x,y) = p(x|y)p(y)\). It can be used to generate new data. A discriminative model learns the posterior \(p(y|x)\) directly, focusing only on the decision boundary. LDA is generative, while logistic regression is discriminative.

Source: Machine Learning: A Probabilistic Perspective, Section 4.2.2, Page 104.

42. The variance of a Beta(\(\alpha, \beta\)) distribution is given by:

The variance of a Beta distribution with parameters \(\alpha\) and \(\beta\) is \(\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\). This shows that the variance depends on both the sum of the parameters (the effective sample size) and their individual values.

Source: A First Course in Machine Learning, Section 3.3.1, Page 105, Eq. 3.7.

43. In Bayesian linear regression, as more data points are observed, what happens to the predictive uncertainty in regions far from the data points?

The predictive variance is \(\sigma_N^2(x) = \frac{1}{\beta} + \phi(x)^T S_N \phi(x)\). For localized basis functions like Gaussians, in regions far from the data, the basis functions \(\phi(x)\) go to zero. Consequently, the second term (parameter uncertainty) goes to zero, and the predictive variance reverts to the data noise variance, \(1/\beta\).

Source: Machine Learning: A Probabilistic Perspective, Section 3.3.2, Page 158.

44. What is the "one-hot encoding" of a categorical variable?

One-hot encoding is used for categorical variables in multinomial models. If a variable can take one of K states, it is represented as a vector of length K where the k-th element is 1 (or "hot") and all other elements are 0, indicating that the variable is in state k.

Source: Machine Learning: A Probabilistic Perspective, Section 2.3.2, Page 35.

45. The Fisher Information matrix provides a measure of:

The Fisher Information is the expected value of the observed information (the negative Hessian of the log-likelihood). It measures the curvature of the likelihood peak. A high curvature (high information) means the peak is sharp and the parameter is well-determined by the data, leading to low posterior variance.

Source: A First Course in Machine Learning, Section 2.9.1, Page 80.

46. What is a major advantage of using Monte Carlo simulation to approximate a posterior distribution?

The great power of Monte Carlo methods is their generality. As long as we can draw samples from a distribution (even if we don't know its analytical form), we can approximate its properties, such as its mean, variance, and percentiles, by computing statistics on the generated samples.

Source: Bayesian Inference Essentials Part 2(1).html, Section: Bayesian Inference using Monte Carlo.

47. In the context of the number game, why does the hypothesis "powers of two" have a higher likelihood than "even numbers" after observing D = {16, 8, 2, 64}?

This is a direct application of the size principle. The set of even numbers is large (50 members), while the set of powers of two is small (6 members). The likelihood of observing 4 specific numbers from the "even" set is \((1/50)^4\), which is much smaller than the likelihood of observing them from the "powers of two" set, \((1/6)^4\). The data seems too specific to have been drawn from the very general "even numbers" concept.

Source: Machine Learning: A Probabilistic Perspective, Section 3.2.1, Page 67.

48. What is the relationship between the mean and mode for a symmetric, unimodal posterior distribution?

For any symmetric and unimodal distribution, such as a Gaussian, the point of symmetry is simultaneously the mean (center of mass), the median (50% point), and the mode (the peak).

Source: General statistical knowledge.

49. The "evidence framework" is a method for:

The evidence framework, also known as Type II Maximum Likelihood or Empirical Bayes, is a practical approach to setting hyperparameters. It involves integrating out the main model parameters to get the marginal likelihood (the "evidence"), and then maximizing this evidence with respect to the hyperparameters.

Source: Machine Learning: A Probabilistic Perspective, Section 3.5, Page 165.

50. If two random variables X and Y are independent, their covariance cov(X, Y) is:

If X and Y are independent, then \(E[XY] = E[X]E[Y]\). Since covariance is defined as \(\text{cov}(X, Y) = E[XY] - E[X]E[Y]\), their covariance is 0. The reverse is not always true: zero covariance (uncorrelated) does not imply independence, unless the variables are jointly Gaussian.

Source: Machine Learning: A Probabilistic Perspective, Section 2.5.1, Page 46.