ST3189 Machine Learning - Linear Regression Flashcards

1 / 100

Question

What is the fundamental assumption of simple linear regression?

1 / 100

Answer

Simple linear regression assumes that there is approximately a linear relationship between a single predictor variable \(X\) and a quantitative response \(Y\). Mathematically, this is expressed as \(Y \approx \beta_0 + \beta_1X\).

ISLP, Chapter 3.1

2 / 100

Question

How are the coefficients \(\beta_0\) and \(\beta_1\) in a simple linear regression model estimated?

2 / 100

Answer

The most common method for estimating the coefficients is the least squares method. This method chooses \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize the Residual Sum of Squares (RSS), which is the sum of the squared differences between the observed responses \(y_i\) and the predicted responses \(\hat{y}_i\). The formulas are: \(\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\) and \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}\).

ISLP, Chapter 3.1.1

3 / 100

Question

What is the Residual Sum of Squares (RSS)?

3 / 100

Answer

The Residual Sum of Squares (RSS) is a measure of the discrepancy between the data and an estimation model. It is defined as the sum of the squared residuals (differences between observed and predicted values): \(RSS = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1x_i)^2\).

ISLP, Chapter 3.1.1

4 / 100

Question

What is the difference between the population regression line and the least squares line?

4 / 100

Answer

The population regression line, \(Y = \beta_0 + \beta_1X + \epsilon\), represents the true, unknown linear relationship in the population. The least squares line, \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x\), is an estimate of the population regression line, computed from the sample data. The least squares line is what we have access to, while the population line is unobservable.

ISLP, Chapter 3.1.2

5 / 100

Question

What does the standard error of a regression coefficient, like SE(\(\hat{\beta}_1\)), measure?

5 / 100

Answer

The standard error of a coefficient estimate, SE(\(\hat{\beta}_1\)), measures the average amount that the estimate \(\hat{\beta}_1\) differs from the actual value of \(\beta_1\). It quantifies the accuracy of the coefficient estimate. A smaller standard error indicates a more precise estimate.

ISLP, Chapter 3.1.2

6 / 100

Question

How is a confidence interval for a regression coefficient interpreted?

6 / 100

Answer

A 95% confidence interval for a coefficient \(\beta_1\) is a range of values that has a 95% probability of containing the true unknown value of \(\beta_1\). For example, if we were to repeatedly draw samples and construct confidence intervals, 95% of these intervals would contain the true value of the coefficient.

ISLP, Chapter 3.1.2

7 / 100

Question

What is the null hypothesis in the context of testing the significance of a regression coefficient \(\beta_1\)?

7 / 100

Answer

The most common null hypothesis (\(H_0\)) is that there is no relationship between the predictor \(X\) and the response \(Y\). Mathematically, this is stated as \(H_0: \beta_1 = 0\). The alternative hypothesis (\(H_a\)) is that there is some relationship, i.e., \(H_a: \beta_1 \neq 0\).

ISLP, Chapter 3.1.2

8 / 100

Question

What is a t-statistic and how is it used in linear regression?

8 / 100

Answer

The t-statistic is used to test the null hypothesis that a coefficient is zero. It is calculated as \(t = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)}\). It measures how many standard deviations \(\hat{\beta}_1\) is away from 0. A large absolute value for the t-statistic provides evidence against the null hypothesis.

ISLP, Chapter 3.1.2

9 / 100

Question

What is the p-value in hypothesis testing for a regression coefficient?

9 / 100

Answer

The p-value is the probability of observing a t-statistic as large as or larger than the one computed, assuming the null hypothesis (\(\beta_1 = 0\)) is true. A small p-value (typically < 0.05) indicates that it is unlikely to observe such a strong association by chance, leading us to reject the null hypothesis.

ISLP, Chapter 3.1.2

10 / 100

Question

What are the two main metrics used to assess the accuracy of a linear regression model?

10 / 100

Answer

The two main metrics are the Residual Standard Error (RSE) and the R-squared (\(R^2\)) statistic. RSE measures the average deviation of the response from the true regression line, while \(R^2\) measures the proportion of variance in the response that is explained by the model.

ISLP, Chapter 3.1.3

11 / 100

Question

Define Residual Standard Error (RSE).

11 / 100

Answer

The Residual Standard Error (RSE) is an estimate of the standard deviation of the error term \(\epsilon\). It is the average amount that the response will deviate from the true regression line. It is calculated as \(RSE = \sqrt{\frac{RSS}{n-2}}\).

ISLP, Chapter 3.1.3

12 / 100

Question

Define the R-squared (\(R^2\)) statistic.

12 / 100

Answer

The \(R^2\) statistic measures the proportion of the total variance in the response \(Y\) that is explained by the model. It is calculated as \(R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}\), where TSS is the Total Sum of Squares, \(\sum(y_i - \bar{y})^2\). It always takes a value between 0 and 1.

ISLP, Chapter 3.1.3

13 / 100

Question

What is the relationship between \(R^2\) and the correlation coefficient in simple linear regression?

13 / 100

Answer

In simple linear regression, the \(R^2\) statistic is equal to the square of the correlation coefficient between the predictor \(X\) and the response \(Y\). That is, \(R^2 = r^2\), where \(r = Cor(X, Y)\).

ISLP, Chapter 3.1.3

14 / 100

Question

How does multiple linear regression differ from simple linear regression?

14 / 100

Answer

Multiple linear regression extends simple linear regression to accommodate multiple predictors. The model takes the form \(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p + \epsilon\). Each predictor has its own slope coefficient, and the model can assess the effect of each predictor while holding others constant.

ISLP, Chapter 3.2

15 / 100

Question

How do you interpret a coefficient \(\beta_j\) in a multiple linear regression model?

15 / 100

Answer

In multiple linear regression, \(\beta_j\) is interpreted as the average effect on \(Y\) of a one-unit increase in \(X_j\), holding all other predictors fixed. This allows for the separation of the individual effects of each predictor on the response.

ISLP, Chapter 3.2

16 / 100

Question

Why can the coefficient estimates for the same predictor be different in simple and multiple regression?

16 / 100

Answer

This difference arises from correlation between predictors. In simple regression, the slope term represents the average effect of the predictor, ignoring all other predictors. In multiple regression, the coefficient represents the effect of the predictor while holding other predictors constant. If predictors are correlated, the simple regression coefficient can be misleading as it captures the effects of other correlated predictors.

ISLP, Chapter 3.2.1

17 / 100

Question

What is the F-statistic used for in multiple linear regression?

17 / 100

Answer

The F-statistic is used to test the null hypothesis that all regression coefficients are zero (\(H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0\)). This tests whether at least one predictor is useful in predicting the response. A large F-statistic provides evidence against the null hypothesis.

ISLP, Chapter 3.2.2

18 / 100

Question

Why is the overall F-statistic needed when individual p-values for each coefficient are available?

18 / 100

Answer

When the number of predictors \(p\) is large, there is a high probability of observing a small p-value for an individual coefficient purely by chance, even if no variable is truly associated with the response (a problem of multiple testing). The F-statistic adjusts for the number of predictors and provides a single test for the overall significance of the model, avoiding this issue.

ISLP, Chapter 3.2.2

19 / 100

Question

What is variable selection in multiple regression?

19 / 100

Answer

Variable selection is the task of identifying which predictors are significantly associated with the response, in order to fit a single, more interpretable model that includes only those important predictors. This helps to avoid overfitting and improves model interpretability.

ISLP, Chapter 3.2.2

20 / 100

Question

Name three classical approaches for variable selection.

20 / 100

Answer

1. Forward Selection: Start with a null model and add predictors one by one, at each step adding the predictor that gives the greatest additional improvement to the fit.
2. Backward Selection: Start with all predictors and remove the least useful predictor one by one.
3. Mixed Selection: A combination of forward and backward selection, where variables are added one by one, but can also be removed if they become insignificant.

ISLP, Chapter 3.2.2

21 / 100

Question

How does adding more variables to a model affect the \(R^2\) statistic?

21 / 100

Answer

The \(R^2\) statistic will always increase when more variables are added to the model, even if those variables are only weakly associated with the response. This is because adding a variable always allows the model to fit the training data at least as well, thus reducing the RSS. Therefore, \(R^2\) alone is not a good metric for model selection.

ISLP, Chapter 3.2.2

22 / 100

Question

What is the difference between a confidence interval and a prediction interval?

22 / 100

Answer

A confidence interval quantifies the uncertainty around the average response \(f(X)\) for a given set of predictors. A prediction interval quantifies the uncertainty for a single individual response \(Y\). Prediction intervals are always wider because they incorporate both the reducible error (uncertainty in the coefficient estimates) and the irreducible error (\(\epsilon\)).

ISLP, Chapter 3.2.2

23 / 100

Question

How can qualitative predictors be included in a linear regression model?

23 / 100

Answer

Qualitative predictors are included by creating dummy variables. For a predictor with two levels (e.g., male/female), a single dummy variable is created that takes on two numerical values (e.g., 0 and 1). For a predictor with more than two levels, one fewer dummy variables than the number of levels are created.

ISLP, Chapter 3.3.1

24 / 100

Question

What is the 'baseline' level in the context of dummy variables?

24 / 100

Answer

When creating dummy variables for a qualitative predictor with \(k\) levels, we create \(k-1\) dummy variables. The level that does not have its own dummy variable is known as the baseline or reference level. The coefficients of the dummy variables are interpreted as the average difference in the response relative to this baseline level.

ISLP, Chapter 3.3.1

25 / 100

Question

What is an interaction effect in a linear model?

25 / 100

Answer

An interaction effect occurs when the effect of one predictor variable on the response depends on the value of another predictor variable. It suggests that the predictors have a combined, synergistic effect. In a linear model, this is incorporated by adding a new predictor that is the product of the interacting variables (e.g., \(X_1 \times X_2\)).

ISLP, Chapter 3.3.2

26 / 100

Question

What is the hierarchical principle in modeling?

26 / 100

Answer

The hierarchical principle states that if an interaction term (e.g., \(X_1 \times X_2\)) is included in a model, the corresponding main effects (\(X_1\) and \(X_2\)) should also be included, even if their individual p-values are not significant. This is because the interaction term is correlated with the main effects, and omitting them can alter the meaning of the interaction.

ISLP, Chapter 3.3.2

27 / 100

Question

How can non-linear relationships be modeled using linear regression?

27 / 100

Answer

Non-linear relationships can be modeled by including transformed versions of the predictors in the model. A common approach is polynomial regression, where we add powers of a predictor (e.g., \(X^2, X^3\)) as new variables. The model is still linear in the coefficients, but the resulting function is non-linear with respect to the original predictor.

ISLP, Chapter 3.3.2

28 / 100

Question

What is a residual plot and what is it used for?

28 / 100

Answer

A residual plot is a graph of the residuals (\(y_i - \hat{y}_i\)) versus the fitted values (\(\hat{y}_i\)). It is a key diagnostic tool for identifying potential problems with a linear regression model. Ideally, the plot should show no discernible pattern. Patterns, such as a U-shape, suggest non-linearity in the data.

ISLP, Chapter 3.3.3

29 / 100

Question

What is heteroscedasticity and how can it be identified?

29 / 100

Answer

Heteroscedasticity refers to the situation where the error terms have non-constant variance (i.e., \(Var(\epsilon_i)\) is not constant). It can be identified from a funnel shape in the residual plot, where the magnitude of the residuals tends to increase or decrease with the fitted values. A possible solution is to apply a concave transformation to the response variable, such as \(\log(Y)\) or \(\sqrt{Y}\).

ISLP, Chapter 3.3.3

30 / 100

Question

What is the difference between an outlier and a high-leverage point?

30 / 100

Answer

An outlier is an observation for which the response \(y_i\) is unusual given its predictor value \(x_i\). It has a large residual. A high-leverage point is an observation that has an unusual predictor value \(x_i\) (e.g., far from the mean of \(X\)). High-leverage points can have a substantial impact on the estimated regression line.

ISLP, Chapter 3.3.3

31 / 100

Question

What is collinearity?

31 / 100

Answer

Collinearity is a situation where two or more predictor variables are closely related to one another. High correlation between predictors makes it difficult to separate out their individual effects on the response, which increases the standard errors of the coefficient estimates and reduces the power of hypothesis tests.

ISLP, Chapter 3.3.3

32 / 100

Question

How can collinearity be detected?

32 / 100

Answer

A simple way is to inspect the correlation matrix of the predictors. A more reliable method is to compute the Variance Inflation Factor (VIF) for each predictor. A VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

ISLP, Chapter 3.3.3

33 / 100

Question

What is the Variance Inflation Factor (VIF)?

33 / 100

Answer

The VIF measures how much the variance of an estimated regression coefficient is inflated because of collinearity. It is calculated for each predictor by regressing it on all other predictors. The VIF is \(VIF(\hat{\beta}_j) = \frac{1}{1 - R^2_{X_j|X_{-j}}}\), where \(R^2_{X_j|X_{-j}}\) is the \(R^2\) from that regression. A value close to 1 indicates no collinearity.

ISLP, Chapter 3.3.3

34 / 100

Question

What are two simple solutions to the problem of collinearity?

34 / 100

Answer

1. Drop one of the problematic variables: Since the information provided by collinear variables is redundant, one can often be dropped without much compromise to the model fit.
2. Combine the collinear variables: Create a new single predictor by combining the collinear variables, for example, by taking their average.

ISLP, Chapter 3.3.3

35 / 100

Question

Why is linear regression considered a parametric method?

35 / 100

Answer

Linear regression is a parametric method because it assumes a specific functional form for the relationship between the predictors and the response, namely a linear one. The model is fully defined by a small number of parameters (the coefficients \(\beta_j\)) that are estimated from the data.

ISLP, Chapter 3.5

36 / 100

Question

What is the 'curse of dimensionality' and how does it affect methods like KNN?

36 / 100

Answer

The curse of dimensionality refers to the fact that in high-dimensional spaces, data points become very sparse. For a given observation, its nearest neighbors can be very far away, making predictions based on them unreliable. This degrades the performance of non-parametric methods like K-Nearest Neighbors (KNN) much more quickly than parametric methods like linear regression as the number of predictors \(p\) increases.

ISLP, Chapter 3.5

37 / 100

Question

In what situation would a parametric method like linear regression outperform a non-parametric method like KNN?

37 / 100

Answer

A parametric approach will outperform a non-parametric one if the assumed parametric form is close to the true form of the relationship. If the true relationship between \(X\) and \(Y\) is actually linear, linear regression will achieve a lower test error than KNN because it has low variance and low bias. KNN, being more flexible, would have higher variance in this scenario without a corresponding reduction in bias.

ISLP, Chapter 3.5

38 / 100

Question

What is the least squares solution for the coefficients \(\mathbf{w}\) in matrix notation?

38 / 100

Answer

The optimal parameter vector \(\mathbf{w}\) that minimizes the squared loss is given by the equation: \(\mathbf{\hat{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{t}\), where \(\mathbf{X}\) is the design matrix and \(\mathbf{t}\) is the vector of target values.

A First Course in Machine Learning, Chapter 1.3.1

39 / 100

Question

What is the purpose of regularization in linear regression?

39 / 100

Answer

Regularization is a technique used to control overfitting by penalizing large coefficient values. It adds a penalty term to the error function, which discourages the model from becoming too complex and fitting the noise in the training data. This generally leads to better generalization performance on new data.

A First Course in Machine Learning, Chapter 1.6

40 / 100

Question

What is ridge regression?

40 / 100

Answer

Ridge regression is a type of regularized least squares where the penalty term is the sum of the squares of the coefficients (an \(L_2\) penalty). The modified error function is \(L' = L + \lambda \mathbf{w}^T\mathbf{w}\). The solution is \(\mathbf{\hat{w}} = (\mathbf{X}^T\mathbf{X} + N\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{t}\).

A First Course in Machine Learning, Chapter 1.6

41 / 100

Question

From a probabilistic perspective, what does minimizing the sum-of-squares error function correspond to?

41 / 100

Answer

Minimizing the sum-of-squares error function is equivalent to maximizing the likelihood of the data under the assumption that the target variable has a Gaussian (normal) distribution around the model's prediction. That is, \(p(t|x, \mathbf{w}, \beta) = \mathcal{N}(t|y(x, \mathbf{w}), \beta^{-1})\).

Pattern Recognition and Machine Learning, Chapter 1.2.5

42 / 100

Question

What is the relationship between ridge regression and Bayesian inference?

42 / 100

Answer

Ridge regression is equivalent to finding the Maximum A Posteriori (MAP) estimate for the coefficients \(\mathbf{w}\) when assuming a Gaussian prior distribution on the coefficients, centered at zero. The regularization parameter \(\lambda\) is related to the precision (inverse variance) of the prior.

Pattern Recognition and Machine Learning, Chapter 1.2.6

43 / 100

Question

What is the predictive distribution in a Bayesian linear regression model?

43 / 100

Answer

The predictive distribution \(p(t|x, \mathbf{x}, \mathbf{t})\) gives the probability distribution over the target value \(t\) for a new input \(x\), after integrating out the uncertainty in the model parameters \(\mathbf{w}\). For a Gaussian likelihood and prior, the predictive distribution is also a Gaussian.

Pattern Recognition and Machine Learning, Chapter 1.2.6

44 / 100

Question

What is the key difference between the prediction from a maximum likelihood approach and a fully Bayesian approach?

44 / 100

Answer

The maximum likelihood approach gives a single point estimate for the parameters, leading to a point prediction. A fully Bayesian approach provides a posterior distribution over the parameters, and the prediction is a distribution (the predictive distribution) that averages over all possible parameter values, weighted by their posterior probability. This accounts for uncertainty in the parameters.

Pattern Recognition and Machine Learning, Chapter 1.2.6

45 / 100

Question

What is the bias-variance tradeoff?

45 / 100

Answer

The bias-variance tradeoff is a fundamental concept in supervised learning. The expected error of a model can be decomposed into bias and variance. Bias is the error from erroneous assumptions in the learning algorithm (underfitting). Variance is the error from sensitivity to small fluctuations in the training set (overfitting). Simple models have high bias and low variance, while complex models have low bias and high variance. The goal is to find a model that optimally balances the two.

A First Course in Machine Learning, Chapter 2.8

46 / 100

Question

What is the maximum likelihood estimate for the variance \(\sigma^2\) of the noise in a linear regression model?

46 / 100

Answer

The maximum likelihood estimate for the noise variance is the average squared error: \(\hat{\sigma}^2 = \frac{1}{N} \sum_{n=1}^N (t_n - \mathbf{w}_{ML}^T\mathbf{x}_n)^2\). This estimator is known to be biased, as it systematically underestimates the true variance.

A First Course in Machine Learning, Chapter 2.7.2

47 / 100

Question

What is the covariance of the maximum likelihood weight vector, \(\text{cov}[\mathbf{\hat{w}}]\)?

47 / 100

Answer

The covariance of the maximum likelihood weight vector \(\mathbf{\hat{w}}\ is given by \(\text{cov}[\mathbf{\hat{w}}] = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\). This matrix quantifies the uncertainty in the parameter estimates. The diagonal elements give the variance of each coefficient, and the off-diagonal elements give their covariance.

A First Course in Machine Learning, Chapter 2.9.1

48 / 100

Question

What is the Fisher Information Matrix for the mean of a Gaussian distribution?

48 / 100

Answer

The Fisher Information Matrix, \(\mathcal{I}(\mathbf{w})\), is the expected value of the negative of the Hessian matrix of the log-likelihood. For a linear regression model with Gaussian noise, it is \(\mathcal{I}(\mathbf{w}) = \frac{1}{\sigma^2}\mathbf{X}^T\mathbf{X}\). Its inverse provides the Cramér-Rao lower bound on the variance of any unbiased estimator.

A First Course in Machine Learning, Chapter 2.9.1

49 / 100

Question

What is the variance of a prediction, \(\sigma^2_{new}\), in a linear regression model?

49 / 100

Answer

The variance of a prediction for a new input \(\mathbf{x}_{new}\) has two components: one from the noise in the data (\(\sigma^2\)) and one from the uncertainty in the weight estimates. The total predictive variance is \(\sigma^2_{new} = \sigma^2 + \mathbf{x}_{new}^T \text{cov}[\mathbf{\hat{w}}] \mathbf{x}_{new} = \sigma^2 + \sigma^2 \mathbf{x}_{new}^T (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}_{new}\).

A First Course in Machine Learning, Chapter 2.10

50 / 100

Question

What is the difference between the Additive and Linearity assumptions in linear regression?

50 / 100

Answer

The additive assumption means the effect of changing one predictor \(X_j\) on the response \(Y\) is independent of the values of other predictors. The linearity assumption means the change in the response \(Y\) for a one-unit change in \(X_j\) is constant, regardless of the value of \(X_j\).

ISLP, Chapter 3.3.2

51 / 100

Question

How can you use a residual plot to detect non-linearity?

51 / 100

Answer

If there is a non-linear relationship between the predictors and the response, the residual plot (residuals vs. fitted values) will exhibit a discernible pattern. A common pattern is a U-shape, which indicates that the model is systematically under-predicting and over-predicting for different ranges of the response.

ISLP, Chapter 3.3.3

52 / 100

Question

What is 'tracking' in a residual plot and what does it indicate?

52 / 100

Answer

'Tracking' refers to a pattern in a plot of residuals versus time (or observation number) where adjacent residuals have similar values. This pattern indicates that the error terms are correlated, which violates a key assumption of linear regression and can lead to underestimated standard errors and misleadingly small p-values.

ISLP, Chapter 3.3.3

53 / 100

Question

What is a studentized residual?

53 / 100

Answer

A studentized residual is computed by dividing each residual \(e_i\) by its estimated standard error. Plotting studentized residuals is a common way to identify outliers. Observations with studentized residuals greater than 3 in absolute value are often considered to be outliers.

ISLP, Chapter 3.3.3

54 / 100

Question

What is the leverage statistic \(h_i\)?

54 / 100

Answer

The leverage statistic \(h_i\) quantifies how much an observation's predictor value \(x_i\) deviates from the mean of the predictors. A large value indicates a high-leverage point. For simple linear regression, it is calculated as \(h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j=1}^n (x_j - \bar{x})^2}\).

ISLP, Chapter 3.3.3

55 / 100

Question

Why is an observation with both high leverage and a large residual particularly dangerous?

55 / 100

Answer

An observation that is both an outlier (large residual) and has high leverage can have a disproportionately large effect on the fitted regression line. Removing such a point can cause the line to shift significantly, indicating that the entire fit may be driven by that single problematic observation.

ISLP, Chapter 3.3.3

56 / 100

Question

What is the primary advantage of a parametric method over a non-parametric one?

56 / 100

Answer

Parametric methods are generally easier to fit because they only require estimating a small number of parameters. They also provide simple interpretations of the coefficients and allow for straightforward statistical significance testing. Non-parametric methods are more flexible but can be harder to interpret and more computationally intensive.

ISLP, Chapter 3.5

57 / 100

Question

What is the primary disadvantage of a parametric method?

57 / 100

Answer

The main disadvantage is that they make strong assumptions about the functional form of the relationship between predictors and the response. If this assumed form is far from the true relationship, the model will have a high bias and will not perform well.

ISLP, Chapter 3.5

58 / 100

Question

What is the general form of the exponential family of distributions?

58 / 100

Answer

The exponential family of distributions is a class of distributions that can be written in the form: \( p(\mathbf{x}|\boldsymbol{\eta}) = h(\mathbf{x})g(\boldsymbol{\eta})\exp\{\boldsymbol{\eta}^T\mathbf{u}(\mathbf{x})\} \). Here, \(\boldsymbol{\eta}\) are the natural parameters of the distribution.

Pattern Recognition and Machine Learning, Chapter 2.4

59 / 100

Question

What is a sufficient statistic?

59 / 100

Answer

A sufficient statistic is a function of the data that contains all the information needed to compute a maximum likelihood estimate for a parameter. For a Gaussian distribution, the sufficient statistics are \(\sum_n x_n\) and \(\sum_n x_n^2\). For the Bernoulli distribution, it is \(\sum_n x_n\).

Pattern Recognition and Machine Learning, Chapter 2.1.1

60 / 100

Question

What is a conjugate prior?

60 / 100

Answer

A prior distribution is conjugate to a likelihood function if the resulting posterior distribution is in the same family as the prior. For example, the Beta distribution is the conjugate prior for the Bernoulli likelihood, and the Gaussian distribution is a conjugate prior for a Gaussian likelihood with a known variance.

Pattern Recognition and Machine Learning, Chapter 2.1.1

61 / 100

Question

What is the relationship between the least squares solution and the maximum likelihood solution in linear regression?

61 / 100

Answer

Minimizing the sum of squared errors (the least squares solution) is equivalent to maximizing the likelihood of the data under the assumption that the errors are independent and identically distributed according to a zero-mean Gaussian distribution.

A First Course in Machine Learning, Chapter 2.7.2

62 / 100

Question

What is the bias of the maximum likelihood estimator for the variance of a Gaussian?

62 / 100

Answer

The maximum likelihood estimator for the variance, \(\hat{\sigma}^2_{ML} = \frac{1}{N}\sum(x_n - \mu_{ML})^2\), is biased. Its expected value is \(E[\hat{\sigma}^2_{ML}] = \frac{N-1}{N}\sigma^2\), meaning it systematically underestimates the true variance \(\sigma^2\).

Pattern Recognition and Machine Learning, Chapter 1.2.4

63 / 100

Question

What is the key idea behind Bayesian linear regression?

63 / 100

Answer

In Bayesian linear regression, we treat the model parameters (coefficients \(\mathbf{w}\)) as random variables. We define a prior distribution over them, \(p(\mathbf{w})\), and combine it with the likelihood \(p(\mathbf{t}|\mathbf{w})\) using Bayes' theorem to obtain a posterior distribution \(p(\mathbf{w}|\mathbf{t})\). This posterior captures our updated beliefs about the parameters after observing the data.

Pattern Recognition and Machine Learning, Chapter 1.2.6

64 / 100

Question

What is the MAP estimate and how does it relate to regularized least squares?

64 / 100

Answer

The Maximum A Posteriori (MAP) estimate is the value of the parameters \(\mathbf{w}\) that maximizes the posterior distribution. If the likelihood is Gaussian and the prior on the weights is a zero-mean Gaussian, finding the MAP estimate is equivalent to minimizing a regularized sum-of-squares error function (ridge regression).

Pattern Recognition and Machine Learning, Chapter 1.2.6

65 / 100

Question

What is the marginal likelihood and what is it used for?

65 / 100

Answer

The marginal likelihood, or model evidence, is the probability of the observed data given a model, with the parameters integrated out: \(p(\mathbf{t}|\text{model}) = \int p(\mathbf{t}|\mathbf{w})p(\mathbf{w})d\mathbf{w}\). It can be used for model selection, for example, to choose the optimal order of a polynomial or the best prior settings, by selecting the model with the highest marginal likelihood.

A First Course in Machine Learning, Chapter 3.4

66 / 100

Question

What is the difference between bias and variance in the context of an estimator?

66 / 100

Answer

The bias of an estimator is the difference between its expected value and the true value of the parameter being estimated. An unbiased estimator has a bias of zero. The variance of an estimator measures the spread of its estimates around its expected value. A good estimator typically has low bias and low variance.

ST3189 Subject Guide - Statistical Inference on Linear Regression Models

67 / 100

Question

What is the Cramér-Rao lower bound?

67 / 100

Answer

The Cramér-Rao lower bound provides a lower bound on the variance of any unbiased estimator of a deterministic parameter. The bound is the inverse of the Fisher information. An estimator that achieves this bound is said to be efficient.

ST3189 Subject Guide - Statistical Inference on Linear Regression Models

68 / 100

Question

What are the asymptotic properties of Maximum Likelihood Estimators (MLEs)?

68 / 100

Answer

As the sample size \(n\) increases, MLEs are asymptotically unbiased, normally distributed, and have the minimum possible variance (they are efficient, achieving the Cramér-Rao lower bound). This makes them optimal estimators for large sample sizes.

ST3189 Subject Guide - Statistical Inference on Linear Regression Models

69 / 100

Question

What is the purpose of the design matrix \(\mathbf{X}\) in linear regression?

69 / 100

Answer

The design matrix \(\mathbf{X}\) is an \(n \times (p+1)\) matrix that contains the predictor values for all \(n\) observations. Each row corresponds to an observation, and each column corresponds to a predictor variable. A column of ones is typically included to account for the intercept term \(\beta_0\).

ST3189 Subject Guide - The Linear Regression Model

70 / 100

Question

What is a confounding variable?

70 / 100

Answer

A confounding variable is a variable that is correlated with both the predictor variable and the response variable. Its presence can distort the relationship between the predictor and the response, potentially leading to spurious conclusions about causality or association.

ST3189 Subject Guide - Using Linear Regression Models

71 / 100

Question

What is the difference between interpolation and extrapolation?

71 / 100

Answer

Interpolation is making a prediction for a new observation whose predictor values fall within the range of the training data. Extrapolation is making a prediction for a new observation whose predictor values fall outside the range of the training data. Extrapolation is generally much less reliable as it assumes the model holds true in regions where no data has been observed.

ST3189 Subject Guide - Using Linear Regression Models

72 / 100

Question

What is the purpose of using basis functions in linear regression?

72 / 100

Answer

Using basis functions \(h(x)\) allows us to model non-linear relationships within the linear regression framework. By transforming the original predictors (e.g., using polynomials, splines, or other functions), we create a new set of predictors. The model is still linear with respect to the coefficients of these new basis functions, but the resulting predictive function is non-linear with respect to the original input \(x\).

ST3189 Subject Guide - The Linear Regression Model

73 / 100

Question

What is the key assumption about the error terms \(\epsilon_i\) in a standard linear regression model?

73 / 100

Answer

The standard assumptions are that the error terms are independent and identically distributed (i.i.d.) with a mean of zero and constant variance \(\sigma^2\). Often, they are also assumed to follow a Normal (Gaussian) distribution, \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\).

ST3189 Subject Guide - The Linear Regression Model

74 / 100

Question

What is the 'leaps and bounds' procedure used for?

74 / 100

Answer

The 'leaps and bounds' procedure is an efficient algorithm for performing best subset selection. It avoids having to fit all \(2^p\) possible models by identifying the best subset of predictors for each subset size \(k\) without exhaustively searching through every single combination.

ST3189 Subject Guide - Subset Selection Methods in Linear Regression

75 / 100

Question

Why might a greedy approach like forward stepwise selection be preferred over best subset selection?

75 / 100

Answer

Forward stepwise selection is computationally much more efficient than best subset selection, especially for a large number of predictors \(p\). While best subset selection explores all possible models, forward selection follows a single path, making it feasible for high-dimensional problems where best subset is computationally intractable.

ST3189 Subject Guide - Subset Selection Methods in Linear Regression

76 / 100

Question

What is the main drawback of backward stepwise selection?

76 / 100

Answer

The main drawback of backward stepwise selection is that it requires the number of samples \(n\) to be larger than the number of predictors \(p\). This is because it starts with the full model, which cannot be fit if \(p > n\).

ISLP, Chapter 3.2.2

77 / 100

Question

Explain the concept of 'shrinking' coefficients in the context of Bayesian linear regression.

77 / 100

Answer

In Bayesian linear regression, the prior distribution on the coefficients (e.g., a Gaussian centered at zero) pulls the posterior estimates of the coefficients away from the maximum likelihood estimate and towards the prior mean (zero). This effect is called 'shrinkage'. It acts as a form of regularization, preventing coefficients from becoming too large and helping to control overfitting.

ST3189 Subject Guide - Bayesian Linear Regression

78 / 100

Question

What prior distribution on the regression coefficients \(\beta\) corresponds to LASSO regression?

78 / 100

Answer

The LASSO (Least Absolute Shrinkage and Selection Operator) estimator corresponds to the posterior mode (MAP estimate) when a Laplace prior, \(\text{La}(0, 1/\gamma)\), is placed on the regression coefficients. The Laplace prior has a sharp peak at zero, which encourages some coefficients to be exactly zero, thus performing variable selection.

ST3189 Subject Guide - Bayesian Linear Regression

79 / 100

Question

What is the primary motivation for using shrinkage methods like Ridge or LASSO over standard least squares?

79 / 100

Answer

The primary motivation is to improve prediction accuracy by reducing the variance of the model. Standard least squares can have high variance, especially when predictors are correlated or when \(p\) is large. Shrinkage methods introduce a small amount of bias but can lead to a substantial reduction in variance, resulting in a lower overall mean squared error.

ST3189 Subject Guide - Bayesian Linear Regression

80 / 100

Question

What is the role of the hyperparameter \(\lambda\) in shrinkage methods?

80 / 100

Answer

The hyperparameter \(\lambda\) controls the amount of shrinkage. A larger \(\lambda\) results in greater shrinkage, pulling the coefficients more strongly towards zero and resulting in a simpler, less flexible model. A smaller \(\lambda\) results in less shrinkage, and as \(\lambda \to 0\), the solution approaches the standard least squares estimate.

A First Course in Machine Learning, Chapter 1.6

81 / 100

Question

How is the optimal value of the shrinkage parameter \(\lambda\) typically chosen?

81 / 100

Answer

The optimal value of \(\lambda\) is typically chosen using a validation method like cross-validation. The goal is to find the value of \(\lambda\) that results in the lowest test error on unseen data, balancing the bias-variance tradeoff.

A First Course in Machine Learning, Chapter 1.6

82 / 100

Question

What is the matrix form of the sum-of-squares error function?

82 / 100

Answer

The sum-of-squares error function can be written in matrix form as: \( L = (\mathbf{t} - \mathbf{Xw})^T(\mathbf{t} - \mathbf{Xw}) \).

A First Course in Machine Learning, Chapter 1.3

83 / 100

Question

What is the purpose of the intercept term (\(\beta_0\)) in a linear regression model?

83 / 100

Answer

The intercept term \(\beta_0\) represents the expected value of the response variable \(Y\) when all predictor variables are equal to zero. Geometrically, it is the value where the regression line or plane crosses the Y-axis.

ISLP, Chapter 3.1

84 / 100

Question

Can linear regression be used for classification problems?

84 / 100

Answer

While it's possible to code a binary response variable as 0/1 and fit a linear regression, it's not recommended. The model can produce predictions outside the [0, 1] interval, which are difficult to interpret as probabilities. For responses with more than two classes, the arbitrary numerical coding implies a false ordering. Classification-specific methods like logistic regression are more appropriate.

ISLP, Chapter 4.2

85 / 100

Question

What is the key difference in the assumptions of Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA)?

85 / 100

Answer

Both LDA and QDA assume that the observations within each class are drawn from a Gaussian distribution. The key difference is that LDA assumes all classes share a common covariance matrix (\(\Sigma\)), while QDA assumes that each class has its own covariance matrix (\(\Sigma_k\)). This makes QDA more flexible but requires estimating more parameters.

Pattern Recognition and Machine Learning, Chapter 4.4.3

86 / 100

Question

What kind of decision boundary does Linear Discriminant Analysis (LDA) produce?

86 / 100

Answer

As its name implies, Linear Discriminant Analysis (LDA) produces linear decision boundaries between classes. This is a direct result of assuming a common covariance matrix for all classes.

Pattern Recognition and Machine Learning, Chapter 4.4.2

87 / 100

Question

What kind of decision boundary does Quadratic Discriminant Analysis (QDA) produce?

87 / 100

Answer

Quadratic Discriminant Analysis (QDA) produces quadratic decision boundaries. This allows for more flexible separation between classes compared to LDA, as it does not assume a common covariance matrix.

Pattern Recognition and Machine Learning, Chapter 4.4.3

88 / 100

Question

When might LDA be a better choice than QDA?

88 / 100

Answer

LDA is generally a better choice when the training set is small, as it has fewer parameters to estimate (it assumes a common covariance matrix). It is also preferred if the assumption of a common covariance matrix is reasonable. QDA's flexibility can lead to overfitting on small datasets.

Pattern Recognition and Machine Learning, Chapter 4.4.3

89 / 100

Question

When might QDA be a better choice than LDA?

89 / 100

Answer

QDA is a better choice when the training set is large and the assumption of a common covariance matrix for all classes is untenable. If the true decision boundary is non-linear, QDA's flexibility will allow it to achieve a better fit than LDA.

Pattern Recognition and Machine Learning, Chapter 4.4.3

90 / 100

Question

What is the 'naive' assumption in Naive Bayes classification?

90 / 100

Answer

The 'naive' assumption is that, within each class, the predictor variables are all conditionally independent of each other. This is a strong assumption that is often not true in reality, but it greatly simplifies the model and often works well in practice, especially for text classification.

ST3189 Subject Guide - Statistical Inference on Linear Regression Models

91 / 100

Question

What is the difference between a generative and a discriminative model?

91 / 100

Answer

A generative model (like LDA or Naive Bayes) models the joint probability distribution \(p(X, Y)\), often by modeling the class-conditional density \(p(X|Y)\) and the prior \(p(Y)\). A discriminative model (like logistic regression) directly models the posterior probability \(p(Y|X)\) without modeling the distribution of \(X\).

Pattern Recognition and Machine Learning, Chapter 4.3

92 / 100

Question

What is the logistic function (or sigmoid function)?

92 / 100

Answer

The logistic function is \(\sigma(\eta) = \frac{e^\eta}{1 + e^\eta} = \frac{1}{1 + e^{-\eta}}\). It takes any real-valued number and maps it to a value between 0 and 1, making it suitable for modeling probabilities in logistic regression.

Pattern Recognition and Machine Learning, Chapter 4.3.2

93 / 100

Question

How does logistic regression model the probability of a binary outcome?

93 / 100

Answer

Logistic regression models the log-odds of the outcome as a linear function of the predictors: \(\log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1X_1 + \dots + \beta_pX_p\). This is equivalent to modeling the probability \(p(X)\) using the logistic function.

ISLP, Chapter 4.3.1

94 / 100

Question

How are the coefficients in logistic regression estimated?

94 / 100

Answer

The coefficients in logistic regression are estimated using the maximum likelihood method. Unlike linear regression, there is no closed-form solution, so an iterative algorithm like gradient ascent or Newton-Raphson (often implemented as Iteratively Reweighted Least Squares) is used to find the estimates.

Pattern Recognition and Machine Learning, Chapter 4.3.3

95 / 100

Question

What is the key difference between the interpretation of coefficients in linear vs. logistic regression?

95 / 100

Answer

In linear regression, \(\beta_j\) is the change in the mean of \(Y\) for a one-unit change in \(X_j\). In logistic regression, \(\beta_j\) is the change in the log-odds of \(Y=1\) for a one-unit change in \(X_j\). Equivalently, exponentiating the coefficient, \(e^{\beta_j}\), gives the odds ratio.

ISLP, Chapter 4.3.1

96 / 100

Question

What is the Total Sum of Squares (TSS)?

96 / 100

Answer

The Total Sum of Squares (TSS) measures the total variance in the response \(Y\) before the regression is performed. It is the sum of the squared differences between each observation and the overall mean of the response: \(TSS = \sum_{i=1}^n (y_i - \bar{y})^2\).

ISLP, Chapter 3.1.3

97 / 100

Question

What does an \(R^2\) value of 0.75 mean?

97 / 100

Answer

An \(R^2\) value of 0.75 means that 75% of the variability in the response variable \(Y\) can be explained by the predictor variables included in the model. The remaining 25% of the variability is unexplained by the model.

ISLP, Chapter 3.1.3

98 / 100

Question

Can you have a negative \(R^2\) value?

98 / 100

Answer

For a linear regression model fit by least squares, the \(R^2\) on the training data will always be between 0 and 1. However, when evaluating a model on a test set, it is possible to get a negative \(R^2\) if the model fits the test data worse than a simple horizontal line at the mean of the test response.

General Knowledge

99 / 100

Question

What is the purpose of the `poly()` function in R or Python's `ISLP` library?

99 / 100

Answer

The `poly()` function is used to generate a basis matrix for polynomial regression. It creates columns representing polynomial functions of a predictor (e.g., \(x, x^2, x^3\)). By default, it often generates orthogonal polynomials, which are numerically more stable for fitting.

ISLP, Chapter 3.6.6

100 / 100

Question

What is the purpose of the `anova_lm()` function?

100 / 100

Answer

The `anova_lm()` function (Analysis of Variance) is used to compare two or more nested linear regression models. It performs an F-test to determine if the larger model provides a statistically significant improvement in fit over the smaller model.

ISLP, Chapter 3.6.6