ST3189 Flashcards: Model Selection and Shrinkage

What is the fundamental trade-off in model selection according to the ST3189 guide?

The trade-off between Bias (how wrong the model is) and Variance (estimation error). Choosing a useful model involves sacrificing a little bit of bias to significantly reduce variance.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What are the two main reasons we are often not satisfied with Maximum Likelihood Estimates (MLE)?

1. Prediction accuracy: LSE/MLE often have low bias but large variance; shrinkage can improve accuracy.
2. Interpretation: With many predictors, subset selection helps identify the strongest effects for a "big picture" view.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

Why is comparing models based on log-likelihood alone problematic?

The log-likelihood evaluated at the MLE always favors models with more parameters ($M$), which inevitably leads to over-fitting.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

Define the Akaike Information Criterion (AIC).

A metric for model comparison that penalizes model complexity:
$AIC = -2\log f(y|\hat{\beta}, \hat{\sigma}^2) + 2p$
where $p$ is the number of parameters. Slower values indicate better models.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

Define the Bayesian Information Criterion (BIC).

Also known as the Schwarz criterion:
$BIC = -2\log f(y|\hat{\beta}, \hat{\sigma}^2) + p\log n$
where $n$ is the number of observations. It uses a heavier penalty than AIC for large $n$.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What is the "Size Principle" in Bayesian concept learning?

The model favors the smallest (simplest) hypothesis consistent with the data. This is a mathematical embodiment of Occam's Razor.

Source: Murphy, Machine Learning, p. 98

What is "Bayesian Occam's Razor"?

The effect where the marginal likelihood naturally favors simpler models because complex models must spread their probability mass more thinly over a larger space of possible datasets.

Source: Murphy, Machine Learning, p. 187

Define Mallow's $C_p$.

An estimate of the test MSE:
$C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2)$
where $d$ is the number of predictors and $\hat{\sigma}^2$ is an estimate of the error variance.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What is Adjusted $R^2$?

A version of $R^2$ that penalizes unnecessary variables:
$Adjusted R^2 = 1 - \frac{RSS/(n-d-1)}{TSS/(n-1)}$.
Unlike $R^2$, it does not automatically increase with more variables.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What does the "No Free Lunch Theorem" state?

There is no universally best model; a set of assumptions that works well in one domain may perform poorly in another. Empirical validation (like CV) is required to find the best method for a specific problem.

Source: Murphy, Machine Learning, p. 55

Describe the first step of Best Subset Selection.

For each $k \in \{1, \dots, p\}$, fit all $\binom{p}{k}$ models with exactly $k$ predictors. Identify the "best" model $M_k$ (usually smallest RSS).

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What is the second step of Best Subset Selection?

Choose a single best model from $M_0, \dots, M_p$ using cross-validated prediction error, $C_p$, BIC, or Adjusted $R^2$.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

Why is Best Subset Selection computationally demanding?

The number of possible models is $2^p$. For 20 predictors, there are over 1 million models; for 40, it is over 1 trillion.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

Define Forward Stepwise Selection.

A greedy approach that starts with the null model and sequentially adds the predictor that results in the greatest additional improvement to the fit.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

Define Backward Stepwise Selection.

Starts with the full model containing all $p$ predictors and sequentially removes the predictor that is least useful (least impact on RSS).

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What is a statistical advantage of Forward Stepwise over Best Subset?

Forward stepwise is a more constrained search. While it might have slightly more bias, it often results in lower variance for the selected model.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

Can Backward Stepwise Selection be used if $p > n$?

No. It requires fitting the full model initially, which cannot be done with LSE when the number of parameters exceeds the number of observations.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What is "Hybrid Stepwise Selection"?

A method where variables are added sequentially (like forward selection), but after adding each new variable, the algorithm also checks if any existing variables can be removed without significantly worsening the fit.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

How does the nvmax parameter in regsubsets() work?

It specifies the maximum size of the subsets to be examined. If nvmax=20, the algorithm will return the best models of sizes 1 through 20.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What is the "One-Standard-Error Rule" for model selection?

Calculate the standard error of the estimated test error for each model size. Then, select the simplest model whose error is no more than one standard error above the lowest point on the curve.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

In Best Subset Selection, is the best model of size $k$ guaranteed to contain the best model of size $k-1$?

No. This is a key difference from Stepwise selection. The variables in the optimal subset of size $k$ may be completely different from those in the subset of size $k-1$.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What is the computational complexity (number of models) of Forward Stepwise?

Approximately $1 + p(p+1)/2$ models.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

How does one use regsubsets for Forward selection in R?

By setting the method="forward" argument in the regsubsets() function.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What R package is required for the regsubsets() function?

The leaps package.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

Why use indirect estimates (AIC/BIC) instead of direct CV error for subset selection?

Computational cost. Directly estimating test error via CV for every possible model in a search procedure is often too demanding. Indirect criteria are faster to compute.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

In the ST3189 guide, what quote is used to emphasize model utility?

"All models are wrong but some are useful!" by George Box.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What is the "Validation Set" approach?

Randomly dividing the available data into a Training set (to fit models) and a Validation set (to select model complexity/parameters).

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What is the formula for RSS in Matrix notation?

$RSS = (y - X\beta)^T(y - X\beta)$.

Source: Rogers & Girolami, FCIML, p. 19

What is the "Null Model" RSS equal to?

The Total Sum of Squares (TSS): $\sum (y_i - \bar{y})^2$.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

How does $R^2$ behave as more variables are added in Best Subset Selection?

It increases monotonically. This is why it is useless for comparing models of different sizes; it will always favor the largest model.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What is the Ridge Regression penalty term?

The $L_2$ norm of the coefficients: $\lambda \sum_{j=1}^p \beta_j^2$.

Source: ST3189 Subject Guide - Shrinkage Methods

What is the Lasso Regression penalty term?

The $L_1$ norm of the coefficients: $\lambda \sum_{j=1}^p |\beta_j|$.

Source: ST3189 Subject Guide - Shrinkage Methods

Compare the mathematical constraint of Ridge vs. Lasso.

Ridge: $\sum \beta_j^2 \leq s$ (circular constraint).
Lasso: $\sum |\beta_j| \leq s$ (diamond/pointy constraint).

Source: ST3189 Subject Guide - Shrinkage Methods

What is the "Ridge Estimator" formula?

$\hat{\beta}_{R} = (X^T X + \lambda I)^{-1} X^T y$.

Source: ST3189 Subject Guide - Shrinkage Methods

Why is Ridge regression stable even if $X^T X$ is singular?

Adding $\lambda$ to the diagonal elements of $X^T X$ ensures the resulting matrix $(X^T X + \lambda I)$ is invertible, even when $p > n$.

Source: ST3189 Subject Guide - Shrinkage Methods

Define "Soft Thresholding" in Lasso.

In the orthonormal case, Lasso translates each coefficient by $\lambda/2$ towards zero, and sets it exactly to zero if its absolute value is less than $\lambda/2$.

Source: ST3189 Subject Guide - Shrinkage Methods

Define "Proportional Shrinkage" in Ridge.

In the orthonormal case, Ridge divides each least squares coefficient by a constant factor $(1 + \lambda)$.

Source: ST3189 Subject Guide - Shrinkage Methods

What is the effect of scaling on Ridge and Lasso?

They are not scale-invariant. Predictors must be standardized (mean 0, variance 1) so that the penalty is applied fairly across all variables.

Source: ST3189 Subject Guide - Shrinkage Methods

What happens to Lasso coefficients as $\lambda$ increases?

They shrink towards zero, and eventually, many are set to exactly zero, effectively performing variable selection.

Source: ST3189 Subject Guide - Shrinkage Methods

Define Elastic Net.

A method minimizing: $RSS + \lambda_1 \sum |\beta_j| + \lambda_2 \sum \beta_j^2$. It bridges Lasso and Ridge, useful for handling groups of correlated variables.

Source: ST3189 Subject Guide - Shrinkage Methods

When does Ridge regression tend to outperform Lasso?

When there are many predictors, all of which have a non-zero effect on the response, or when variables are highly correlated.

Source: ST3189 Subject Guide - Shrinkage Methods

When does Lasso tend to outperform Ridge regression?

When the true model is sparse (only a small number of predictors have a large effect, and others are close to zero).

Source: ST3189 Subject Guide - Shrinkage Methods

How does glmnet handle categorical variables?

It requires numerical matrices. You should use model.matrix() first to convert factors into dummy variables.

Source: ST3189 Subject Guide - Subset Selection Shrinkage Methods R

What is the "Lambda Grid"?

A range of $\lambda$ values (e.g., $10^{10}$ down to $10^{-2}$) tested via CV to find the optimal shrinkage level.

Source: ST3189 Subject Guide - Subset Selection Shrinkage Methods R

What is lambda.min in cv.glmnet?

The value of $\lambda$ that results in the smallest cross-validated mean squared error.

Source: ST3189 Subject Guide - Subset Selection Shrinkage Methods R

What is lambda.1se in cv.glmnet?

The largest $\lambda$ such that the error is within one standard error of the minimum. It typically provides a more regularized (simpler) model.

Source: ST3189 Subject Guide - Subset Selection Shrinkage Methods R

In glmnet, what does alpha=0 signify?

It specifies the Ridge Regression model ($L_2$ penalty only).

Source: ST3189 Subject Guide - Subset Selection Shrinkage Methods R

In glmnet, what does alpha=1 signify?

It specifies the Lasso Regression model ($L_1$ penalty only).

Source: ST3189 Subject Guide - Subset Selection Shrinkage Methods R

Does Ridge regression produce a "sparse" model?

No. It includes all variables in the final model, though their coefficients may be very small.

Source: ST3189 Subject Guide - Shrinkage Methods

What is the "Bias-Variance Trade-off" in Ridge regression?

As $\lambda$ increases, model complexity and Variance decrease, but Bias increases. The goal is to find $\lambda$ where the Total Error (MSE) is minimized.

Source: ST3189 Subject Guide - Shrinkage Methods

Define "L1 Norm" ($\|\beta\|_1$).

The sum of the absolute values of the coefficients: $\sum_{j=1}^p |\beta_j|$.

Source: ST3189 Subject Guide - Shrinkage Methods

Define "L2 Norm" ($\|\beta\|_2$).

The square root of the sum of the squares of the coefficients: $\sqrt{\sum \beta_j^2}$. (The penalty in Ridge uses the square of this norm).

Source: ST3189 Subject Guide - Shrinkage Methods

What is a "Bayesian Interpretation" of Ridge Regression?

Ridge regression is the MAP (Maximum A Posteriori) estimate for a linear model with a Gaussian prior (centered at zero) on the coefficients.

Source: Bishop, Pattern Recognition, p. 30

What is a "Bayesian Interpretation" of Lasso?

Lasso is the MAP estimate for a linear model with a Laplace prior (centered at zero) on the coefficients.

Source: Murphy, Machine Learning, p. 440

What is "Data Leaking"?

Using information from the test set (e.g., its mean or variance for scaling) during the training phase. This leads to overly optimistic performance estimates.

Source: ST3189 Subject Guide - Subset Selection Shrinkage Methods R

What is a "Pipeline" in mlr3?

A sequence of data processing steps (like scaling) and learning algorithms bundled together to ensure valid cross-validation and prevent data leaking.

Source: ST3189 Subject Guide - Subset Selection Shrinkage Methods R

In mlr3, what does po('scale') do?

It is a "PipeOp" (Pipe Operator) that performs data scaling (centering and scaling to unit variance).

Source: ST3189 Subject Guide - Subset Selection Shrinkage Methods R

What is the "James-Stein Estimator"?

A shrinkage estimator for the mean of a Gaussian that is proven to have lower MSE than the sample mean for $N \geq 4$.

Source: Murphy, Machine Learning, p. 230

Define "Stein's Paradox".

The counter-intuitive result that the best estimate of a vector of means is NOT the vector of individual sample means, but rather a shrunken version.

Source: Murphy, Machine Learning, p. 230

Is Ridge regression suitable for "High-Dimensional" data ($p > n$)?

Yes. It is specifically designed to handle ill-conditioned matrices where LSE fails.

Source: ST3189 Subject Guide - Shrinkage Methods

How does Lasso behave when variables are perfectly correlated?

The solution is not unique; it may pick any of the variables or a combination, but it will typically drop all but one.

Source: ST3189 Subject Guide - Shrinkage Methods

How does Ridge behave when variables are perfectly correlated?

It assigns them identical coefficients, effectively averaging their effects.

Source: ST3189 Subject Guide - Shrinkage Methods

Define "Hard-Thresholding".

The model selection strategy used in subset selection: coefficients for excluded variables are set to zero, and others are left at their LSE values.

Source: ST3189 Subject Guide - Shrinkage Methods

What is the "L1 Penalty" shape in 3D?

An octahedron (dual of a cube). It has sharp edges and vertices on the coordinate axes.

Source: ST3189 Subject Guide - Shrinkage Methods

What is the "L2 Penalty" shape in 3D?

A sphere. It is smooth and lacks the axis-aligning "corners" of L1.

Source: ST3189 Subject Guide - Shrinkage Methods

What is "Coordinate Descent"?

A numerical optimization algorithm used to fit Lasso models by optimizing one coefficient at a time while holding others fixed.

Source: Murphy, Machine Learning, p. 441

Define "Total Sum of Squares" (TSS).

The total variability in the response: $TSS = \sum (y_i - \bar{y})^2$.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What is the formula for the "Full LSE" in matrix notation?

$\hat{\beta} = (X^T X)^{-1} X^T y$.

Source: Rogers & Girolami, FCIML, p. 22

What is "Mahalanobis Distance"?

A distance measure that accounts for the correlation between variables: $\sqrt{(x-\mu)^T \Sigma^{-1}(x-\mu)}$.

Source: Bishop, Pattern Recognition, p. 129

How does "Trace Trick" apply to log-likelihood?

It allows rewriting scalar quadratic forms as traces of matrices: $x^T A x = \text{tr}(A x x^T)$. Useful for MVN parameter estimation.

Source: Murphy, Machine Learning, p. 130

What is "Shrinkage Factor" ($B$)?

In Empirical Bayes, the value $B = \frac{\sigma^2}{\sigma^2 + \tau^2}$ that determines how much the local estimate is pulled toward the global mean.

Source: Murphy, Machine Learning, p. 176

Define "Standard Error" of the mean.

$SE(\bar{x}) = \frac{s}{\sqrt{n}}$. It quantifies the uncertainty in the sample mean estimate.

Source: Murphy, Machine Learning, p. 137

What is a "Credible Interval"?

A Bayesian interval containing $1-\alpha$ of the posterior probability mass.

Source: Murphy, Machine Learning, p. 183

What is the "Highest Posterior Density" (HPD) region?

The narrowest possible interval containing $1-\alpha$ of the posterior mass; every point inside has higher density than points outside.

Source: Murphy, Machine Learning, p. 184

Define "Degrees of Freedom" in the context of AIC.

The number of free parameters ($p$) used to fit the model. It represents the "cost" paid for complexity.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What is the "Bayes Factor"?

The ratio of the marginal likelihoods of two competing models: $BF_{10} = \frac{p(D|M_1)}{p(D|M_0)}$.

Source: Murphy, Machine Learning, p. 194

How does "Bias" change as model complexity decreases?

Bias increases as the model becomes simpler and potentially under-fits the true underlying pattern.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

How does "Variance" change as model complexity decreases?

Variance decreases as the model becomes less sensitive to specific training data points.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What is "Consensus Sequence"?

In biosequence analysis, the sequence formed by picking the most probable letter at each location.

Source: Murphy, Machine Learning, p. 67

What is "Sifting Property" of Dirac delta?

$\int f(x)\delta(x-\mu)dx = f(\mu)$. It extracts the value of a function at a specific point.

Source: Murphy, Machine Learning, p. 70

How do you plot the BIC values for models from regsubsets?

plot(summary(regfit.full)$bic, type='l').

Source: ST3189 Subject Guide - Subset Selection Linear Regression

What is the "Elbow" in a BIC plot?

The point where the curve starts to flatten out or increase, indicating that additional variables are no longer providing significant improvement relative to the penalty.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

How do you extract coefficients from a glmnet model at a specific $\lambda$?

coef(fit, s=0.01).

Source: ST3189 Subject Guide - Subset Selection Shrinkage Methods R

What is the purpose of TaskRegr$new() in mlr3?

To create a new regression task, specifying the task ID, the backend data, and the target variable name.

Source: ST3189 Subject Guide - Subset Selection Shrinkage Methods R

What is po('scale') %>>% po(learner)?

An mlr3 pipeline that scales the data first and then passes it to the learner.

Source: ST3189 Subject Guide - Subset Selection Shrinkage Methods R

Why is LSE unbiased?

Because $E(\hat{\beta}_{LSE}) = \beta$, meaning on average, the estimator hits the true parameter value.

Source: ST3189 Subject Guide - Subset Selection Linear Regression

Is Ridge regression biased?

Yes. It intentionally introduces bias (pulling coefficients toward zero) to achieve a larger reduction in variance.

Source: ST3189 Subject Guide - Shrinkage Methods

Is Lasso regression biased?

Yes. Like Ridge, it trades bias for a reduction in variance and the benefit of variable selection.

Source: ST3189 Subject Guide - Shrinkage Methods

What is "MSE Decomposition"?

$MSE = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$.

Source: Murphy, Machine Learning, p. 233

Define "Consistent Estimator".

An estimator $\hat{\theta}$ such that $\hat{\theta} \to \theta^*$ as $n \to \infty$. MLE is a consistent estimator.

Source: Murphy, Machine Learning, p. 231

What is the "Cramer-Rao Lower Bound"?

A lower bound on the variance of any unbiased estimator. MLE achieves this bound asymptotically.

Source: Murphy, Machine Learning, p. 232

What is "Minimax Risk"?

A decision rule that minimizes the maximum possible risk (worst-case scenario). It is often very pessimistic.

Source: Murphy, Machine Learning, p. 227

Define "Admissible Estimator".

An estimator that is not "strictly dominated" by any other estimator (i.e., there's no other estimator with lower risk for all possible values of $\theta$).

Source: Murphy, Machine Learning, p. 228

What is the "Mode" of a distribution?

The point at which the probability density (or mass) function reaches its maximum value. In Bayesian context, it is the MAP estimate.

Source: Murphy, Machine Learning, p. 181

Why might the Mode be an "untypical" point?

In skewed or multimodal distributions, the mode may be at an extreme or far from the bulk of the probability mass. Mean or Median are often better summaries.

Source: Murphy, Machine Learning, p. 181

What is "Type II Maximum Likelihood"?

Also called Empirical Bayes: optimizing the hyperparameters by maximizing the marginal likelihood (evidence) instead of integrating them out.

Source: Murphy, Machine Learning, p. 173

Define "Sufficient Statistic".

A function of the data $s(D)$ that contains all the information needed to infer the parameter $\theta$.

Source: Murphy, Machine Learning, p. 105

What is "Conjugate Prior"?

A prior distribution that, when combined with a specific likelihood, results in a posterior distribution of the same functional form.

Source: Murphy, Machine Learning, p. 105

How does "Ridge Regression" handle Multi-collinearity?

By adding $\lambda$ to the diagonal of $X^T X$, it effectively "breaks" the perfect correlation, allowing for stable coefficient estimates.

Source: ST3189 Subject Guide - Shrinkage Methods

Final check: What is the primary mission of this scaffolding?

To deconstruct raw educational materials into structured interactive tools for mastery-level exam preparation.

Project Persona