Naive Bayes
The prof’s framing: “LDA / QDA but you assume is diagonal.” Predictors are conditionally independent within each class, which is generally false, but it slashes the parameter count and works surprisingly well. The slide deck calls it “Idiot’s Bayes.” The win is in large- settings where you can’t afford the covariance estimates.
Definition (prof’s framing)
“Naive Bayes: assume the covariance matrix is diagonal, the predictors are conditionally independent given the class. Drops all off-diagonal covariance parameters.” - L09-classif-3
“Naive Bayes is optimal or popular when is large because of the parameter count, fewer things to estimate, more robust.” - L09-classif-3
The single modeling assumption: within each class , the predictors are independent.
Plug into Bayes’ theorem like LDA/QDA, classify by max posterior.
Notation & setup
- : class-, variable- marginal density. Often Gaussian: , but doesn’t have to be.
- : class prior (same as LDA/QDA).
- For Gaussian marginals: parameter count per class = (a mean and variance per predictor) → total for the priors. Linear in , vs for LDA/QDA.
Formula(s) to know cold
Posterior under naive Bayes:
Discriminant (Gaussian marginals):
(The term is included if class-specific variances are kept.)
This is just QDA with restricted to be diagonal.
Insights & mental models
- Naive Bayes = LDA/QDA with diagonal . That’s the cleanest mental file for it. ISLP §4.5.1 makes this precise: with Gaussian marginals it’s QDA-with-diagonal-; if you further pool across classes, it’s LDA-with-diagonal-.
- The independence assumption is generally false. “Do we really believe the naive Bayes assumption that the covariates are independent within each class? In most settings, we do not.” (ISLP §4.4.4) But the resulting bias is often offset by the dramatic variance reduction, bias-variance argument in the simplest form.
- Why it’s “naive”: assuming predictors are conditionally independent given the class is a strong, usually-wrong assumption. It’s “idiotic” hence the alternate name.
- Why it works anyway: for classification (vs density estimation), what matters is which class wins the argmax, not whether the densities are accurately estimated. Even rough-and-wrong densities often rank classes correctly.
- Mixed predictor types are easy. Continuous → Gaussian or kernel-smoothed marginal. Categorical → multinomial. They factor cleanly because of the independence assumption.
- Prof’s slot for it: “popular when is large”, the standard case where LDA/QDA’s covariance estimates blow up.
Exam signals
“Naive Bayes: large . Skip the cross-covariance.” - L09-classif-3
” is assumed diagonal, and only the diagonal elements are estimated.”, slide deck
The prof dedicated only one slide stretch to it (toward end of L09); no exam-flagging quote, but the bias-variance argument (large , small , simpler model wins) is the recurring theme he’d test.
Pitfalls
- Confusing naive Bayes with the Bayes classifier. Different things. The Bayes classifier is the abstraction that uses the true posterior . Naive Bayes is a specific generative model with the conditional-independence assumption, used to estimate a posterior.
- Forgetting that “naive” is a technical word. Refers to the conditional-independence assumption, not to the classifier being simple.
- Treating naive Bayes as a strict subset of LDA. Strictly: with Gaussian marginals + pooled , naive Bayes ⊂ LDA (LDA with diagonal ). With Gaussian marginals + class-specific , naive Bayes ⊂ QDA. With non-Gaussian marginals, naive Bayes is its own thing.
- Standardization concerns. Same as for LDA/QDA, if predictors are on wildly different scales, the marginals’ sds get badly estimated. Standardize.
Scope vs ISLP
- In scope: The conditional-independence assumption, , why it’s used (large , parameter-count win), bias-variance trade-off justification.
- Look up in ISLP: §4.4.4, pp. 156–158. The toy example (, ) illustrates the multiplicative posterior. §4.5.1 shows the formal connection to LDA (eq. 4.34), useful for the “naive Bayes ⊂ LDA with diagonal Σ” insight.
- Skip in ISLP:
- Detailed mixed-predictor naive-Bayes implementations (§4.4.4 final paragraphs), concept matters, mechanics don’t.
- Smoothing parameter / Laplace correction for zero-frequency categorical cells, never covered.
Exercise instances
None, naive Bayes has no recommended-exercise or compulsory-exercise problem in module 4. The slide deck mentions it briefly; the prof’s lecture covers it in maybe 2 minutes. Kept as a thin atom (per manifest note 5) because it’s a named method on the slide curriculum and could plausibly appear as an MCQ.
How it might appear on the exam
- MCQ: “Naive Bayes assumes which of the following?” → predictors are conditionally independent given the class.
- Method-comparison T/F: “Naive Bayes is preferred when is large” → true. “Naive Bayes assumes the predictors are unconditionally independent” → false (only conditionally, given class). “Naive Bayes is a special case of QDA” → true (if Gaussian marginals + class-specific variances).
- Parameter-count question: “How many parameters does Gaussian naive Bayes estimate for classes and predictors?” → , much fewer than LDA or QDA.
- Bias-variance argument: Why naive Bayes might out-perform QDA when is small relative to .
Related
- linear-discriminant-analysis: same Bayesian-discriminant machinery; LDA with diagonal ≈ Gaussian naive Bayes (with shared variances).
- quadratic-discriminant-analysis: naive Bayes is QDA with diagonal .
- multivariate-normal: the multivariate Gaussian whose diagonal-restriction gives Gaussian naive Bayes.
- diagnostic-vs-sampling-paradigm: naive Bayes is on the sampling/generative side.
- bias-variance-tradeoff: the standard justification for using a more-restricted model when is large.
- curse-of-dimensionality: naive Bayes is one of the standard answers for “what to do in high .”