The Hierarchy of Advertising Evidence
In medicine and other sciences, the “Hierarchy of Evidence” ranks research methods by the strength of their causal claims, with meta-analyses of randomized controlled trials at the top and anecdotal opinion at the bottom. Advertising faces the same challenge: separating true incrementality from correlation. This hierarchy adapts that framework to advertising, showing how experimental methods provide the most credible evidence of causal impact. Quasi-experimental and observational approaches, such as marketing mix models, synthetic controls, or attribution, remain widely practiced and valuable, especially when experiments are impractical (e.g., in small national markets or for channels like in-store promotions). Still, because these approaches lack randomization, they depend on modeling choices and statistical assumptions that are harder to verify, and thus they are subject to bias, overfitting and provide less certain evidence of causal impact than experiments.
Experiments
Models Plus Experiments (MPE) – Most rigorous framework: routinely validating MMM (or other models) with RCTs; combines scale of models with causal credibility of experiments.
Meta-Analysis of Many Large-Scale Experiments – Aggregating results across multiple RCTs; strongest external validity.
Large-Scale Geo Experiments (Cluster RCTs) – Random assignment of all geo regions within a country (e.g., DMAs); a consistent, robust, and scalable framework applicable across virtually all media channels.
Large-Scale User-Level Experiments – Randomizing millions of IDs; high internal validity but limited by cross-device fragmentation and low match rates, noisy for measuring small ad lifts.
Small-Scale Geo Experiments – Still randomized but less power, higher risk of idiosyncratic bias from a few markets.
Small-Scale User-Level Experiments – Feasible but often underpowered; generalizability limited.
Interrupted Time Series / Switchback Tests – Turning campaigns on/off repeatedly; can work if treatment effect is immediate and reversible, but vulnerable to temporal confounds.
Observational / Quasi-Experimental Analytics
Synthetic Controls – Weighted combination of control units to construct a “synthetic twin”; can provide credible counterfactuals but sensitive to donor pool and overfitting. (E.g., Meta's GeoLift R package.)
Bayesian Structural Time Series (BSTS) – Flexible probabilistic framework for counterfactual forecasting with covariates, trend, and seasonality; powerful but dependent on priors and specification choices. (E.g., Google's CausalImpact R package.)
Difference-in-Differences (DiD) – Compares changes in treated vs. untreated groups over time; intuitive and widely used but hinges on the parallel trends assumption.
Marketing Mix Models (MMM) – Longstanding regression framework using aggregate longitudinal data to estimate channel contributions; offers a holistic view of the mix but causal claims depend on assumptions unless validated by experiments.
Matched Market Tests – Selects untreated markets that resemble treated ones; easy to explain but prone to hidden bias.
Lookalike Controls (Propensity Score Matching or Machine Learning–adjusted cohorts) – Construct control groups matched on demographics, purchase history, or other observables; widely used in research panel-based lift tests but subject to hidden bias from unmeasured factors.
Exposed vs. Unexposed Comparisons – Naïve lift tests comparing those who saw ads to those who didn’t; easy to run but confounded by targeting bias.
Attribution Models (MTA, heuristic last-click, etc.) – Useful for exploring customer journeys, but correlation-based and not causal evidence.
Expert Judgment – Can inform hypotheses or priors, but not empirical evidence.
Black Box AI Optimization – Proprietary “incrementality” claims without transparency; credibility requires experimental validation.
Self-Reporting by Media Vendors – Metrics reported by the seller; rife with bias and not reliable for causal claims.
Popular Opinion / Conventional Wisdom – Lowest rung; anecdotal, not evidence.