Files
JPMC-quant/docs/statistical-significance-guide.md
2026-02-02 21:47:37 +01:00

429 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Statistical Significance Testing Guide
A beginner-friendly reference for choosing the right statistical test and correction method for your Voice Branding analysis.
---
## Table of Contents
1. [Quick Decision Flowchart](#quick-decision-flowchart)
2. [Understanding Your Data Types](#understanding-your-data-types)
3. [Available Tests](#available-tests)
4. [Multiple Comparison Corrections](#multiple-comparison-corrections)
5. [Interpreting Results](#interpreting-results)
6. [Code Examples](#code-examples)
---
## Quick Decision Flowchart
```
What kind of data do you have?
├─► Continuous scores (1-10 ratings, averages)
│ │
│ └─► Use: compute_pairwise_significance()
│ │
│ ├─► Data normally distributed? → test_type="ttest"
│ └─► Not sure / skewed data? → test_type="mannwhitney" (safer choice)
└─► Ranking data (1st, 2nd, 3rd place votes)
└─► Use: compute_ranking_significance()
(automatically uses proportion z-test)
```
---
## Understanding Your Data Types
### Continuous Data
**What it looks like:** Numbers on a scale with many possible values.
| Example | Data Source |
|---------|-------------|
| Voice ratings 1-10 | `get_voice_scale_1_10()` |
| Speaking style scores | `get_ss_green_blue()` |
| Any averaged scores | Custom aggregations |
```
shape: (5, 3)
┌───────────┬─────────────────┬─────────────────┐
│ _recordId │ Voice_Scale__V14│ Voice_Scale__V04│
│ str │ f64 │ f64 │
├───────────┼─────────────────┼─────────────────┤
│ R_001 │ 7.5 │ 6.0 │
│ R_002 │ 8.0 │ 7.5 │
│ R_003 │ 6.5 │ 8.0 │
```
### Ranking Data
**What it looks like:** Discrete ranks (1, 2, 3) or null if not ranked.
| Example | Data Source |
|---------|-------------|
| Top 3 voice rankings | `get_top_3_voices()` |
| Character rankings | `get_character_ranking()` |
```
shape: (5, 3)
┌───────────┬──────────────────┬──────────────────┐
│ _recordId │ Top_3__V14 │ Top_3__V04 │
│ str │ i64 │ i64 │
├───────────┼──────────────────┼──────────────────┤
│ R_001 │ 1 │ null │ ← V14 was ranked 1st
│ R_002 │ 2 │ 1 │ ← V04 was ranked 1st
│ R_003 │ null │ 3 │ ← V04 was ranked 3rd
```
### ⚠️ Aggregated Data (Cannot Test!)
**What it looks like:** Already summarized/totaled data.
```
shape: (3, 2)
┌───────────┬────────────────┐
│ Character │ Weighted Score │ ← ALREADY AGGREGATED
│ str │ i64 │ Lost individual variance
├───────────┼────────────────┤ Cannot do significance tests!
│ V14 │ 209 │
│ V04 │ 180 │
```
**Solution:** Go back to the raw data before aggregation.
---
## Available Tests
### 1. Mann-Whitney U Test (Default for Continuous)
**Use when:** Comparing scores/ratings between groups
**Assumes:** Nothing about distribution shape (non-parametric)
**Best for:** Most survey data, Likert scales, ratings
```python
pairwise_df, meta = S.compute_pairwise_significance(
voice_data,
test_type="mannwhitney" # This is the default
)
```
**Pros:**
- Works with any distribution shape
- Robust to outliers
- Safe choice when unsure
**Cons:**
- Slightly less powerful than t-test when data IS normally distributed
---
### 2. Independent t-Test
**Use when:** Comparing means between groups
**Assumes:** Data is approximately normally distributed
**Best for:** Large samples (n > 30 per group), truly continuous data
```python
pairwise_df, meta = S.compute_pairwise_significance(
voice_data,
test_type="ttest"
)
```
**Pros:**
- Most powerful when assumptions are met
- Well-understood, commonly reported
**Cons:**
- Can give misleading results if data is skewed
- Sensitive to outliers
---
### 3. Chi-Square Test
**Use when:** Comparing frequency distributions
**Assumes:** Expected counts ≥ 5 in each cell
**Best for:** Count data, categorical comparisons
```python
pairwise_df, meta = S.compute_pairwise_significance(
count_data,
test_type="chi2"
)
```
**Pros:**
- Designed for count/frequency data
- Tests if distributions differ
**Cons:**
- Needs sufficient sample sizes
- Less informative about direction of difference
---
### 4. Two-Proportion Z-Test (For Rankings)
**Use when:** Comparing ranking vote proportions
**Automatically used by:** `compute_ranking_significance()`
```python
pairwise_df, meta = S.compute_ranking_significance(ranking_data)
```
**What it tests:** "Does Voice A get a significantly different proportion of Rank 1 votes than Voice B?"
---
## Multiple Comparison Corrections
### Why Do We Need Corrections?
When you compare many groups, you're doing many tests. Each test has a 5% chance of a false positive (if α = 0.05). With 17 voices:
| Comparisons | Expected False Positives (no correction) |
|-------------|------------------------------------------|
| 136 pairs | ~7 false "significant" results! |
**Corrections adjust p-values to account for this.**
---
### Bonferroni Correction (Conservative)
**Formula:** `p_adjusted = p_value × number_of_comparisons`
```python
pairwise_df, meta = S.compute_pairwise_significance(
data,
correction="bonferroni" # This is the default
)
```
**Use when:**
- You want to be very confident about significant results
- False positives are costly (publishing, major decisions)
- You have few comparisons (< 20)
**Trade-off:** May miss real differences (more false negatives)
---
### Holm-Bonferroni Correction (Less Conservative)
**Formula:** Step-down procedure that's less strict than Bonferroni
```python
pairwise_df, meta = S.compute_pairwise_significance(
data,
correction="holm"
)
```
**Use when:**
- You have many comparisons
- You want better power to detect real differences
- Exploratory analysis where missing a real effect is costly
**Trade-off:** Slightly higher false positive risk than Bonferroni
---
### No Correction
**Not recommended for final analysis**, but useful for exploration.
```python
pairwise_df, meta = S.compute_pairwise_significance(
data,
correction="none"
)
```
**Use when:**
- Initial exploration only
- You'll follow up with specific hypotheses
- You understand and accept the inflated false positive rate
---
### Correction Method Comparison
| Method | Strictness | Best For | Risk |
|--------|------------|----------|------|
| Bonferroni | Most strict | Few comparisons, high stakes | Miss real effects |
| Holm | Moderate | Many comparisons, balanced approach | Slightly more false positives |
| None | No control | Exploration only | Many false positives |
**Recommendation for Voice Branding:** Use **Holm** for exploratory analysis, **Bonferroni** for final reporting.
---
## Interpreting Results
### Key Output Columns
| Column | Meaning |
|--------|---------|
| `p_value` | Raw probability this difference happened by chance |
| `p_adjusted` | Corrected p-value (use this for decisions!) |
| `significant` | TRUE if p_adjusted < alpha (usually 0.05) |
| `effect_size` | How big is the difference (practical significance) |
### What the p-value Means
| p-value | Interpretation |
|---------|----------------|
| < 0.001 | Very strong evidence of difference |
| < 0.01 | Strong evidence |
| < 0.05 | Moderate evidence (traditional threshold) |
| 0.05 - 0.10 | Weak evidence, "trending" |
| > 0.10 | No significant evidence |
### Statistical vs Practical Significance
**Statistical significance** (p < 0.05) means the difference is unlikely due to chance.
**Practical significance** (effect size) means the difference matters in the real world.
| Effect Size (Cohen's d) | Interpretation |
|-------------------------|----------------|
| < 0.2 | Small (may not matter practically) |
| 0.2 - 0.5 | Medium |
| 0.5 - 0.8 | Large |
| > 0.8 | Very large |
**Example:** A p-value of 0.001 with effect size of 0.1 means "we're confident there's a difference, but it's tiny."
---
## Code Examples
### Example 1: Voice Scale Ratings
```python
# Get the raw rating data
voice_data, _ = S.get_voice_scale_1_10(data)
# Test for significant differences
pairwise_df, meta = S.compute_pairwise_significance(
voice_data,
test_type="mannwhitney", # Safe default for ratings
alpha=0.05,
correction="bonferroni"
)
# Check overall test first
print(f"Overall test: {meta['overall_test']}")
print(f"Overall p-value: {meta['overall_p_value']:.4f}")
# If overall is significant, look at pairwise
if meta['overall_p_value'] < 0.05:
sig_pairs = pairwise_df.filter(pl.col('significant') == True)
print(f"Found {sig_pairs.height} significant pairwise differences")
# Visualize
S.plot_significance_heatmap(pairwise_df, metadata=meta)
```
### Example 2: Top 3 Voice Rankings
```python
# Get the raw ranking data (NOT the weighted scores!)
ranking_data, _ = S.get_top_3_voices(data)
# Test for significant differences in Rank 1 proportions
pairwise_df, meta = S.compute_ranking_significance(
ranking_data,
alpha=0.05,
correction="holm" # Less conservative for many comparisons
)
# Check chi-square test
print(f"Chi-square p-value: {meta['chi2_p_value']:.4f}")
# View contingency table (Rank 1, 2, 3 counts per voice)
for voice, counts in meta['contingency_table'].items():
print(f"{voice}: R1={counts[0]}, R2={counts[1]}, R3={counts[2]}")
# Find significant pairs
sig_pairs = pairwise_df.filter(pl.col('significant') == True)
print(sig_pairs)
```
### Example 3: Comparing Demographic Subgroups
```python
# Filter to specific demographics
S.filter_data(data, consumer=['Early Professional'])
early_pro_data, _ = S.get_voice_scale_1_10(data)
S.filter_data(data, consumer=['Established Professional'])
estab_pro_data, _ = S.get_voice_scale_1_10(data)
# Test each group separately, then compare results qualitatively
# (For direct group comparison, you'd need a different test design)
```
---
## Common Mistakes to Avoid
### ❌ Using Aggregated Data
```python
# WRONG - already summarized, lost individual variance
weighted_scores = calculate_weighted_ranking_scores(ranking_data)
S.compute_pairwise_significance(weighted_scores) # Will fail!
```
### ✅ Use Raw Data
```python
# RIGHT - use raw data before aggregation
ranking_data, _ = S.get_top_3_voices(data)
S.compute_ranking_significance(ranking_data)
```
### ❌ Ignoring Multiple Comparisons
```python
# WRONG - 7% of pairs will be "significant" by chance alone!
S.compute_pairwise_significance(data, correction="none")
```
### ✅ Apply Correction
```python
# RIGHT - corrected p-values control false positives
S.compute_pairwise_significance(data, correction="bonferroni")
```
### ❌ Only Reporting p-values
```python
# WRONG - statistical significance isn't everything
print(f"p = {p_value}") # Missing context!
```
### ✅ Report Effect Sizes Too
```python
# RIGHT - include practical significance
print(f"p = {p_value}, effect size = {effect_size}")
print(f"Mean difference: {mean1 - mean2:.2f} points")
```
---
## Quick Reference Card
| Data Type | Function | Default Test | Recommended Correction |
|-----------|----------|--------------|------------------------|
| Ratings (1-10) | `compute_pairwise_significance()` | Mann-Whitney U | Bonferroni |
| Rankings (1st/2nd/3rd) | `compute_ranking_significance()` | Proportion Z | Holm |
| Count frequencies | `compute_pairwise_significance(test_type="chi2")` | Chi-square | Bonferroni |
| Scenario | Correction |
|----------|------------|
| Publishing results | Bonferroni |
| Client presentation | Bonferroni |
| Exploratory analysis | Holm |
| Quick internal check | Holm or None |
---
## Further Reading
- [Statistics for Dummies Cheat Sheet](https://www.dummies.com/article/academics-the-arts/math/statistics/statistics-for-dummies-cheat-sheet-208650/)
- [Choosing the Right Statistical Test](https://stats.oarc.ucla.edu/other/mult-pkg/whatstat/)
- [Multiple Comparisons Problem (Wikipedia)](https://en.wikipedia.org/wiki/Multiple_comparisons_problem)