statistical tests

2026-02-02 21:47:37 +01:00
parent 29df6a4bd9
commit f2c659c266
9 changed files with 1679 additions and 47 deletions
--- a/docs/statistical-significance-guide.md
+++ b/docs/statistical-significance-guide.md
@@ -0,0 +1,428 @@
+# Statistical Significance Testing Guide
+
+A beginner-friendly reference for choosing the right statistical test and correction method for your Voice Branding analysis.
+
+---
+
+## Table of Contents
+1. [Quick Decision Flowchart](#quick-decision-flowchart)
+2. [Understanding Your Data Types](#understanding-your-data-types)
+3. [Available Tests](#available-tests)
+4. [Multiple Comparison Corrections](#multiple-comparison-corrections)
+5. [Interpreting Results](#interpreting-results)
+6. [Code Examples](#code-examples)
+
+---
+
+## Quick Decision Flowchart
+
+```
+What kind of data do you have?
+│
+├─► Continuous scores (1-10 ratings, averages)
+│   │
+│   └─► Use: compute_pairwise_significance()
+│       │
+│       ├─► Data normally distributed? → test_type="ttest"
+│       └─► Not sure / skewed data?   → test_type="mannwhitney" (safer choice)
+│
+└─► Ranking data (1st, 2nd, 3rd place votes)
+    │
+    └─► Use: compute_ranking_significance()
+        (automatically uses proportion z-test)
+```
+
+---
+
+## Understanding Your Data Types
+
+### Continuous Data
+**What it looks like:** Numbers on a scale with many possible values.
+
+| Example | Data Source |
+|---------|-------------|
+| Voice ratings 1-10 | `get_voice_scale_1_10()` |
+| Speaking style scores | `get_ss_green_blue()` |
+| Any averaged scores | Custom aggregations |
+
+```
+shape: (5, 3)
+┌───────────┬─────────────────┬─────────────────┐
+│ _recordId │ Voice_Scale__V14│ Voice_Scale__V04│
+│ str       │ f64             │ f64             │
+├───────────┼─────────────────┼─────────────────┤
+│ R_001     │ 7.5             │ 6.0             │
+│ R_002     │ 8.0             │ 7.5             │
+│ R_003     │ 6.5             │ 8.0             │
+```
+
+### Ranking Data
+**What it looks like:** Discrete ranks (1, 2, 3) or null if not ranked.
+
+| Example | Data Source |
+|---------|-------------|
+| Top 3 voice rankings | `get_top_3_voices()` |
+| Character rankings | `get_character_ranking()` |
+
+```
+shape: (5, 3)
+┌───────────┬──────────────────┬──────────────────┐
+│ _recordId │ Top_3__V14       │ Top_3__V04       │
+│ str       │ i64              │ i64              │
+├───────────┼──────────────────┼──────────────────┤
+│ R_001     │ 1                │ null             │  ← V14 was ranked 1st
+│ R_002     │ 2                │ 1                │  ← V04 was ranked 1st
+│ R_003     │ null             │ 3                │  ← V04 was ranked 3rd
+```
+
+### ⚠️ Aggregated Data (Cannot Test!)
+**What it looks like:** Already summarized/totaled data.
+
+```
+shape: (3, 2)
+┌───────────┬────────────────┐
+│ Character │ Weighted Score │  ← ALREADY AGGREGATED
+│ str       │ i64            │     Lost individual variance
+├───────────┼────────────────┤     Cannot do significance tests!
+│ V14       │ 209            │
+│ V04       │ 180            │
+```
+
+**Solution:** Go back to the raw data before aggregation.
+
+---
+
+## Available Tests
+
+### 1. Mann-Whitney U Test (Default for Continuous)
+**Use when:** Comparing scores/ratings between groups  
+**Assumes:** Nothing about distribution shape (non-parametric)  
+**Best for:** Most survey data, Likert scales, ratings
+
+```python
+pairwise_df, meta = S.compute_pairwise_significance(
+    voice_data, 
+    test_type="mannwhitney"  # This is the default
+)
+```
+
+**Pros:**
+- Works with any distribution shape
+- Robust to outliers
+- Safe choice when unsure
+
+**Cons:**
+- Slightly less powerful than t-test when data IS normally distributed
+
+---
+
+### 2. Independent t-Test
+**Use when:** Comparing means between groups  
+**Assumes:** Data is approximately normally distributed  
+**Best for:** Large samples (n > 30 per group), truly continuous data
+
+```python
+pairwise_df, meta = S.compute_pairwise_significance(
+    voice_data, 
+    test_type="ttest"
+)
+```
+
+**Pros:**
+- Most powerful when assumptions are met
+- Well-understood, commonly reported
+
+**Cons:**
+- Can give misleading results if data is skewed
+- Sensitive to outliers
+
+---
+
+### 3. Chi-Square Test
+**Use when:** Comparing frequency distributions  
+**Assumes:** Expected counts ≥ 5 in each cell  
+**Best for:** Count data, categorical comparisons
+
+```python
+pairwise_df, meta = S.compute_pairwise_significance(
+    count_data, 
+    test_type="chi2"
+)
+```
+
+**Pros:**
+- Designed for count/frequency data
+- Tests if distributions differ
+
+**Cons:**
+- Needs sufficient sample sizes
+- Less informative about direction of difference
+
+---
+
+### 4. Two-Proportion Z-Test (For Rankings)
+**Use when:** Comparing ranking vote proportions  
+**Automatically used by:** `compute_ranking_significance()`
+
+```python
+pairwise_df, meta = S.compute_ranking_significance(ranking_data)
+```
+
+**What it tests:** "Does Voice A get a significantly different proportion of Rank 1 votes than Voice B?"
+
+---
+
+## Multiple Comparison Corrections
+
+### Why Do We Need Corrections?
+
+When you compare many groups, you're doing many tests. Each test has a 5% chance of a false positive (if α = 0.05). With 17 voices:
+
+| Comparisons | Expected False Positives (no correction) |
+|-------------|------------------------------------------|
+| 136 pairs   | ~7 false "significant" results!          |
+
+**Corrections adjust p-values to account for this.**
+
+---
+
+### Bonferroni Correction (Conservative)
+**Formula:** `p_adjusted = p_value × number_of_comparisons`
+
+```python
+pairwise_df, meta = S.compute_pairwise_significance(
+    data, 
+    correction="bonferroni"  # This is the default
+)
+```
+
+**Use when:**
+- You want to be very confident about significant results
+- False positives are costly (publishing, major decisions)
+- You have few comparisons (< 20)
+
+**Trade-off:** May miss real differences (more false negatives)
+
+---
+
+### Holm-Bonferroni Correction (Less Conservative)
+**Formula:** Step-down procedure that's less strict than Bonferroni
+
+```python
+pairwise_df, meta = S.compute_pairwise_significance(
+    data, 
+    correction="holm"
+)
+```
+
+**Use when:**
+- You have many comparisons
+- You want better power to detect real differences
+- Exploratory analysis where missing a real effect is costly
+
+**Trade-off:** Slightly higher false positive risk than Bonferroni
+
+---
+
+### No Correction
+**Not recommended for final analysis**, but useful for exploration.
+
+```python
+pairwise_df, meta = S.compute_pairwise_significance(
+    data, 
+    correction="none"
+)
+```
+
+**Use when:**
+- Initial exploration only
+- You'll follow up with specific hypotheses
+- You understand and accept the inflated false positive rate
+
+---
+
+### Correction Method Comparison
+
+| Method | Strictness | Best For | Risk |
+|--------|------------|----------|------|
+| Bonferroni | Most strict | Few comparisons, high stakes | Miss real effects |
+| Holm | Moderate | Many comparisons, balanced approach | Slightly more false positives |
+| None | No control | Exploration only | Many false positives |
+
+**Recommendation for Voice Branding:** Use **Holm** for exploratory analysis, **Bonferroni** for final reporting.
+
+---
+
+## Interpreting Results
+
+### Key Output Columns
+
+| Column | Meaning |
+|--------|---------|
+| `p_value` | Raw probability this difference happened by chance |
+| `p_adjusted` | Corrected p-value (use this for decisions!) |
+| `significant` | TRUE if p_adjusted < alpha (usually 0.05) |
+| `effect_size` | How big is the difference (practical significance) |
+
+### What the p-value Means
+
+| p-value | Interpretation |
+|---------|----------------|
+| < 0.001 | Very strong evidence of difference |
+| < 0.01 | Strong evidence |
+| < 0.05 | Moderate evidence (traditional threshold) |
+| 0.05 - 0.10 | Weak evidence, "trending" |
+| > 0.10 | No significant evidence |
+
+### Statistical vs Practical Significance
+
+**Statistical significance** (p < 0.05) means the difference is unlikely due to chance.
+
+**Practical significance** (effect size) means the difference matters in the real world.
+
+| Effect Size (Cohen's d) | Interpretation |
+|-------------------------|----------------|
+| < 0.2 | Small (may not matter practically) |
+| 0.2 - 0.5 | Medium |
+| 0.5 - 0.8 | Large |
+| > 0.8 | Very large |
+
+**Example:** A p-value of 0.001 with effect size of 0.1 means "we're confident there's a difference, but it's tiny."
+
+---
+
+## Code Examples
+
+### Example 1: Voice Scale Ratings
+
+```python
+# Get the raw rating data
+voice_data, _ = S.get_voice_scale_1_10(data)
+
+# Test for significant differences
+pairwise_df, meta = S.compute_pairwise_significance(
+    voice_data,
+    test_type="mannwhitney",  # Safe default for ratings
+    alpha=0.05,
+    correction="bonferroni"
+)
+
+# Check overall test first
+print(f"Overall test: {meta['overall_test']}")
+print(f"Overall p-value: {meta['overall_p_value']:.4f}")
+
+# If overall is significant, look at pairwise
+if meta['overall_p_value'] < 0.05:
+    sig_pairs = pairwise_df.filter(pl.col('significant') == True)
+    print(f"Found {sig_pairs.height} significant pairwise differences")
+
+# Visualize
+S.plot_significance_heatmap(pairwise_df, metadata=meta)
+```
+
+### Example 2: Top 3 Voice Rankings
+
+```python
+# Get the raw ranking data (NOT the weighted scores!)
+ranking_data, _ = S.get_top_3_voices(data)
+
+# Test for significant differences in Rank 1 proportions
+pairwise_df, meta = S.compute_ranking_significance(
+    ranking_data,
+    alpha=0.05,
+    correction="holm"  # Less conservative for many comparisons
+)
+
+# Check chi-square test
+print(f"Chi-square p-value: {meta['chi2_p_value']:.4f}")
+
+# View contingency table (Rank 1, 2, 3 counts per voice)
+for voice, counts in meta['contingency_table'].items():
+    print(f"{voice}: R1={counts[0]}, R2={counts[1]}, R3={counts[2]}")
+
+# Find significant pairs
+sig_pairs = pairwise_df.filter(pl.col('significant') == True)
+print(sig_pairs)
+```
+
+### Example 3: Comparing Demographic Subgroups
+
+```python
+# Filter to specific demographics
+S.filter_data(data, consumer=['Early Professional'])
+early_pro_data, _ = S.get_voice_scale_1_10(data)
+
+S.filter_data(data, consumer=['Established Professional'])
+estab_pro_data, _ = S.get_voice_scale_1_10(data)
+
+# Test each group separately, then compare results qualitatively
+# (For direct group comparison, you'd need a different test design)
+```
+
+---
+
+## Common Mistakes to Avoid
+
+### ❌ Using Aggregated Data
+```python
+# WRONG - already summarized, lost individual variance
+weighted_scores = calculate_weighted_ranking_scores(ranking_data)
+S.compute_pairwise_significance(weighted_scores)  # Will fail!
+```
+
+### ✅ Use Raw Data
+```python
+# RIGHT - use raw data before aggregation
+ranking_data, _ = S.get_top_3_voices(data)
+S.compute_ranking_significance(ranking_data)
+```
+
+### ❌ Ignoring Multiple Comparisons
+```python
+# WRONG - 7% of pairs will be "significant" by chance alone!
+S.compute_pairwise_significance(data, correction="none")
+```
+
+### ✅ Apply Correction
+```python
+# RIGHT - corrected p-values control false positives
+S.compute_pairwise_significance(data, correction="bonferroni")
+```
+
+### ❌ Only Reporting p-values
+```python
+# WRONG - statistical significance isn't everything
+print(f"p = {p_value}")  # Missing context!
+```
+
+### ✅ Report Effect Sizes Too
+```python
+# RIGHT - include practical significance
+print(f"p = {p_value}, effect size = {effect_size}")
+print(f"Mean difference: {mean1 - mean2:.2f} points")
+```
+
+---
+
+## Quick Reference Card
+
+| Data Type | Function | Default Test | Recommended Correction |
+|-----------|----------|--------------|------------------------|
+| Ratings (1-10) | `compute_pairwise_significance()` | Mann-Whitney U | Bonferroni |
+| Rankings (1st/2nd/3rd) | `compute_ranking_significance()` | Proportion Z | Holm |
+| Count frequencies | `compute_pairwise_significance(test_type="chi2")` | Chi-square | Bonferroni |
+
+| Scenario | Correction |
+|----------|------------|
+| Publishing results | Bonferroni |
+| Client presentation | Bonferroni |
+| Exploratory analysis | Holm |
+| Quick internal check | Holm or None |
+
+---
+
+## Further Reading
+
+- [Statistics for Dummies Cheat Sheet](https://www.dummies.com/article/academics-the-arts/math/statistics/statistics-for-dummies-cheat-sheet-208650/)
+- [Choosing the Right Statistical Test](https://stats.oarc.ucla.edu/other/mult-pkg/whatstat/)
+- [Multiple Comparisons Problem (Wikipedia)](https://en.wikipedia.org/wiki/Multiple_comparisons_problem)