Files
JPMC-quant/docs/statistical-significance-guide.md
2026-02-02 21:47:37 +01:00

12 KiB
Raw Blame History

Statistical Significance Testing Guide

A beginner-friendly reference for choosing the right statistical test and correction method for your Voice Branding analysis.


Table of Contents

  1. Quick Decision Flowchart
  2. Understanding Your Data Types
  3. Available Tests
  4. Multiple Comparison Corrections
  5. Interpreting Results
  6. Code Examples

Quick Decision Flowchart

What kind of data do you have?
│
├─► Continuous scores (1-10 ratings, averages)
│   │
│   └─► Use: compute_pairwise_significance()
│       │
│       ├─► Data normally distributed? → test_type="ttest"
│       └─► Not sure / skewed data?   → test_type="mannwhitney" (safer choice)
│
└─► Ranking data (1st, 2nd, 3rd place votes)
    │
    └─► Use: compute_ranking_significance()
        (automatically uses proportion z-test)

Understanding Your Data Types

Continuous Data

What it looks like: Numbers on a scale with many possible values.

Example Data Source
Voice ratings 1-10 get_voice_scale_1_10()
Speaking style scores get_ss_green_blue()
Any averaged scores Custom aggregations
shape: (5, 3)
┌───────────┬─────────────────┬─────────────────┐
│ _recordId │ Voice_Scale__V14│ Voice_Scale__V04│
│ str       │ f64             │ f64             │
├───────────┼─────────────────┼─────────────────┤
│ R_001     │ 7.5             │ 6.0             │
│ R_002     │ 8.0             │ 7.5             │
│ R_003     │ 6.5             │ 8.0             │

Ranking Data

What it looks like: Discrete ranks (1, 2, 3) or null if not ranked.

Example Data Source
Top 3 voice rankings get_top_3_voices()
Character rankings get_character_ranking()
shape: (5, 3)
┌───────────┬──────────────────┬──────────────────┐
│ _recordId │ Top_3__V14       │ Top_3__V04       │
│ str       │ i64              │ i64              │
├───────────┼──────────────────┼──────────────────┤
│ R_001     │ 1                │ null             │  ← V14 was ranked 1st
│ R_002     │ 2                │ 1                │  ← V04 was ranked 1st
│ R_003     │ null             │ 3                │  ← V04 was ranked 3rd

⚠️ Aggregated Data (Cannot Test!)

What it looks like: Already summarized/totaled data.

shape: (3, 2)
┌───────────┬────────────────┐
│ Character │ Weighted Score │  ← ALREADY AGGREGATED
│ str       │ i64            │     Lost individual variance
├───────────┼────────────────┤     Cannot do significance tests!
│ V14       │ 209            │
│ V04       │ 180            │

Solution: Go back to the raw data before aggregation.


Available Tests

1. Mann-Whitney U Test (Default for Continuous)

Use when: Comparing scores/ratings between groups
Assumes: Nothing about distribution shape (non-parametric)
Best for: Most survey data, Likert scales, ratings

pairwise_df, meta = S.compute_pairwise_significance(
    voice_data, 
    test_type="mannwhitney"  # This is the default
)

Pros:

  • Works with any distribution shape
  • Robust to outliers
  • Safe choice when unsure

Cons:

  • Slightly less powerful than t-test when data IS normally distributed

2. Independent t-Test

Use when: Comparing means between groups
Assumes: Data is approximately normally distributed
Best for: Large samples (n > 30 per group), truly continuous data

pairwise_df, meta = S.compute_pairwise_significance(
    voice_data, 
    test_type="ttest"
)

Pros:

  • Most powerful when assumptions are met
  • Well-understood, commonly reported

Cons:

  • Can give misleading results if data is skewed
  • Sensitive to outliers

3. Chi-Square Test

Use when: Comparing frequency distributions
Assumes: Expected counts ≥ 5 in each cell
Best for: Count data, categorical comparisons

pairwise_df, meta = S.compute_pairwise_significance(
    count_data, 
    test_type="chi2"
)

Pros:

  • Designed for count/frequency data
  • Tests if distributions differ

Cons:

  • Needs sufficient sample sizes
  • Less informative about direction of difference

4. Two-Proportion Z-Test (For Rankings)

Use when: Comparing ranking vote proportions
Automatically used by: compute_ranking_significance()

pairwise_df, meta = S.compute_ranking_significance(ranking_data)

What it tests: "Does Voice A get a significantly different proportion of Rank 1 votes than Voice B?"


Multiple Comparison Corrections

Why Do We Need Corrections?

When you compare many groups, you're doing many tests. Each test has a 5% chance of a false positive (if α = 0.05). With 17 voices:

Comparisons Expected False Positives (no correction)
136 pairs ~7 false "significant" results!

Corrections adjust p-values to account for this.


Bonferroni Correction (Conservative)

Formula: p_adjusted = p_value × number_of_comparisons

pairwise_df, meta = S.compute_pairwise_significance(
    data, 
    correction="bonferroni"  # This is the default
)

Use when:

  • You want to be very confident about significant results
  • False positives are costly (publishing, major decisions)
  • You have few comparisons (< 20)

Trade-off: May miss real differences (more false negatives)


Holm-Bonferroni Correction (Less Conservative)

Formula: Step-down procedure that's less strict than Bonferroni

pairwise_df, meta = S.compute_pairwise_significance(
    data, 
    correction="holm"
)

Use when:

  • You have many comparisons
  • You want better power to detect real differences
  • Exploratory analysis where missing a real effect is costly

Trade-off: Slightly higher false positive risk than Bonferroni


No Correction

Not recommended for final analysis, but useful for exploration.

pairwise_df, meta = S.compute_pairwise_significance(
    data, 
    correction="none"
)

Use when:

  • Initial exploration only
  • You'll follow up with specific hypotheses
  • You understand and accept the inflated false positive rate

Correction Method Comparison

Method Strictness Best For Risk
Bonferroni Most strict Few comparisons, high stakes Miss real effects
Holm Moderate Many comparisons, balanced approach Slightly more false positives
None No control Exploration only Many false positives

Recommendation for Voice Branding: Use Holm for exploratory analysis, Bonferroni for final reporting.


Interpreting Results

Key Output Columns

Column Meaning
p_value Raw probability this difference happened by chance
p_adjusted Corrected p-value (use this for decisions!)
significant TRUE if p_adjusted < alpha (usually 0.05)
effect_size How big is the difference (practical significance)

What the p-value Means

p-value Interpretation
< 0.001 Very strong evidence of difference
< 0.01 Strong evidence
< 0.05 Moderate evidence (traditional threshold)
0.05 - 0.10 Weak evidence, "trending"
> 0.10 No significant evidence

Statistical vs Practical Significance

Statistical significance (p < 0.05) means the difference is unlikely due to chance.

Practical significance (effect size) means the difference matters in the real world.

Effect Size (Cohen's d) Interpretation
< 0.2 Small (may not matter practically)
0.2 - 0.5 Medium
0.5 - 0.8 Large
> 0.8 Very large

Example: A p-value of 0.001 with effect size of 0.1 means "we're confident there's a difference, but it's tiny."


Code Examples

Example 1: Voice Scale Ratings

# Get the raw rating data
voice_data, _ = S.get_voice_scale_1_10(data)

# Test for significant differences
pairwise_df, meta = S.compute_pairwise_significance(
    voice_data,
    test_type="mannwhitney",  # Safe default for ratings
    alpha=0.05,
    correction="bonferroni"
)

# Check overall test first
print(f"Overall test: {meta['overall_test']}")
print(f"Overall p-value: {meta['overall_p_value']:.4f}")

# If overall is significant, look at pairwise
if meta['overall_p_value'] < 0.05:
    sig_pairs = pairwise_df.filter(pl.col('significant') == True)
    print(f"Found {sig_pairs.height} significant pairwise differences")

# Visualize
S.plot_significance_heatmap(pairwise_df, metadata=meta)

Example 2: Top 3 Voice Rankings

# Get the raw ranking data (NOT the weighted scores!)
ranking_data, _ = S.get_top_3_voices(data)

# Test for significant differences in Rank 1 proportions
pairwise_df, meta = S.compute_ranking_significance(
    ranking_data,
    alpha=0.05,
    correction="holm"  # Less conservative for many comparisons
)

# Check chi-square test
print(f"Chi-square p-value: {meta['chi2_p_value']:.4f}")

# View contingency table (Rank 1, 2, 3 counts per voice)
for voice, counts in meta['contingency_table'].items():
    print(f"{voice}: R1={counts[0]}, R2={counts[1]}, R3={counts[2]}")

# Find significant pairs
sig_pairs = pairwise_df.filter(pl.col('significant') == True)
print(sig_pairs)

Example 3: Comparing Demographic Subgroups

# Filter to specific demographics
S.filter_data(data, consumer=['Early Professional'])
early_pro_data, _ = S.get_voice_scale_1_10(data)

S.filter_data(data, consumer=['Established Professional'])
estab_pro_data, _ = S.get_voice_scale_1_10(data)

# Test each group separately, then compare results qualitatively
# (For direct group comparison, you'd need a different test design)

Common Mistakes to Avoid

Using Aggregated Data

# WRONG - already summarized, lost individual variance
weighted_scores = calculate_weighted_ranking_scores(ranking_data)
S.compute_pairwise_significance(weighted_scores)  # Will fail!

Use Raw Data

# RIGHT - use raw data before aggregation
ranking_data, _ = S.get_top_3_voices(data)
S.compute_ranking_significance(ranking_data)

Ignoring Multiple Comparisons

# WRONG - 7% of pairs will be "significant" by chance alone!
S.compute_pairwise_significance(data, correction="none")

Apply Correction

# RIGHT - corrected p-values control false positives
S.compute_pairwise_significance(data, correction="bonferroni")

Only Reporting p-values

# WRONG - statistical significance isn't everything
print(f"p = {p_value}")  # Missing context!

Report Effect Sizes Too

# RIGHT - include practical significance
print(f"p = {p_value}, effect size = {effect_size}")
print(f"Mean difference: {mean1 - mean2:.2f} points")

Quick Reference Card

Data Type Function Default Test Recommended Correction
Ratings (1-10) compute_pairwise_significance() Mann-Whitney U Bonferroni
Rankings (1st/2nd/3rd) compute_ranking_significance() Proportion Z Holm
Count frequencies compute_pairwise_significance(test_type="chi2") Chi-square Bonferroni
Scenario Correction
Publishing results Bonferroni
Client presentation Bonferroni
Exploratory analysis Holm
Quick internal check Holm or None

Further Reading