12 KiB
Statistical Significance Testing Guide
A beginner-friendly reference for choosing the right statistical test and correction method for your Voice Branding analysis.
Table of Contents
- Quick Decision Flowchart
- Understanding Your Data Types
- Available Tests
- Multiple Comparison Corrections
- Interpreting Results
- Code Examples
Quick Decision Flowchart
What kind of data do you have?
│
├─► Continuous scores (1-10 ratings, averages)
│ │
│ └─► Use: compute_pairwise_significance()
│ │
│ ├─► Data normally distributed? → test_type="ttest"
│ └─► Not sure / skewed data? → test_type="mannwhitney" (safer choice)
│
└─► Ranking data (1st, 2nd, 3rd place votes)
│
└─► Use: compute_ranking_significance()
(automatically uses proportion z-test)
Understanding Your Data Types
Continuous Data
What it looks like: Numbers on a scale with many possible values.
| Example | Data Source |
|---|---|
| Voice ratings 1-10 | get_voice_scale_1_10() |
| Speaking style scores | get_ss_green_blue() |
| Any averaged scores | Custom aggregations |
shape: (5, 3)
┌───────────┬─────────────────┬─────────────────┐
│ _recordId │ Voice_Scale__V14│ Voice_Scale__V04│
│ str │ f64 │ f64 │
├───────────┼─────────────────┼─────────────────┤
│ R_001 │ 7.5 │ 6.0 │
│ R_002 │ 8.0 │ 7.5 │
│ R_003 │ 6.5 │ 8.0 │
Ranking Data
What it looks like: Discrete ranks (1, 2, 3) or null if not ranked.
| Example | Data Source |
|---|---|
| Top 3 voice rankings | get_top_3_voices() |
| Character rankings | get_character_ranking() |
shape: (5, 3)
┌───────────┬──────────────────┬──────────────────┐
│ _recordId │ Top_3__V14 │ Top_3__V04 │
│ str │ i64 │ i64 │
├───────────┼──────────────────┼──────────────────┤
│ R_001 │ 1 │ null │ ← V14 was ranked 1st
│ R_002 │ 2 │ 1 │ ← V04 was ranked 1st
│ R_003 │ null │ 3 │ ← V04 was ranked 3rd
⚠️ Aggregated Data (Cannot Test!)
What it looks like: Already summarized/totaled data.
shape: (3, 2)
┌───────────┬────────────────┐
│ Character │ Weighted Score │ ← ALREADY AGGREGATED
│ str │ i64 │ Lost individual variance
├───────────┼────────────────┤ Cannot do significance tests!
│ V14 │ 209 │
│ V04 │ 180 │
Solution: Go back to the raw data before aggregation.
Available Tests
1. Mann-Whitney U Test (Default for Continuous)
Use when: Comparing scores/ratings between groups
Assumes: Nothing about distribution shape (non-parametric)
Best for: Most survey data, Likert scales, ratings
pairwise_df, meta = S.compute_pairwise_significance(
voice_data,
test_type="mannwhitney" # This is the default
)
Pros:
- Works with any distribution shape
- Robust to outliers
- Safe choice when unsure
Cons:
- Slightly less powerful than t-test when data IS normally distributed
2. Independent t-Test
Use when: Comparing means between groups
Assumes: Data is approximately normally distributed
Best for: Large samples (n > 30 per group), truly continuous data
pairwise_df, meta = S.compute_pairwise_significance(
voice_data,
test_type="ttest"
)
Pros:
- Most powerful when assumptions are met
- Well-understood, commonly reported
Cons:
- Can give misleading results if data is skewed
- Sensitive to outliers
3. Chi-Square Test
Use when: Comparing frequency distributions
Assumes: Expected counts ≥ 5 in each cell
Best for: Count data, categorical comparisons
pairwise_df, meta = S.compute_pairwise_significance(
count_data,
test_type="chi2"
)
Pros:
- Designed for count/frequency data
- Tests if distributions differ
Cons:
- Needs sufficient sample sizes
- Less informative about direction of difference
4. Two-Proportion Z-Test (For Rankings)
Use when: Comparing ranking vote proportions
Automatically used by: compute_ranking_significance()
pairwise_df, meta = S.compute_ranking_significance(ranking_data)
What it tests: "Does Voice A get a significantly different proportion of Rank 1 votes than Voice B?"
Multiple Comparison Corrections
Why Do We Need Corrections?
When you compare many groups, you're doing many tests. Each test has a 5% chance of a false positive (if α = 0.05). With 17 voices:
| Comparisons | Expected False Positives (no correction) |
|---|---|
| 136 pairs | ~7 false "significant" results! |
Corrections adjust p-values to account for this.
Bonferroni Correction (Conservative)
Formula: p_adjusted = p_value × number_of_comparisons
pairwise_df, meta = S.compute_pairwise_significance(
data,
correction="bonferroni" # This is the default
)
Use when:
- You want to be very confident about significant results
- False positives are costly (publishing, major decisions)
- You have few comparisons (< 20)
Trade-off: May miss real differences (more false negatives)
Holm-Bonferroni Correction (Less Conservative)
Formula: Step-down procedure that's less strict than Bonferroni
pairwise_df, meta = S.compute_pairwise_significance(
data,
correction="holm"
)
Use when:
- You have many comparisons
- You want better power to detect real differences
- Exploratory analysis where missing a real effect is costly
Trade-off: Slightly higher false positive risk than Bonferroni
No Correction
Not recommended for final analysis, but useful for exploration.
pairwise_df, meta = S.compute_pairwise_significance(
data,
correction="none"
)
Use when:
- Initial exploration only
- You'll follow up with specific hypotheses
- You understand and accept the inflated false positive rate
Correction Method Comparison
| Method | Strictness | Best For | Risk |
|---|---|---|---|
| Bonferroni | Most strict | Few comparisons, high stakes | Miss real effects |
| Holm | Moderate | Many comparisons, balanced approach | Slightly more false positives |
| None | No control | Exploration only | Many false positives |
Recommendation for Voice Branding: Use Holm for exploratory analysis, Bonferroni for final reporting.
Interpreting Results
Key Output Columns
| Column | Meaning |
|---|---|
p_value |
Raw probability this difference happened by chance |
p_adjusted |
Corrected p-value (use this for decisions!) |
significant |
TRUE if p_adjusted < alpha (usually 0.05) |
effect_size |
How big is the difference (practical significance) |
What the p-value Means
| p-value | Interpretation |
|---|---|
| < 0.001 | Very strong evidence of difference |
| < 0.01 | Strong evidence |
| < 0.05 | Moderate evidence (traditional threshold) |
| 0.05 - 0.10 | Weak evidence, "trending" |
| > 0.10 | No significant evidence |
Statistical vs Practical Significance
Statistical significance (p < 0.05) means the difference is unlikely due to chance.
Practical significance (effect size) means the difference matters in the real world.
| Effect Size (Cohen's d) | Interpretation |
|---|---|
| < 0.2 | Small (may not matter practically) |
| 0.2 - 0.5 | Medium |
| 0.5 - 0.8 | Large |
| > 0.8 | Very large |
Example: A p-value of 0.001 with effect size of 0.1 means "we're confident there's a difference, but it's tiny."
Code Examples
Example 1: Voice Scale Ratings
# Get the raw rating data
voice_data, _ = S.get_voice_scale_1_10(data)
# Test for significant differences
pairwise_df, meta = S.compute_pairwise_significance(
voice_data,
test_type="mannwhitney", # Safe default for ratings
alpha=0.05,
correction="bonferroni"
)
# Check overall test first
print(f"Overall test: {meta['overall_test']}")
print(f"Overall p-value: {meta['overall_p_value']:.4f}")
# If overall is significant, look at pairwise
if meta['overall_p_value'] < 0.05:
sig_pairs = pairwise_df.filter(pl.col('significant') == True)
print(f"Found {sig_pairs.height} significant pairwise differences")
# Visualize
S.plot_significance_heatmap(pairwise_df, metadata=meta)
Example 2: Top 3 Voice Rankings
# Get the raw ranking data (NOT the weighted scores!)
ranking_data, _ = S.get_top_3_voices(data)
# Test for significant differences in Rank 1 proportions
pairwise_df, meta = S.compute_ranking_significance(
ranking_data,
alpha=0.05,
correction="holm" # Less conservative for many comparisons
)
# Check chi-square test
print(f"Chi-square p-value: {meta['chi2_p_value']:.4f}")
# View contingency table (Rank 1, 2, 3 counts per voice)
for voice, counts in meta['contingency_table'].items():
print(f"{voice}: R1={counts[0]}, R2={counts[1]}, R3={counts[2]}")
# Find significant pairs
sig_pairs = pairwise_df.filter(pl.col('significant') == True)
print(sig_pairs)
Example 3: Comparing Demographic Subgroups
# Filter to specific demographics
S.filter_data(data, consumer=['Early Professional'])
early_pro_data, _ = S.get_voice_scale_1_10(data)
S.filter_data(data, consumer=['Established Professional'])
estab_pro_data, _ = S.get_voice_scale_1_10(data)
# Test each group separately, then compare results qualitatively
# (For direct group comparison, you'd need a different test design)
Common Mistakes to Avoid
❌ Using Aggregated Data
# WRONG - already summarized, lost individual variance
weighted_scores = calculate_weighted_ranking_scores(ranking_data)
S.compute_pairwise_significance(weighted_scores) # Will fail!
✅ Use Raw Data
# RIGHT - use raw data before aggregation
ranking_data, _ = S.get_top_3_voices(data)
S.compute_ranking_significance(ranking_data)
❌ Ignoring Multiple Comparisons
# WRONG - 7% of pairs will be "significant" by chance alone!
S.compute_pairwise_significance(data, correction="none")
✅ Apply Correction
# RIGHT - corrected p-values control false positives
S.compute_pairwise_significance(data, correction="bonferroni")
❌ Only Reporting p-values
# WRONG - statistical significance isn't everything
print(f"p = {p_value}") # Missing context!
✅ Report Effect Sizes Too
# RIGHT - include practical significance
print(f"p = {p_value}, effect size = {effect_size}")
print(f"Mean difference: {mean1 - mean2:.2f} points")
Quick Reference Card
| Data Type | Function | Default Test | Recommended Correction |
|---|---|---|---|
| Ratings (1-10) | compute_pairwise_significance() |
Mann-Whitney U | Bonferroni |
| Rankings (1st/2nd/3rd) | compute_ranking_significance() |
Proportion Z | Holm |
| Count frequencies | compute_pairwise_significance(test_type="chi2") |
Chi-square | Bonferroni |
| Scenario | Correction |
|---|---|
| Publishing results | Bonferroni |
| Client presentation | Bonferroni |
| Exploratory analysis | Holm |
| Quick internal check | Holm or None |