statistical tests
This commit is contained in:
428
docs/statistical-significance-guide.md
Normal file
428
docs/statistical-significance-guide.md
Normal file
@@ -0,0 +1,428 @@
|
||||
# Statistical Significance Testing Guide
|
||||
|
||||
A beginner-friendly reference for choosing the right statistical test and correction method for your Voice Branding analysis.
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
1. [Quick Decision Flowchart](#quick-decision-flowchart)
|
||||
2. [Understanding Your Data Types](#understanding-your-data-types)
|
||||
3. [Available Tests](#available-tests)
|
||||
4. [Multiple Comparison Corrections](#multiple-comparison-corrections)
|
||||
5. [Interpreting Results](#interpreting-results)
|
||||
6. [Code Examples](#code-examples)
|
||||
|
||||
---
|
||||
|
||||
## Quick Decision Flowchart
|
||||
|
||||
```
|
||||
What kind of data do you have?
|
||||
│
|
||||
├─► Continuous scores (1-10 ratings, averages)
|
||||
│ │
|
||||
│ └─► Use: compute_pairwise_significance()
|
||||
│ │
|
||||
│ ├─► Data normally distributed? → test_type="ttest"
|
||||
│ └─► Not sure / skewed data? → test_type="mannwhitney" (safer choice)
|
||||
│
|
||||
└─► Ranking data (1st, 2nd, 3rd place votes)
|
||||
│
|
||||
└─► Use: compute_ranking_significance()
|
||||
(automatically uses proportion z-test)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Understanding Your Data Types
|
||||
|
||||
### Continuous Data
|
||||
**What it looks like:** Numbers on a scale with many possible values.
|
||||
|
||||
| Example | Data Source |
|
||||
|---------|-------------|
|
||||
| Voice ratings 1-10 | `get_voice_scale_1_10()` |
|
||||
| Speaking style scores | `get_ss_green_blue()` |
|
||||
| Any averaged scores | Custom aggregations |
|
||||
|
||||
```
|
||||
shape: (5, 3)
|
||||
┌───────────┬─────────────────┬─────────────────┐
|
||||
│ _recordId │ Voice_Scale__V14│ Voice_Scale__V04│
|
||||
│ str │ f64 │ f64 │
|
||||
├───────────┼─────────────────┼─────────────────┤
|
||||
│ R_001 │ 7.5 │ 6.0 │
|
||||
│ R_002 │ 8.0 │ 7.5 │
|
||||
│ R_003 │ 6.5 │ 8.0 │
|
||||
```
|
||||
|
||||
### Ranking Data
|
||||
**What it looks like:** Discrete ranks (1, 2, 3) or null if not ranked.
|
||||
|
||||
| Example | Data Source |
|
||||
|---------|-------------|
|
||||
| Top 3 voice rankings | `get_top_3_voices()` |
|
||||
| Character rankings | `get_character_ranking()` |
|
||||
|
||||
```
|
||||
shape: (5, 3)
|
||||
┌───────────┬──────────────────┬──────────────────┐
|
||||
│ _recordId │ Top_3__V14 │ Top_3__V04 │
|
||||
│ str │ i64 │ i64 │
|
||||
├───────────┼──────────────────┼──────────────────┤
|
||||
│ R_001 │ 1 │ null │ ← V14 was ranked 1st
|
||||
│ R_002 │ 2 │ 1 │ ← V04 was ranked 1st
|
||||
│ R_003 │ null │ 3 │ ← V04 was ranked 3rd
|
||||
```
|
||||
|
||||
### ⚠️ Aggregated Data (Cannot Test!)
|
||||
**What it looks like:** Already summarized/totaled data.
|
||||
|
||||
```
|
||||
shape: (3, 2)
|
||||
┌───────────┬────────────────┐
|
||||
│ Character │ Weighted Score │ ← ALREADY AGGREGATED
|
||||
│ str │ i64 │ Lost individual variance
|
||||
├───────────┼────────────────┤ Cannot do significance tests!
|
||||
│ V14 │ 209 │
|
||||
│ V04 │ 180 │
|
||||
```
|
||||
|
||||
**Solution:** Go back to the raw data before aggregation.
|
||||
|
||||
---
|
||||
|
||||
## Available Tests
|
||||
|
||||
### 1. Mann-Whitney U Test (Default for Continuous)
|
||||
**Use when:** Comparing scores/ratings between groups
|
||||
**Assumes:** Nothing about distribution shape (non-parametric)
|
||||
**Best for:** Most survey data, Likert scales, ratings
|
||||
|
||||
```python
|
||||
pairwise_df, meta = S.compute_pairwise_significance(
|
||||
voice_data,
|
||||
test_type="mannwhitney" # This is the default
|
||||
)
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- Works with any distribution shape
|
||||
- Robust to outliers
|
||||
- Safe choice when unsure
|
||||
|
||||
**Cons:**
|
||||
- Slightly less powerful than t-test when data IS normally distributed
|
||||
|
||||
---
|
||||
|
||||
### 2. Independent t-Test
|
||||
**Use when:** Comparing means between groups
|
||||
**Assumes:** Data is approximately normally distributed
|
||||
**Best for:** Large samples (n > 30 per group), truly continuous data
|
||||
|
||||
```python
|
||||
pairwise_df, meta = S.compute_pairwise_significance(
|
||||
voice_data,
|
||||
test_type="ttest"
|
||||
)
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- Most powerful when assumptions are met
|
||||
- Well-understood, commonly reported
|
||||
|
||||
**Cons:**
|
||||
- Can give misleading results if data is skewed
|
||||
- Sensitive to outliers
|
||||
|
||||
---
|
||||
|
||||
### 3. Chi-Square Test
|
||||
**Use when:** Comparing frequency distributions
|
||||
**Assumes:** Expected counts ≥ 5 in each cell
|
||||
**Best for:** Count data, categorical comparisons
|
||||
|
||||
```python
|
||||
pairwise_df, meta = S.compute_pairwise_significance(
|
||||
count_data,
|
||||
test_type="chi2"
|
||||
)
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- Designed for count/frequency data
|
||||
- Tests if distributions differ
|
||||
|
||||
**Cons:**
|
||||
- Needs sufficient sample sizes
|
||||
- Less informative about direction of difference
|
||||
|
||||
---
|
||||
|
||||
### 4. Two-Proportion Z-Test (For Rankings)
|
||||
**Use when:** Comparing ranking vote proportions
|
||||
**Automatically used by:** `compute_ranking_significance()`
|
||||
|
||||
```python
|
||||
pairwise_df, meta = S.compute_ranking_significance(ranking_data)
|
||||
```
|
||||
|
||||
**What it tests:** "Does Voice A get a significantly different proportion of Rank 1 votes than Voice B?"
|
||||
|
||||
---
|
||||
|
||||
## Multiple Comparison Corrections
|
||||
|
||||
### Why Do We Need Corrections?
|
||||
|
||||
When you compare many groups, you're doing many tests. Each test has a 5% chance of a false positive (if α = 0.05). With 17 voices:
|
||||
|
||||
| Comparisons | Expected False Positives (no correction) |
|
||||
|-------------|------------------------------------------|
|
||||
| 136 pairs | ~7 false "significant" results! |
|
||||
|
||||
**Corrections adjust p-values to account for this.**
|
||||
|
||||
---
|
||||
|
||||
### Bonferroni Correction (Conservative)
|
||||
**Formula:** `p_adjusted = p_value × number_of_comparisons`
|
||||
|
||||
```python
|
||||
pairwise_df, meta = S.compute_pairwise_significance(
|
||||
data,
|
||||
correction="bonferroni" # This is the default
|
||||
)
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- You want to be very confident about significant results
|
||||
- False positives are costly (publishing, major decisions)
|
||||
- You have few comparisons (< 20)
|
||||
|
||||
**Trade-off:** May miss real differences (more false negatives)
|
||||
|
||||
---
|
||||
|
||||
### Holm-Bonferroni Correction (Less Conservative)
|
||||
**Formula:** Step-down procedure that's less strict than Bonferroni
|
||||
|
||||
```python
|
||||
pairwise_df, meta = S.compute_pairwise_significance(
|
||||
data,
|
||||
correction="holm"
|
||||
)
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- You have many comparisons
|
||||
- You want better power to detect real differences
|
||||
- Exploratory analysis where missing a real effect is costly
|
||||
|
||||
**Trade-off:** Slightly higher false positive risk than Bonferroni
|
||||
|
||||
---
|
||||
|
||||
### No Correction
|
||||
**Not recommended for final analysis**, but useful for exploration.
|
||||
|
||||
```python
|
||||
pairwise_df, meta = S.compute_pairwise_significance(
|
||||
data,
|
||||
correction="none"
|
||||
)
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- Initial exploration only
|
||||
- You'll follow up with specific hypotheses
|
||||
- You understand and accept the inflated false positive rate
|
||||
|
||||
---
|
||||
|
||||
### Correction Method Comparison
|
||||
|
||||
| Method | Strictness | Best For | Risk |
|
||||
|--------|------------|----------|------|
|
||||
| Bonferroni | Most strict | Few comparisons, high stakes | Miss real effects |
|
||||
| Holm | Moderate | Many comparisons, balanced approach | Slightly more false positives |
|
||||
| None | No control | Exploration only | Many false positives |
|
||||
|
||||
**Recommendation for Voice Branding:** Use **Holm** for exploratory analysis, **Bonferroni** for final reporting.
|
||||
|
||||
---
|
||||
|
||||
## Interpreting Results
|
||||
|
||||
### Key Output Columns
|
||||
|
||||
| Column | Meaning |
|
||||
|--------|---------|
|
||||
| `p_value` | Raw probability this difference happened by chance |
|
||||
| `p_adjusted` | Corrected p-value (use this for decisions!) |
|
||||
| `significant` | TRUE if p_adjusted < alpha (usually 0.05) |
|
||||
| `effect_size` | How big is the difference (practical significance) |
|
||||
|
||||
### What the p-value Means
|
||||
|
||||
| p-value | Interpretation |
|
||||
|---------|----------------|
|
||||
| < 0.001 | Very strong evidence of difference |
|
||||
| < 0.01 | Strong evidence |
|
||||
| < 0.05 | Moderate evidence (traditional threshold) |
|
||||
| 0.05 - 0.10 | Weak evidence, "trending" |
|
||||
| > 0.10 | No significant evidence |
|
||||
|
||||
### Statistical vs Practical Significance
|
||||
|
||||
**Statistical significance** (p < 0.05) means the difference is unlikely due to chance.
|
||||
|
||||
**Practical significance** (effect size) means the difference matters in the real world.
|
||||
|
||||
| Effect Size (Cohen's d) | Interpretation |
|
||||
|-------------------------|----------------|
|
||||
| < 0.2 | Small (may not matter practically) |
|
||||
| 0.2 - 0.5 | Medium |
|
||||
| 0.5 - 0.8 | Large |
|
||||
| > 0.8 | Very large |
|
||||
|
||||
**Example:** A p-value of 0.001 with effect size of 0.1 means "we're confident there's a difference, but it's tiny."
|
||||
|
||||
---
|
||||
|
||||
## Code Examples
|
||||
|
||||
### Example 1: Voice Scale Ratings
|
||||
|
||||
```python
|
||||
# Get the raw rating data
|
||||
voice_data, _ = S.get_voice_scale_1_10(data)
|
||||
|
||||
# Test for significant differences
|
||||
pairwise_df, meta = S.compute_pairwise_significance(
|
||||
voice_data,
|
||||
test_type="mannwhitney", # Safe default for ratings
|
||||
alpha=0.05,
|
||||
correction="bonferroni"
|
||||
)
|
||||
|
||||
# Check overall test first
|
||||
print(f"Overall test: {meta['overall_test']}")
|
||||
print(f"Overall p-value: {meta['overall_p_value']:.4f}")
|
||||
|
||||
# If overall is significant, look at pairwise
|
||||
if meta['overall_p_value'] < 0.05:
|
||||
sig_pairs = pairwise_df.filter(pl.col('significant') == True)
|
||||
print(f"Found {sig_pairs.height} significant pairwise differences")
|
||||
|
||||
# Visualize
|
||||
S.plot_significance_heatmap(pairwise_df, metadata=meta)
|
||||
```
|
||||
|
||||
### Example 2: Top 3 Voice Rankings
|
||||
|
||||
```python
|
||||
# Get the raw ranking data (NOT the weighted scores!)
|
||||
ranking_data, _ = S.get_top_3_voices(data)
|
||||
|
||||
# Test for significant differences in Rank 1 proportions
|
||||
pairwise_df, meta = S.compute_ranking_significance(
|
||||
ranking_data,
|
||||
alpha=0.05,
|
||||
correction="holm" # Less conservative for many comparisons
|
||||
)
|
||||
|
||||
# Check chi-square test
|
||||
print(f"Chi-square p-value: {meta['chi2_p_value']:.4f}")
|
||||
|
||||
# View contingency table (Rank 1, 2, 3 counts per voice)
|
||||
for voice, counts in meta['contingency_table'].items():
|
||||
print(f"{voice}: R1={counts[0]}, R2={counts[1]}, R3={counts[2]}")
|
||||
|
||||
# Find significant pairs
|
||||
sig_pairs = pairwise_df.filter(pl.col('significant') == True)
|
||||
print(sig_pairs)
|
||||
```
|
||||
|
||||
### Example 3: Comparing Demographic Subgroups
|
||||
|
||||
```python
|
||||
# Filter to specific demographics
|
||||
S.filter_data(data, consumer=['Early Professional'])
|
||||
early_pro_data, _ = S.get_voice_scale_1_10(data)
|
||||
|
||||
S.filter_data(data, consumer=['Established Professional'])
|
||||
estab_pro_data, _ = S.get_voice_scale_1_10(data)
|
||||
|
||||
# Test each group separately, then compare results qualitatively
|
||||
# (For direct group comparison, you'd need a different test design)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Mistakes to Avoid
|
||||
|
||||
### ❌ Using Aggregated Data
|
||||
```python
|
||||
# WRONG - already summarized, lost individual variance
|
||||
weighted_scores = calculate_weighted_ranking_scores(ranking_data)
|
||||
S.compute_pairwise_significance(weighted_scores) # Will fail!
|
||||
```
|
||||
|
||||
### ✅ Use Raw Data
|
||||
```python
|
||||
# RIGHT - use raw data before aggregation
|
||||
ranking_data, _ = S.get_top_3_voices(data)
|
||||
S.compute_ranking_significance(ranking_data)
|
||||
```
|
||||
|
||||
### ❌ Ignoring Multiple Comparisons
|
||||
```python
|
||||
# WRONG - 7% of pairs will be "significant" by chance alone!
|
||||
S.compute_pairwise_significance(data, correction="none")
|
||||
```
|
||||
|
||||
### ✅ Apply Correction
|
||||
```python
|
||||
# RIGHT - corrected p-values control false positives
|
||||
S.compute_pairwise_significance(data, correction="bonferroni")
|
||||
```
|
||||
|
||||
### ❌ Only Reporting p-values
|
||||
```python
|
||||
# WRONG - statistical significance isn't everything
|
||||
print(f"p = {p_value}") # Missing context!
|
||||
```
|
||||
|
||||
### ✅ Report Effect Sizes Too
|
||||
```python
|
||||
# RIGHT - include practical significance
|
||||
print(f"p = {p_value}, effect size = {effect_size}")
|
||||
print(f"Mean difference: {mean1 - mean2:.2f} points")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference Card
|
||||
|
||||
| Data Type | Function | Default Test | Recommended Correction |
|
||||
|-----------|----------|--------------|------------------------|
|
||||
| Ratings (1-10) | `compute_pairwise_significance()` | Mann-Whitney U | Bonferroni |
|
||||
| Rankings (1st/2nd/3rd) | `compute_ranking_significance()` | Proportion Z | Holm |
|
||||
| Count frequencies | `compute_pairwise_significance(test_type="chi2")` | Chi-square | Bonferroni |
|
||||
|
||||
| Scenario | Correction |
|
||||
|----------|------------|
|
||||
| Publishing results | Bonferroni |
|
||||
| Client presentation | Bonferroni |
|
||||
| Exploratory analysis | Holm |
|
||||
| Quick internal check | Holm or None |
|
||||
|
||||
---
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [Statistics for Dummies Cheat Sheet](https://www.dummies.com/article/academics-the-arts/math/statistics/statistics-for-dummies-cheat-sheet-208650/)
|
||||
- [Choosing the Right Statistical Test](https://stats.oarc.ucla.edu/other/mult-pkg/whatstat/)
|
||||
- [Multiple Comparisons Problem (Wikipedia)](https://en.wikipedia.org/wiki/Multiple_comparisons_problem)
|
||||
Reference in New Issue
Block a user