# Statistical Significance Testing Guide A beginner-friendly reference for choosing the right statistical test and correction method for your Voice Branding analysis. --- ## Table of Contents 1. [Quick Decision Flowchart](#quick-decision-flowchart) 2. [Understanding Your Data Types](#understanding-your-data-types) 3. [Available Tests](#available-tests) 4. [Multiple Comparison Corrections](#multiple-comparison-corrections) 5. [Interpreting Results](#interpreting-results) 6. [Code Examples](#code-examples) --- ## Quick Decision Flowchart ``` What kind of data do you have? │ ├─► Continuous scores (1-10 ratings, averages) │ │ │ └─► Use: compute_pairwise_significance() │ │ │ ├─► Data normally distributed? → test_type="ttest" │ └─► Not sure / skewed data? → test_type="mannwhitney" (safer choice) │ └─► Ranking data (1st, 2nd, 3rd place votes) │ └─► Use: compute_ranking_significance() (automatically uses proportion z-test) ``` --- ## Understanding Your Data Types ### Continuous Data **What it looks like:** Numbers on a scale with many possible values. | Example | Data Source | |---------|-------------| | Voice ratings 1-10 | `get_voice_scale_1_10()` | | Speaking style scores | `get_ss_green_blue()` | | Any averaged scores | Custom aggregations | ``` shape: (5, 3) ┌───────────┬─────────────────┬─────────────────┐ │ _recordId │ Voice_Scale__V14│ Voice_Scale__V04│ │ str │ f64 │ f64 │ ├───────────┼─────────────────┼─────────────────┤ │ R_001 │ 7.5 │ 6.0 │ │ R_002 │ 8.0 │ 7.5 │ │ R_003 │ 6.5 │ 8.0 │ ``` ### Ranking Data **What it looks like:** Discrete ranks (1, 2, 3) or null if not ranked. | Example | Data Source | |---------|-------------| | Top 3 voice rankings | `get_top_3_voices()` | | Character rankings | `get_character_ranking()` | ``` shape: (5, 3) ┌───────────┬──────────────────┬──────────────────┐ │ _recordId │ Top_3__V14 │ Top_3__V04 │ │ str │ i64 │ i64 │ ├───────────┼──────────────────┼──────────────────┤ │ R_001 │ 1 │ null │ ← V14 was ranked 1st │ R_002 │ 2 │ 1 │ ← V04 was ranked 1st │ R_003 │ null │ 3 │ ← V04 was ranked 3rd ``` ### ⚠️ Aggregated Data (Cannot Test!) **What it looks like:** Already summarized/totaled data. ``` shape: (3, 2) ┌───────────┬────────────────┐ │ Character │ Weighted Score │ ← ALREADY AGGREGATED │ str │ i64 │ Lost individual variance ├───────────┼────────────────┤ Cannot do significance tests! │ V14 │ 209 │ │ V04 │ 180 │ ``` **Solution:** Go back to the raw data before aggregation. --- ## Available Tests ### 1. Mann-Whitney U Test (Default for Continuous) **Use when:** Comparing scores/ratings between groups **Assumes:** Nothing about distribution shape (non-parametric) **Best for:** Most survey data, Likert scales, ratings ```python pairwise_df, meta = S.compute_pairwise_significance( voice_data, test_type="mannwhitney" # This is the default ) ``` **Pros:** - Works with any distribution shape - Robust to outliers - Safe choice when unsure **Cons:** - Slightly less powerful than t-test when data IS normally distributed --- ### 2. Independent t-Test **Use when:** Comparing means between groups **Assumes:** Data is approximately normally distributed **Best for:** Large samples (n > 30 per group), truly continuous data ```python pairwise_df, meta = S.compute_pairwise_significance( voice_data, test_type="ttest" ) ``` **Pros:** - Most powerful when assumptions are met - Well-understood, commonly reported **Cons:** - Can give misleading results if data is skewed - Sensitive to outliers --- ### 3. Chi-Square Test **Use when:** Comparing frequency distributions **Assumes:** Expected counts ≥ 5 in each cell **Best for:** Count data, categorical comparisons ```python pairwise_df, meta = S.compute_pairwise_significance( count_data, test_type="chi2" ) ``` **Pros:** - Designed for count/frequency data - Tests if distributions differ **Cons:** - Needs sufficient sample sizes - Less informative about direction of difference --- ### 4. Two-Proportion Z-Test (For Rankings) **Use when:** Comparing ranking vote proportions **Automatically used by:** `compute_ranking_significance()` ```python pairwise_df, meta = S.compute_ranking_significance(ranking_data) ``` **What it tests:** "Does Voice A get a significantly different proportion of Rank 1 votes than Voice B?" --- ## Multiple Comparison Corrections ### Why Do We Need Corrections? When you compare many groups, you're doing many tests. Each test has a 5% chance of a false positive (if α = 0.05). With 17 voices: | Comparisons | Expected False Positives (no correction) | |-------------|------------------------------------------| | 136 pairs | ~7 false "significant" results! | **Corrections adjust p-values to account for this.** --- ### Bonferroni Correction (Conservative) **Formula:** `p_adjusted = p_value × number_of_comparisons` ```python pairwise_df, meta = S.compute_pairwise_significance( data, correction="bonferroni" # This is the default ) ``` **Use when:** - You want to be very confident about significant results - False positives are costly (publishing, major decisions) - You have few comparisons (< 20) **Trade-off:** May miss real differences (more false negatives) --- ### Holm-Bonferroni Correction (Less Conservative) **Formula:** Step-down procedure that's less strict than Bonferroni ```python pairwise_df, meta = S.compute_pairwise_significance( data, correction="holm" ) ``` **Use when:** - You have many comparisons - You want better power to detect real differences - Exploratory analysis where missing a real effect is costly **Trade-off:** Slightly higher false positive risk than Bonferroni --- ### No Correction **Not recommended for final analysis**, but useful for exploration. ```python pairwise_df, meta = S.compute_pairwise_significance( data, correction="none" ) ``` **Use when:** - Initial exploration only - You'll follow up with specific hypotheses - You understand and accept the inflated false positive rate --- ### Correction Method Comparison | Method | Strictness | Best For | Risk | |--------|------------|----------|------| | Bonferroni | Most strict | Few comparisons, high stakes | Miss real effects | | Holm | Moderate | Many comparisons, balanced approach | Slightly more false positives | | None | No control | Exploration only | Many false positives | **Recommendation for Voice Branding:** Use **Holm** for exploratory analysis, **Bonferroni** for final reporting. --- ## Interpreting Results ### Key Output Columns | Column | Meaning | |--------|---------| | `p_value` | Raw probability this difference happened by chance | | `p_adjusted` | Corrected p-value (use this for decisions!) | | `significant` | TRUE if p_adjusted < alpha (usually 0.05) | | `effect_size` | How big is the difference (practical significance) | ### What the p-value Means | p-value | Interpretation | |---------|----------------| | < 0.001 | Very strong evidence of difference | | < 0.01 | Strong evidence | | < 0.05 | Moderate evidence (traditional threshold) | | 0.05 - 0.10 | Weak evidence, "trending" | | > 0.10 | No significant evidence | ### Statistical vs Practical Significance **Statistical significance** (p < 0.05) means the difference is unlikely due to chance. **Practical significance** (effect size) means the difference matters in the real world. | Effect Size (Cohen's d) | Interpretation | |-------------------------|----------------| | < 0.2 | Small (may not matter practically) | | 0.2 - 0.5 | Medium | | 0.5 - 0.8 | Large | | > 0.8 | Very large | **Example:** A p-value of 0.001 with effect size of 0.1 means "we're confident there's a difference, but it's tiny." --- ## Code Examples ### Example 1: Voice Scale Ratings ```python # Get the raw rating data voice_data, _ = S.get_voice_scale_1_10(data) # Test for significant differences pairwise_df, meta = S.compute_pairwise_significance( voice_data, test_type="mannwhitney", # Safe default for ratings alpha=0.05, correction="bonferroni" ) # Check overall test first print(f"Overall test: {meta['overall_test']}") print(f"Overall p-value: {meta['overall_p_value']:.4f}") # If overall is significant, look at pairwise if meta['overall_p_value'] < 0.05: sig_pairs = pairwise_df.filter(pl.col('significant') == True) print(f"Found {sig_pairs.height} significant pairwise differences") # Visualize S.plot_significance_heatmap(pairwise_df, metadata=meta) ``` ### Example 2: Top 3 Voice Rankings ```python # Get the raw ranking data (NOT the weighted scores!) ranking_data, _ = S.get_top_3_voices(data) # Test for significant differences in Rank 1 proportions pairwise_df, meta = S.compute_ranking_significance( ranking_data, alpha=0.05, correction="holm" # Less conservative for many comparisons ) # Check chi-square test print(f"Chi-square p-value: {meta['chi2_p_value']:.4f}") # View contingency table (Rank 1, 2, 3 counts per voice) for voice, counts in meta['contingency_table'].items(): print(f"{voice}: R1={counts[0]}, R2={counts[1]}, R3={counts[2]}") # Find significant pairs sig_pairs = pairwise_df.filter(pl.col('significant') == True) print(sig_pairs) ``` ### Example 3: Comparing Demographic Subgroups ```python # Filter to specific demographics S.filter_data(data, consumer=['Early Professional']) early_pro_data, _ = S.get_voice_scale_1_10(data) S.filter_data(data, consumer=['Established Professional']) estab_pro_data, _ = S.get_voice_scale_1_10(data) # Test each group separately, then compare results qualitatively # (For direct group comparison, you'd need a different test design) ``` --- ## Common Mistakes to Avoid ### ❌ Using Aggregated Data ```python # WRONG - already summarized, lost individual variance weighted_scores = calculate_weighted_ranking_scores(ranking_data) S.compute_pairwise_significance(weighted_scores) # Will fail! ``` ### ✅ Use Raw Data ```python # RIGHT - use raw data before aggregation ranking_data, _ = S.get_top_3_voices(data) S.compute_ranking_significance(ranking_data) ``` ### ❌ Ignoring Multiple Comparisons ```python # WRONG - 7% of pairs will be "significant" by chance alone! S.compute_pairwise_significance(data, correction="none") ``` ### ✅ Apply Correction ```python # RIGHT - corrected p-values control false positives S.compute_pairwise_significance(data, correction="bonferroni") ``` ### ❌ Only Reporting p-values ```python # WRONG - statistical significance isn't everything print(f"p = {p_value}") # Missing context! ``` ### ✅ Report Effect Sizes Too ```python # RIGHT - include practical significance print(f"p = {p_value}, effect size = {effect_size}") print(f"Mean difference: {mean1 - mean2:.2f} points") ``` --- ## Quick Reference Card | Data Type | Function | Default Test | Recommended Correction | |-----------|----------|--------------|------------------------| | Ratings (1-10) | `compute_pairwise_significance()` | Mann-Whitney U | Bonferroni | | Rankings (1st/2nd/3rd) | `compute_ranking_significance()` | Proportion Z | Holm | | Count frequencies | `compute_pairwise_significance(test_type="chi2")` | Chi-square | Bonferroni | | Scenario | Correction | |----------|------------| | Publishing results | Bonferroni | | Client presentation | Bonferroni | | Exploratory analysis | Holm | | Quick internal check | Holm or None | --- ## Further Reading - [Statistics for Dummies Cheat Sheet](https://www.dummies.com/article/academics-the-arts/math/statistics/statistics-for-dummies-cheat-sheet-208650/) - [Choosing the Right Statistical Test](https://stats.oarc.ucla.edu/other/mult-pkg/whatstat/) - [Multiple Comparisons Problem (Wikipedia)](https://en.wikipedia.org/wiki/Multiple_comparisons_problem)