Compare commits

...

52 Commits

Author SHA1 Message Date
03a716e8ec correlation matrix speech characteristics vs score 2026-02-10 16:50:47 +01:00
8720bb670d started speech data notebook 2026-02-10 14:58:13 +01:00
9dfab75925 missing data analysis 2026-02-10 14:24:26 +01:00
14e28cf368 stat significance nr times ranked 1st 2026-02-09 18:37:41 +01:00
8e181e193a SL filter 2026-02-09 17:57:04 +01:00
6c16993cb3 straight-liner plot analysis 2026-02-09 17:26:45 +01:00
92c6fc03ab docs datasets 2026-02-09 13:17:59 +01:00
7fb6570190 statistical significance 2026-02-05 19:49:19 +01:00
840bd2940d other top bc's 2026-02-05 11:50:00 +01:00
af9a15ccb0 renamed notebooks and added significance test 2026-02-05 10:14:53 +01:00
a3cf9f103d update plots with final data release 2026-02-04 21:15:03 +01:00
f0eab32c34 update alt-text with full filepaths 2026-02-04 17:48:48 +01:00
d231fc02db fix missing filter descr in correlation plots 2026-02-04 14:48:14 +01:00
fc76bb0ab5 voice gender split correlation plots 2026-02-04 13:44:51 +01:00
ab78276a97 male/female voices in separate plots for correlations 2026-02-04 12:35:24 +01:00
e17646eb70 correlation plots for best bc 2026-02-04 10:46:31 +01:00
ad1d8c6e58 all plots offline update 2026-02-03 22:38:15 +01:00
f5b4c247b8 tidy plots 2026-02-03 22:12:17 +01:00
a35670aa72 fixed missing ai_user category 2026-02-03 21:13:29 +01:00
36280a6ff8 fix sample size 2026-02-03 20:48:34 +01:00
9a587dcc4c add ai-user filter combinations 2026-02-03 19:46:07 +01:00
9a49d1c690 added sample size to filter text 2026-02-03 19:16:39 +01:00
8f505da550 offline update 18-30 2026-02-03 18:43:20 +01:00
495b56307c fixed filter to none 2026-02-03 18:19:06 +01:00
1e76a82f24 fix wordcloud filter values 2026-02-03 17:41:12 +01:00
01b7d50637 fixed empty plots, updated filters 2026-02-03 16:51:24 +01:00
dca9ac11ba supposed wordcloud fix, but everything broke 2026-02-03 15:36:35 +01:00
081fb0dd6e added 6 more filters 2026-02-03 15:20:01 +01:00
2817ed240a automatic generation of all plots with all combinations 2026-02-03 15:03:57 +01:00
e44251c3d6 fixed consumer and ethnicity filter combinations 2026-02-03 14:43:03 +01:00
8dd41dfc96 Start automation of running filter combinations 2026-02-03 14:33:09 +01:00
840cb4e6dc exported marimo to script form 2026-02-03 13:48:05 +01:00
a162701e94 move cell for better running 2026-02-03 02:22:06 +01:00
38f6d8a87c fixed for basic plots, filter active 2026-02-03 02:19:47 +01:00
5c39bbb23a images tagged 2026-02-03 02:05:29 +01:00
190e4fbdc4 finished correlation plots and generic voice plots 2026-02-03 01:59:26 +01:00
2408d06098 base correlations 2026-02-03 01:32:06 +01:00
1dce4db909 voice gender plots done 2026-02-03 01:03:29 +01:00
acf9c45844 male/female colored plots 2026-02-03 00:40:51 +01:00
77fdd6e8f6 fixed voices plot order 2026-02-03 00:20:56 +01:00
426495ebe3 generic voice plots 2026-02-03 00:15:10 +01:00
a7ee854ed0 voice plots 2026-02-03 00:12:18 +01:00
97c4b07208 added filter disabled broken cells and starting spoken voice generic results 2026-02-02 23:32:10 +01:00
fd14038253 comment out 'per subgroup' since these just take too long to create 2026-02-02 23:22:09 +01:00
611fc8d19a added var split_group 2026-02-02 23:15:05 +01:00
3ac330263f BC results per consumer 2026-02-02 23:04:40 +01:00
bda4d54231 split consumer groups best character 2026-02-02 22:05:56 +01:00
f2c659c266 statistical tests 2026-02-02 21:47:37 +01:00
29df6a4bd9 og traits 2026-02-02 18:37:45 +01:00
a62524c6e4 update plot agent with explicit things not to do 2026-02-02 18:26:23 +01:00
43b41a01f5 plot creator agent 2026-02-02 17:57:19 +01:00
b7cf6adfb8 fix ppt update images 2026-02-02 17:36:32 +01:00
24 changed files with 10647 additions and 667 deletions

216
.github/agents/plot-creator.agent.md vendored Normal file
View File

@@ -0,0 +1,216 @@
# Plot Creator Agent
You are a specialized agent for creating data visualizations for the Voice Branding Qualtrics survey analysis project.
## ⚠️ Critical Data Handling Rules
1. **NEVER assume or load datasets without explicit user consent** - This is confidential data
2. **NEVER guess file paths or dataset locations**
3. **DO NOT assume data comes from a `Survey.get_*()` method** - Data may have been manually manipulated in a notebook
4. **Use ONLY the data snippet provided by the user** for understanding structure and testing
## Your Workflow
When the user provides a plotting request (e.g., "I need a bar plot that shows the frequency of the times each trait is chosen per brand character"), follow this workflow:
### Step 1: Understand the Request
- Parse the user's natural language request to identify:
- **Chart type** (bar, stacked bar, line, heatmap, etc.)
- **X-axis variable**
- **Y-axis variable / aggregation** (count, mean, sum, etc.)
- **Grouping / color encoding** (if any)
- **Filtering requirements** (if any)
- Think critically about whether the requested plot is feasible with the available data.
- Think critically about the best way to visualize the requested information, and if the requested chart type is appropriate. If not, consider alternatives and ask the user for confirmation before proceeding.
### Step 2: Analyze Provided Data
The user will paste a `df.head()` output. Examine:
- Column names and their meaning (refer to column naming conventions in `.github/copilot-instructions.md`)
- Data types
- Whether the data is in the right shape for the desired plot
**Important:** Do NOT make assumptions about where this data came from. It may be:
- Output from a `Survey.get_*()` method
- Manually transformed in a notebook
- A join of multiple data sources
- Any other custom manipulation
### Step 3: Determine Data Manipulation Needs
Decide if the provided data can be plotted directly, or if transformations are needed:
- **No manipulation**: Data is ready → proceed to Step 5
- **Manipulation needed**: Aggregation, pivoting, melting, filtering, or new computed columns required → proceed to Step 4
### Step 4: Create Data Manipulation Function (if needed)
Check if an existing `transform_<descriptive_name>` function exists in `utils.py` that performs the needed data manipulation. If not, create a dedicated method in the `QualtricsSurvey` class (`utils.py`):
```python
def transform_<descriptive_name>(self, df: pl.LazyFrame | pl.DataFrame) -> tuple[pl.LazyFrame, dict | None]:
"""Transform <input_description> to <output_description>.
Original use-case: "<paste user's original question here>"
This function <concise 1-2 sentence explanation of what it does>.
Args:
df: Pre-fetched data as a Polars LazyFrame or DataFrame.
Returns:
tuple: (LazyFrame with columns [...], Optional metadata dict)
"""
# Implementation - transform the INPUT data only
# NEVER call self.get_*() methods here
return result, metadata
```
**Requirements:**
- **NEVER retrieve data inside transform functions** - The function receives already-fetched data as input
- Data retrieval (`get_*()` calls) stays in the notebook so analysts can see all steps
- Method must return `(pl.LazyFrame, Optional[dict])` tuple
- Docstring MUST contain the original question verbatim
- Follow existing patterns class methods of the `QualtricsSurvey()` in `utils.py`
**❌ BAD Example (do NOT do this):**
```python
def transform_character_trait_frequency(self, q: pl.LazyFrame):
# BAD: Fetching data inside transform function
char_df, _ = self.get_character_refine(q) # ← WRONG!
# ... rest of transform
```
**✅ GOOD Example:**
```python
def transform_character_trait_frequency(self, char_df: pl.LazyFrame | pl.DataFrame):
# GOOD: Receives pre-fetched data as input
if isinstance(char_df, pl.LazyFrame):
char_df = char_df.collect()
# ... rest of transform
```
**In the notebook, the analyst writes:**
```python
char_data, _ = S.get_character_refine(data) # Step visible to analyst
trait_freq, _ = S.transform_character_trait_frequency(char_data) # Transform step
chart = S.plot_character_trait_frequency(trait_freq)
```
### Step 5: Create Temporary Test File
Create `debug_plot_temp.py` for testing. **Prefer using the data snippet already provided by the user.**
**Option A: Use provided data snippet (preferred)**
If the user provided a `df.head()` or sample data output, create inline test data from it:
```python
"""Temporary test file for <plot_name>.
Delete after testing.
"""
import polars as pl
from theme import ColorPalette
import altair as alt
# ============================================================
# TEST DATA (reconstructed from user's df.head() output)
# ============================================================
test_data = pl.DataFrame({
"Column1": ["value1", "value2", ...],
"Column2": [1, 2, ...],
# ... recreate structure from provided sample
})
# ============================================================
# Test the plot function
from plots import QualtricsPlotsMixin
# ... test code
```
**Option B: Ask user (only if necessary)**
Only ask the user for additional code if:
- The provided sample is insufficient to test the plot logic
- You need to understand complex data relationships not visible in the sample
- The transformation requires understanding the full data pipeline
If you must ask:
> "The sample data you provided should work for basic testing. However, I need [specific reason]. Could you provide:
> 1. [specific information needed]
>
> If you'd prefer, I can proceed with a minimal test using the sample data you shared."
### Step 6: Create Plot Function
Add a new method to `QualtricsPlotsMixin` in `plots.py`:
```python
def plot_<descriptive_name>(
self,
data: pl.LazyFrame | pl.DataFrame | None = None,
title: str = "<Default title>",
x_label: str = "<X label>",
y_label: str = "<Y label>",
height: int | None = None,
width: int | str | None = None,
) -> alt.Chart:
"""<Docstring with original question and description>."""
df = self._ensure_dataframe(data)
# Build chart using ONLY ColorPalette from theme.py
chart = alt.Chart(...).mark_bar(color=ColorPalette.PRIMARY)...
chart = self._save_plot(chart, title)
return chart
```
**Requirements:**
- ALL colors MUST use `ColorPalette` constants from `theme.py`
- Use `self._ensure_dataframe()` to handle LazyFrame/DataFrame
- Use `self._save_plot()` at the end to enable auto-save
- Use `self._process_title()` for titles with `<br>` tags
- Follow existing plot patterns (see `plot_average_scores_with_counts`, `plot_top3_ranking_distribution`)
### Step 7: Test
Run the temporary test file to verify the plot works:
```bash
uv run python debug_plot_temp.py
```
### Step 8: Provide Summary
After successful completion, output a summary:
```
✅ Plot created successfully!
**Data function** (if created): `S.transform_<name>(data)`
**Plot function**: `S.plot_<name>(data, title="...")`
**Usage example:**
```python
# Assuming you have your data already prepared as `plot_data`
chart = S.plot_<name>(plot_data, title="Your Title Here")
chart # Display in Marimo
```
**Files modified:**
- `utils.py` - Added `transform_<name>()` (if applicable)
- `plots.py` - Added `plot_<name>()`
- `debug_plot_temp.py` - Test file (can be deleted)
```
## Critical Rules (from .github/copilot-instructions.md)
1. **NEVER load confidential data without explicit user-provided code**
2. **NEVER assume data source** - do not guess which `get_*()` method produced the data
3. **NEVER modify Marimo notebooks** (`0X_*.py` files)
4. **NEVER run Marimo notebooks for debugging**
5. **ALL colors MUST come from `theme.py`** - use `ColorPalette.PRIMARY`, `ColorPalette.RANK_1`, etc.
6. **If a new color is needed**, add it to `ColorPalette` in `theme.py` first
7. **No changelog markdown files** - do not create new .md files documenting changes
8. **Reading notebooks is OK** to understand function usage patterns
9. **Getter methods return tuples**: `(LazyFrame, Optional[metadata])`
10. **Use Polars LazyFrames** until visualization, then `.collect()`
If any rule causes problems, ask user for permission before deviating.
## Reference: Column Patterns
- `SS_Green_Blue__V14__Choice_1` → Speaking Style trait score
- `Voice_Scale_1_10__V48` → 1-10 voice rating
- `Top_3_Voices_ranking__V77` → Ranking position
- `Character_Ranking_<Name>` → Character personality ranking

5
.vscode/extensions.json vendored Normal file
View File

@@ -0,0 +1,5 @@
{
"recommendations": [
"wakatime.vscode-wakatime"
]
}

5
.vscode/settings.json vendored Normal file
View File

@@ -0,0 +1,5 @@
{
"chat.tools.terminal.autoApprove": {
"/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/.venv/bin/python": true
}
}

View File

@@ -1,6 +1,6 @@
import marimo
__generated_with = "0.19.2"
__generated_with = "0.19.7"
app = marimo.App(width="full")
@@ -16,8 +16,8 @@ def _():
from speaking_styles import SPEAKING_STYLES
return (
QualtricsSurvey,
Path,
QualtricsSurvey,
SPEAKING_STYLES,
calculate_weighted_ranking_scores,
check_progress,
@@ -49,7 +49,7 @@ def _(Path, file_browser, mo):
@app.cell
def _(QualtricsSurvey, QSF_FILE, RESULTS_FILE, mo):
def _(QSF_FILE, QualtricsSurvey, RESULTS_FILE, mo):
S = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
try:
data_all = S.load_data()
@@ -383,6 +383,12 @@ def _(S, data, mo):
return (vscales,)
@app.cell
def _(vscales):
print(vscales.collect().head())
return
@app.cell
def _(pl, vscales):
# Count non-null values per row

View File

@@ -1,6 +1,6 @@
import marimo
__generated_with = "0.19.2"
__generated_with = "0.19.7"
app = marimo.App(width="full")
with app.setup:
@@ -44,14 +44,14 @@ def _(QSF_FILE, RESULTS_FILE):
@app.cell(hide_code=True)
def _():
mo.md(r"""
def _(RESULTS_FILE, data_all):
mo.md(rf"""
---
# Load Data
**Dataset:** `{Path(RESULTS_FILE).name}`
**Dataset:** {Path(RESULTS_FILE).name}
**Responses**: `{data_all.collect().shape[0]}`
**Responses**: {data_all.collect().shape[0]}
""")
return
@@ -71,7 +71,6 @@ def _(S, data_all):
mo.md(f"""
# Data Validation
{check_progress(data_all)}
@@ -104,39 +103,20 @@ def _(data_all):
return (data_validated,)
@app.cell(hide_code=True)
@app.cell
def _():
return
@app.cell
def _(data_validated):
data = data_validated
data.collect()
return (data,)
@app.cell(hide_code=True)
def _():
mo.md(r"""
---
# Introduction (Respondent Demographics)
""")
#
return
@app.cell
def _(S, data):
demographics = S.get_demographics(data)[0].collect()
demographics
return (demographics,)
@app.cell(hide_code=True)
def _():
mo.md(r"""
## Lucia confirmation missing 'Consumer' data
@@ -145,6 +125,13 @@ def _():
@app.cell
def _(S, data_validated):
demographics = S.get_demographics(data_validated)[0].collect()
# demographics
return (demographics,)
@app.cell(hide_code=True)
def _(demographics):
# Demographics where 'Consumer' is null
demographics_no_consumer = demographics.filter(pl.col('Consumer').is_null())['_recordId'].to_list()
@@ -160,16 +147,82 @@ def _(data_all, demographics_no_consumer):
@app.cell
def _(data_all):
def _():
mo.md(r"""
# Filter Data (Global corrections)
""")
return
@app.cell
def _():
BEST_CHOSEN_CHARACTER = "the_coach"
return (BEST_CHOSEN_CHARACTER,)
@app.cell
def _(S):
filter_form = mo.md('''
{age}
{gender}
{ethnicity}
{income}
{consumer}
'''
).batch(
age=mo.ui.multiselect(options=S.options_age, value=S.options_age, label="Select Age Group(s):"),
gender=mo.ui.multiselect(options=S.options_gender, value=S.options_gender, label="Select Gender(s):"),
ethnicity=mo.ui.multiselect(options=S.options_ethnicity, value=S.options_ethnicity, label="Select Ethnicities:"),
income=mo.ui.multiselect(options=S.options_income, value=S.options_income, label="Select Income Group(s):"),
consumer=mo.ui.multiselect(options=S.options_consumer, value=S.options_consumer, label="Select Consumer Groups:")
).form()
mo.md(f'''
---
# Data Filter
{filter_form}
''')
return (filter_form,)
@app.cell
def _(S, data_validated, filter_form):
mo.stop(filter_form.value is None, mo.md("**Please submit filter above to proceed**"))
_d = S.filter_data(data_validated, age=filter_form.value['age'], gender=filter_form.value['gender'], income=filter_form.value['income'], ethnicity=filter_form.value['ethnicity'], consumer=filter_form.value['consumer'])
# Stop execution and prevent other cells from running if no data is selected
mo.stop(len(_d.collect()) == 0, mo.md("**No Data available for current filter combination**"))
data = _d
# data = data_validated
data.collect()
return (data,)
@app.cell
def _():
return
@app.cell
def _():
# Check if all business owners are missing a 'Consumer type' in demographics
assert all([a is None for a in data_all.filter(pl.col('QID4') == 'Yes').collect()['Consumer'].unique()]) , "Not all business owners are missing 'Consumer type' in demographics."
# assert all([a is None for a in data_all.filter(pl.col('QID4') == 'Yes').collect()['Consumer'].unique()]) , "Not all business owners are missing 'Consumer type' in demographics."
return
@app.cell
def _():
mo.md(r"""
## Demographic Distributions
# Demographic Distributions
""")
return
@@ -187,14 +240,13 @@ def _():
@app.cell
def _(S, demo_plot_cols, demographics):
def _(S, data, demo_plot_cols):
_content = """
## Demographic Distributions
"""
for c in demo_plot_cols:
_fig = S.plot_demographic_distribution(
data=demographics,
data=S.get_demographics(data)[0],
column=c,
title=f"{c.replace('Bussiness', 'Business').replace('_', ' ')} Distribution of Survey Respondents"
)
@@ -214,7 +266,7 @@ def _():
return
@app.cell
@app.cell(disabled=True)
def _():
mo.md(r"""
## Best performing: Original vs Refined frankenstein
@@ -222,15 +274,15 @@ def _():
return
@app.cell
@app.cell(disabled=True)
def _(S, data):
char_refine_rank = S.get_character_refine(data)[0]
# print(char_rank.collect().head())
# print(char_refine_rank.collect().head())
print(char_refine_rank.collect().head())
return
@app.cell
@app.cell(disabled=True)
def _():
mo.md(r"""
## Character ranking points
@@ -266,6 +318,30 @@ def _(S, char_rank):
@app.cell
def _():
mo.md(r"""
### Statistical Significance Character Ranking
""")
return
@app.cell(disabled=True)
def _(S, char_rank):
_pairwise_df, _meta = S.compute_ranking_significance(char_rank)
# print(_pairwise_df.columns)
mo.md(f"""
{mo.ui.altair_chart(S.plot_significance_heatmap(_pairwise_df, metadata=_meta))}
{mo.ui.altair_chart(S.plot_significance_summary(_pairwise_df, metadata=_meta))}
""")
return
@app.cell(disabled=True)
def _():
mo.md(r"""
## Character Ranking: times 1st place
@@ -306,9 +382,75 @@ def _():
return
@app.cell
def _(S, data):
char_df = S.get_character_refine(data)[0]
return (char_df,)
@app.cell
def _(S, char_df):
from theme import ColorPalette
# Assuming you already have char_df (your data from get_character_refine or similar)
characters = ['Bank Teller', 'Familiar Friend', 'The Coach', 'Personal Assistant']
character_colors = {
'Bank Teller': (ColorPalette.CHARACTER_BANK_TELLER, ColorPalette.CHARACTER_BANK_TELLER_HIGHLIGHT),
'Familiar Friend': (ColorPalette.CHARACTER_FAMILIAR_FRIEND, ColorPalette.CHARACTER_FAMILIAR_FRIEND_HIGHLIGHT),
'The Coach': (ColorPalette.CHARACTER_COACH, ColorPalette.CHARACTER_COACH_HIGHLIGHT),
'Personal Assistant': (ColorPalette.CHARACTER_PERSONAL_ASSISTANT, ColorPalette.CHARACTER_PERSONAL_ASSISTANT_HIGHLIGHT),
}
# Build consistent sort order (by total frequency across all characters)
all_trait_counts = {}
for char in characters:
freq_df, _ = S.transform_character_trait_frequency(char_df, char)
for row in freq_df.iter_rows(named=True):
all_trait_counts[row['trait']] = all_trait_counts.get(row['trait'], 0) + row['count']
consistent_sort_order = sorted(all_trait_counts.keys(), key=lambda x: -all_trait_counts[x])
_content = """"""
# Generate 4 plots (one per character)
for char in characters:
freq_df, _ = S.transform_character_trait_frequency(char_df, char)
main_color, highlight_color = character_colors[char]
chart = S.plot_single_character_trait_frequency(
data=freq_df,
character_name=char,
bar_color=main_color,
highlight_color=highlight_color,
trait_sort_order=consistent_sort_order,
)
_content += f"""
{mo.ui.altair_chart(chart)}
"""
mo.md(_content)
return
@app.cell(disabled=True)
def _():
mo.md(r"""
## Statistical significance best characters
zie chat
> voorbeeld: als de nr 1 en 2 niet significant verschillen maar wel van de nr 3 bijvoorbeeld is dat ook top. Beetje meedenkend over hoe ik het kan presenteren weetje wat ik bedoel?:)
>
""")
return
@app.cell(disabled=True)
def _():
return
@app.cell
def _():
# Join respondent
return
@@ -322,12 +464,208 @@ def _():
return
@app.cell
def _():
COLOR_GENDER = True
return (COLOR_GENDER,)
@app.cell
def _():
mo.md(r"""
## Top 8 Most Chosen out of 18
""")
return
@app.cell
def _(S, data):
v_18_8_3 = S.get_18_8_3(data)[0]
return (v_18_8_3,)
@app.cell
def _(COLOR_GENDER, S, v_18_8_3):
S.plot_voice_selection_counts(v_18_8_3, title="Top 8 Voice Selection from 18 Voices", x_label='Voice', color_gender=COLOR_GENDER)
return
@app.cell
def _():
mo.md(r"""
## Top 3 most chosen out of 8
""")
return
@app.cell
def _(COLOR_GENDER, S, v_18_8_3):
S.plot_top3_selection_counts(v_18_8_3, title="Top 3 Voice Selection Counts from 8 Voices", x_label='Voice', color_gender=COLOR_GENDER)
return
@app.cell
def _():
mo.md(r"""
## Voice Ranking Weighted Score
""")
return
@app.cell
def _(S, data):
top3_voices = S.get_top_3_voices(data)[0]
top3_voices_weighted = calculate_weighted_ranking_scores(top3_voices)
return top3_voices, top3_voices_weighted
@app.cell
def _(COLOR_GENDER, S, top3_voices_weighted):
S.plot_weighted_ranking_score(top3_voices_weighted, title="Most Popular Voice - Weighted Popularity Score<br>(1st = 3pts, 2nd = 2pts, 3rd = 1pt)", color_gender=COLOR_GENDER)
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
---
## Which voice is ranked best in the ranking question for top 3?
# Brand Character Results
(not best 3 out of 8 question)
""")
return
@app.cell
def _(COLOR_GENDER, S, top3_voices):
S.plot_ranking_distribution(top3_voices, x_label='Voice', title="Distribution of Top 3 Voice Rankings (1st, 2nd, 3rd)", color_gender=COLOR_GENDER)
return
@app.cell
def _():
mo.md(r"""
### Statistical significance for voice ranking
""")
return
@app.cell
def _():
# print(top3_voices.collect().head())
return
@app.cell
def _():
# _pairwise_df, _metadata = S.compute_ranking_significance(
# top3_voices,alpha=0.05,correction="none")
# # View significant pairs
# # print(pairwise_df.filter(pl.col('significant') == True))
# # Create heatmap visualization
# _heatmap = S.plot_significance_heatmap(
# _pairwise_df,
# metadata=_metadata,
# title="Weighted Voice Ranking Significance<br>(Pairwise Comparisons)"
# )
# # Create summary bar chart
# _summary = S.plot_significance_summary(
# _pairwise_df,
# metadata=_metadata
# )
# mo.md(f"""
# {mo.ui.altair_chart(_heatmap)}
# {mo.ui.altair_chart(_summary)}
# """)
return
@app.cell
def _():
## Voice Ranked 1st the most
return
@app.cell
def _(COLOR_GENDER, S, top3_voices):
S.plot_most_ranked_1(top3_voices, title="Most Popular Voice<br>(Number of Times Ranked 1st)", x_label='Voice', color_gender=COLOR_GENDER)
return
@app.cell
def _():
mo.md(r"""
## Voice Scale 1-10
""")
return
@app.cell
def _(COLOR_GENDER, S, data):
# Get your voice scale data (from notebook)
voice_1_10, _ = S.get_voice_scale_1_10(data)
S.plot_average_scores_with_counts(voice_1_10, x_label='Voice', domain=[1,10], title="Voice General Impression (Scale 1-10)", color_gender=COLOR_GENDER)
return (voice_1_10,)
@app.cell(disabled=True)
def _():
mo.md(r"""
### Statistical Significance (Scale 1-10)
""")
return
@app.cell(disabled=True)
def _(S, voice_1_10):
# Compute pairwise significance tests
pairwise_df, metadata = S.compute_pairwise_significance(
voice_1_10,
test_type="mannwhitney", # or "ttest", "chi2", "auto"
alpha=0.05,
correction="bonferroni" # or "holm", "none"
)
# View significant pairs
# print(pairwise_df.filter(pl.col('significant') == True))
# Create heatmap visualization
_heatmap = S.plot_significance_heatmap(
pairwise_df,
metadata=metadata,
title="Voice Rating Significance<br>(Pairwise Comparisons)"
)
# Create summary bar chart
_summary = S.plot_significance_summary(
pairwise_df,
metadata=metadata
)
mo.md(f"""
{mo.ui.altair_chart(_heatmap)}
{mo.ui.altair_chart(_summary)}
""")
return
@app.cell
def _():
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
## Ranking points for Voice per Chosen Brand Character
**missing mapping**
""")
return
@@ -335,12 +673,261 @@ def _():
@app.cell(hide_code=True)
def _():
mo.md(r"""
---
# Spoken Voice Results
## Correlation Speaking Styles
""")
return
@app.cell
def _(S, data, top3_voices):
ss_or, choice_map_or = S.get_ss_orange_red(data)
ss_gb, choice_map_gb = S.get_ss_green_blue(data)
# Combine the data
ss_all = ss_or.join(ss_gb, on='_recordId')
_d = ss_all.collect()
choice_map = {**choice_map_or, **choice_map_gb}
# print(_d.head())
# print(choice_map)
ss_long = utils.process_speaking_style_data(ss_all, choice_map)
df_style = utils.process_speaking_style_data(ss_all, choice_map)
vscales = S.get_voice_scale_1_10(data)[0]
df_scale_long = utils.process_voice_scale_data(vscales)
joined_scale = df_style.join(df_scale_long, on=["_recordId", "Voice"], how="inner")
df_ranking = utils.process_voice_ranking_data(top3_voices)
joined_ranking = df_style.join(df_ranking, on=['_recordId', 'Voice'], how='inner')
return joined_ranking, joined_scale
@app.cell
def _(joined_ranking):
joined_ranking.head()
return
@app.cell
def _():
mo.md(r"""
### Colors vs Scale 1-10
""")
return
@app.cell
def _(S, joined_scale):
# Transform to get one row per color with average correlation
color_corr_scale, _ = utils.transform_speaking_style_color_correlation(joined_scale, SPEAKING_STYLES)
S.plot_speaking_style_color_correlation(
data=color_corr_scale,
title="Correlation: Speaking Style Colors and Voice Scale 1-10"
)
return
@app.cell
def _():
mo.md(r"""
### Colors vs Ranking Points
""")
return
@app.cell
def _(S, joined_ranking):
color_corr_ranking, _ = utils.transform_speaking_style_color_correlation(
joined_ranking,
SPEAKING_STYLES,
target_column="Ranking_Points"
)
S.plot_speaking_style_color_correlation(
data=color_corr_ranking,
title="Correlation: Speaking Style Colors and Voice Ranking Points"
)
return
@app.cell
def _():
mo.md(r"""
### Individual Traits vs Scale 1-10
""")
return
@app.cell
def _(S, joined_scale):
_content = """"""
for _style, _traits in SPEAKING_STYLES.items():
# print(f"Correlation plot for {style}...")
_fig = S.plot_speaking_style_correlation(
data=joined_scale,
style_color=_style,
style_traits=_traits,
title=f"Correlation: Speaking Style {_style} and Voice Scale 1-10",
)
_content += f"""
#### Speaking Style **{_style}**:
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
### Individual Traits vs Ranking Points
""")
return
@app.cell
def _(S, joined_ranking):
_content = """"""
for _style, _traits in SPEAKING_STYLES.items():
# print(f"Correlation plot for {style}...")
_fig = S.plot_speaking_style_ranking_correlation(
data=joined_ranking,
style_color=_style,
style_traits=_traits,
title=f"Correlation: Speaking Style {_style} and Voice Ranking Points",
)
_content += f"""
#### Speaking Style **{_style}**:
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
## Correlations when "Best Brand Character" is chosen
Select only the traits that fit with that character
""")
return
@app.cell
def _(BEST_CHOSEN_CHARACTER):
from reference import ORIGINAL_CHARACTER_TRAITS
chosen_bc_traits = ORIGINAL_CHARACTER_TRAITS[BEST_CHOSEN_CHARACTER]
return (chosen_bc_traits,)
@app.cell
def _(chosen_bc_traits):
STYLES_SUBSET = utils.filter_speaking_styles(SPEAKING_STYLES, chosen_bc_traits)
return (STYLES_SUBSET,)
@app.cell(hide_code=True)
def _():
mo.md(r"""
### Individual Traits vs Ranking Points
""")
return
@app.cell
def _(BEST_CHOSEN_CHARACTER, S, STYLES_SUBSET, joined_ranking):
_content = ""
for _style, _traits in STYLES_SUBSET.items():
_fig = S.plot_speaking_style_ranking_correlation(
data=joined_ranking,
style_color=_style,
style_traits=_traits,
title=f"""Brand Character "{BEST_CHOSEN_CHARACTER.replace('_', ' ').title()}" - Correlation: Speaking Style {_style} and Voice Ranking Points"""
)
_content += f"""
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
### Individual Traits vs Scale 1-10
""")
return
@app.cell
def _(BEST_CHOSEN_CHARACTER, S, STYLES_SUBSET, joined_scale):
_content = """"""
for _style, _traits in STYLES_SUBSET.items():
# print(f"Correlation plot for {style}...")
_fig = S.plot_speaking_style_correlation(
data=joined_scale,
style_color=_style,
style_traits=_traits,
title=f"""Brand Character "{BEST_CHOSEN_CHARACTER.replace('_', ' ').title()}" - Correlation: Speaking Style {_style} and Voice Scale 1-10""",
)
_content += f"""
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
### Colors vs Scale 1-10 (Best Character)
""")
return
@app.cell
def _(BEST_CHOSEN_CHARACTER, S, STYLES_SUBSET, joined_scale):
# Transform to get one row per color with average correlation
_color_corr_scale, _ = utils.transform_speaking_style_color_correlation(joined_scale, STYLES_SUBSET)
S.plot_speaking_style_color_correlation(
data=_color_corr_scale,
title=f"""Brand Character "{BEST_CHOSEN_CHARACTER.replace('_', ' ').title()}" - Correlation: Speaking Style Colors and Voice Scale 1-10"""
)
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
### Colors vs Ranking Points (Best Character)
""")
return
@app.cell
def _(BEST_CHOSEN_CHARACTER, S, STYLES_SUBSET, joined_ranking):
_color_corr_ranking, _ = utils.transform_speaking_style_color_correlation(
joined_ranking,
STYLES_SUBSET,
target_column="Ranking_Points"
)
S.plot_speaking_style_color_correlation(
data=_color_corr_ranking,
title=f"""Brand Character "{BEST_CHOSEN_CHARACTER.replace('_', ' ').title()}" - Correlation: Speaking Style Colors and Voice Ranking Points"""
)
return
if __name__ == "__main__":
app.run()

View File

@@ -1,6 +1,6 @@
import marimo
__generated_with = "0.19.2"
__generated_with = "0.19.7"
app = marimo.App(width="medium")
with app.setup:
@@ -21,28 +21,24 @@ def _():
@app.cell
def _():
TAG_SOURCE = Path('data/reports/Perception-Research-Report.pptx')
TAG_TARGET = Path('data/reports/Perception-Research-Report_tagged.pptx')
TAG_IMAGE_DIR = Path('figures/OneDrive_2026-01-28/')
return TAG_IMAGE_DIR, TAG_SOURCE, TAG_TARGET
@app.cell
def _(TAG_IMAGE_DIR, TAG_SOURCE, TAG_TARGET):
utils.update_ppt_alt_text(ppt_path=TAG_SOURCE, image_source_dir=TAG_IMAGE_DIR, output_path=TAG_TARGET)
return
@app.cell
def _():
utils._calculate_file_sha1('figures/OneDrive_2026-01-28/All_Respondents/most_prominent_personality_traits.png')
return
TAG_SOURCE = Path('data/reports/VOICE_Perception-Research-Report_4-2-26_19-30.pptx')
# TAG_TARGET = Path('data/reports/Perception-Research-Report_2-2_tagged.pptx')
TAG_IMAGE_DIR = Path('figures/debug')
return TAG_IMAGE_DIR, TAG_SOURCE
@app.cell
def _():
utils._calculate_perceptual_hash('figures/Picture.png')
def _(TAG_IMAGE_DIR, TAG_SOURCE):
utils.update_ppt_alt_text(
ppt_path=TAG_SOURCE,
image_source_dir=TAG_IMAGE_DIR,
# output_path=TAG_TARGET
)
return
@@ -56,26 +52,21 @@ def _():
@app.cell
def _():
REPLACE_SOURCE = Path('data/test_replace_source.pptx')
REPLACE_TARGET = Path('data/test_replace_target.pptx')
return REPLACE_SOURCE, REPLACE_TARGET
REPLACE_SOURCE = Path('data/reports/VOICE_Perception-Research-Report_4-2-26_19-30.pptx')
# REPLACE_TARGET = Path('data/reports/Perception-Research-Report_2-2_updated.pptx')
app._unparsable_cell(
r"""
IMAGE_FILE = Path('figures/OneDrive_2026-01-28/Cons-Early_Professional/cold_distant_approachable_familiar_warm.png'
""",
name="_"
)
NEW_IMAGES_DIR = Path('figures/2-4-26')
return NEW_IMAGES_DIR, REPLACE_SOURCE
@app.cell
def _(IMAGE_FILE, REPLACE_SOURCE, REPLACE_TARGET):
utils.pptx_replace_named_image(
presentation_path=REPLACE_SOURCE,
target_tag=utils.image_alt_text_generator(IMAGE_FILE),
new_image_path=IMAGE_FILE,
save_path=REPLACE_TARGET)
def _(NEW_IMAGES_DIR, REPLACE_SOURCE):
# get all files in the image source directory and subdirectories
results = utils.pptx_replace_images_from_directory(
REPLACE_SOURCE, # Source presentation path,
NEW_IMAGES_DIR, # Source directory with new images
# REPLACE_TARGET # Output path (optional, defaults to overwrite)
)
return

238
README.md
View File

@@ -1,5 +1,239 @@
# Voice Branding Quantitative Analysis
## Running Marimo Notebooks
Running on Ct-105 for shared access:
```
```bash
uv run marimo run 02_quant_analysis.py --headless --port 8080
```
```
---
## Batch Report Generation
The quant report can be run with different filter combinations via CLI or automated batch processing.
### Single Filter Run (CLI)
Run the report script directly with JSON-encoded filter arguments:
```bash
# Single consumer segment
uv run python 03_quant_report.script.py --consumer '["Starter"]'
# Single age group
uv run python 03_quant_report.script.py --age '["18 to 21 years"]'
# Multiple filters combined
uv run python 03_quant_report.script.py --age '["18 to 21 years", "22 to 24 years"]' --gender '["Male"]'
# All respondents (no filters = defaults to all options selected)
uv run python 03_quant_report.script.py
```
Available filter arguments:
- `--age` — JSON list of age groups
- `--gender` — JSON list of genders
- `--ethnicity` — JSON list of ethnicities
- `--income` — JSON list of income groups
- `--consumer` — JSON list of consumer segments
### Batch Runner (All Combinations)
Run all single-filter combinations automatically with progress tracking:
```bash
# Preview all combinations without running
uv run python run_filter_combinations.py --dry-run
# Run all combinations (shows progress bar)
uv run python run_filter_combinations.py
# Or use the registered CLI entry point
uv run quant-report-batch
uv run quant-report-batch --dry-run
```
This generates reports for:
- All Respondents (no filters)
- Each age group individually
- Each gender individually
- Each ethnicity individually
- Each income group individually
- Each consumer segment individually
Output figures are saved to `figures/<export_date>/<filter_slug>/`.
### Jupyter Notebook Debugging
The script auto-detects Jupyter/IPython environments. When running in VS Code's Jupyter extension, CLI args default to `None` (all options selected), so you can debug cell-by-cell normally.
---
## Adding Custom Filter Combinations
To add new filter combinations to the batch runner, edit `run_filter_combinations.py`:
### Checklist
1. **Open** `run_filter_combinations.py`
2. **Find** the `get_filter_combinations()` function
3. **Add** your combination to the list before the `return` statement:
```python
# Example: Add a specific age + consumer cross-filter
combinations.append({
'name': 'Age-18to24_Consumer-Starter', # Used for output folder naming
'filters': {
'age': ['18 to 21 years', '22 to 24 years'],
'consumer': ['Starter']
}
})
```
4. **Filter keys** must match CLI argument names (defined in `FILTER_CONFIG` in `03_quant_report.script.py`):
- `age` — values from `survey.options_age`
- `gender` — values from `survey.options_gender`
- `ethnicity` — values from `survey.options_ethnicity`
- `income` — values from `survey.options_income`
- `consumer` — values from `survey.options_consumer`
5. **Check available values** by running:
```python
from utils import QualtricsSurvey
S = QualtricsSurvey('data/exports/2-2-26/...Labels.csv', 'data/exports/.../....qsf')
S.load_data()
print(S.options_age)
print(S.options_consumer)
# etc.
```
6. **Test** with dry-run first:
```bash
uv run python run_filter_combinations.py --dry-run
```
### Example: Adding Multiple Cross-Filters
```python
# In get_filter_combinations(), before return:
# Young professionals
combinations.append({
'name': 'Young_Professionals',
'filters': {
'age': ['22 to 24 years', '25 to 34 years'],
'consumer': ['Early Professional']
}
})
# High income males
combinations.append({
'name': 'High_Income_Male',
'filters': {
'income': ['$150,000 - $199,999', '$200,000 or more'],
'gender': ['Male']
}
})
```
### Notes
- **Empty filters dict** = all respondents (no filtering)
- **Omitted filter keys** = all options for that dimension selected
- **Output folder names** are auto-generated from active filters by `QualtricsSurvey.filter_data()`
---
## Adding a New Filter Dimension
To add an entirely new filter dimension (e.g., a new demographic question), you need to update several files:
### Checklist
1. **Update `utils.py` — `QualtricsSurvey.__init__()`** to initialize the filter state attribute:
```python
# In __init__(), add after existing filter_ attributes (around line 758):
self.filter_region:list = None # QID99
```
2. **Update `utils.py` — `load_data()`** to populate the `options_*` attribute:
```python
# In load_data(), add after existing options:
self.options_region = sorted(df['QID99'].drop_nulls().unique().to_list()) if 'QID99' in df.columns else []
```
3. **Update `utils.py` — `filter_data()`** to accept and apply the filter:
```python
# Add parameter to function signature:
def filter_data(self, q: pl.LazyFrame, ..., region:list=None) -> pl.LazyFrame:
# Add filter logic in function body:
self.filter_region = region
if region is not None:
q = q.filter(pl.col('QID99').is_in(region))
```
4. **Update `plots.py` — `_get_filter_slug()`** to include the filter in directory slugs:
```python
# Add to the filters list:
('region', 'Reg', getattr(self, 'filter_region', None), 'options_region'),
```
5. **Update `plots.py` — `_get_filter_description()`** for human-readable descriptions:
```python
# Add to the filters list:
('Region', getattr(self, 'filter_region', None), 'options_region'),
```
6. **Update `03_quant_report.script.py` — `FILTER_CONFIG`**:
```python
FILTER_CONFIG = {
'age': 'options_age',
'gender': 'options_gender',
# ... existing filters ...
'region': 'options_region', # ← New filter
}
```
This **automatically**:
- Adds `--region` CLI argument
- Includes it in Jupyter mode (defaults to all options)
- Passes it to `S.filter_data()`
- Writes it to the `.txt` filter description file
7. **Update `run_filter_combinations.py`** to generate combinations (optional):
```python
# Add after existing filter loops:
for region in survey.options_region:
combinations.append({
'name': f'Region-{region}',
'filters': {'region': [region]}
})
```
### Currently Available Filters
| CLI Argument | Options Attribute | QID Column | Description |
|--------------|-------------------|------------|-------------|
| `--age` | `options_age` | QID1 | Age groups |
| `--gender` | `options_gender` | QID2 | Gender |
| `--ethnicity` | `options_ethnicity` | QID3 | Ethnicity |
| `--income` | `options_income` | QID15 | Income brackets |
| `--consumer` | `options_consumer` | Consumer | Consumer segments |
| `--business_owner` | `options_business_owner` | QID4 | Business owner status |
| `--employment_status` | `options_employment_status` | QID13 | Employment status |
| `--personal_products` | `options_personal_products` | QID14 | Personal products |
| `--ai_user` | `options_ai_user` | QID22 | AI user status |
| `--investable_assets` | `options_investable_assets` | QID16 | Investable assets |
| `--industry` | `options_industry` | QID17 | Industry |

View File

@@ -0,0 +1,263 @@
"""Extra analyses of the traits"""
# %% Imports
import utils
import polars as pl
import argparse
import json
import re
from pathlib import Path
from validation import check_straight_liners
# %% Fixed Variables
RESULTS_FILE = 'data/exports/2-4-26/JPMC_Chase Brand Personality_Quant Round 1_February 4, 2026_Labels.csv'
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
# %% CLI argument parsing for batch automation
# When run as script: uv run XX_statistical_significance.script.py --age '["18
# Central filter configuration - add new filters here only
# Format: 'cli_arg_name': 'QualtricsSurvey.options_* attribute name'
FILTER_CONFIG = {
'age': 'options_age',
'gender': 'options_gender',
'ethnicity': 'options_ethnicity',
'income': 'options_income',
'consumer': 'options_consumer',
'business_owner': 'options_business_owner',
'ai_user': 'options_ai_user',
'investable_assets': 'options_investable_assets',
'industry': 'options_industry',
}
def parse_cli_args():
parser = argparse.ArgumentParser(description='Generate quant report with optional filters')
# Dynamically add filter arguments from config
for filter_name in FILTER_CONFIG:
parser.add_argument(f'--{filter_name}', type=str, default=None, help=f'JSON list of {filter_name} values')
parser.add_argument('--filter-name', type=str, default=None, help='Name for this filter combination (used for .txt description file)')
parser.add_argument('--figures-dir', type=str, default=f'figures/traits-likert-analysis/{Path(RESULTS_FILE).parts[2]}', help='Override the default figures directory')
# Only parse if running as script (not in Jupyter/interactive)
try:
# Check if running in Jupyter by looking for ipykernel
get_ipython() # noqa: F821 # type: ignore
# Return namespace with all filters set to None
no_filters = {f: None for f in FILTER_CONFIG}
# Use the same default as argparse
default_fig_dir = f'figures/traits-likert-analysis/{Path(RESULTS_FILE).parts[2]}'
return argparse.Namespace(**no_filters, filter_name=None, figures_dir=default_fig_dir)
except NameError:
args = parser.parse_args()
# Parse JSON strings to lists
for filter_name in FILTER_CONFIG:
val = getattr(args, filter_name)
setattr(args, filter_name, json.loads(val) if val else None)
return args
cli_args = parse_cli_args()
# %%
S = utils.QualtricsSurvey(RESULTS_FILE, QSF_FILE, figures_dir=cli_args.figures_dir)
data_all = S.load_data()
# %% Build filtered dataset based on CLI args
# CLI args: None means "no filter applied" - filter_data() will skip None filters
# Build filter values dict dynamically from FILTER_CONFIG
_active_filters = {filter_name: getattr(cli_args, filter_name) for filter_name in FILTER_CONFIG}
_d = S.filter_data(data_all, **_active_filters)
# Write filter description file if filter-name is provided
if cli_args.filter_name and S.fig_save_dir:
# Get the filter slug (e.g., "All_Respondents", "Cons-Starter", etc.)
_filter_slug = S._get_filter_slug()
_filter_slug_dir = S.fig_save_dir / _filter_slug
_filter_slug_dir.mkdir(parents=True, exist_ok=True)
# Build filter description
_filter_desc_lines = [
f"Filter: {cli_args.filter_name}",
"",
"Applied Filters:",
]
_short_desc_parts = []
for filter_name, options_attr in FILTER_CONFIG.items():
all_options = getattr(S, options_attr)
values = _active_filters[filter_name]
display_name = filter_name.replace('_', ' ').title()
# None means no filter applied (same as "All")
if values is not None and values != all_options:
_short_desc_parts.append(f"{display_name}: {', '.join(values)}")
_filter_desc_lines.append(f" {display_name}: {', '.join(values)}")
else:
_filter_desc_lines.append(f" {display_name}: All")
# Write detailed description INSIDE the filter-slug directory
# Sanitize filter name for filename usage (replace / and other chars)
_safe_filter_name = re.sub(r'[^\w\s-]', '_', cli_args.filter_name)
_filter_file = _filter_slug_dir / f"{_safe_filter_name}.txt"
_filter_file.write_text('\n'.join(_filter_desc_lines))
# Append to summary index file at figures/<export_date>/filter_index.txt
_summary_file = S.fig_save_dir / "filter_index.txt"
_short_desc = "; ".join(_short_desc_parts) if _short_desc_parts else "All Respondents"
_summary_line = f"{_filter_slug} | {cli_args.filter_name} | {_short_desc}\n"
# Append or create the summary file
if _summary_file.exists():
_existing = _summary_file.read_text()
# Avoid duplicate entries for same slug
if _filter_slug not in _existing:
with _summary_file.open('a') as f:
f.write(_summary_line)
else:
_header = "Filter Index\n" + "=" * 80 + "\n\n"
_header += "Directory | Filter Name | Description\n"
_header += "-" * 80 + "\n"
_summary_file.write_text(_header + _summary_line)
# Save to logical variable name for further analysis
data = _d
data.collect()
# %% Voices per trait
ss_or, choice_map_or = S.get_ss_orange_red(data)
ss_gb, choice_map_gb = S.get_ss_green_blue(data)
# Combine the data
ss_all = ss_or.join(ss_gb, on='_recordId')
_d = ss_all.collect()
choice_map = {**choice_map_or, **choice_map_gb}
# print(_d.head())
# print(choice_map)
ss_long = utils.process_speaking_style_data(ss_all, choice_map)
# %% Create plots
for i, trait in enumerate(ss_long.select("Description").unique().to_series().to_list()):
trait_d = ss_long.filter(pl.col("Description") == trait)
S.plot_speaking_style_trait_scores(trait_d, title=trait.replace(":", ""), height=550, color_gender=True)
# %% Filter out straight-liner (PER TRAIT) and re-plot to see if any changes
# Save with different filename suffix so we can compare with/without straight-liners
print("\n--- Straight-lining Checks on TRAITS ---")
sl_report_traits, sl_traits_df = check_straight_liners(ss_all, max_score=5)
sl_traits_df
# %%
if sl_traits_df is not None and not sl_traits_df.is_empty():
sl_ids = sl_traits_df.select(pl.col("Record ID").unique()).to_series().to_list()
n_sl_groups = sl_traits_df.height
print(f"\nExcluding {n_sl_groups} straight-lined question blocks from {len(sl_ids)} respondents.")
# Create key in ss_long to match sl_traits_df for anti-join
# Question Group key in sl_traits_df is like "SS_Orange_Red__V14"
# ss_long has "Style_Group" and "Voice"
ss_long_w_key = ss_long.with_columns(
(pl.col("Style_Group") + "__" + pl.col("Voice")).alias("Question Group")
)
# Prepare filter table: Record ID + Question Group
sl_filter = sl_traits_df.select([
pl.col("Record ID").alias("_recordId"),
pl.col("Question Group")
])
# Anti-join to remove specific question blocks that were straight-lined
ss_long_clean = ss_long_w_key.join(sl_filter, on=["_recordId", "Question Group"], how="anti").drop("Question Group")
# Re-plot with suffix in title
print("Re-plotting traits (Cleaned)...")
for i, trait in enumerate(ss_long_clean.select("Description").unique().to_series().to_list()):
trait_d = ss_long_clean.filter(pl.col("Description") == trait)
# Modify title to create unique filename (and display title)
title_clean = trait.replace(":", "") + " (Excl. Straight-Liners)"
S.plot_speaking_style_trait_scores(trait_d, title=title_clean, height=550, color_gender=True)
else:
print("No straight-liners found on traits.")
# %% Compare All vs Cleaned
if sl_traits_df is not None and not sl_traits_df.is_empty():
print("Generating Comparison Plots (All vs Cleaned)...")
# Always apply the per-question-group filtering here to ensure consistency
# (Matches the logic used in the re-plotting section above)
print("Applying filter to remove straight-lined question blocks...")
ss_long_w_key = ss_long.with_columns(
(pl.col("Style_Group") + "__" + pl.col("Voice")).alias("Question Group")
)
sl_filter = sl_traits_df.select([
pl.col("Record ID").alias("_recordId"),
pl.col("Question Group")
])
ss_long_clean = ss_long_w_key.join(sl_filter, on=["_recordId", "Question Group"], how="anti").drop("Question Group")
sl_ids = sl_traits_df.select(pl.col("Record ID").unique()).to_series().to_list()
# --- Verification Prints ---
print(f"\n--- Verification of Filter ---")
print(f"Original Row Count: {ss_long.height}")
print(f"Number of Straight-Liner Question Blocks: {sl_traits_df.height}")
print(f"Sample IDs affected: {sl_ids[:5]}")
print(f"Cleaned Row Count: {ss_long_clean.height}")
print(f"Rows Removed: {ss_long.height - ss_long_clean.height}")
# Verify removal
# Re-construct key to verify
ss_long_check = ss_long.with_columns(
(pl.col("Style_Group") + "__" + pl.col("Voice")).alias("Question Group")
)
sl_filter_check = sl_traits_df.select([
pl.col("Record ID").alias("_recordId"),
pl.col("Question Group")
])
should_be_removed = ss_long_check.join(sl_filter_check, on=["_recordId", "Question Group"], how="inner").height
print(f"Discrepancy Check (Should be 0): { (ss_long.height - ss_long_clean.height) - should_be_removed }")
# Show what was removed (the straight lining behavior)
print("\nSample of Straight-Liner Data (Values that caused removal):")
print(sl_traits_df.head(5))
print("-" * 30 + "\n")
# ---------------------------
for i, trait in enumerate(ss_long.select("Description").unique().to_series().to_list()):
# Get data for this trait from both datasets
trait_d_all = ss_long.filter(pl.col("Description") == trait)
trait_d_clean = ss_long_clean.filter(pl.col("Description") == trait)
# Plot comparison
title_comp = trait.replace(":", "") + " (Impact of Straight-Liners)"
S.plot_speaking_style_trait_scores_comparison(
trait_d_all,
trait_d_clean,
title=title_comp,
height=600 # Slightly taller for grouped bars
)

849
XX_quant_report.script.py Normal file
View File

@@ -0,0 +1,849 @@
__generated_with = "0.19.7"
# %%
import marimo as mo
import polars as pl
from pathlib import Path
import argparse
import json
import re
from validation import check_progress, duration_validation, check_straight_liners
from utils import QualtricsSurvey, combine_exclusive_columns, calculate_weighted_ranking_scores
import utils
from speaking_styles import SPEAKING_STYLES
# %% Fixed Variables
RESULTS_FILE = 'data/exports/2-4-26/JPMC_Chase Brand Personality_Quant Round 1_February 4, 2026_Labels.csv'
# RESULTS_FILE = 'data/exports/debug/JPMC_Chase Brand Personality_Quant Round 1_February 2, 2026_Labels.csv'
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
# %%
# CLI argument parsing for batch automation
# When run as script: python 03_quant_report.script.py --age '["18 to 21 years"]' --consumer '["Starter"]'
# When run in Jupyter: args will use defaults (all filters = None = all options selected)
# Central filter configuration - add new filters here only
# Format: 'cli_arg_name': 'QualtricsSurvey.options_* attribute name'
FILTER_CONFIG = {
'age': 'options_age',
'gender': 'options_gender',
'ethnicity': 'options_ethnicity',
'income': 'options_income',
'consumer': 'options_consumer',
'business_owner': 'options_business_owner',
'ai_user': 'options_ai_user',
'investable_assets': 'options_investable_assets',
'industry': 'options_industry',
}
def parse_cli_args():
parser = argparse.ArgumentParser(description='Generate quant report with optional filters')
# Dynamically add filter arguments from config
for filter_name in FILTER_CONFIG:
parser.add_argument(f'--{filter_name}', type=str, default=None, help=f'JSON list of {filter_name} values')
parser.add_argument('--filter-name', type=str, default=None, help='Name for this filter combination (used for .txt description file)')
parser.add_argument('--figures-dir', type=str, default=f'figures/{Path(RESULTS_FILE).parts[2]}', help='Override the default figures directory')
parser.add_argument('--best-character', type=str, default="the_coach", help='Slug of the best chosen character (default: "the_coach")')
parser.add_argument('--sl-threshold', type=int, default=None, help='Exclude respondents who straight-lined >= N question groups (e.g. 3 removes anyone with 3+ straight-lined groups)')
parser.add_argument('--voice-ranking-filter', type=str, default=None, choices=['only-missing', 'exclude-missing'], help='Filter by voice ranking completeness: "only-missing" keeps only respondents missing QID98 ranking data, "exclude-missing" removes them')
# Only parse if running as script (not in Jupyter/interactive)
try:
# Check if running in Jupyter by looking for ipykernel
get_ipython() # noqa: F821 # type: ignore
# Return namespace with all filters set to None
no_filters = {f: None for f in FILTER_CONFIG}
return argparse.Namespace(**no_filters, filter_name=None, figures_dir=f'figures/statistical_significance/{Path(RESULTS_FILE).parts[2]}', best_character="the_coach", sl_threshold=None, voice_ranking_filter=None)
except NameError:
args = parser.parse_args()
# Parse JSON strings to lists
for filter_name in FILTER_CONFIG:
val = getattr(args, filter_name)
setattr(args, filter_name, json.loads(val) if val else None)
return args
cli_args = parse_cli_args()
BEST_CHOSEN_CHARACTER = cli_args.best_character
# %%
S = QualtricsSurvey(RESULTS_FILE, QSF_FILE, figures_dir=cli_args.figures_dir)
try:
data_all = S.load_data()
except NotImplementedError as e:
mo.stop(True, mo.md(f"**⚠️ {str(e)}**"))
# %% Build filtered dataset based on CLI args
# CLI args: None means "no filter applied" - filter_data() will skip None filters
# Build filter values dict dynamically from FILTER_CONFIG
_active_filters = {filter_name: getattr(cli_args, filter_name) for filter_name in FILTER_CONFIG}
# %% Apply filters
_d = S.filter_data(data_all, **_active_filters)
# Write filter description file if filter-name is provided
if cli_args.filter_name and S.fig_save_dir:
# Get the filter slug (e.g., "All_Respondents", "Cons-Starter", etc.)
_filter_slug = S._get_filter_slug()
_filter_slug_dir = S.fig_save_dir / _filter_slug
_filter_slug_dir.mkdir(parents=True, exist_ok=True)
# Build filter description
_filter_desc_lines = [
f"Filter: {cli_args.filter_name}",
"",
"Applied Filters:",
]
_short_desc_parts = []
for filter_name, options_attr in FILTER_CONFIG.items():
all_options = getattr(S, options_attr)
values = _active_filters[filter_name]
display_name = filter_name.replace('_', ' ').title()
# None means no filter applied (same as "All")
if values is not None and values != all_options:
_short_desc_parts.append(f"{display_name}: {', '.join(values)}")
_filter_desc_lines.append(f" {display_name}: {', '.join(values)}")
else:
_filter_desc_lines.append(f" {display_name}: All")
# Write detailed description INSIDE the filter-slug directory
# Sanitize filter name for filename usage (replace / and other chars)
_safe_filter_name = re.sub(r'[^\w\s-]', '_', cli_args.filter_name)
_filter_file = _filter_slug_dir / f"{_safe_filter_name}.txt"
_filter_file.write_text('\n'.join(_filter_desc_lines))
# Append to summary index file at figures/<export_date>/filter_index.txt
_summary_file = S.fig_save_dir / "filter_index.txt"
_short_desc = "; ".join(_short_desc_parts) if _short_desc_parts else "All Respondents"
_summary_line = f"{_filter_slug} | {cli_args.filter_name} | {_short_desc}\n"
# Append or create the summary file
if _summary_file.exists():
_existing = _summary_file.read_text()
# Avoid duplicate entries for same slug
if _filter_slug not in _existing:
with _summary_file.open('a') as f:
f.write(_summary_line)
else:
_header = "Filter Index\n" + "=" * 80 + "\n\n"
_header += "Directory | Filter Name | Description\n"
_header += "-" * 80 + "\n"
_summary_file.write_text(_header + _summary_line)
# %% Apply straight-liner threshold filter (if specified)
# Removes respondents who straight-lined >= N question groups across
# speaking style and voice scale questions.
if cli_args.sl_threshold is not None:
_sl_n = cli_args.sl_threshold
S.sl_threshold = _sl_n # Store on Survey so filter slug/description include it
print(f"Applying straight-liner filter: excluding respondents with ≥{_sl_n} straight-lined question groups...")
_n_before = _d.select(pl.len()).collect().item()
# Extract question groups with renamed columns for check_straight_liners
_sl_ss_or, _ = S.get_ss_orange_red(_d)
_sl_ss_gb, _ = S.get_ss_green_blue(_d)
_sl_vs, _ = S.get_voice_scale_1_10(_d)
_sl_all_q = _sl_ss_or.join(_sl_ss_gb, on='_recordId').join(_sl_vs, on='_recordId')
_, _sl_df = check_straight_liners(_sl_all_q, max_score=5)
if _sl_df is not None and not _sl_df.is_empty():
# Count straight-lined question groups per respondent
_sl_counts = (
_sl_df
.group_by("Record ID")
.agg(pl.len().alias("sl_count"))
.filter(pl.col("sl_count") >= _sl_n)
.select(pl.col("Record ID").alias("_recordId"))
)
# Anti-join to remove offending respondents
_d = _d.collect().join(_sl_counts, on="_recordId", how="anti").lazy()
# Update filtered data on the Survey object so sample size is correct
S.data_filtered = _d
_n_after = _d.select(pl.len()).collect().item()
print(f" Removed {_n_before - _n_after} respondents ({_n_before}{_n_after})")
else:
print(" No straight-liners detected — no respondents removed.")
# %% Apply voice-ranking completeness filter (if specified)
# Keeps only / excludes respondents who are missing the explicit voice
# ranking question (QID98) despite completing the top-3 selection (QID36).
if cli_args.voice_ranking_filter is not None:
S.voice_ranking_filter = cli_args.voice_ranking_filter # Store on Survey so filter slug/description include it
_vr_missing = S.get_top_3_voices_missing_ranking(_d)
_vr_missing_ids = _vr_missing.select('_recordId')
_n_before = _d.select(pl.len()).collect().item()
if cli_args.voice_ranking_filter == 'only-missing':
print(f"Voice ranking filter: keeping ONLY respondents missing QID98 ranking data...")
_d = _d.collect().join(_vr_missing_ids, on='_recordId', how='inner').lazy()
elif cli_args.voice_ranking_filter == 'exclude-missing':
print(f"Voice ranking filter: EXCLUDING respondents missing QID98 ranking data...")
_d = _d.collect().join(_vr_missing_ids, on='_recordId', how='anti').lazy()
S.data_filtered = _d
_n_after = _d.select(pl.len()).collect().item()
print(f" {_n_before}{_n_after} respondents ({_vr_missing_ids.height} missing ranking data)")
# Save to logical variable name for further analysis
data = _d
data.collect()
# %%
# Check if all business owners are missing a 'Consumer type' in demographics
# assert all([a is None for a in data_all.filter(pl.col('QID4') == 'Yes').collect()['Consumer'].unique()]) , "Not all business owners are missing 'Consumer type' in demographics."
# %%
mo.md(r"""
# Demographic Distributions
""")
# %%
demo_plot_cols = [
'Age',
'Gender',
# 'Race/Ethnicity',
'Bussiness_Owner',
'Consumer'
]
# %%
_content = """
"""
for c in demo_plot_cols:
_fig = S.plot_demographic_distribution(
data=S.get_demographics(data)[0],
column=c,
title=f"{c.replace('Bussiness', 'Business').replace('_', ' ')} Distribution of Survey Respondents"
)
_content += f"""{mo.ui.altair_chart(_fig)}\n\n"""
mo.md(_content)
# %%
mo.md(r"""
---
# Brand Character Results
""")
# %%
mo.md(r"""
## Best performing: Original vs Refined frankenstein
""")
# %%
char_refine_rank = S.get_character_refine(data)[0]
# print(char_rank.collect().head())
print(char_refine_rank.collect().head())
# %%
mo.md(r"""
## Character ranking points
""")
# %%
mo.md(r"""
## Character ranking 1-2-3
""")
# %%
char_rank = S.get_character_ranking(data)[0]
# %%
char_rank_weighted = calculate_weighted_ranking_scores(char_rank)
S.plot_weighted_ranking_score(char_rank_weighted, title="Most Popular Character - Weighted Popularity Score<br>(1st=3pts, 2nd=2pts, 3rd=1pt)", x_label='Voice')
# %%
S.plot_top3_ranking_distribution(char_rank, x_label='Character Personality', title='Character Personality: Rankings Top 3')
# %%
mo.md(r"""
### Statistical Significance Character Ranking
""")
# %%
# _pairwise_df, _meta = S.compute_ranking_significance(char_rank)
# # print(_pairwise_df.columns)
# mo.md(f"""
# {mo.ui.altair_chart(S.plot_significance_heatmap(_pairwise_df, metadata=_meta))}
# {mo.ui.altair_chart(S.plot_significance_summary(_pairwise_df, metadata=_meta))}
# """)
# %%
mo.md(r"""
## Character Ranking: times 1st place
""")
# %%
S.plot_most_ranked_1(char_rank, title="Most Popular Character<br>(Number of Times Ranked 1st)", x_label='Character Personality')
# %%
mo.md(r"""
## Prominent predefined personality traits wordcloud
""")
# %%
top8_traits = S.get_top_8_traits(data)[0]
S.plot_traits_wordcloud(
data=top8_traits,
column='Top_8_Traits',
title="Most Prominent Personality Traits",
)
# %%
mo.md(r"""
## Trait frequency per brand character
""")
# %%
char_df = S.get_character_refine(data)[0]
# %%
from theme import ColorPalette
# Assuming you already have char_df (your data from get_character_refine or similar)
characters = ['Bank Teller', 'Familiar Friend', 'The Coach', 'Personal Assistant']
character_colors = {
'Bank Teller': (ColorPalette.CHARACTER_BANK_TELLER, ColorPalette.CHARACTER_BANK_TELLER_HIGHLIGHT),
'Familiar Friend': (ColorPalette.CHARACTER_FAMILIAR_FRIEND, ColorPalette.CHARACTER_FAMILIAR_FRIEND_HIGHLIGHT),
'The Coach': (ColorPalette.CHARACTER_COACH, ColorPalette.CHARACTER_COACH_HIGHLIGHT),
'Personal Assistant': (ColorPalette.CHARACTER_PERSONAL_ASSISTANT, ColorPalette.CHARACTER_PERSONAL_ASSISTANT_HIGHLIGHT),
}
# Build consistent sort order (by total frequency across all characters)
all_trait_counts = {}
for char in characters:
freq_df, _ = S.transform_character_trait_frequency(char_df, char)
for row in freq_df.iter_rows(named=True):
all_trait_counts[row['trait']] = all_trait_counts.get(row['trait'], 0) + row['count']
consistent_sort_order = sorted(all_trait_counts.keys(), key=lambda x: -all_trait_counts[x])
_content = """"""
# Generate 4 plots (one per character)
for char in characters:
freq_df, _ = S.transform_character_trait_frequency(char_df, char)
main_color, highlight_color = character_colors[char]
chart = S.plot_single_character_trait_frequency(
data=freq_df,
character_name=char,
bar_color=main_color,
highlight_color=highlight_color,
trait_sort_order=consistent_sort_order,
)
_content += f"""
{mo.ui.altair_chart(chart)}
"""
mo.md(_content)
# %%
mo.md(r"""
## Statistical significance best characters
zie chat
> voorbeeld: als de nr 1 en 2 niet significant verschillen maar wel van de nr 3 bijvoorbeeld is dat ook top. Beetje meedenkend over hoe ik het kan presenteren weetje wat ik bedoel?:)
>
""")
# %%
# %%
# %%
mo.md(r"""
---
# Spoken Voice Results
""")
# %%
COLOR_GENDER = True
# %%
mo.md(r"""
## Top 8 Most Chosen out of 18
""")
# %%
v_18_8_3 = S.get_18_8_3(data)[0]
# %%
S.plot_voice_selection_counts(v_18_8_3, title="Top 8 Voice Selection from 18 Voices", x_label='Voice', color_gender=COLOR_GENDER)
# %%
mo.md(r"""
## Top 3 most chosen out of 8
""")
# %%
S.plot_top3_selection_counts(v_18_8_3, title="Top 3 Voice Selection Counts from 8 Voices", x_label='Voice', color_gender=COLOR_GENDER)
# %%
mo.md(r"""
## Voice Ranking Weighted Score
""")
# %%
top3_voices = S.get_top_3_voices(data)[0]
top3_voices_weighted = calculate_weighted_ranking_scores(top3_voices)
# %%
S.plot_weighted_ranking_score(top3_voices_weighted, title="Most Popular Voice - Weighted Popularity Score<br>(1st = 3pts, 2nd = 2pts, 3rd = 1pt)", color_gender=COLOR_GENDER)
# %%
mo.md(r"""
## Which voice is ranked best in the ranking question for top 3?
(not best 3 out of 8 question)
""")
# %%
S.plot_ranking_distribution(top3_voices, x_label='Voice', title="Distribution of Top 3 Voice Rankings (1st, 2nd, 3rd)", color_gender=COLOR_GENDER)
# %%
mo.md(r"""
### Statistical significance for voice ranking
""")
# %%
# print(top3_voices.collect().head())
# %%
# _pairwise_df, _metadata = S.compute_ranking_significance(
# top3_voices,alpha=0.05,correction="none")
# # View significant pairs
# # print(pairwise_df.filter(pl.col('significant') == True))
# # Create heatmap visualization
# _heatmap = S.plot_significance_heatmap(
# _pairwise_df,
# metadata=_metadata,
# title="Weighted Voice Ranking Significance<br>(Pairwise Comparisons)"
# )
# # Create summary bar chart
# _summary = S.plot_significance_summary(
# _pairwise_df,
# metadata=_metadata
# )
# mo.md(f"""
# {mo.ui.altair_chart(_heatmap)}
# {mo.ui.altair_chart(_summary)}
# """)
# %%
## Voice Ranked 1st the most
# %%
S.plot_most_ranked_1(top3_voices, title="Most Popular Voice<br>(Number of Times Ranked 1st)", x_label='Voice', color_gender=COLOR_GENDER)
# %%
mo.md(r"""
## Voice Scale 1-10
""")
# %%
# Get your voice scale data (from notebook)
voice_1_10, _ = S.get_voice_scale_1_10(data)
S.plot_average_scores_with_counts(voice_1_10, x_label='Voice', domain=[1,10], title="Voice General Impression (Scale 1-10)", color_gender=COLOR_GENDER)
# %%
mo.md(r"""
### Statistical Significance (Scale 1-10)
""")
# %%
# Compute pairwise significance tests
# pairwise_df, metadata = S.compute_pairwise_significance(
# voice_1_10,
# test_type="mannwhitney", # or "ttest", "chi2", "auto"
# alpha=0.05,
# correction="bonferroni" # or "holm", "none"
# )
# # View significant pairs
# # print(pairwise_df.filter(pl.col('significant') == True))
# # Create heatmap visualization
# _heatmap = S.plot_significance_heatmap(
# pairwise_df,
# metadata=metadata,
# title="Voice Rating Significance<br>(Pairwise Comparisons)"
# )
# # Create summary bar chart
# _summary = S.plot_significance_summary(
# pairwise_df,
# metadata=metadata
# )
# mo.md(f"""
# {mo.ui.altair_chart(_heatmap)}
# {mo.ui.altair_chart(_summary)}
# """)
# %%
# %%
mo.md(r"""
## Ranking points for Voice per Chosen Brand Character
**missing mapping**
""")
# %%
mo.md(r"""
## Correlation Speaking Styles
""")
# %%
ss_or, choice_map_or = S.get_ss_orange_red(data)
ss_gb, choice_map_gb = S.get_ss_green_blue(data)
# Combine the data
ss_all = ss_or.join(ss_gb, on='_recordId')
_d = ss_all.collect()
choice_map = {**choice_map_or, **choice_map_gb}
# print(_d.head())
# print(choice_map)
ss_long = utils.process_speaking_style_data(ss_all, choice_map)
df_style = utils.process_speaking_style_data(ss_all, choice_map)
vscales = S.get_voice_scale_1_10(data)[0]
df_scale_long = utils.process_voice_scale_data(vscales)
joined_scale = df_style.join(df_scale_long, on=["_recordId", "Voice"], how="inner")
df_ranking = utils.process_voice_ranking_data(top3_voices)
joined_ranking = df_style.join(df_ranking, on=['_recordId', 'Voice'], how='inner')
# %%
joined_ranking.head()
# %%
mo.md(r"""
### Colors vs Scale 1-10
""")
# %%
# Transform to get one row per color with average correlation
color_corr_scale, _ = utils.transform_speaking_style_color_correlation(joined_scale, SPEAKING_STYLES)
S.plot_speaking_style_color_correlation(
data=color_corr_scale,
title="Correlation: Speaking Style Colors and Voice Scale 1-10"
)
# %%
mo.md(r"""
### Colors vs Ranking Points
""")
# %%
color_corr_ranking, _ = utils.transform_speaking_style_color_correlation(
joined_ranking,
SPEAKING_STYLES,
target_column="Ranking_Points"
)
S.plot_speaking_style_color_correlation(
data=color_corr_ranking,
title="Correlation: Speaking Style Colors and Voice Ranking Points"
)
# %%
# Gender-filtered correlation plots (Male vs Female voices)
from reference import VOICE_GENDER_MAPPING
MALE_VOICES = [v for v, g in VOICE_GENDER_MAPPING.items() if g == "Male"]
FEMALE_VOICES = [v for v, g in VOICE_GENDER_MAPPING.items() if g == "Female"]
# Filter joined data by voice gender
joined_scale_male = joined_scale.filter(pl.col("Voice").is_in(MALE_VOICES))
joined_scale_female = joined_scale.filter(pl.col("Voice").is_in(FEMALE_VOICES))
joined_ranking_male = joined_ranking.filter(pl.col("Voice").is_in(MALE_VOICES))
joined_ranking_female = joined_ranking.filter(pl.col("Voice").is_in(FEMALE_VOICES))
# Colors vs Scale 1-10 (grouped by voice gender)
S.plot_speaking_style_color_correlation_by_gender(
data_male=joined_scale_male,
data_female=joined_scale_female,
speaking_styles=SPEAKING_STYLES,
target_column="Voice_Scale_Score",
title="Correlation: Speaking Style Colors and Voice Scale 1-10 (by Voice Gender)",
filename="correlation_speaking_style_and_voice_scale_1-10_by_voice_gender_color",
)
# Colors vs Ranking Points (grouped by voice gender)
S.plot_speaking_style_color_correlation_by_gender(
data_male=joined_ranking_male,
data_female=joined_ranking_female,
speaking_styles=SPEAKING_STYLES,
target_column="Ranking_Points",
title="Correlation: Speaking Style Colors and Voice Ranking Points (by Voice Gender)",
filename="correlation_speaking_style_and_voice_ranking_points_by_voice_gender_color",
)
# %%
mo.md(r"""
### Individual Traits vs Scale 1-10
""")
# %%
_content = """"""
for _style, _traits in SPEAKING_STYLES.items():
# print(f"Correlation plot for {style}...")
_fig = S.plot_speaking_style_scale_correlation(
data=joined_scale,
style_color=_style,
style_traits=_traits,
title=f"Correlation: Speaking Style {_style} and Voice Scale 1-10",
)
_content += f"""
#### Speaking Style **{_style}**:
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
# %%
mo.md(r"""
### Individual Traits vs Ranking Points
""")
# %%
_content = """"""
for _style, _traits in SPEAKING_STYLES.items():
# print(f"Correlation plot for {style}...")
_fig = S.plot_speaking_style_ranking_correlation(
data=joined_ranking,
style_color=_style,
style_traits=_traits,
title=f"Correlation: Speaking Style {_style} and Voice Ranking Points",
)
_content += f"""
#### Speaking Style **{_style}**:
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
# %%
# Individual Traits vs Scale 1-10 (grouped by voice gender)
_content = """### Individual Traits vs Scale 1-10 (by Voice Gender)\n\n"""
for _style, _traits in SPEAKING_STYLES.items():
_fig = S.plot_speaking_style_scale_correlation_by_gender(
data_male=joined_scale_male,
data_female=joined_scale_female,
style_color=_style,
style_traits=_traits,
title=f"Correlation: Speaking Style {_style} and Voice Scale 1-10 (by Voice Gender)",
filename=f"correlation_speaking_style_and_voice_scale_1-10_by_voice_gender_{_style.lower()}",
)
_content += f"""
#### Speaking Style **{_style}**:
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
# %%
# Individual Traits vs Ranking Points (grouped by voice gender)
_content = """### Individual Traits vs Ranking Points (by Voice Gender)\n\n"""
for _style, _traits in SPEAKING_STYLES.items():
_fig = S.plot_speaking_style_ranking_correlation_by_gender(
data_male=joined_ranking_male,
data_female=joined_ranking_female,
style_color=_style,
style_traits=_traits,
title=f"Correlation: Speaking Style {_style} and Voice Ranking Points (by Voice Gender)",
filename=f"correlation_speaking_style_and_voice_ranking_points_by_voice_gender_{_style.lower()}",
)
_content += f"""
#### Speaking Style **{_style}**:
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
# %%
# ## Correlations when "Best Brand Character" is chosen
# For each of the 4 brand characters, filter the dataset to only those respondents
# who selected that character as their #1 choice.
# %%
# Prepare character-filtered data subsets
char_rank_for_filter = S.get_character_ranking(data)[0].collect()
CHARACTER_FILTER_MAP = {
'Familiar Friend': 'Character_Ranking_Familiar_Friend',
'The Coach': 'Character_Ranking_The_Coach',
'Personal Assistant': 'Character_Ranking_The_Personal_Assistant',
'Bank Teller': 'Character_Ranking_The_Bank_Teller',
}
def get_filtered_data_for_character(char_name: str) -> tuple[pl.DataFrame, pl.DataFrame, int]:
"""Filter joined_scale and joined_ranking to respondents who ranked char_name #1."""
col = CHARACTER_FILTER_MAP[char_name]
respondents = char_rank_for_filter.filter(pl.col(col) == 1).select('_recordId')
n = respondents.height
filtered_scale = joined_scale.join(respondents, on='_recordId', how='inner')
filtered_ranking = joined_ranking.join(respondents, on='_recordId', how='inner')
return filtered_scale, filtered_ranking, n
def _char_filename(char_name: str, suffix: str) -> str:
"""Generate filename for character-filtered plots (without n-value).
Format: bc_ranked_1_{suffix}__{char_slug}
This groups all plot types together in directory listings.
"""
char_slug = char_name.lower().replace(' ', '_')
return f"bc_ranked_1_{suffix}__{char_slug}"
# %%
# ### Voice Weighted Ranking Score (by Best Character)
for char_name in CHARACTER_FILTER_MAP:
_, _, n = get_filtered_data_for_character(char_name)
# Get top3 voices for this character subset using _recordIds
respondents = char_rank_for_filter.filter(
pl.col(CHARACTER_FILTER_MAP[char_name]) == 1
).select('_recordId')
# Collect top3_voices if it's a LazyFrame, then join
top3_df = top3_voices.collect() if isinstance(top3_voices, pl.LazyFrame) else top3_voices
filtered_top3 = top3_df.join(respondents, on='_recordId', how='inner')
weighted = calculate_weighted_ranking_scores(filtered_top3)
S.plot_weighted_ranking_score(
data=weighted,
title=f'"{char_name}" Ranked #1 (n={n})<br>Most Popular Voice - Weighted Score (1st=3pts, 2nd=2pts, 3rd=1pt)',
filename=_char_filename(char_name, "voice_weighted_ranking_score"),
color_gender=COLOR_GENDER,
)
# %%
# ### Voice Scale 1-10 Average Scores (by Best Character)
for char_name in CHARACTER_FILTER_MAP:
_, _, n = get_filtered_data_for_character(char_name)
# Get voice scale data for this character subset using _recordIds
respondents = char_rank_for_filter.filter(
pl.col(CHARACTER_FILTER_MAP[char_name]) == 1
).select('_recordId')
# Collect voice_1_10 if it's a LazyFrame, then join
voice_1_10_df = voice_1_10.collect() if isinstance(voice_1_10, pl.LazyFrame) else voice_1_10
filtered_voice_1_10 = voice_1_10_df.join(respondents, on='_recordId', how='inner')
S.plot_average_scores_with_counts(
data=filtered_voice_1_10,
title=f'"{char_name}" Ranked #1 (n={n})<br>Voice General Impression (Scale 1-10)',
filename=_char_filename(char_name, "voice_scale_1-10"),
x_label='Voice',
domain=[1, 10],
color_gender=COLOR_GENDER,
)
# %%
# ### Speaking Style Colors vs Scale 1-10 (only for Best Character)
for char_name in CHARACTER_FILTER_MAP:
if char_name.lower().replace(' ', '_') != BEST_CHOSEN_CHARACTER:
continue
filtered_scale, _, n = get_filtered_data_for_character(char_name)
color_corr, _ = utils.transform_speaking_style_color_correlation(filtered_scale, SPEAKING_STYLES)
S.plot_speaking_style_color_correlation(
data=color_corr,
title=f'"{char_name}" Ranked #1 (n={n})<br>Correlation: Speaking Style Colors vs Voice Scale 1-10',
filename=_char_filename(char_name, "colors_vs_voice_scale_1-10"),
)
# %%
# ### Speaking Style Colors vs Ranking Points (only for Best Character)
for char_name in CHARACTER_FILTER_MAP:
if char_name.lower().replace(' ', '_') != BEST_CHOSEN_CHARACTER:
continue
_, filtered_ranking, n = get_filtered_data_for_character(char_name)
color_corr, _ = utils.transform_speaking_style_color_correlation(
filtered_ranking, SPEAKING_STYLES, target_column="Ranking_Points"
)
S.plot_speaking_style_color_correlation(
data=color_corr,
title=f'"{char_name}" Ranked #1 (n={n})<br>Correlation: Speaking Style Colors vs Voice Ranking Points',
filename=_char_filename(char_name, "colors_vs_voice_ranking_points"),
)
# %%
# ### Individual Traits vs Scale 1-10 (only for Best Character)
for _style, _traits in SPEAKING_STYLES.items():
print(f"--- Speaking Style: {_style} ---")
for char_name in CHARACTER_FILTER_MAP:
if char_name.lower().replace(' ', '_') != BEST_CHOSEN_CHARACTER:
continue
filtered_scale, _, n = get_filtered_data_for_character(char_name)
S.plot_speaking_style_scale_correlation(
data=filtered_scale,
style_color=_style,
style_traits=_traits,
title=f'"{char_name}" Ranked #1 (n={n})<br>Correlation: {_style} vs Voice Scale 1-10',
filename=_char_filename(char_name, f"{_style.lower()}_vs_voice_scale_1-10"),
)
# %%
# ### Individual Traits vs Ranking Points (only for Best Character)
for _style, _traits in SPEAKING_STYLES.items():
print(f"--- Speaking Style: {_style} ---")
for char_name in CHARACTER_FILTER_MAP:
if char_name.lower().replace(' ', '_') != BEST_CHOSEN_CHARACTER:
continue
_, filtered_ranking, n = get_filtered_data_for_character(char_name)
S.plot_speaking_style_ranking_correlation(
data=filtered_ranking,
style_color=_style,
style_traits=_traits,
title=f'"{char_name}" Ranked #1 (n={n})<br>Correlation: {_style} vs Voice Ranking Points',
filename=_char_filename(char_name, f"{_style.lower()}_vs_voice_ranking_points"),
)
# %%

View File

@@ -0,0 +1,370 @@
"""Extra statistical significance analyses for quant report."""
# %% Imports
import utils
import polars as pl
import argparse
import json
import re
from pathlib import Path
# %% Fixed Variables
RESULTS_FILE = 'data/exports/2-4-26/JPMC_Chase Brand Personality_Quant Round 1_February 4, 2026_Labels.csv'
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
# %% CLI argument parsing for batch automation
# When run as script: uv run XX_statistical_significance.script.py --age '["18
# Central filter configuration - add new filters here only
# Format: 'cli_arg_name': 'QualtricsSurvey.options_* attribute name'
FILTER_CONFIG = {
'age': 'options_age',
'gender': 'options_gender',
'ethnicity': 'options_ethnicity',
'income': 'options_income',
'consumer': 'options_consumer',
'business_owner': 'options_business_owner',
'ai_user': 'options_ai_user',
'investable_assets': 'options_investable_assets',
'industry': 'options_industry',
}
def parse_cli_args():
parser = argparse.ArgumentParser(description='Generate quant report with optional filters')
# Dynamically add filter arguments from config
for filter_name in FILTER_CONFIG:
parser.add_argument(f'--{filter_name}', type=str, default=None, help=f'JSON list of {filter_name} values')
parser.add_argument('--filter-name', type=str, default=None, help='Name for this filter combination (used for .txt description file)')
parser.add_argument('--figures-dir', type=str, default=f'figures/statistical_significance/{Path(RESULTS_FILE).parts[2]}', help='Override the default figures directory')
# Only parse if running as script (not in Jupyter/interactive)
try:
# Check if running in Jupyter by looking for ipykernel
get_ipython() # noqa: F821 # type: ignore
# Return namespace with all filters set to None
no_filters = {f: None for f in FILTER_CONFIG}
# Use the same default as argparse
default_fig_dir = f'figures/statistical_significance/{Path(RESULTS_FILE).parts[2]}'
return argparse.Namespace(**no_filters, filter_name=None, figures_dir=default_fig_dir)
except NameError:
args = parser.parse_args()
# Parse JSON strings to lists
for filter_name in FILTER_CONFIG:
val = getattr(args, filter_name)
setattr(args, filter_name, json.loads(val) if val else None)
return args
cli_args = parse_cli_args()
# %%
S = utils.QualtricsSurvey(RESULTS_FILE, QSF_FILE, figures_dir=cli_args.figures_dir)
data_all = S.load_data()
# %% Build filtered dataset based on CLI args
# CLI args: None means "no filter applied" - filter_data() will skip None filters
# Build filter values dict dynamically from FILTER_CONFIG
_active_filters = {filter_name: getattr(cli_args, filter_name) for filter_name in FILTER_CONFIG}
_d = S.filter_data(data_all, **_active_filters)
# Write filter description file if filter-name is provided
if cli_args.filter_name and S.fig_save_dir:
# Get the filter slug (e.g., "All_Respondents", "Cons-Starter", etc.)
_filter_slug = S._get_filter_slug()
_filter_slug_dir = S.fig_save_dir / _filter_slug
_filter_slug_dir.mkdir(parents=True, exist_ok=True)
# Build filter description
_filter_desc_lines = [
f"Filter: {cli_args.filter_name}",
"",
"Applied Filters:",
]
_short_desc_parts = []
for filter_name, options_attr in FILTER_CONFIG.items():
all_options = getattr(S, options_attr)
values = _active_filters[filter_name]
display_name = filter_name.replace('_', ' ').title()
# None means no filter applied (same as "All")
if values is not None and values != all_options:
_short_desc_parts.append(f"{display_name}: {', '.join(values)}")
_filter_desc_lines.append(f" {display_name}: {', '.join(values)}")
else:
_filter_desc_lines.append(f" {display_name}: All")
# Write detailed description INSIDE the filter-slug directory
# Sanitize filter name for filename usage (replace / and other chars)
_safe_filter_name = re.sub(r'[^\w\s-]', '_', cli_args.filter_name)
_filter_file = _filter_slug_dir / f"{_safe_filter_name}.txt"
_filter_file.write_text('\n'.join(_filter_desc_lines))
# Append to summary index file at figures/<export_date>/filter_index.txt
_summary_file = S.fig_save_dir / "filter_index.txt"
_short_desc = "; ".join(_short_desc_parts) if _short_desc_parts else "All Respondents"
_summary_line = f"{_filter_slug} | {cli_args.filter_name} | {_short_desc}\n"
# Append or create the summary file
if _summary_file.exists():
_existing = _summary_file.read_text()
# Avoid duplicate entries for same slug
if _filter_slug not in _existing:
with _summary_file.open('a') as f:
f.write(_summary_line)
else:
_header = "Filter Index\n" + "=" * 80 + "\n\n"
_header += "Directory | Filter Name | Description\n"
_header += "-" * 80 + "\n"
_summary_file.write_text(_header + _summary_line)
# Save to logical variable name for further analysis
data = _d
data.collect()
# %% Character coach significatly higher than others
char_rank = S.get_character_ranking(data)[0]
_pairwise_df, _meta = S.compute_ranking_significance(
char_rank,
alpha=0.05,
correction="none",
)
# %% [markdown]
"""
### Methodology Analysis
**Input Data (`char_rank`)**:
* Generated by `S.get_character_ranking(data)`.
* Contains the ranking values (1st, 2nd, 3rd, 4th) assigned by each respondent to the four options ("The Coach", etc.).
* Columns represent the characters; rows represent individual respondents; values are the numerical rank (1 = Top Choice).
**Processing**:
* The function `compute_ranking_significance` aggregates these rankings to find the **"Rank 1 Share"** (the percentage of respondents who picked that character as their #1 favorite).
* It builds a contingency table of how many times each character was ranked 1st vs. not 1st (or 1st v 2nd v 3rd).
**Statistical Test**:
* **Test Used**: Pairwise Z-test for two proportions (uncorrected).
* **Comparison**: It compares the **Rank 1 Share** of every pair of characters.
* *Example*: "Is the 42% of people who chose 'Coach' significantly different from the 29% who chose 'Familiar Friend'?"
* **Significance**: A result of `p < 0.05` means the difference in popularity (top-choice preference) is statistically significant and not due to random chance.
"""
# %% Plot heatmap of pairwise significance
S.plot_significance_heatmap(_pairwise_df, metadata=_meta, title="Statistical Significance: Character Top Choice Preference")
# %% Plot summary of significant differences (e.g., which characters are significantly higher than others)
# S.plot_significance_summary(_pairwise_df, metadata=_meta)
# %% [markdown]
"""
# Analysis: Significance of "The Coach"
**Parameters**: `alpha=0.05`, `correction='none'`
* **Rationale**: No correction was applied to allow for detection of all potential pairwise differences (uncorrected p < 0.05). If strict control for family-wise error rate were required (e.g., Bonferroni), the significance threshold would be lower (p < 0.0083).
**Results**:
"The Coach" is the top-ranked option (42.0% Rank 1 share) and shows strong separation from the field.
* **Vs. Bottom Two**: "The Coach" is significantly higher than both "The Bank Teller" (26.9%, p < 0.001) and "Familiar Friend" (29.4%, p < 0.001).
* **Vs. Runner-Up**: "The Coach" is widely preferred over "The Personal Assistant" (33.4%). The difference of **8.6 percentage points** is statistically significant (p = 0.017) at the standard 0.05 level.
* *Note*: While p=0.017 is significant in isolation, it would not meet the stricter Bonferroni threshold (0.0083). However, the effect size (+8.6%) is commercially meaningful.
**Conclusion**:
Yes, "The Coach" can be considered statistically more significant than the other options. It is clearly superior to the bottom two options and holds a statistically significant lead over the runner-up ("Personal Assistant") in direct comparison.
"""
# %% Mentions significance analysis
char_pairwise_df_mentions, _meta_mentions = S.compute_mentions_significance(
char_rank,
alpha=0.05,
correction="none",
)
S.plot_significance_heatmap(
char_pairwise_df_mentions,
metadata=_meta_mentions,
title="Statistical Significance: Character Total Mentions (Top 3 Visibility)"
)
# %% voices analysis
top3_voices = S.get_top_3_voices(data)[0]
_pairwise_df_voice, _metadata = S.compute_ranking_significance(
top3_voices,alpha=0.05,correction="none")
S.plot_significance_heatmap(
_pairwise_df_voice,
metadata=_metadata,
title="Statistical Significance: Voice Top Choice Preference"
)
# %% Total Mentions Significance (Rank 1+2+3 Combined)
# This tests "Quantity" (Visibility) instead of "Quality" (Preference)
_pairwise_df_mentions, _meta_mentions = S.compute_mentions_significance(
top3_voices,
alpha=0.05,
correction="none"
)
S.plot_significance_heatmap(
_pairwise_df_mentions,
metadata=_meta_mentions,
title="Statistical Significance: Voice Total Mentions (Top 3 Visibility)"
)
# %% Male Voices Only Analysis
import reference
def filter_voices_by_gender(df: pl.DataFrame, target_gender: str) -> pl.DataFrame:
"""Filter ranking columns to keep only those matching target gender."""
cols_to_keep = []
# Always keep identifier if present
if '_recordId' in df.columns:
cols_to_keep.append('_recordId')
for col in df.columns:
# Check if column is a voice column (contains Vxx)
# Format is typically "Top_3_Voices_ranking__V14"
if '__V' in col:
voice_id = col.split('__')[1]
if reference.VOICE_GENDER_MAPPING.get(voice_id) == target_gender:
cols_to_keep.append(col)
return df.select(cols_to_keep)
# Get full ranking data as DataFrame
df_voices = top3_voices.collect()
# Filter for Male voices
df_male_voices = filter_voices_by_gender(df_voices, 'Male')
# 1. Male Voices: Top Choice Preference (Rank 1)
_pairwise_male_pref, _meta_male_pref = S.compute_ranking_significance(
df_male_voices,
alpha=0.05,
correction="none"
)
S.plot_significance_heatmap(
_pairwise_male_pref,
metadata=_meta_male_pref,
title="Male Voices Only: Top Choice Preference Significance"
)
# 2. Male Voices: Total Mentions (Visibility)
_pairwise_male_vis, _meta_male_vis = S.compute_mentions_significance(
df_male_voices,
alpha=0.05,
correction="none"
)
S.plot_significance_heatmap(
_pairwise_male_vis,
metadata=_meta_male_vis,
title="Male Voices Only: Total Mentions Significance"
)
# %% Male Voices (Excluding Bottom 3: V88, V86, V81)
# Start with the male voices dataframe from the previous step
voices_to_exclude = ['V88', 'V86', 'V81']
def filter_exclude_voices(df: pl.DataFrame, exclude_list: list[str]) -> pl.DataFrame:
"""Filter ranking columns to exclude specific voices."""
cols_to_keep = []
# Always keep identifier if present
if '_recordId' in df.columns:
cols_to_keep.append('_recordId')
for col in df.columns:
# Check if column is a voice column (contains Vxx)
if '__V' in col:
voice_id = col.split('__')[1]
if voice_id not in exclude_list:
cols_to_keep.append(col)
return df.select(cols_to_keep)
df_male_top = filter_exclude_voices(df_male_voices, voices_to_exclude)
# 1. Male Top Candidates: Top Choice Preference
_pairwise_male_top_pref, _meta_male_top_pref = S.compute_ranking_significance(
df_male_top,
alpha=0.05,
correction="none"
)
S.plot_significance_heatmap(
_pairwise_male_top_pref,
metadata=_meta_male_top_pref,
title="Male Voices (Excl. Bottom 3): Top Choice Preference Significance"
)
# 2. Male Top Candidates: Total Mentions
_pairwise_male_top_vis, _meta_male_top_vis = S.compute_mentions_significance(
df_male_top,
alpha=0.05,
correction="none"
)
S.plot_significance_heatmap(
_pairwise_male_top_vis,
metadata=_meta_male_top_vis,
title="Male Voices (Excl. Bottom 3): Total Mentions Significance"
)
# %% [markdown]
"""
# Rank 1 Selection Significance (Voice Level)
Similar to the Total Mentions significance analysis above, but counting
only how many times each voice was ranked **1st** (out of all respondents).
This isolates first-choice preference rather than overall top-3 visibility.
"""
# %% Rank 1 Significance: All Voices
_pairwise_df_rank1, _meta_rank1 = S.compute_rank1_significance(
top3_voices,
alpha=0.05,
correction="none",
)
S.plot_significance_heatmap(
_pairwise_df_rank1,
metadata=_meta_rank1,
title="Statistical Significance: Voice Rank 1 Selection"
)
# %% Rank 1 Significance: Male Voices Only
_pairwise_df_rank1_male, _meta_rank1_male = S.compute_rank1_significance(
df_male_voices,
alpha=0.05,
correction="none",
)
S.plot_significance_heatmap(
_pairwise_df_rank1_male,
metadata=_meta_rank1_male,
title="Male Voices Only: Rank 1 Selection Significance"
)
# %%

267
XX_straight_liners.py Normal file
View File

@@ -0,0 +1,267 @@
"""Extra analyses of the straight-liners"""
# %% Imports
import utils
import polars as pl
import argparse
import json
import re
from pathlib import Path
from validation import check_straight_liners
# %% Fixed Variables
RESULTS_FILE = 'data/exports/2-4-26/JPMC_Chase Brand Personality_Quant Round 1_February 4, 2026_Labels.csv'
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
# %% CLI argument parsing for batch automation
# When run as script: uv run XX_statistical_significance.script.py --age '["18
# Central filter configuration - add new filters here only
# Format: 'cli_arg_name': 'QualtricsSurvey.options_* attribute name'
FILTER_CONFIG = {
'age': 'options_age',
'gender': 'options_gender',
'ethnicity': 'options_ethnicity',
'income': 'options_income',
'consumer': 'options_consumer',
'business_owner': 'options_business_owner',
'ai_user': 'options_ai_user',
'investable_assets': 'options_investable_assets',
'industry': 'options_industry',
}
def parse_cli_args():
parser = argparse.ArgumentParser(description='Generate quant report with optional filters')
# Dynamically add filter arguments from config
for filter_name in FILTER_CONFIG:
parser.add_argument(f'--{filter_name}', type=str, default=None, help=f'JSON list of {filter_name} values')
parser.add_argument('--filter-name', type=str, default=None, help='Name for this filter combination (used for .txt description file)')
parser.add_argument('--figures-dir', type=str, default=f'figures/straight-liner-analysis/{Path(RESULTS_FILE).parts[2]}', help='Override the default figures directory')
# Only parse if running as script (not in Jupyter/interactive)
try:
# Check if running in Jupyter by looking for ipykernel
get_ipython() # noqa: F821 # type: ignore
# Return namespace with all filters set to None
no_filters = {f: None for f in FILTER_CONFIG}
# Use the same default as argparse
default_fig_dir = f'figures/straight-liner-analysis/{Path(RESULTS_FILE).parts[2]}'
return argparse.Namespace(**no_filters, filter_name=None, figures_dir=default_fig_dir)
except NameError:
args = parser.parse_args()
# Parse JSON strings to lists
for filter_name in FILTER_CONFIG:
val = getattr(args, filter_name)
setattr(args, filter_name, json.loads(val) if val else None)
return args
cli_args = parse_cli_args()
# %%
S = utils.QualtricsSurvey(RESULTS_FILE, QSF_FILE, figures_dir=cli_args.figures_dir)
data_all = S.load_data()
# %% Build filtered dataset based on CLI args
# CLI args: None means "no filter applied" - filter_data() will skip None filters
# Build filter values dict dynamically from FILTER_CONFIG
_active_filters = {filter_name: getattr(cli_args, filter_name) for filter_name in FILTER_CONFIG}
_d = S.filter_data(data_all, **_active_filters)
# Write filter description file if filter-name is provided
if cli_args.filter_name and S.fig_save_dir:
# Get the filter slug (e.g., "All_Respondents", "Cons-Starter", etc.)
_filter_slug = S._get_filter_slug()
_filter_slug_dir = S.fig_save_dir / _filter_slug
_filter_slug_dir.mkdir(parents=True, exist_ok=True)
# Build filter description
_filter_desc_lines = [
f"Filter: {cli_args.filter_name}",
"",
"Applied Filters:",
]
_short_desc_parts = []
for filter_name, options_attr in FILTER_CONFIG.items():
all_options = getattr(S, options_attr)
values = _active_filters[filter_name]
display_name = filter_name.replace('_', ' ').title()
# None means no filter applied (same as "All")
if values is not None and values != all_options:
_short_desc_parts.append(f"{display_name}: {', '.join(values)}")
_filter_desc_lines.append(f" {display_name}: {', '.join(values)}")
else:
_filter_desc_lines.append(f" {display_name}: All")
# Write detailed description INSIDE the filter-slug directory
# Sanitize filter name for filename usage (replace / and other chars)
_safe_filter_name = re.sub(r'[^\w\s-]', '_', cli_args.filter_name)
_filter_file = _filter_slug_dir / f"{_safe_filter_name}.txt"
_filter_file.write_text('\n'.join(_filter_desc_lines))
# Append to summary index file at figures/<export_date>/filter_index.txt
_summary_file = S.fig_save_dir / "filter_index.txt"
_short_desc = "; ".join(_short_desc_parts) if _short_desc_parts else "All Respondents"
_summary_line = f"{_filter_slug} | {cli_args.filter_name} | {_short_desc}\n"
# Append or create the summary file
if _summary_file.exists():
_existing = _summary_file.read_text()
# Avoid duplicate entries for same slug
if _filter_slug not in _existing:
with _summary_file.open('a') as f:
f.write(_summary_line)
else:
_header = "Filter Index\n" + "=" * 80 + "\n\n"
_header += "Directory | Filter Name | Description\n"
_header += "-" * 80 + "\n"
_summary_file.write_text(_header + _summary_line)
# Save to logical variable name for further analysis
data = _d
data.collect()
# %% Determine straight-liner repeat offenders
# Extract question groups with renamed columns that check_straight_liners expects.
# The raw `data` has QID-based column names; the getter methods rename them to
# patterns like SS_Green_Blue__V14__Choice_1, Voice_Scale_1_10__V48, etc.
ss_or, _ = S.get_ss_orange_red(data)
ss_gb, _ = S.get_ss_green_blue(data)
vs, _ = S.get_voice_scale_1_10(data)
# Combine all question groups into one wide LazyFrame (joined on _recordId)
all_questions = ss_or.join(ss_gb, on='_recordId').join(vs, on='_recordId')
# Run straight-liner detection across all question groups
# max_score=5 catches all speaking-style straight-lining (1-5 scale)
# and voice-scale values ≤5 on the 1-10 scale
# Note: sl_threshold is NOT set on S here — this script analyses straight-liners,
# it doesn't filter them out of the dataset.
print("Running straight-liner detection across all question groups...")
sl_report, sl_df = check_straight_liners(all_questions, max_score=5)
# %% Quantify repeat offenders
# sl_df has one row per (Record ID, Question Group) that was straight-lined.
# Group by Record ID to count how many question groups each person SL'd.
if sl_df is not None and not sl_df.is_empty():
total_respondents = data.select(pl.len()).collect().item()
# Per-respondent count of straight-lined question groups
respondent_sl_counts = (
sl_df
.group_by("Record ID")
.agg(pl.len().alias("sl_count"))
.sort("sl_count", descending=True)
)
max_sl = respondent_sl_counts["sl_count"].max()
print(f"\nTotal respondents: {total_respondents}")
print(f"Respondents who straight-lined at least 1 question group: "
f"{respondent_sl_counts.height}")
print(f"Maximum question groups straight-lined by one person: {max_sl}")
print()
# Build cumulative distribution: for each threshold N, count respondents
# who straight-lined >= N question groups
cumulative_rows = []
for threshold in range(1, max_sl + 1):
count = respondent_sl_counts.filter(
pl.col("sl_count") >= threshold
).height
pct = (count / total_respondents) * 100
cumulative_rows.append({
"threshold": threshold,
"count": count,
"pct": pct,
})
print(
f"{threshold} question groups straight-lined: "
f"{count} respondents ({pct:.1f}%)"
)
cumulative_df = pl.DataFrame(cumulative_rows)
print(f"\n{cumulative_df}")
# %% Save cumulative data to CSV
_filter_slug = S._get_filter_slug()
_csv_dir = Path(S.fig_save_dir) / _filter_slug
_csv_dir.mkdir(parents=True, exist_ok=True)
_csv_path = _csv_dir / "straight_liner_repeat_offenders.csv"
cumulative_df.write_csv(_csv_path)
print(f"Saved cumulative data to {_csv_path}")
# %% Plot the cumulative distribution
S.plot_straight_liner_repeat_offenders(
cumulative_df,
total_respondents=total_respondents,
)
# %% Per-question straight-lining frequency
# Build human-readable question group names from the raw keys
def _humanise_question_group(key: str) -> str:
"""Convert internal question group key to a readable label.
Examples:
SS_Green_Blue__V14 → Green/Blue V14
SS_Orange_Red__V48 → Orange/Red V48
Voice_Scale_1_10 → Voice Scale (1-10)
"""
if key.startswith("SS_Green_Blue__"):
voice = key.split("__")[1]
return f"Green/Blue {voice}"
if key.startswith("SS_Orange_Red__"):
voice = key.split("__")[1]
return f"Orange/Red {voice}"
if key == "Voice_Scale_1_10":
return "Voice Scale (1-10)"
# Fallback: replace underscores
return key.replace("_", " ")
per_question_counts = (
sl_df
.group_by("Question Group")
.agg(pl.col("Record ID").n_unique().alias("count"))
.sort("count", descending=True)
.with_columns(
(pl.col("count") / total_respondents * 100).alias("pct")
)
)
# Add human-readable names
per_question_counts = per_question_counts.with_columns(
pl.col("Question Group").map_elements(
_humanise_question_group, return_dtype=pl.Utf8
).alias("question")
)
print("\n--- Per-Question Straight-Lining Frequency ---")
print(per_question_counts)
# Save per-question data to CSV
_csv_path_pq = _csv_dir / "straight_liner_per_question.csv"
per_question_counts.write_csv(_csv_path_pq)
print(f"Saved per-question data to {_csv_path_pq}")
# Plot
S.plot_straight_liner_per_question(
per_question_counts,
total_respondents=total_respondents,
)
# %% Show the top repeat offenders (respondents with most SL'd groups)
print("\n--- Top Repeat Offenders ---")
print(respondent_sl_counts.head(20))
else:
print("No straight-liners detected in the dataset.")

File diff suppressed because one or more lines are too long

BIN
docs/README.pdf Normal file

Binary file not shown.

View File

@@ -0,0 +1,104 @@
# Appendix: Quantitative Analysis Plots - Folder Structure Manual
This folder contains all the quantitative analysis plots, sorted by the filters applied to the dataset. Each folder corresponds to a specific demographic cut.
## Folder Overview
* `All_Respondents/`: Analysis of the full dataset (no filters).
* `filter_index.txt`: A master list of every folder code and its corresponding demographic filter.
* **Filter Folders**: All other folders represent specific demographic cuts (e.g., `Age-18to21years`, `Gen-Woman`).
## How to Navigate
Each folder contains the same set of charts generated for that specific filter.
## Directory Reference Table
Below is the complete list of folder names. These names are encodings of the filters applied to the dataset, which we use to maintain consistency across our analysis.
| Directory Code | Filter Description |
| :--- | :--- |
| All_Respondents | All Respondents |
| Age-18to21years | Age: 18 to 21 years |
| Age-22to24years | Age: 22 to 24 years |
| Age-25to34years | Age: 25 to 34 years |
| Age-35to40years | Age: 35 to 40 years |
| Age-41to50years | Age: 41 to 50 years |
| Age-51to59years | Age: 51 to 59 years |
| Age-60to70years | Age: 60 to 70 years |
| Age-70yearsormore | Age: 70 years or more |
| Gen-Man | Gender: Man |
| Gen-Prefernottosay | Gender: Prefer not to say |
| Gen-Woman | Gender: Woman |
| Eth-6_grps_c64411 | Ethnicity: All options containing 'Alaska Native or Indigenous American' |
| Eth-6_grps_8f145b | Ethnicity: All options containing 'Asian or Asian American' |
| Eth-8_grps_71ac47 | Ethnicity: All options containing 'Black or African American' |
| Eth-7_grps_c5b3ce | Ethnicity: All options containing 'Hispanic or Latinx' |
| Eth-BlackorAfricanAmerican<br>MiddleEasternorNorthAfrican<br>WhiteorCaucasian+<br>MiddleEasternorNorthAfrican | Ethnicity: Middle Eastern or North African |
| Eth-AsianorAsianAmericanBlackorAfricanAmerican<br>NativeHawaiianorOtherPacificIslander+<br>NativeHawaiianorOtherPacificIslander | Ethnicity: Native Hawaiian or Other Pacific Islander |
| Eth-10_grps_cef760 | Ethnicity: All options containing 'White or Caucasian' |
| Inc-100000to149999 | Income: $100,000 to $149,999 |
| Inc-150000to199999 | Income: $150,000 to $199,999 |
| Inc-200000ormore | Income: $200,000 or more |
| Inc-25000to34999 | Income: $25,000 to $34,999 |
| Inc-35000to54999 | Income: $35,000 to $54,999 |
| Inc-55000to79999 | Income: $55,000 to $79,999 |
| Inc-80000to99999 | Income: $80,000 to $99,999 |
| Inc-Lessthan25000 | Income: Less than $25,000 |
| Cons-Lower_Mass_A+Lower_Mass_B | Consumer: Lower_Mass_A, Lower_Mass_B |
| Cons-MassAffluent_A+MassAffluent_B | Consumer: MassAffluent_A, MassAffluent_B |
| Cons-Mass_A+Mass_B | Consumer: Mass_A, Mass_B |
| Cons-Mix_of_Affluent_Wealth__<br>High_Net_Woth_A+<br>Mix_of_Affluent_Wealth__<br>High_Net_Woth_B | Consumer: Mix_of_Affluent_Wealth_&_High_Net_Woth_A, Mix_of_Affluent_Wealth_&_High_Net_Woth_B |
| Cons-Early_Professional | Consumer: Early_Professional |
| Cons-Lower_Mass_B | Consumer: Lower_Mass_B |
| Cons-MassAffluent_B | Consumer: MassAffluent_B |
| Cons-Mass_B | Consumer: Mass_B |
| Cons-Mix_of_Affluent_Wealth__<br>High_Net_Woth_B | Consumer: Mix_of_Affluent_Wealth_&_High_Net_Woth_B |
| Cons-Starter | Consumer: Starter |
| BizOwn-No | Business Owner: No |
| BizOwn-Yes | Business Owner: Yes |
| AI-Daily | Ai User: Daily |
| AI-Lessthanonceamonth | Ai User: Less than once a month |
| AI-Morethanoncedaily | Ai User: More than once daily |
| AI-Multipletimesperweek | Ai User: Multiple times per week |
| AI-Onceamonth | Ai User: Once a month |
| AI-Onceaweek | Ai User: Once a week |
| AI-RarelyNever | Ai User: Rarely/Never |
| AI-Daily+<br>Morethanoncedaily+<br>Multipletimesperweek | Ai User: Daily, More than once daily, Multiple times per week |
| AI-4_grps_d4f57a | Ai User: Once a week, Once a month, Less than once a month, Rarely/Never |
| InvAsts-0to24999 | Investable Assets: $0 to $24,999 |
| InvAsts-150000to249999 | Investable Assets: $150,000 to $249,999 |
| InvAsts-1Mto4.9M | Investable Assets: $1M to $4.9M |
| InvAsts-25000to49999 | Investable Assets: $25,000 to $49,999 |
| InvAsts-250000to499999 | Investable Assets: $250,000 to $499,999 |
| InvAsts-50000to149999 | Investable Assets: $50,000 to $149,999 |
| InvAsts-500000to999999 | Investable Assets: $500,000 to $999,999 |
| InvAsts-5Mormore | Investable Assets: $5M or more |
| InvAsts-Prefernottoanswer | Investable Assets: Prefer not to answer |
| Ind-Agricultureforestryfishingorhunting | Industry: Agriculture, forestry, fishing, or hunting |
| Ind-Artsentertainmentorrecreation | Industry: Arts, entertainment, or recreation |
| Ind-Broadcasting | Industry: Broadcasting |
| Ind-Construction | Industry: Construction |
| Ind-EducationCollegeuniversityoradult | Industry: Education College, university, or adult |
| Ind-EducationOther | Industry: Education Other |
| Ind-EducationPrimarysecondaryK-12 | Industry: Education Primary/secondary (K-12) |
| Ind-Governmentandpublicadministration | Industry: Government and public administration |
| Ind-Hotelandfoodservices | Industry: Hotel and food services |
| Ind-InformationOther | Industry: Information Other |
| Ind-InformationServicesanddata | Industry: Information Services and data |
| Ind-Legalservices | Industry: Legal services |
| Ind-ManufacturingComputerandelectronics | Industry: Manufacturing Computer and electronics |
| Ind-ManufacturingOther | Industry: Manufacturing Other |
| Ind-Notemployed | Industry: Not employed |
| Ind-Otherindustrypleasespecify | Industry: Other industry (please specify) |
| Ind-Processing | Industry: Processing |
| Ind-Publishing | Industry: Publishing |
| Ind-Realestaterentalorleasing | Industry: Real estate, rental, or leasing |
| Ind-Retired | Industry: Retired |
| Ind-Scientificortechnicalservices | Industry: Scientific or technical services |
| Ind-Software | Industry: Software |
| Ind-Telecommunications | Industry: Telecommunications |
| Ind-Transportationandwarehousing | Industry: Transportation and warehousing |
| Ind-Utilities | Industry: Utilities |
| Ind-Wholesale | Industry: Wholesale |

View File

@@ -0,0 +1,428 @@
# Statistical Significance Testing Guide
A beginner-friendly reference for choosing the right statistical test and correction method for your Voice Branding analysis.
---
## Table of Contents
1. [Quick Decision Flowchart](#quick-decision-flowchart)
2. [Understanding Your Data Types](#understanding-your-data-types)
3. [Available Tests](#available-tests)
4. [Multiple Comparison Corrections](#multiple-comparison-corrections)
5. [Interpreting Results](#interpreting-results)
6. [Code Examples](#code-examples)
---
## Quick Decision Flowchart
```
What kind of data do you have?
├─► Continuous scores (1-10 ratings, averages)
│ │
│ └─► Use: compute_pairwise_significance()
│ │
│ ├─► Data normally distributed? → test_type="ttest"
│ └─► Not sure / skewed data? → test_type="mannwhitney" (safer choice)
└─► Ranking data (1st, 2nd, 3rd place votes)
└─► Use: compute_ranking_significance()
(automatically uses proportion z-test)
```
---
## Understanding Your Data Types
### Continuous Data
**What it looks like:** Numbers on a scale with many possible values.
| Example | Data Source |
|---------|-------------|
| Voice ratings 1-10 | `get_voice_scale_1_10()` |
| Speaking style scores | `get_ss_green_blue()` |
| Any averaged scores | Custom aggregations |
```
shape: (5, 3)
┌───────────┬─────────────────┬─────────────────┐
│ _recordId │ Voice_Scale__V14│ Voice_Scale__V04│
│ str │ f64 │ f64 │
├───────────┼─────────────────┼─────────────────┤
│ R_001 │ 7.5 │ 6.0 │
│ R_002 │ 8.0 │ 7.5 │
│ R_003 │ 6.5 │ 8.0 │
```
### Ranking Data
**What it looks like:** Discrete ranks (1, 2, 3) or null if not ranked.
| Example | Data Source |
|---------|-------------|
| Top 3 voice rankings | `get_top_3_voices()` |
| Character rankings | `get_character_ranking()` |
```
shape: (5, 3)
┌───────────┬──────────────────┬──────────────────┐
│ _recordId │ Top_3__V14 │ Top_3__V04 │
│ str │ i64 │ i64 │
├───────────┼──────────────────┼──────────────────┤
│ R_001 │ 1 │ null │ ← V14 was ranked 1st
│ R_002 │ 2 │ 1 │ ← V04 was ranked 1st
│ R_003 │ null │ 3 │ ← V04 was ranked 3rd
```
### ⚠️ Aggregated Data (Cannot Test!)
**What it looks like:** Already summarized/totaled data.
```
shape: (3, 2)
┌───────────┬────────────────┐
│ Character │ Weighted Score │ ← ALREADY AGGREGATED
│ str │ i64 │ Lost individual variance
├───────────┼────────────────┤ Cannot do significance tests!
│ V14 │ 209 │
│ V04 │ 180 │
```
**Solution:** Go back to the raw data before aggregation.
---
## Available Tests
### 1. Mann-Whitney U Test (Default for Continuous)
**Use when:** Comparing scores/ratings between groups
**Assumes:** Nothing about distribution shape (non-parametric)
**Best for:** Most survey data, Likert scales, ratings
```python
pairwise_df, meta = S.compute_pairwise_significance(
voice_data,
test_type="mannwhitney" # This is the default
)
```
**Pros:**
- Works with any distribution shape
- Robust to outliers
- Safe choice when unsure
**Cons:**
- Slightly less powerful than t-test when data IS normally distributed
---
### 2. Independent t-Test
**Use when:** Comparing means between groups
**Assumes:** Data is approximately normally distributed
**Best for:** Large samples (n > 30 per group), truly continuous data
```python
pairwise_df, meta = S.compute_pairwise_significance(
voice_data,
test_type="ttest"
)
```
**Pros:**
- Most powerful when assumptions are met
- Well-understood, commonly reported
**Cons:**
- Can give misleading results if data is skewed
- Sensitive to outliers
---
### 3. Chi-Square Test
**Use when:** Comparing frequency distributions
**Assumes:** Expected counts ≥ 5 in each cell
**Best for:** Count data, categorical comparisons
```python
pairwise_df, meta = S.compute_pairwise_significance(
count_data,
test_type="chi2"
)
```
**Pros:**
- Designed for count/frequency data
- Tests if distributions differ
**Cons:**
- Needs sufficient sample sizes
- Less informative about direction of difference
---
### 4. Two-Proportion Z-Test (For Rankings)
**Use when:** Comparing ranking vote proportions
**Automatically used by:** `compute_ranking_significance()`
```python
pairwise_df, meta = S.compute_ranking_significance(ranking_data)
```
**What it tests:** "Does Voice A get a significantly different proportion of Rank 1 votes than Voice B?"
---
## Multiple Comparison Corrections
### Why Do We Need Corrections?
When you compare many groups, you're doing many tests. Each test has a 5% chance of a false positive (if α = 0.05). With 17 voices:
| Comparisons | Expected False Positives (no correction) |
|-------------|------------------------------------------|
| 136 pairs | ~7 false "significant" results! |
**Corrections adjust p-values to account for this.**
---
### Bonferroni Correction (Conservative)
**Formula:** `p_adjusted = p_value × number_of_comparisons`
```python
pairwise_df, meta = S.compute_pairwise_significance(
data,
correction="bonferroni" # This is the default
)
```
**Use when:**
- You want to be very confident about significant results
- False positives are costly (publishing, major decisions)
- You have few comparisons (< 20)
**Trade-off:** May miss real differences (more false negatives)
---
### Holm-Bonferroni Correction (Less Conservative)
**Formula:** Step-down procedure that's less strict than Bonferroni
```python
pairwise_df, meta = S.compute_pairwise_significance(
data,
correction="holm"
)
```
**Use when:**
- You have many comparisons
- You want better power to detect real differences
- Exploratory analysis where missing a real effect is costly
**Trade-off:** Slightly higher false positive risk than Bonferroni
---
### No Correction
**Not recommended for final analysis**, but useful for exploration.
```python
pairwise_df, meta = S.compute_pairwise_significance(
data,
correction="none"
)
```
**Use when:**
- Initial exploration only
- You'll follow up with specific hypotheses
- You understand and accept the inflated false positive rate
---
### Correction Method Comparison
| Method | Strictness | Best For | Risk |
|--------|------------|----------|------|
| Bonferroni | Most strict | Few comparisons, high stakes | Miss real effects |
| Holm | Moderate | Many comparisons, balanced approach | Slightly more false positives |
| None | No control | Exploration only | Many false positives |
**Recommendation for Voice Branding:** Use **Holm** for exploratory analysis, **Bonferroni** for final reporting.
---
## Interpreting Results
### Key Output Columns
| Column | Meaning |
|--------|---------|
| `p_value` | Raw probability this difference happened by chance |
| `p_adjusted` | Corrected p-value (use this for decisions!) |
| `significant` | TRUE if p_adjusted < alpha (usually 0.05) |
| `effect_size` | How big is the difference (practical significance) |
### What the p-value Means
| p-value | Interpretation |
|---------|----------------|
| < 0.001 | Very strong evidence of difference |
| < 0.01 | Strong evidence |
| < 0.05 | Moderate evidence (traditional threshold) |
| 0.05 - 0.10 | Weak evidence, "trending" |
| > 0.10 | No significant evidence |
### Statistical vs Practical Significance
**Statistical significance** (p < 0.05) means the difference is unlikely due to chance.
**Practical significance** (effect size) means the difference matters in the real world.
| Effect Size (Cohen's d) | Interpretation |
|-------------------------|----------------|
| < 0.2 | Small (may not matter practically) |
| 0.2 - 0.5 | Medium |
| 0.5 - 0.8 | Large |
| > 0.8 | Very large |
**Example:** A p-value of 0.001 with effect size of 0.1 means "we're confident there's a difference, but it's tiny."
---
## Code Examples
### Example 1: Voice Scale Ratings
```python
# Get the raw rating data
voice_data, _ = S.get_voice_scale_1_10(data)
# Test for significant differences
pairwise_df, meta = S.compute_pairwise_significance(
voice_data,
test_type="mannwhitney", # Safe default for ratings
alpha=0.05,
correction="bonferroni"
)
# Check overall test first
print(f"Overall test: {meta['overall_test']}")
print(f"Overall p-value: {meta['overall_p_value']:.4f}")
# If overall is significant, look at pairwise
if meta['overall_p_value'] < 0.05:
sig_pairs = pairwise_df.filter(pl.col('significant') == True)
print(f"Found {sig_pairs.height} significant pairwise differences")
# Visualize
S.plot_significance_heatmap(pairwise_df, metadata=meta)
```
### Example 2: Top 3 Voice Rankings
```python
# Get the raw ranking data (NOT the weighted scores!)
ranking_data, _ = S.get_top_3_voices(data)
# Test for significant differences in Rank 1 proportions
pairwise_df, meta = S.compute_ranking_significance(
ranking_data,
alpha=0.05,
correction="holm" # Less conservative for many comparisons
)
# Check chi-square test
print(f"Chi-square p-value: {meta['chi2_p_value']:.4f}")
# View contingency table (Rank 1, 2, 3 counts per voice)
for voice, counts in meta['contingency_table'].items():
print(f"{voice}: R1={counts[0]}, R2={counts[1]}, R3={counts[2]}")
# Find significant pairs
sig_pairs = pairwise_df.filter(pl.col('significant') == True)
print(sig_pairs)
```
### Example 3: Comparing Demographic Subgroups
```python
# Filter to specific demographics
S.filter_data(data, consumer=['Early Professional'])
early_pro_data, _ = S.get_voice_scale_1_10(data)
S.filter_data(data, consumer=['Established Professional'])
estab_pro_data, _ = S.get_voice_scale_1_10(data)
# Test each group separately, then compare results qualitatively
# (For direct group comparison, you'd need a different test design)
```
---
## Common Mistakes to Avoid
### ❌ Using Aggregated Data
```python
# WRONG - already summarized, lost individual variance
weighted_scores = calculate_weighted_ranking_scores(ranking_data)
S.compute_pairwise_significance(weighted_scores) # Will fail!
```
### ✅ Use Raw Data
```python
# RIGHT - use raw data before aggregation
ranking_data, _ = S.get_top_3_voices(data)
S.compute_ranking_significance(ranking_data)
```
### ❌ Ignoring Multiple Comparisons
```python
# WRONG - 7% of pairs will be "significant" by chance alone!
S.compute_pairwise_significance(data, correction="none")
```
### ✅ Apply Correction
```python
# RIGHT - corrected p-values control false positives
S.compute_pairwise_significance(data, correction="bonferroni")
```
### ❌ Only Reporting p-values
```python
# WRONG - statistical significance isn't everything
print(f"p = {p_value}") # Missing context!
```
### ✅ Report Effect Sizes Too
```python
# RIGHT - include practical significance
print(f"p = {p_value}, effect size = {effect_size}")
print(f"Mean difference: {mean1 - mean2:.2f} points")
```
---
## Quick Reference Card
| Data Type | Function | Default Test | Recommended Correction |
|-----------|----------|--------------|------------------------|
| Ratings (1-10) | `compute_pairwise_significance()` | Mann-Whitney U | Bonferroni |
| Rankings (1st/2nd/3rd) | `compute_ranking_significance()` | Proportion Z | Holm |
| Count frequencies | `compute_pairwise_significance(test_type="chi2")` | Chi-square | Bonferroni |
| Scenario | Correction |
|----------|------------|
| Publishing results | Bonferroni |
| Client presentation | Bonferroni |
| Exploratory analysis | Holm |
| Quick internal check | Holm or None |
---
## Further Reading
- [Statistics for Dummies Cheat Sheet](https://www.dummies.com/article/academics-the-arts/math/statistics/statistics-for-dummies-cheat-sheet-208650/)
- [Choosing the Right Statistical Test](https://stats.oarc.ucla.edu/other/mult-pkg/whatstat/)
- [Multiple Comparisons Problem (Wikipedia)](https://en.wikipedia.org/wiki/Multiple_comparisons_problem)

2252
plots.py

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,3 @@
- V46 not in scale 1-10. Qualtrics
- Straightliners
- V45 goed in qual maar slecht in quant

View File

@@ -7,6 +7,7 @@ requires-python = ">=3.12"
dependencies = [
"altair>=6.0.0",
"imagehash>=4.3.1",
"jupyter>=1.1.1",
"marimo>=0.18.0",
"matplotlib>=3.10.8",
"modin[dask]>=0.37.1",
@@ -22,9 +23,14 @@ dependencies = [
"python-pptx>=1.0.2",
"pyzmq>=27.1.0",
"requests>=2.32.5",
"scipy>=1.14.0",
"taguette>=1.5.1",
"tqdm>=4.66.0",
"vl-convert-python>=1.9.0.post1",
"wordcloud>=1.9.5",
]
[project.scripts]
quant-report-batch = "run_filter_combinations:main"

59
reference.py Normal file
View File

@@ -0,0 +1,59 @@
ORIGINAL_CHARACTER_TRAITS = {
"the_familiar_friend": [
"Warm",
"Friendly",
"Approachable",
"Familiar",
"Casual",
"Appreciative",
"Benevolent",
],
"the_coach": [
"Empowering",
"Encouraging",
"Caring",
"Positive",
"Optimistic",
"Guiding",
"Reassuring",
],
"the_personal_assistant": [
"Forward-thinking",
"Progressive",
"Cooperative",
"Intentional",
"Resourceful",
"Attentive",
"Adaptive",
],
"the_bank_teller": [
"Patient",
"Grounded",
"Down-to-earth",
"Stable",
"Formal",
"Balanced",
"Efficient",
]
}
VOICE_GENDER_MAPPING = {
"V14": "Female",
"V04": "Female",
"V08": "Female",
"V77": "Female",
"V48": "Female",
"V82": "Female",
"V89": "Female",
"V91": "Female",
"V34": "Male",
"V69": "Male",
"V45": "Male",
"V46": "Male",
"V54": "Male",
"V74": "Male",
"V81": "Male",
"V86": "Male",
"V88": "Male",
"V16": "Male",
}

306
run_filter_combinations.py Normal file
View File

@@ -0,0 +1,306 @@
#!/usr/bin/env python
"""
Batch runner for quant report with different filter combinations.
Runs 03_quant_report.script.py for each single-filter combination:
- Each age group (with all others active)
- Each gender (with all others active)
- Each ethnicity (with all others active)
- Each income group (with all others active)
- Each consumer segment (with all others active)
Usage:
uv run python run_filter_combinations.py
uv run python run_filter_combinations.py --dry-run # Preview combinations without running
uv run python run_filter_combinations.py --category age # Only run age combinations
uv run python run_filter_combinations.py --category consumer # Only run consumer segment combinations
"""
import subprocess
import sys
import json
from pathlib import Path
from tqdm import tqdm
from utils import QualtricsSurvey
# Default data paths (same as in 03_quant_report.script.py)
RESULTS_FILE = 'data/exports/2-2-26/JPMC_Chase Brand Personality_Quant Round 1_February 2, 2026_Labels.csv'
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
REPORT_SCRIPT = Path(__file__).parent / '03_quant_report.script.py'
def get_filter_combinations(survey: QualtricsSurvey, category: str = None) -> list[dict]:
"""
Generate all single-filter combinations.
Each combination isolates ONE filter value while keeping all others at "all selected".
Args:
survey: QualtricsSurvey instance with loaded data
category: Optional filter category to limit combinations to.
Valid values: 'all', 'age', 'gender', 'ethnicity', 'income', 'consumer',
'business_owner', 'ai_user', 'investable_assets', 'industry'
If None or 'all', generates all combinations.
Returns:
List of dicts with filter kwargs for each run.
"""
combinations = []
# Add "All Respondents" run (no filters = all options selected)
if not category or category in ['all_filters', 'all']:
combinations.append({
'name': 'All_Respondents',
'filters': {} # Empty = use defaults (all selected)
})
# Age groups - one at a time
if not category or category in ['all_filters', 'age']:
for age in survey.options_age:
combinations.append({
'name': f'Age-{age}',
'filters': {'age': [age]}
})
# Gender - one at a time
if not category or category in ['all_filters', 'gender']:
for gender in survey.options_gender:
combinations.append({
'name': f'Gender-{gender}',
'filters': {'gender': [gender]}
})
# Ethnicity - grouped by individual values
if not category or category in ['all_filters', 'ethnicity']:
# Ethnicity options are comma-separated (e.g., "White or Caucasian, Hispanic or Latino")
# Create filters that include ALL options containing each individual ethnicity value
ethnicity_values = set()
for ethnicity_option in survey.options_ethnicity:
# Split by comma and strip whitespace
values = [v.strip() for v in ethnicity_option.split(',')]
ethnicity_values.update(values)
for ethnicity_value in sorted(ethnicity_values):
# Find all options that contain this value
matching_options = [
opt for opt in survey.options_ethnicity
if ethnicity_value in [v.strip() for v in opt.split(',')]
]
combinations.append({
'name': f'Ethnicity-{ethnicity_value}',
'filters': {'ethnicity': matching_options}
})
# Income - one at a time
if not category or category in ['all_filters', 'income']:
for income in survey.options_income:
combinations.append({
'name': f'Income-{income}',
'filters': {'income': [income]}
})
# Consumer segments - combine _A and _B options, and also include standalone
if not category or category in ['all_filters', 'consumer']:
# Group options by base name (removing _A/_B suffix)
consumer_groups = {}
for consumer in survey.options_consumer:
# Check if ends with _A or _B
if consumer.endswith('_A') or consumer.endswith('_B'):
base_name = consumer[:-2] # Remove last 2 chars (_A or _B)
if base_name not in consumer_groups:
consumer_groups[base_name] = []
consumer_groups[base_name].append(consumer)
else:
# Not an _A/_B option, keep as-is
consumer_groups[consumer] = [consumer]
# Add combined _A+_B options
for base_name, options in consumer_groups.items():
if len(options) > 1: # Only combine if there are multiple (_A and _B)
combinations.append({
'name': f'Consumer-{base_name}',
'filters': {'consumer': options}
})
# Add standalone options (including individual _A and _B)
for consumer in survey.options_consumer:
combinations.append({
'name': f'Consumer-{consumer}',
'filters': {'consumer': [consumer]}
})
# Business Owner - one at a time
if not category or category in ['all_filters', 'business_owner']:
for business_owner in survey.options_business_owner:
combinations.append({
'name': f'BusinessOwner-{business_owner}',
'filters': {'business_owner': [business_owner]}
})
# AI User - one at a time
if not category or category in ['all_filters', 'ai_user']:
for ai_user in survey.options_ai_user:
combinations.append({
'name': f'AIUser-{ai_user}',
'filters': {'ai_user': [ai_user]}
})
# AI user daily, more than once daily, en multiple times a week = frequent
combinations.append({
'name': 'AIUser-Frequent',
'filters': {'ai_user': [
'Daily', 'More than once daily', 'Multiple times per week'
]}
})
combinations.append({
'name': 'AIUser-RarelyNever',
'filters': {'ai_user': [
'Once a month', 'Less than once a month', 'Once a week', 'Rarely/Never'
]}
})
# Investable Assets - one at a time
if not category or category in ['all_filters', 'investable_assets']:
for investable_assets in survey.options_investable_assets:
combinations.append({
'name': f'Assets-{investable_assets}',
'filters': {'investable_assets': [investable_assets]}
})
# Industry - one at a time
if not category or category in ['all_filters', 'industry']:
for industry in survey.options_industry:
combinations.append({
'name': f'Industry-{industry}',
'filters': {'industry': [industry]}
})
# Voice ranking completeness filter
# These use a special flag rather than demographic filters, so we store
# the mode in a dedicated key that run_report passes as --voice-ranking-filter.
if not category or category in ['all_filters', 'voice_ranking']:
combinations.append({
'name': 'VoiceRanking-OnlyMissing',
'filters': {},
'voice_ranking_filter': 'only-missing',
})
combinations.append({
'name': 'VoiceRanking-ExcludeMissing',
'filters': {},
'voice_ranking_filter': 'exclude-missing',
})
return combinations
def run_report(filters: dict, name: str = None, dry_run: bool = False, sl_threshold: int = None, voice_ranking_filter: str = None) -> bool:
"""
Run the report script with given filters.
Args:
filters: Dict of filter_name -> list of values
name: Name for this filter combination (used for .txt description file)
dry_run: If True, just print command without running
sl_threshold: If set, exclude respondents with >= N straight-lined question groups
voice_ranking_filter: If set, filter by voice ranking completeness.
'only-missing' keeps only respondents missing QID98 data,
'exclude-missing' removes them.
Returns:
True if successful, False otherwise
"""
cmd = [sys.executable, str(REPORT_SCRIPT)]
# Add filter-name for description file
if name:
cmd.extend(['--filter-name', name])
# Pass straight-liner threshold if specified
if sl_threshold is not None:
cmd.extend(['--sl-threshold', str(sl_threshold)])
# Pass voice ranking filter if specified
if voice_ranking_filter is not None:
cmd.extend(['--voice-ranking-filter', voice_ranking_filter])
for filter_name, values in filters.items():
if values:
cmd.extend([f'--{filter_name}', json.dumps(values)])
if dry_run:
print(f" Would run: {' '.join(cmd)}")
return True
try:
result = subprocess.run(
cmd,
capture_output=True,
text=True,
cwd=Path(__file__).parent
)
if result.returncode != 0:
print(f"\n ERROR: {result.stderr[:500]}")
return False
return True
except Exception as e:
print(f"\n ERROR: {e}")
return False
def main():
import argparse
parser = argparse.ArgumentParser(description='Run quant report for all filter combinations')
parser.add_argument('--dry-run', action='store_true', help='Preview combinations without running')
parser.add_argument(
'--category',
choices=['all_filters', 'all', 'age', 'gender', 'ethnicity', 'income', 'consumer', 'business_owner', 'ai_user', 'investable_assets', 'industry', 'voice_ranking'],
default='all_filters',
help='Filter category to run combinations for (default: all_filters)'
)
parser.add_argument('--sl-threshold', type=int, default=None, help='Exclude respondents who straight-lined >= N question groups (passed to report script)')
args = parser.parse_args()
# Load survey to get available filter options
print("Loading survey to get filter options...")
survey = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
survey.load_data() # Populates options_* attributes
# Generate combinations for specified category
combinations = get_filter_combinations(survey, category=args.category)
category_desc = f" for category '{args.category}'" if args.category != 'all' else ''
print(f"Generated {len(combinations)} filter combinations{category_desc}")
if args.sl_threshold is not None:
print(f"Straight-liner threshold: excluding respondents with ≥{args.sl_threshold} straight-lined question groups")
if args.dry_run:
print("\nDRY RUN - Commands that would be executed:")
for combo in combinations:
print(f"\n{combo['name']}:")
run_report(combo['filters'], name=combo['name'], dry_run=True, sl_threshold=args.sl_threshold, voice_ranking_filter=combo.get('voice_ranking_filter'))
return
# Run each combination with progress bar
successful = 0
failed = []
for combo in tqdm(combinations, desc="Running reports", unit="filter"):
tqdm.write(f"Running: {combo['name']}")
if run_report(combo['filters'], name=combo['name'], sl_threshold=args.sl_threshold, voice_ranking_filter=combo.get('voice_ranking_filter')):
successful += 1
else:
failed.append(combo['name'])
# Summary
print(f"\n{'='*50}")
print(f"Completed: {successful}/{len(combinations)} successful")
if failed:
print(f"Failed: {', '.join(failed)}")
if __name__ == '__main__':
main()

File diff suppressed because one or more lines are too long

View File

@@ -19,11 +19,32 @@ class ColorPalette:
# Neutral color for unhighlighted comparison items
NEUTRAL = "#D3D3D3" # Light Grey
# Character-specific colors (for individual character plots)
# Each character has a main color and a lighter highlight for original traits
CHARACTER_BANK_TELLER = "#004C6D" # Dark Blue
CHARACTER_BANK_TELLER_HIGHLIGHT = "#669BBC" # Light Steel Blue
CHARACTER_FAMILIAR_FRIEND = "#008493" # Teal
CHARACTER_FAMILIAR_FRIEND_HIGHLIGHT = "#A8DADC" # Pale Cyan
CHARACTER_COACH = "#5AAE95" # Sea Green
CHARACTER_COACH_HIGHLIGHT = "#A8DADC" # Pale Cyan
CHARACTER_PERSONAL_ASSISTANT = "#457B9D" # Steel Blue
CHARACTER_PERSONAL_ASSISTANT_HIGHLIGHT = "#669BBC" # Light Steel Blue
# General UI elements
TEXT = "black"
GRID = "lightgray"
BACKGROUND = "white"
# Statistical significance colors (for heatmaps/annotations)
SIG_STRONG = "#004C6D" # p < 0.001 - Dark Blue (highly significant)
SIG_MODERATE = "#0077B6" # p < 0.01 - Medium Blue (significant)
SIG_WEAK = "#5AAE95" # p < 0.05 - Sea Green (marginally significant)
SIG_NONE = "#E8E8E8" # p >= 0.05 - Light Grey (not significant)
SIG_DIAGONAL = "#FFFFFF" # White for diagonal (self-comparison)
# Extended palette for categorical charts (e.g., pie charts with many categories)
CATEGORICAL = [
"#0077B6", # PRIMARY - Medium Blue
@@ -38,6 +59,37 @@ class ColorPalette:
"#457B9D", # Steel Blue
]
# Gender-based colors (Male = Blue tones, Female = Pink tones)
# Primary colors by gender
GENDER_MALE = "#0077B6" # Medium Blue (same as PRIMARY)
GENDER_FEMALE = "#B6007A" # Medium Pink
# Ranking colors by gender (Darkest -> Lightest)
GENDER_MALE_RANK_1 = "#004C6D" # Dark Blue
GENDER_MALE_RANK_2 = "#0077B6" # Medium Blue
GENDER_MALE_RANK_3 = "#669BBC" # Light Steel Blue
GENDER_FEMALE_RANK_1 = "#6D004C" # Dark Pink
GENDER_FEMALE_RANK_2 = "#B6007A" # Medium Pink
GENDER_FEMALE_RANK_3 = "#BC669B" # Light Pink
# Neutral colors by gender (for non-highlighted items)
GENDER_MALE_NEUTRAL = "#B8C9D9" # Grey-Blue
GENDER_FEMALE_NEUTRAL = "#D9B8C9" # Grey-Pink
# Gender colors for correlation plots (green/red indicate +/- correlation)
# Male = darker shade, Female = lighter shade
CORR_MALE_POSITIVE = "#1B5E20" # Dark Green
CORR_FEMALE_POSITIVE = "#81C784" # Light Green
CORR_MALE_NEGATIVE = "#B71C1C" # Dark Red
CORR_FEMALE_NEGATIVE = "#E57373" # Light Red
# Speaking Style Colors (named after the style quadrant colors)
STYLE_GREEN = "#2E7D32" # Forest Green
STYLE_BLUE = "#1565C0" # Strong Blue
STYLE_ORANGE = "#E07A00" # Burnt Orange
STYLE_RED = "#C62828" # Deep Red
def jpmc_altair_theme():
"""JPMC brand theme for Altair charts."""

1347
utils.py

File diff suppressed because it is too large Load Diff

1455
uv.lock generated

File diff suppressed because it is too large Load Diff