Compare commits

...

76 Commits

Author SHA1 Message Date
03a716e8ec correlation matrix speech characteristics vs score 2026-02-10 16:50:47 +01:00
8720bb670d started speech data notebook 2026-02-10 14:58:13 +01:00
9dfab75925 missing data analysis 2026-02-10 14:24:26 +01:00
14e28cf368 stat significance nr times ranked 1st 2026-02-09 18:37:41 +01:00
8e181e193a SL filter 2026-02-09 17:57:04 +01:00
6c16993cb3 straight-liner plot analysis 2026-02-09 17:26:45 +01:00
92c6fc03ab docs datasets 2026-02-09 13:17:59 +01:00
7fb6570190 statistical significance 2026-02-05 19:49:19 +01:00
840bd2940d other top bc's 2026-02-05 11:50:00 +01:00
af9a15ccb0 renamed notebooks and added significance test 2026-02-05 10:14:53 +01:00
a3cf9f103d update plots with final data release 2026-02-04 21:15:03 +01:00
f0eab32c34 update alt-text with full filepaths 2026-02-04 17:48:48 +01:00
d231fc02db fix missing filter descr in correlation plots 2026-02-04 14:48:14 +01:00
fc76bb0ab5 voice gender split correlation plots 2026-02-04 13:44:51 +01:00
ab78276a97 male/female voices in separate plots for correlations 2026-02-04 12:35:24 +01:00
e17646eb70 correlation plots for best bc 2026-02-04 10:46:31 +01:00
ad1d8c6e58 all plots offline update 2026-02-03 22:38:15 +01:00
f5b4c247b8 tidy plots 2026-02-03 22:12:17 +01:00
a35670aa72 fixed missing ai_user category 2026-02-03 21:13:29 +01:00
36280a6ff8 fix sample size 2026-02-03 20:48:34 +01:00
9a587dcc4c add ai-user filter combinations 2026-02-03 19:46:07 +01:00
9a49d1c690 added sample size to filter text 2026-02-03 19:16:39 +01:00
8f505da550 offline update 18-30 2026-02-03 18:43:20 +01:00
495b56307c fixed filter to none 2026-02-03 18:19:06 +01:00
1e76a82f24 fix wordcloud filter values 2026-02-03 17:41:12 +01:00
01b7d50637 fixed empty plots, updated filters 2026-02-03 16:51:24 +01:00
dca9ac11ba supposed wordcloud fix, but everything broke 2026-02-03 15:36:35 +01:00
081fb0dd6e added 6 more filters 2026-02-03 15:20:01 +01:00
2817ed240a automatic generation of all plots with all combinations 2026-02-03 15:03:57 +01:00
e44251c3d6 fixed consumer and ethnicity filter combinations 2026-02-03 14:43:03 +01:00
8dd41dfc96 Start automation of running filter combinations 2026-02-03 14:33:09 +01:00
840cb4e6dc exported marimo to script form 2026-02-03 13:48:05 +01:00
a162701e94 move cell for better running 2026-02-03 02:22:06 +01:00
38f6d8a87c fixed for basic plots, filter active 2026-02-03 02:19:47 +01:00
5c39bbb23a images tagged 2026-02-03 02:05:29 +01:00
190e4fbdc4 finished correlation plots and generic voice plots 2026-02-03 01:59:26 +01:00
2408d06098 base correlations 2026-02-03 01:32:06 +01:00
1dce4db909 voice gender plots done 2026-02-03 01:03:29 +01:00
acf9c45844 male/female colored plots 2026-02-03 00:40:51 +01:00
77fdd6e8f6 fixed voices plot order 2026-02-03 00:20:56 +01:00
426495ebe3 generic voice plots 2026-02-03 00:15:10 +01:00
a7ee854ed0 voice plots 2026-02-03 00:12:18 +01:00
97c4b07208 added filter disabled broken cells and starting spoken voice generic results 2026-02-02 23:32:10 +01:00
fd14038253 comment out 'per subgroup' since these just take too long to create 2026-02-02 23:22:09 +01:00
611fc8d19a added var split_group 2026-02-02 23:15:05 +01:00
3ac330263f BC results per consumer 2026-02-02 23:04:40 +01:00
bda4d54231 split consumer groups best character 2026-02-02 22:05:56 +01:00
f2c659c266 statistical tests 2026-02-02 21:47:37 +01:00
29df6a4bd9 og traits 2026-02-02 18:37:45 +01:00
a62524c6e4 update plot agent with explicit things not to do 2026-02-02 18:26:23 +01:00
43b41a01f5 plot creator agent 2026-02-02 17:57:19 +01:00
b7cf6adfb8 fix ppt update images 2026-02-02 17:36:32 +01:00
6ba30ff041 add copilot instructions and rename classes 2026-02-02 17:21:57 +01:00
02a0214539 fixed plot alt-text-tag function 2026-02-02 17:07:44 +01:00
45dd121d90 wordcloud 2026-02-02 11:12:53 +01:00
d770645d8e demographics section done 2026-02-02 09:04:29 +01:00
6b3fcb2f43 report layout 2026-01-29 22:38:31 +01:00
036dd911df fixed normalization functions 2026-01-29 21:53:58 +01:00
becc435d3c drop voice46 from scales 1-10. fix plots breakline in title 2026-01-29 21:10:56 +01:00
8aee09f968 SL validation complete 2026-01-29 20:39:16 +01:00
c1729d4896 straightliner verification for SS questions 2026-01-29 19:57:29 +01:00
2958fed780 straightliner validation 2026-01-29 18:40:18 +01:00
5f9e67a312 rename example notebooks and finish ppt pipeline functions 2026-01-29 16:07:55 +01:00
3ee25f9e33 ppt function to replace images 2026-01-29 15:36:34 +01:00
bc12df28a5 straight line fn dev 2026-01-29 13:20:32 +01:00
70719702ec added data selector ui element 2026-01-29 12:16:09 +01:00
36d8bc4d88 fixed saving png issue 2026-01-29 10:43:05 +01:00
0485f991d2 migration to altair plots 2026-01-28 21:01:39 +01:00
3f929d93fd altair migration plan 2026-01-28 18:17:45 +01:00
62e75fe899 saving plots to subdirectories grouped by filter 2026-01-28 15:58:38 +01:00
365e70b834 move plots to mixin class of JPMCSurvey to simplify file saving 2026-01-28 14:54:36 +01:00
23136b5c2e save figures to directory 2026-01-27 18:35:09 +01:00
fd4cb4b596 correlation start 2026-01-27 17:22:16 +01:00
393c527656 filters 2026-01-23 15:05:35 +01:00
0f5ecf5ac7 speaking style trait scores 2026-01-23 12:39:12 +01:00
84a0f8052e speaking style trait scores vertical 2026-01-23 12:26:47 +01:00
38 changed files with 20095 additions and 1142 deletions

216
.github/agents/plot-creator.agent.md vendored Normal file
View File

@@ -0,0 +1,216 @@
# Plot Creator Agent
You are a specialized agent for creating data visualizations for the Voice Branding Qualtrics survey analysis project.
## ⚠️ Critical Data Handling Rules
1. **NEVER assume or load datasets without explicit user consent** - This is confidential data
2. **NEVER guess file paths or dataset locations**
3. **DO NOT assume data comes from a `Survey.get_*()` method** - Data may have been manually manipulated in a notebook
4. **Use ONLY the data snippet provided by the user** for understanding structure and testing
## Your Workflow
When the user provides a plotting request (e.g., "I need a bar plot that shows the frequency of the times each trait is chosen per brand character"), follow this workflow:
### Step 1: Understand the Request
- Parse the user's natural language request to identify:
- **Chart type** (bar, stacked bar, line, heatmap, etc.)
- **X-axis variable**
- **Y-axis variable / aggregation** (count, mean, sum, etc.)
- **Grouping / color encoding** (if any)
- **Filtering requirements** (if any)
- Think critically about whether the requested plot is feasible with the available data.
- Think critically about the best way to visualize the requested information, and if the requested chart type is appropriate. If not, consider alternatives and ask the user for confirmation before proceeding.
### Step 2: Analyze Provided Data
The user will paste a `df.head()` output. Examine:
- Column names and their meaning (refer to column naming conventions in `.github/copilot-instructions.md`)
- Data types
- Whether the data is in the right shape for the desired plot
**Important:** Do NOT make assumptions about where this data came from. It may be:
- Output from a `Survey.get_*()` method
- Manually transformed in a notebook
- A join of multiple data sources
- Any other custom manipulation
### Step 3: Determine Data Manipulation Needs
Decide if the provided data can be plotted directly, or if transformations are needed:
- **No manipulation**: Data is ready → proceed to Step 5
- **Manipulation needed**: Aggregation, pivoting, melting, filtering, or new computed columns required → proceed to Step 4
### Step 4: Create Data Manipulation Function (if needed)
Check if an existing `transform_<descriptive_name>` function exists in `utils.py` that performs the needed data manipulation. If not, create a dedicated method in the `QualtricsSurvey` class (`utils.py`):
```python
def transform_<descriptive_name>(self, df: pl.LazyFrame | pl.DataFrame) -> tuple[pl.LazyFrame, dict | None]:
"""Transform <input_description> to <output_description>.
Original use-case: "<paste user's original question here>"
This function <concise 1-2 sentence explanation of what it does>.
Args:
df: Pre-fetched data as a Polars LazyFrame or DataFrame.
Returns:
tuple: (LazyFrame with columns [...], Optional metadata dict)
"""
# Implementation - transform the INPUT data only
# NEVER call self.get_*() methods here
return result, metadata
```
**Requirements:**
- **NEVER retrieve data inside transform functions** - The function receives already-fetched data as input
- Data retrieval (`get_*()` calls) stays in the notebook so analysts can see all steps
- Method must return `(pl.LazyFrame, Optional[dict])` tuple
- Docstring MUST contain the original question verbatim
- Follow existing patterns class methods of the `QualtricsSurvey()` in `utils.py`
**❌ BAD Example (do NOT do this):**
```python
def transform_character_trait_frequency(self, q: pl.LazyFrame):
# BAD: Fetching data inside transform function
char_df, _ = self.get_character_refine(q) # ← WRONG!
# ... rest of transform
```
**✅ GOOD Example:**
```python
def transform_character_trait_frequency(self, char_df: pl.LazyFrame | pl.DataFrame):
# GOOD: Receives pre-fetched data as input
if isinstance(char_df, pl.LazyFrame):
char_df = char_df.collect()
# ... rest of transform
```
**In the notebook, the analyst writes:**
```python
char_data, _ = S.get_character_refine(data) # Step visible to analyst
trait_freq, _ = S.transform_character_trait_frequency(char_data) # Transform step
chart = S.plot_character_trait_frequency(trait_freq)
```
### Step 5: Create Temporary Test File
Create `debug_plot_temp.py` for testing. **Prefer using the data snippet already provided by the user.**
**Option A: Use provided data snippet (preferred)**
If the user provided a `df.head()` or sample data output, create inline test data from it:
```python
"""Temporary test file for <plot_name>.
Delete after testing.
"""
import polars as pl
from theme import ColorPalette
import altair as alt
# ============================================================
# TEST DATA (reconstructed from user's df.head() output)
# ============================================================
test_data = pl.DataFrame({
"Column1": ["value1", "value2", ...],
"Column2": [1, 2, ...],
# ... recreate structure from provided sample
})
# ============================================================
# Test the plot function
from plots import QualtricsPlotsMixin
# ... test code
```
**Option B: Ask user (only if necessary)**
Only ask the user for additional code if:
- The provided sample is insufficient to test the plot logic
- You need to understand complex data relationships not visible in the sample
- The transformation requires understanding the full data pipeline
If you must ask:
> "The sample data you provided should work for basic testing. However, I need [specific reason]. Could you provide:
> 1. [specific information needed]
>
> If you'd prefer, I can proceed with a minimal test using the sample data you shared."
### Step 6: Create Plot Function
Add a new method to `QualtricsPlotsMixin` in `plots.py`:
```python
def plot_<descriptive_name>(
self,
data: pl.LazyFrame | pl.DataFrame | None = None,
title: str = "<Default title>",
x_label: str = "<X label>",
y_label: str = "<Y label>",
height: int | None = None,
width: int | str | None = None,
) -> alt.Chart:
"""<Docstring with original question and description>."""
df = self._ensure_dataframe(data)
# Build chart using ONLY ColorPalette from theme.py
chart = alt.Chart(...).mark_bar(color=ColorPalette.PRIMARY)...
chart = self._save_plot(chart, title)
return chart
```
**Requirements:**
- ALL colors MUST use `ColorPalette` constants from `theme.py`
- Use `self._ensure_dataframe()` to handle LazyFrame/DataFrame
- Use `self._save_plot()` at the end to enable auto-save
- Use `self._process_title()` for titles with `<br>` tags
- Follow existing plot patterns (see `plot_average_scores_with_counts`, `plot_top3_ranking_distribution`)
### Step 7: Test
Run the temporary test file to verify the plot works:
```bash
uv run python debug_plot_temp.py
```
### Step 8: Provide Summary
After successful completion, output a summary:
```
✅ Plot created successfully!
**Data function** (if created): `S.transform_<name>(data)`
**Plot function**: `S.plot_<name>(data, title="...")`
**Usage example:**
```python
# Assuming you have your data already prepared as `plot_data`
chart = S.plot_<name>(plot_data, title="Your Title Here")
chart # Display in Marimo
```
**Files modified:**
- `utils.py` - Added `transform_<name>()` (if applicable)
- `plots.py` - Added `plot_<name>()`
- `debug_plot_temp.py` - Test file (can be deleted)
```
## Critical Rules (from .github/copilot-instructions.md)
1. **NEVER load confidential data without explicit user-provided code**
2. **NEVER assume data source** - do not guess which `get_*()` method produced the data
3. **NEVER modify Marimo notebooks** (`0X_*.py` files)
4. **NEVER run Marimo notebooks for debugging**
5. **ALL colors MUST come from `theme.py`** - use `ColorPalette.PRIMARY`, `ColorPalette.RANK_1`, etc.
6. **If a new color is needed**, add it to `ColorPalette` in `theme.py` first
7. **No changelog markdown files** - do not create new .md files documenting changes
8. **Reading notebooks is OK** to understand function usage patterns
9. **Getter methods return tuples**: `(LazyFrame, Optional[metadata])`
10. **Use Polars LazyFrames** until visualization, then `.collect()`
If any rule causes problems, ask user for permission before deviating.
## Reference: Column Patterns
- `SS_Green_Blue__V14__Choice_1` → Speaking Style trait score
- `Voice_Scale_1_10__V48` → 1-10 voice rating
- `Top_3_Voices_ranking__V77` → Ranking position
- `Character_Ranking_<Name>` → Character personality ranking

105
.github/copilot-instructions.md vendored Normal file
View File

@@ -0,0 +1,105 @@
# Voice Branding Quantitative Analysis - Copilot Instructions
## Project Overview
Qualtrics survey analysis for brand personality research. Analyzes voice samples (V04-V91) across speaking style traits, character rankings, and demographic segments. Uses **Marimo notebooks** for interactive analysis and **Polars** for data processing.
## Architecture
### Core Components
- **`QualtricsSurvey`** (`utils.py`): Main class combining data loading, filtering, and plotting via `QualtricsPlotsMixin`
- **Marimo notebooks** (`0X_*.py`): Interactive apps run via `uv run marimo run <file>.py`
- **Data exports** (`data/exports/<date>/`): Qualtrics CSVs with `_Labels.csv` and `_Values.csv` variants
- **QSF files**: Qualtrics survey definitions for mapping QIDs to question text
### Data Flow
```
Qualtrics CSV (3-row header) → QualtricsSurvey.load_data() → LazyFrame with QID columns
filter_data() → get_*() methods → plot_*() methods → figures/<export>/<filter>/
```
## ⚠️ Critical AI Agent Rules
1. **NEVER modify Marimo notebooks directly** - The `XX_*.py` files are Marimo notebooks and should not be edited by AI agents
2. **NEVER run Marimo notebooks for debugging** - These are interactive apps, not test scripts
3. **For debugging**: Create a standalone temporary Python script (e.g., `debug_temp.py`) to test functions
4. **Reading notebooks is OK** - You may read notebook files to understand how functions are used. Ask the user which notebook they're working in for context
5. **No changelog markdown files** - Do not create new markdown files to document small changes or describe new usage
## Key Patterns
### Polars LazyFrames
Always work with `pl.LazyFrame` until visualization; call `.collect()` only when needed:
```python
data = S.load_data() # Returns LazyFrame
subset, meta = S.get_voice_scale_1_10(data) # Returns (LazyFrame, Optional[dict])
df = subset.collect() # Materialize for plotting
```
### Column Naming Convention
Survey columns follow patterns that encode voice/trait info:
- `SS_Green_Blue__V14__Choice_1` → Speaking Style, Voice 14, Trait 1
- `Voice_Scale_1_10__V48` → 1-10 rating for Voice 48
- `Top_3_Voices_ranking__V77` → Ranking position for Voice 77
### Filter State & Figure Output
`QualtricsSurvey` stores filter state and auto-generates output paths:
```python
S.filter_data(data, consumer=['Early Professional'])
# Plots save to: figures/<export>/Cons-Early_Professional/<plot_name>.png
```
### Getter Methods Return Tuples
All `get_*()` methods return `(LazyFrame, Optional[metadata])`:
```python
df, choices_map = S.get_ss_green_blue(data) # choices_map has trait descriptions
df, _ = S.get_character_ranking(data) # Second element may be None
```
## Development Commands
```bash
# Run interactive analysis notebook
uv run marimo run 02_quant_analysis.py --port 8080
# Edit notebook in editor mode
uv run marimo edit 02_quant_analysis.py
# Headless mode for shared access
uv run marimo run 02_quant_analysis.py --headless --port 8080
```
## Important Files
| File | Purpose |
|------|---------|
| `utils.py` | `QualtricsSurvey` class, data transformations, PPTX utilities |
| `plots.py` | `QualtricsPlotsMixin` with all Altair plotting methods |
| `theme.py` | `ColorPalette` and `jpmc_altair_theme()` for consistent styling |
| `validation.py` | Data quality checks (progress, duration outliers, straight-liners) |
| `speaking_styles.py` | `SPEAKING_STYLES` dict mapping colors to trait groups |
## Conventions
### Altair Charts & Colors
- **ALL colors MUST come from `theme.py`** - Use `ColorPalette.PRIMARY`, `ColorPalette.RANK_1`, etc.
- If a new color is needed, add it to `ColorPalette` in `theme.py` first, then use it
- Never hardcode hex colors directly in plotting code
- Charts auto-save via `_save_plot()` when `fig_save_dir` is set
- Filter footnotes added automatically via `_add_filter_footnote()`
### QSF Parsing
Use `_get_qsf_question_by_QID()` to extract question config:
```python
cfg = self._get_qsf_question_by_QID('QID27')['Payload']
recode_map = cfg['RecodeValues'] # Maps choice numbers to values
```
### PPTX Image Replacement
Images matched by perceptual hash (not filename); alt-text encodes figure path:
```python
utils.update_ppt_alt_text(ppt_path, image_source_dir) # Tag images with alt-text
utils.pptx_replace_named_image(ppt, target_tag, new_image) # Replace by alt-text
```
This is a process that should be run manually be the user ONLY.

1
.gitignore vendored
View File

@@ -15,3 +15,4 @@ data/
docker-volumes/
logs/
figures/

5
.vscode/extensions.json vendored Normal file
View File

@@ -0,0 +1,5 @@
{
"recommendations": [
"wakatime.vscode-wakatime"
]
}

5
.vscode/settings.json vendored Normal file
View File

@@ -0,0 +1,5 @@
{
"chat.tools.terminal.autoApprove": {
"/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/.venv/bin/python": true
}
}

View File

@@ -12,15 +12,24 @@ def _():
import plotly as plt
from pathlib import Path
from utils import extract_qid_descr_map
return Path, extract_qid_descr_map, mo, pd
import utils
return Path, mo, pd, utils
@app.cell
def _(Path):
# results_file = Path('data/exports/OneDrive_1_1-16-2026/JPMC_Chase Brand Personality_Quant Round 1_TestData_Labels.csv')
results_file = Path('data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase Brand Personality_Quant Round 1_January 21, 2026_Soft Launch_Labels.csv')
return (results_file,)
# results_file = Path('data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase Brand Personality_Quant Round 1_January 21, 2026_Soft Launch_Labels.csv')
results_file = Path('data/exports/1-23-26/JPMC_Chase Brand Personality_Quant Round 1_January 23, 2026_Labels.csv')
qsf_file = 'data/exports/OneDrive_1_1-16-2026/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
return qsf_file, results_file
@app.cell
def _(qsf_file, results_file, utils):
survey = utils.QualtricsSurvey(results_file, qsf_file)
data_all = survey.load_data()
return (survey,)
@app.cell
@@ -33,8 +42,8 @@ def _(mo):
@app.cell
def _(extract_qid_descr_map, results_file):
qid_descr_map = extract_qid_descr_map(results_file)
def _(survey):
qid_descr_map = survey.qid_descr_map
qid_descr_map
return (qid_descr_map,)

View File

@@ -1,7 +1,7 @@
import marimo
__generated_with = "0.19.2"
app = marimo.App(width="medium")
__generated_with = "0.19.7"
app = marimo.App(width="full")
@app.cell
@@ -10,254 +10,514 @@ def _():
import polars as pl
from pathlib import Path
from validation import check_progress, duration_validation
from utils import JPMCSurvey, combine_exclusive_columns, calculate_weighted_ranking_scores
from plots import plot_average_scores_with_counts, plot_top3_ranking_distribution, plot_character_ranking_distribution, plot_most_ranked_1_character, plot_weighted_ranking_score
from validation import check_progress, duration_validation, check_straight_liners
from utils import QualtricsSurvey, combine_exclusive_columns, calculate_weighted_ranking_scores
import utils
from speaking_styles import SPEAKING_STYLES
return (
JPMCSurvey,
Path,
QualtricsSurvey,
SPEAKING_STYLES,
calculate_weighted_ranking_scores,
check_progress,
check_straight_liners,
duration_validation,
mo,
plot_average_scores_with_counts,
plot_character_ranking_distribution,
plot_most_ranked_1_character,
plot_top3_ranking_distribution,
plot_weighted_ranking_score,
pl,
utils,
)
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
# Load Data
""")
return
file_browser = mo.ui.file_browser(
initial_path="./data/exports", multiple=False, restrict_navigation=True, filetypes=[".csv"], label="Select 'Labels' File"
)
file_browser
return (file_browser,)
@app.cell
def _(Path, mo):
RESULTS_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase Brand Personality_Quant Round 1_January 21, 2026_Soft Launch_Labels.csv'
def _(Path, file_browser, mo):
mo.stop(file_browser.path(index=0) is None, mo.md("**⚠️ Please select a `_Labels.csv` file above to proceed**"))
# RESULTS_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase Brand Personality_Quant Round 1_January 21, 2026_Soft Launch_Labels.csv'
RESULTS_FILE = Path(file_browser.path(index=0))
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
mo.md(f"**Dataset:** `{Path(RESULTS_FILE).name}`")
# RESULTS_FILE
return QSF_FILE, RESULTS_FILE
@app.cell
def _(JPMCSurvey, QSF_FILE, RESULTS_FILE):
survey = JPMCSurvey(RESULTS_FILE, QSF_FILE)
data_all = survey.load_data()
data_all.collect()
return data_all, survey
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
## Data Validation
""")
return
def _(QSF_FILE, QualtricsSurvey, RESULTS_FILE, mo):
S = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
try:
data_all = S.load_data()
except NotImplementedError as e:
mo.stop(True, mo.md(f"**⚠️ {str(e)}**"))
return S, data_all
@app.cell
def _(check_progress, data_all):
check_progress(data_all)
return
@app.cell
def _(data_all, duration_validation):
duration_validation(data_all)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
### ToDo: "straight-liner" detection and removal
""")
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
def _(Path, RESULTS_FILE, data_all, mo):
mo.md(f"""
---
# Load Data
# Data Filter
**Dataset:** `{Path(RESULTS_FILE).name}`
Use to select a subset of the data for the following analysis
**Responses**: `{data_all.collect().shape[0]}`
""")
return
@app.cell
def _(data_all, survey):
data = survey.filter_data(data_all, age=None, gender=None, income=None, ethnicity=None, consumer=None)
return (data,)
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
---
# Analysis
""")
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
## Character personality ranking
### 1. Which character personality is ranked best?
""")
return
@app.cell
def _(data, survey):
char_rank = survey.get_character_ranking(data)[0].collect()
return (char_rank,)
@app.cell
def _(char_rank, plot_character_ranking_distribution):
plot_character_ranking_distribution(char_rank, x_label='Character Personality', width=1000)
return
@app.cell
def _(mo):
mo.md(r"""
### 2. Which character personality is ranked number 1 the most?
""")
return
def _():
sl_ss_max_score = 5
sl_v1_10_max_score = 10
return sl_ss_max_score, sl_v1_10_max_score
@app.cell
def _(
calculate_weighted_ranking_scores,
char_rank,
plot_weighted_ranking_score,
S,
check_progress,
check_straight_liners,
data_all,
duration_validation,
mo,
sl_ss_max_score,
sl_v1_10_max_score,
):
char_rank_weighted = calculate_weighted_ranking_scores(char_rank)
plot_weighted_ranking_score(char_rank_weighted, x_label='Voice', width=1000)
_ss_all = S.get_ss_green_blue(data_all)[0].join(S.get_ss_orange_red(data_all)[0], on='_recordId')
_sl_ss_c, sl_ss_df = check_straight_liners(_ss_all, max_score=sl_ss_max_score)
_sl_v1_10_c, sl_v1_10_df = check_straight_liners(
S.get_voice_scale_1_10(data_all)[0],
max_score=sl_v1_10_max_score
)
mo.md(f"""
# Data Validation
{check_progress(data_all)}
{duration_validation(data_all)}
## Speaking Style - Straight Liners
{_sl_ss_c}
## Voice Score Scale 1-10 - Straight Liners
{_sl_v1_10_c}
""")
return
@app.cell
def _(char_rank, plot_most_ranked_1_character):
plot_most_ranked_1_character(char_rank, x_label='Character Personality', width=1000)
def _(data_all):
# # Drop any Voice Scale 1-10 responses with straight-lining, using sl_v1_10_df _responseId values
# records_to_drop = sl_v1_10_df.select('Record ID').to_series().to_list()
# data_validated = data_all.filter(~pl.col('_recordId').is_in(records_to_drop))
# mo.md(f"""
# Dropped `{len(records_to_drop)}` responses with straight-lining in Voice Scale 1-10 evaluation.
# """)
data_validated = data_all
return (data_validated,)
@app.cell(hide_code=True)
def _(S, mo):
filter_form = mo.md('''
{age}
{gender}
{ethnicity}
{income}
{consumer}
'''
).batch(
age=mo.ui.multiselect(options=S.options_age, value=S.options_age, label="Select Age Group(s):"),
gender=mo.ui.multiselect(options=S.options_gender, value=S.options_gender, label="Select Gender(s):"),
ethnicity=mo.ui.multiselect(options=S.options_ethnicity, value=S.options_ethnicity, label="Select Ethnicities:"),
income=mo.ui.multiselect(options=S.options_income, value=S.options_income, label="Select Income Group(s):"),
consumer=mo.ui.multiselect(options=S.options_consumer, value=S.options_consumer, label="Select Consumer Groups:")
).form()
mo.md(f'''
---
# Data Filter
{filter_form}
''')
return
@app.cell
def _(data_validated):
# mo.stop(filter_form.value is None, mo.md("**Please submit filter above to proceed**"))
# _d = S.filter_data(data_validated, age=filter_form.value['age'], gender=filter_form.value['gender'], income=filter_form.value['income'], ethnicity=filter_form.value['ethnicity'], consumer=filter_form.value['consumer'])
# # Stop execution and prevent other cells from running if no data is selected
# mo.stop(len(_d.collect()) == 0, mo.md("**No Data available for current filter combination**"))
# data = _d
data = data_validated
data.collect()
return (data,)
@app.cell(hide_code=True)
def _(S, data, mo):
char_rank = S.get_character_ranking(data)[0]
mo.md(r"""
---
# Analysis
## Character personality ranking
""")
return (char_rank,)
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
""")
return
@app.cell(hide_code=True)
def _():
# char_rank = S.get_character_ranking(data)[0]
return
@app.cell
def _(S, char_rank, mo):
mo.md(f"""
### 1. Which character personality is ranked best?
{mo.ui.altair_chart(S.plot_top3_ranking_distribution(char_rank, x_label='Character Personality'))}
""")
return
@app.cell
def _(S, char_rank, mo):
mo.md(f"""
### 2. Which character personality is ranked 1st the most?
{mo.ui.altair_chart(S.plot_most_ranked_1(char_rank, title="Most Popular Character<br>(Number of Times Ranked 1st)", x_label='Character Personality', width=1000))}
""")
return
@app.cell
def _(S, calculate_weighted_ranking_scores, char_rank, mo):
char_rank_weighted = calculate_weighted_ranking_scores(char_rank)
mo.md(f"""
### 3. Which character personality most popular based on weighted scores?
{mo.ui.altair_chart(S.plot_weighted_ranking_score(char_rank_weighted, title="Most Popular Character - Weighted Popularity Score<br>(1st=3pts, 2nd=2pts, 3rd=1pt)", x_label='Voice', width=1000))}
""")
return
@app.cell(hide_code=True)
def _(S, data, mo):
v_18_8_3 = S.get_18_8_3(data)[0].collect()
mo.md(r"""
## Voice Ranking
""")
return
return (v_18_8_3,)
@app.cell
def _(data, survey):
v_18_8_3 = survey.get_18_8_3(data)[0].collect()
print(v_18_8_3.head())
@app.cell(hide_code=True)
def _():
# print(v_18_8_3.head())
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
Which 8 voices are chosen the most out of 18?
def _(S, mo, v_18_8_3):
mo.md(f"""
### Which 8 voices are chosen the most out of 18?
{mo.ui.altair_chart(S.plot_voice_selection_counts(v_18_8_3, height=500, width=1000))}
""")
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
Which 3 voices are chosen the most out of 18? How many times does each voice end up in the top 3? ( this is based on the survey question where participants need to choose 3 out of the earlier selected 8 voices. So how often each of the 18 stimuli ended up in participants Top 3, after they first selected 8 out of 18.
""")
return
def _(S, mo, v_18_8_3):
mo.md(f"""
### Which 3 voices are chosen the most out of 18?
How many times does each voice end up in the top 3? ( this is based on the survey question where participants need to choose 3 out of the earlier selected 8 voices. So how often each of the 18 stimuli ended up in participants Top 3, after they first selected 8 out of 18.
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
Which voice is ranked best in the ranking question for top 3.? (so not best 3 out of 8 question)
- E.g. 1 point for place 3. 2 points for place 2 and 3 points for place 1. The voice with most points is ranked best.
{mo.ui.altair_chart(S.plot_top3_selection_counts(v_18_8_3, height=500, width=1000))}
""")
return
@app.cell
def _(plot_top3_ranking_distribution, top3_voices):
plot_top3_ranking_distribution(top3_voices, x_label='Voice', width=1000)
return
def _(S, calculate_weighted_ranking_scores, data):
top3_voices = S.get_top_3_voices(data)[0]
top3_voices_weighted = calculate_weighted_ranking_scores(top3_voices)
return top3_voices, top3_voices_weighted
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
Which voice is ranked number 1 the most? (not always the voice with most points)
@app.cell
def _(S, mo, top3_voices):
mo.md(f"""
### Which voice is ranked best in the ranking question for top 3?
- Each of the 350 participants gives exactly one 1st-place vote.
- Total Rank-1 votes = 350.
- Voices are sorted from most to least 1st-place votes.
- The top 3 voices with the most Rank-1 votes are colored blue.
- This can differ from the points-based winners (321 totals), because a voice may receive many 2nd/3rd places but fewer 1st places.
(not best 3 out of 8 question)
{mo.ui.altair_chart(S.plot_ranking_distribution(top3_voices, x_label='Voice', width=1000))}
""")
return
@app.cell
def _(S, mo, top3_voices_weighted):
mo.md(f"""
### Most popular **voice** based on weighted scores?
- E.g. 1 point for place 3. 2 points for place 2 and 3 points for place 1. The voice with most points is ranked best.
Distribution of the rankings for each voice:
{mo.ui.altair_chart(S.plot_weighted_ranking_score(top3_voices_weighted, title="Most Popular Voice - Weighted Popularity Score<br>(1st = 3pts, 2nd = 2pts, 3rd = 1pt)", height=500, width=1000))}
""")
return
@app.cell
def _(S, mo, top3_voices):
mo.md(f"""
### Which voice is ranked number 1 the most?
(not always the voice with most points)
{mo.ui.altair_chart(S.plot_most_ranked_1(top3_voices, title="Most Popular Voice<br>(Number of Times Ranked 1st)", x_label='Voice', width=1000))}
""")
return
@app.cell
def _():
return
@app.cell(hide_code=True)
def _(mo):
def _(S, data, mo, utils):
ss_or, choice_map_or = S.get_ss_orange_red(data)
ss_gb, choice_map_gb = S.get_ss_green_blue(data)
# Combine the data
ss_all = ss_or.join(ss_gb, on='_recordId')
_d = ss_all.collect()
choice_map = {**choice_map_or, **choice_map_gb}
# print(_d.head())
# print(choice_map)
ss_long = utils.process_speaking_style_data(ss_all, choice_map)
mo.md(r"""
## Voice Speaking Style - Perception Traits
Here you can find the speaking styles and traits: [Speaking Style Traits Quantitative test design.docx](https://voicebranding-my.sharepoint.com/:w:/g/personal/phoebe_voicebranding_ai/IQBfM_Z8PF98Qalz4lzIbJ3RAUCdc7waB32HZXCj7k3xfo0?e=rtFd27)
""")
return choice_map, ss_all, ss_long
@app.cell
def _(S, mo, pl, ss_long):
content = """### How does each voice score for each “speaking style labeled trait”?"""
for i, trait in enumerate(ss_long.select("Description").unique().to_series().to_list()):
trait_d = ss_long.filter(pl.col("Description") == trait)
content += f"""
### {i+1}) {trait.replace(":", "")}
{mo.ui.altair_chart(S.plot_speaking_style_trait_scores(trait_d, title=trait.replace(":", ""), height=550))}
"""
mo.md(content)
return
@app.cell
def _():
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
How does each voice score for each “speaking style labeled trait”? Here you can find the speaking styles and traits: [Speaking Style Traits Quantitative test design.docx](https://voicebranding-my.sharepoint.com/:w:/g/personal/phoebe_voicebranding_ai/IQBfM_Z8PF98Qalz4lzIbJ3RAUCdc7waB32HZXCj7k3xfo0?e=rtFd27)
- There are 4 speaking styles: Green, Blue, Orange, Red.
- There are 16 traits distributed across the 4 speaking styles.
""")
return
@app.cell(hide_code=True)
def _(mo):
def _(S, data, mo):
vscales = S.get_voice_scale_1_10(data)[0]
# plot_average_scores_with_counts(vscales, x_label='Voice', width=1000)
mo.md(r"""
## Voice Scale 1-10
""")
return (vscales,)
@app.cell
def _(vscales):
print(vscales.collect().head())
return
@app.cell
def _(data, mo, plot_average_scores_with_counts, survey):
vscales = survey.get_voice_scale_1_10(data)[0].collect()
plot_average_scores_with_counts(vscales, x_label='Voice', width=1000)
def _(pl, vscales):
# Count non-null values per row
nn_vscale = vscales.with_columns(
non_null_count = pl.sum_horizontal(pl.all().exclude("_recordID").is_not_null())
)
nn_vscale.collect()['non_null_count'].describe()
return
@app.cell(hide_code=True)
def _(S, mo, vscales):
mo.md(f"""
### How does each voice score on a scale from 1-10?
How does each voice score on a scale from 1-10?
{mo.ui.plotly(plot_average_scores_with_counts(vscales, x_label='Voice', width=1000))}
{mo.ui.altair_chart(S.plot_average_scores_with_counts(vscales, x_label='Voice', width=1000, domain=[1,10], title="Voice General Impression (Scale 1-10)"))}
""")
return
@app.cell
def _(S, mo, utils, vscales):
_target_cols=[c for c in vscales.collect().columns if c not in ['_recordId']]
vscales_row_norm = utils.normalize_row_values(vscales.collect(), target_cols=_target_cols)
mo.md(f"""
### Voice scale 1-10 normalized per respondent?
{mo.ui.altair_chart(S.plot_average_scores_with_counts(vscales_row_norm, x_label='Voice', width=1000, domain=[1,10], title="Voice General Impression (Scale 1-10) - Normalized per Respondent"))}
""")
return
@app.cell
def _(S, mo, utils, vscales):
_target_cols=[c for c in vscales.collect().columns if c not in ['_recordId']]
vscales_global_norm = utils.normalize_global_values(vscales.collect(), target_cols=_target_cols)
mo.md(f"""
### Voice scale 1-10 normalized per respondent?
{mo.ui.altair_chart(S.plot_average_scores_with_counts(vscales_global_norm, x_label='Voice', width=1000, domain=[1,10], title="Voice General Impression (Scale 1-10) - Normalized Across All Respondents"))}
""")
return
@app.cell
def _(choice_map, mo, ss_all, utils, vscales):
df_style = utils.process_speaking_style_data(ss_all, choice_map)
df_voice_long = utils.process_voice_scale_data(vscales)
joined_df = df_style.join(df_voice_long, on=["_recordId", "Voice"], how="inner")
# df_voice_long
mo.md(r"""
## Correlations Voice Speaking Styles <-> Voice Scale 1-10
Lets show how scoring better on these speaking styles correlates (or not) with better Voice Scale 1-10 evaluation. For each speaking style we show how the traits in these speaking styles correlate with Voice Scale 1-10 evaluation. This gives us a total of 4 correlation diagrams.
Example for speaking style green:
- Trait 1: Friendly | Conversational | Down-to-earth
- Trait 2: Approachable | Familiar | Warm
- Trait 3: Optimistic | Benevolent | Positive | Appreciative
### How to Interpret These Correlation Results
Each bar represents the Pearson correlation coefficient (r) between a speaking style trait rating (1-5 scale) and the overall Voice Scale rating (1-10).
**Reading the Chart**
| Correlation Value | Interpretation |
|-----------|----------|
| r > 0 (Green bars)| Positive correlation — voices rated higher on this trait tend to receive higher Voice Scale scores|
| r < 0 (Red bars)| Negative correlation — voices rated higher on this trait tend to receive lower Voice Scale scores|
| r ≈ 0| No relationship — this trait doesn't predict Voice Scale ratings|
""")
return df_style, joined_df
@app.cell(hide_code=True)
def _(S, SPEAKING_STYLES, joined_df, mo):
_content = """### Total Results
"""
for style, traits in SPEAKING_STYLES.items():
# print(f"Correlation plot for {style}...")
fig = S.plot_speaking_style_correlation(
data=joined_df,
style_color=style,
style_traits=traits,
title=f"Correlation: Speaking Style {style} and Voice Scale 1-10"
)
_content += f"""
#### Speaking Style **{style}**:
{mo.ui.altair_chart(fig)}
"""
mo.md(_content)
return
@app.cell
def _(mo):
mo.md(r"""
### Female / Male Voices considered seperately
- [ ] 4 correlation diagrams considering each speaking style (4) and all female voice results.
- [ ] 4 correlation diagrams considering each speaking style (4) and all male voice results.
## Correlations Voice Speaking Styles <-> Voice Ranking Points
Lets show how scoring better on these speaking styles correlates (or not) with better Vocie Ranking results. For each speaking style we show how the traits in these speaking styles correlate with voice ranking points. This gives us a total of 4 correlation diagrams.
Example for speaking style green:
- Trait 1: Friendly | Conversational | Down-to-earth
- Trait 2: Approachable | Familiar | Warm
- Trait 3: Optimistic | Benevolent | Positive | Appreciative
### Total Results
- [ ] 4 correlation diagrams
""")
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
@@ -268,7 +528,7 @@ def _(mo):
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
## Correlations Voice Speaking Styles <-> Voice Scale 1-10
""")
return
@@ -276,23 +536,32 @@ def _(mo):
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
Lets show how scoring better on these speaking styles correlates (or not) with better Voice Scale 1-10 evaluation. For each speaking style we show how the traits in these speaking styles correlate with Voice Scale 1-10 evaluation. This gives us a total of 4 correlation diagrams.
Example for speaking style green:
- Trait 1: Friendly | Conversational | Down-to-earth
- Trait 2: Approachable | Familiar | Warm
- Trait 3: Optimistic | Benevolent | Positive | Appreciative
""")
return
@app.cell
def _(mo):
mo.md(r"""
### Total Results
@app.cell(hide_code=True)
def _(S, SPEAKING_STYLES, df_style, mo, top3_voices, utils):
df_ranking = utils.process_voice_ranking_data(top3_voices)
joined = df_style.join(df_ranking, on=['_recordId', 'Voice'], how='inner')
- [ ] 4 correlation diagrams
""")
_content = """## Correlations Voice Speaking Styles <-> Voice Ranking Points
"""
for _style, _traits in SPEAKING_STYLES.items():
_fig = S.plot_speaking_style_ranking_correlation(data=joined, style_color=_style, style_traits=_traits)
_content += f"""
#### Speaking Style **{_style}**:
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
return
@@ -307,91 +576,5 @@ def _(mo):
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
## Correlations Voice Speaking Styles <-> Voice Ranking Points
""")
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
Lets show how scoring better on these speaking styles correlates (or not) with better Vocie Ranking results. For each speaking style we show how the traits in these speaking styles correlate with voice ranking points. This gives us a total of 4 correlation diagrams.
Example for speaking style green:
- Trait 1: Friendly | Conversational | Down-to-earth
- Trait 2: Approachable | Familiar | Warm
- Trait 3: Optimistic | Benevolent | Positive | Appreciative
""")
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
### Total Results
- [ ] 4 correlation diagrams
""")
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
### Female / Male Voices considered seperately
- [ ] 4 correlation diagrams considering each speaking style (4) and all female voice results.
- [ ] 4 correlation diagrams considering each speaking style (4) and all male voice results.
""")
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
## Correlation Heatmap all evaluations <-> voice acoustic data
- [ ] Heatmap for male voices
- [ ] Heatmap for female voices
""")
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
## Most Prominent Character Personality Traits
""")
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""
The last question of the survey is about traits for the described character's personality. For each Character personality, we want to display the 8 most chosen character personality traits. This will give us a total of 4 diagrams, one for each character personality included in the test.
- [ ] Bank Teller
- [ ] Familiar Friend
- [ ] The Coach
- [ ] Personal Assistant
""")
return
@app.cell
def _(mo):
mo.md(r"""
---
# Results per subgroup
Use the dropdown selector at the top to filter the data and generate all the plots again
""")
return
if __name__ == "__main__":
app.run()

933
03_quant_report.py Normal file
View File

@@ -0,0 +1,933 @@
import marimo
__generated_with = "0.19.7"
app = marimo.App(width="full")
with app.setup:
import marimo as mo
import polars as pl
from pathlib import Path
from validation import check_progress, duration_validation, check_straight_liners
from utils import QualtricsSurvey, combine_exclusive_columns, calculate_weighted_ranking_scores
import utils
from speaking_styles import SPEAKING_STYLES
@app.cell
def _():
file_browser = mo.ui.file_browser(
initial_path="./data/exports", multiple=False, restrict_navigation=True, filetypes=[".csv"], label="Select 'Labels' File"
)
file_browser
return (file_browser,)
@app.cell
def _(file_browser):
mo.stop(file_browser.path(index=0) is None, mo.md("**⚠️ Please select a `_Labels.csv` file above to proceed**"))
RESULTS_FILE = Path(file_browser.path(index=0))
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
return QSF_FILE, RESULTS_FILE
@app.cell
def _(QSF_FILE, RESULTS_FILE):
S = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
try:
data_all = S.load_data()
except NotImplementedError as e:
mo.stop(True, mo.md(f"**⚠️ {str(e)}**"))
return S, data_all
@app.cell(hide_code=True)
def _(RESULTS_FILE, data_all):
mo.md(rf"""
---
# Load Data
**Dataset:** {Path(RESULTS_FILE).name}
**Responses**: {data_all.collect().shape[0]}
""")
return
@app.cell
def _(S, data_all):
sl_ss_max_score = 5
sl_v1_10_max_score = 10
_ss_all = S.get_ss_green_blue(data_all)[0].join(S.get_ss_orange_red(data_all)[0], on='_recordId')
_sl_ss_c, sl_ss_df = check_straight_liners(_ss_all, max_score=sl_ss_max_score)
_sl_v1_10_c, sl_v1_10_df = check_straight_liners(
S.get_voice_scale_1_10(data_all)[0],
max_score=sl_v1_10_max_score
)
mo.md(f"""
{check_progress(data_all)}
{duration_validation(data_all)}
## Speaking Style - Straight Liners
{_sl_ss_c}
## Voice Score Scale 1-10 - Straight Liners
{_sl_v1_10_c}
""")
return
@app.cell
def _(data_all):
# # Drop any Voice Scale 1-10 responses with straight-lining, using sl_v1_10_df _responseId values
# records_to_drop = sl_v1_10_df.select('Record ID').to_series().to_list()
# data_validated = data_all.filter(~pl.col('_recordId').is_in(records_to_drop))
# mo.md(f"""
# Dropped `{len(records_to_drop)}` responses with straight-lining in Voice Scale 1-10 evaluation.
# """)
data_validated = data_all
return (data_validated,)
@app.cell
def _():
return
@app.cell(hide_code=True)
def _():
#
return
@app.cell
def _():
mo.md(r"""
## Lucia confirmation missing 'Consumer' data
""")
return
@app.cell
def _(S, data_validated):
demographics = S.get_demographics(data_validated)[0].collect()
# demographics
return (demographics,)
@app.cell(hide_code=True)
def _(demographics):
# Demographics where 'Consumer' is null
demographics_no_consumer = demographics.filter(pl.col('Consumer').is_null())['_recordId'].to_list()
# demographics_no_consumer
return (demographics_no_consumer,)
@app.cell
def _(data_all, demographics_no_consumer):
# check if the responses with missing 'Consumer type' in demographics are all business owners as Lucia mentioned
assert all(data_all.filter(pl.col('_recordId').is_in(demographics_no_consumer)).collect()['QID4'] == 'Yes'), "Not all respondents with missing 'Consumer' are business owners."
return
@app.cell
def _():
mo.md(r"""
# Filter Data (Global corrections)
""")
return
@app.cell
def _():
BEST_CHOSEN_CHARACTER = "the_coach"
return (BEST_CHOSEN_CHARACTER,)
@app.cell
def _(S):
filter_form = mo.md('''
{age}
{gender}
{ethnicity}
{income}
{consumer}
'''
).batch(
age=mo.ui.multiselect(options=S.options_age, value=S.options_age, label="Select Age Group(s):"),
gender=mo.ui.multiselect(options=S.options_gender, value=S.options_gender, label="Select Gender(s):"),
ethnicity=mo.ui.multiselect(options=S.options_ethnicity, value=S.options_ethnicity, label="Select Ethnicities:"),
income=mo.ui.multiselect(options=S.options_income, value=S.options_income, label="Select Income Group(s):"),
consumer=mo.ui.multiselect(options=S.options_consumer, value=S.options_consumer, label="Select Consumer Groups:")
).form()
mo.md(f'''
---
# Data Filter
{filter_form}
''')
return (filter_form,)
@app.cell
def _(S, data_validated, filter_form):
mo.stop(filter_form.value is None, mo.md("**Please submit filter above to proceed**"))
_d = S.filter_data(data_validated, age=filter_form.value['age'], gender=filter_form.value['gender'], income=filter_form.value['income'], ethnicity=filter_form.value['ethnicity'], consumer=filter_form.value['consumer'])
# Stop execution and prevent other cells from running if no data is selected
mo.stop(len(_d.collect()) == 0, mo.md("**No Data available for current filter combination**"))
data = _d
# data = data_validated
data.collect()
return (data,)
@app.cell
def _():
return
@app.cell
def _():
# Check if all business owners are missing a 'Consumer type' in demographics
# assert all([a is None for a in data_all.filter(pl.col('QID4') == 'Yes').collect()['Consumer'].unique()]) , "Not all business owners are missing 'Consumer type' in demographics."
return
@app.cell
def _():
mo.md(r"""
# Demographic Distributions
""")
return
@app.cell
def _():
demo_plot_cols = [
'Age',
'Gender',
# 'Race/Ethnicity',
'Bussiness_Owner',
'Consumer'
]
return (demo_plot_cols,)
@app.cell
def _(S, data, demo_plot_cols):
_content = """
"""
for c in demo_plot_cols:
_fig = S.plot_demographic_distribution(
data=S.get_demographics(data)[0],
column=c,
title=f"{c.replace('Bussiness', 'Business').replace('_', ' ')} Distribution of Survey Respondents"
)
_content += f"""{mo.ui.altair_chart(_fig)}\n\n"""
mo.md(_content)
return
@app.cell
def _():
mo.md(r"""
---
# Brand Character Results
""")
return
@app.cell(disabled=True)
def _():
mo.md(r"""
## Best performing: Original vs Refined frankenstein
""")
return
@app.cell(disabled=True)
def _(S, data):
char_refine_rank = S.get_character_refine(data)[0]
# print(char_rank.collect().head())
print(char_refine_rank.collect().head())
return
@app.cell(disabled=True)
def _():
mo.md(r"""
## Character ranking points
""")
return
@app.cell
def _(S, char_rank):
char_rank_weighted = calculate_weighted_ranking_scores(char_rank)
S.plot_weighted_ranking_score(char_rank_weighted, title="Most Popular Character - Weighted Popularity Score<br>(1st=3pts, 2nd=2pts, 3rd=1pt)", x_label='Voice')
return
@app.cell
def _():
mo.md(r"""
## Character ranking 1-2-3
""")
return
@app.cell
def _(S, data):
char_rank = S.get_character_ranking(data)[0]
return (char_rank,)
@app.cell
def _(S, char_rank):
S.plot_top3_ranking_distribution(char_rank, x_label='Character Personality', title='Character Personality: Rankings Top 3')
return
@app.cell
def _():
mo.md(r"""
### Statistical Significance Character Ranking
""")
return
@app.cell(disabled=True)
def _(S, char_rank):
_pairwise_df, _meta = S.compute_ranking_significance(char_rank)
# print(_pairwise_df.columns)
mo.md(f"""
{mo.ui.altair_chart(S.plot_significance_heatmap(_pairwise_df, metadata=_meta))}
{mo.ui.altair_chart(S.plot_significance_summary(_pairwise_df, metadata=_meta))}
""")
return
@app.cell(disabled=True)
def _():
mo.md(r"""
## Character Ranking: times 1st place
""")
return
@app.cell
def _(S, char_rank):
S.plot_most_ranked_1(char_rank, title="Most Popular Character<br>(Number of Times Ranked 1st)", x_label='Character Personality')
return
@app.cell
def _():
mo.md(r"""
## Prominent predefined personality traits wordcloud
""")
return
@app.cell
def _(S, data):
top8_traits = S.get_top_8_traits(data)[0]
S.plot_traits_wordcloud(
data=top8_traits,
column='Top_8_Traits',
title="Most Prominent Personality Traits",
)
return
@app.cell
def _():
mo.md(r"""
## Trait frequency per brand character
""")
return
@app.cell
def _(S, data):
char_df = S.get_character_refine(data)[0]
return (char_df,)
@app.cell
def _(S, char_df):
from theme import ColorPalette
# Assuming you already have char_df (your data from get_character_refine or similar)
characters = ['Bank Teller', 'Familiar Friend', 'The Coach', 'Personal Assistant']
character_colors = {
'Bank Teller': (ColorPalette.CHARACTER_BANK_TELLER, ColorPalette.CHARACTER_BANK_TELLER_HIGHLIGHT),
'Familiar Friend': (ColorPalette.CHARACTER_FAMILIAR_FRIEND, ColorPalette.CHARACTER_FAMILIAR_FRIEND_HIGHLIGHT),
'The Coach': (ColorPalette.CHARACTER_COACH, ColorPalette.CHARACTER_COACH_HIGHLIGHT),
'Personal Assistant': (ColorPalette.CHARACTER_PERSONAL_ASSISTANT, ColorPalette.CHARACTER_PERSONAL_ASSISTANT_HIGHLIGHT),
}
# Build consistent sort order (by total frequency across all characters)
all_trait_counts = {}
for char in characters:
freq_df, _ = S.transform_character_trait_frequency(char_df, char)
for row in freq_df.iter_rows(named=True):
all_trait_counts[row['trait']] = all_trait_counts.get(row['trait'], 0) + row['count']
consistent_sort_order = sorted(all_trait_counts.keys(), key=lambda x: -all_trait_counts[x])
_content = """"""
# Generate 4 plots (one per character)
for char in characters:
freq_df, _ = S.transform_character_trait_frequency(char_df, char)
main_color, highlight_color = character_colors[char]
chart = S.plot_single_character_trait_frequency(
data=freq_df,
character_name=char,
bar_color=main_color,
highlight_color=highlight_color,
trait_sort_order=consistent_sort_order,
)
_content += f"""
{mo.ui.altair_chart(chart)}
"""
mo.md(_content)
return
@app.cell(disabled=True)
def _():
mo.md(r"""
## Statistical significance best characters
zie chat
> voorbeeld: als de nr 1 en 2 niet significant verschillen maar wel van de nr 3 bijvoorbeeld is dat ook top. Beetje meedenkend over hoe ik het kan presenteren weetje wat ik bedoel?:)
>
""")
return
@app.cell(disabled=True)
def _():
return
@app.cell
def _():
return
@app.cell
def _():
mo.md(r"""
---
# Spoken Voice Results
""")
return
@app.cell
def _():
COLOR_GENDER = True
return (COLOR_GENDER,)
@app.cell
def _():
mo.md(r"""
## Top 8 Most Chosen out of 18
""")
return
@app.cell
def _(S, data):
v_18_8_3 = S.get_18_8_3(data)[0]
return (v_18_8_3,)
@app.cell
def _(COLOR_GENDER, S, v_18_8_3):
S.plot_voice_selection_counts(v_18_8_3, title="Top 8 Voice Selection from 18 Voices", x_label='Voice', color_gender=COLOR_GENDER)
return
@app.cell
def _():
mo.md(r"""
## Top 3 most chosen out of 8
""")
return
@app.cell
def _(COLOR_GENDER, S, v_18_8_3):
S.plot_top3_selection_counts(v_18_8_3, title="Top 3 Voice Selection Counts from 8 Voices", x_label='Voice', color_gender=COLOR_GENDER)
return
@app.cell
def _():
mo.md(r"""
## Voice Ranking Weighted Score
""")
return
@app.cell
def _(S, data):
top3_voices = S.get_top_3_voices(data)[0]
top3_voices_weighted = calculate_weighted_ranking_scores(top3_voices)
return top3_voices, top3_voices_weighted
@app.cell
def _(COLOR_GENDER, S, top3_voices_weighted):
S.plot_weighted_ranking_score(top3_voices_weighted, title="Most Popular Voice - Weighted Popularity Score<br>(1st = 3pts, 2nd = 2pts, 3rd = 1pt)", color_gender=COLOR_GENDER)
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
## Which voice is ranked best in the ranking question for top 3?
(not best 3 out of 8 question)
""")
return
@app.cell
def _(COLOR_GENDER, S, top3_voices):
S.plot_ranking_distribution(top3_voices, x_label='Voice', title="Distribution of Top 3 Voice Rankings (1st, 2nd, 3rd)", color_gender=COLOR_GENDER)
return
@app.cell
def _():
mo.md(r"""
### Statistical significance for voice ranking
""")
return
@app.cell
def _():
# print(top3_voices.collect().head())
return
@app.cell
def _():
# _pairwise_df, _metadata = S.compute_ranking_significance(
# top3_voices,alpha=0.05,correction="none")
# # View significant pairs
# # print(pairwise_df.filter(pl.col('significant') == True))
# # Create heatmap visualization
# _heatmap = S.plot_significance_heatmap(
# _pairwise_df,
# metadata=_metadata,
# title="Weighted Voice Ranking Significance<br>(Pairwise Comparisons)"
# )
# # Create summary bar chart
# _summary = S.plot_significance_summary(
# _pairwise_df,
# metadata=_metadata
# )
# mo.md(f"""
# {mo.ui.altair_chart(_heatmap)}
# {mo.ui.altair_chart(_summary)}
# """)
return
@app.cell
def _():
## Voice Ranked 1st the most
return
@app.cell
def _(COLOR_GENDER, S, top3_voices):
S.plot_most_ranked_1(top3_voices, title="Most Popular Voice<br>(Number of Times Ranked 1st)", x_label='Voice', color_gender=COLOR_GENDER)
return
@app.cell
def _():
mo.md(r"""
## Voice Scale 1-10
""")
return
@app.cell
def _(COLOR_GENDER, S, data):
# Get your voice scale data (from notebook)
voice_1_10, _ = S.get_voice_scale_1_10(data)
S.plot_average_scores_with_counts(voice_1_10, x_label='Voice', domain=[1,10], title="Voice General Impression (Scale 1-10)", color_gender=COLOR_GENDER)
return (voice_1_10,)
@app.cell(disabled=True)
def _():
mo.md(r"""
### Statistical Significance (Scale 1-10)
""")
return
@app.cell(disabled=True)
def _(S, voice_1_10):
# Compute pairwise significance tests
pairwise_df, metadata = S.compute_pairwise_significance(
voice_1_10,
test_type="mannwhitney", # or "ttest", "chi2", "auto"
alpha=0.05,
correction="bonferroni" # or "holm", "none"
)
# View significant pairs
# print(pairwise_df.filter(pl.col('significant') == True))
# Create heatmap visualization
_heatmap = S.plot_significance_heatmap(
pairwise_df,
metadata=metadata,
title="Voice Rating Significance<br>(Pairwise Comparisons)"
)
# Create summary bar chart
_summary = S.plot_significance_summary(
pairwise_df,
metadata=metadata
)
mo.md(f"""
{mo.ui.altair_chart(_heatmap)}
{mo.ui.altair_chart(_summary)}
""")
return
@app.cell
def _():
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
## Ranking points for Voice per Chosen Brand Character
**missing mapping**
""")
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
## Correlation Speaking Styles
""")
return
@app.cell
def _(S, data, top3_voices):
ss_or, choice_map_or = S.get_ss_orange_red(data)
ss_gb, choice_map_gb = S.get_ss_green_blue(data)
# Combine the data
ss_all = ss_or.join(ss_gb, on='_recordId')
_d = ss_all.collect()
choice_map = {**choice_map_or, **choice_map_gb}
# print(_d.head())
# print(choice_map)
ss_long = utils.process_speaking_style_data(ss_all, choice_map)
df_style = utils.process_speaking_style_data(ss_all, choice_map)
vscales = S.get_voice_scale_1_10(data)[0]
df_scale_long = utils.process_voice_scale_data(vscales)
joined_scale = df_style.join(df_scale_long, on=["_recordId", "Voice"], how="inner")
df_ranking = utils.process_voice_ranking_data(top3_voices)
joined_ranking = df_style.join(df_ranking, on=['_recordId', 'Voice'], how='inner')
return joined_ranking, joined_scale
@app.cell
def _(joined_ranking):
joined_ranking.head()
return
@app.cell
def _():
mo.md(r"""
### Colors vs Scale 1-10
""")
return
@app.cell
def _(S, joined_scale):
# Transform to get one row per color with average correlation
color_corr_scale, _ = utils.transform_speaking_style_color_correlation(joined_scale, SPEAKING_STYLES)
S.plot_speaking_style_color_correlation(
data=color_corr_scale,
title="Correlation: Speaking Style Colors and Voice Scale 1-10"
)
return
@app.cell
def _():
mo.md(r"""
### Colors vs Ranking Points
""")
return
@app.cell
def _(S, joined_ranking):
color_corr_ranking, _ = utils.transform_speaking_style_color_correlation(
joined_ranking,
SPEAKING_STYLES,
target_column="Ranking_Points"
)
S.plot_speaking_style_color_correlation(
data=color_corr_ranking,
title="Correlation: Speaking Style Colors and Voice Ranking Points"
)
return
@app.cell
def _():
mo.md(r"""
### Individual Traits vs Scale 1-10
""")
return
@app.cell
def _(S, joined_scale):
_content = """"""
for _style, _traits in SPEAKING_STYLES.items():
# print(f"Correlation plot for {style}...")
_fig = S.plot_speaking_style_correlation(
data=joined_scale,
style_color=_style,
style_traits=_traits,
title=f"Correlation: Speaking Style {_style} and Voice Scale 1-10",
)
_content += f"""
#### Speaking Style **{_style}**:
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
### Individual Traits vs Ranking Points
""")
return
@app.cell
def _(S, joined_ranking):
_content = """"""
for _style, _traits in SPEAKING_STYLES.items():
# print(f"Correlation plot for {style}...")
_fig = S.plot_speaking_style_ranking_correlation(
data=joined_ranking,
style_color=_style,
style_traits=_traits,
title=f"Correlation: Speaking Style {_style} and Voice Ranking Points",
)
_content += f"""
#### Speaking Style **{_style}**:
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
## Correlations when "Best Brand Character" is chosen
Select only the traits that fit with that character
""")
return
@app.cell
def _(BEST_CHOSEN_CHARACTER):
from reference import ORIGINAL_CHARACTER_TRAITS
chosen_bc_traits = ORIGINAL_CHARACTER_TRAITS[BEST_CHOSEN_CHARACTER]
return (chosen_bc_traits,)
@app.cell
def _(chosen_bc_traits):
STYLES_SUBSET = utils.filter_speaking_styles(SPEAKING_STYLES, chosen_bc_traits)
return (STYLES_SUBSET,)
@app.cell(hide_code=True)
def _():
mo.md(r"""
### Individual Traits vs Ranking Points
""")
return
@app.cell
def _(BEST_CHOSEN_CHARACTER, S, STYLES_SUBSET, joined_ranking):
_content = ""
for _style, _traits in STYLES_SUBSET.items():
_fig = S.plot_speaking_style_ranking_correlation(
data=joined_ranking,
style_color=_style,
style_traits=_traits,
title=f"""Brand Character "{BEST_CHOSEN_CHARACTER.replace('_', ' ').title()}" - Correlation: Speaking Style {_style} and Voice Ranking Points"""
)
_content += f"""
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
### Individual Traits vs Scale 1-10
""")
return
@app.cell
def _(BEST_CHOSEN_CHARACTER, S, STYLES_SUBSET, joined_scale):
_content = """"""
for _style, _traits in STYLES_SUBSET.items():
# print(f"Correlation plot for {style}...")
_fig = S.plot_speaking_style_correlation(
data=joined_scale,
style_color=_style,
style_traits=_traits,
title=f"""Brand Character "{BEST_CHOSEN_CHARACTER.replace('_', ' ').title()}" - Correlation: Speaking Style {_style} and Voice Scale 1-10""",
)
_content += f"""
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
### Colors vs Scale 1-10 (Best Character)
""")
return
@app.cell
def _(BEST_CHOSEN_CHARACTER, S, STYLES_SUBSET, joined_scale):
# Transform to get one row per color with average correlation
_color_corr_scale, _ = utils.transform_speaking_style_color_correlation(joined_scale, STYLES_SUBSET)
S.plot_speaking_style_color_correlation(
data=_color_corr_scale,
title=f"""Brand Character "{BEST_CHOSEN_CHARACTER.replace('_', ' ').title()}" - Correlation: Speaking Style Colors and Voice Scale 1-10"""
)
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
### Colors vs Ranking Points (Best Character)
""")
return
@app.cell
def _(BEST_CHOSEN_CHARACTER, S, STYLES_SUBSET, joined_ranking):
_color_corr_ranking, _ = utils.transform_speaking_style_color_correlation(
joined_ranking,
STYLES_SUBSET,
target_column="Ranking_Points"
)
S.plot_speaking_style_color_correlation(
data=_color_corr_ranking,
title=f"""Brand Character "{BEST_CHOSEN_CHARACTER.replace('_', ' ').title()}" - Correlation: Speaking Style Colors and Voice Ranking Points"""
)
return
if __name__ == "__main__":
app.run()

74
04_PPTX_Update_Images.py Normal file
View File

@@ -0,0 +1,74 @@
import marimo
__generated_with = "0.19.7"
app = marimo.App(width="medium")
with app.setup:
import marimo as mo
from pathlib import Path
import utils
@app.cell
def _():
mo.md(r"""
# Tag existing images with Alt-Text
Based on image content
""")
return
@app.cell
def _():
return
@app.cell
def _():
TAG_SOURCE = Path('data/reports/VOICE_Perception-Research-Report_4-2-26_19-30.pptx')
# TAG_TARGET = Path('data/reports/Perception-Research-Report_2-2_tagged.pptx')
TAG_IMAGE_DIR = Path('figures/debug')
return TAG_IMAGE_DIR, TAG_SOURCE
@app.cell
def _(TAG_IMAGE_DIR, TAG_SOURCE):
utils.update_ppt_alt_text(
ppt_path=TAG_SOURCE,
image_source_dir=TAG_IMAGE_DIR,
# output_path=TAG_TARGET
)
return
@app.cell(hide_code=True)
def _():
mo.md(r"""
# Replace Images using Alt-Text
""")
return
@app.cell
def _():
REPLACE_SOURCE = Path('data/reports/VOICE_Perception-Research-Report_4-2-26_19-30.pptx')
# REPLACE_TARGET = Path('data/reports/Perception-Research-Report_2-2_updated.pptx')
NEW_IMAGES_DIR = Path('figures/2-4-26')
return NEW_IMAGES_DIR, REPLACE_SOURCE
@app.cell
def _(NEW_IMAGES_DIR, REPLACE_SOURCE):
# get all files in the image source directory and subdirectories
results = utils.pptx_replace_images_from_directory(
REPLACE_SOURCE, # Source presentation path,
NEW_IMAGES_DIR, # Source directory with new images
# REPLACE_TARGET # Output path (optional, defaults to overwrite)
)
return
if __name__ == "__main__":
app.run()

View File

@@ -10,16 +10,14 @@ def _():
import polars as pl
from pathlib import Path
from utils import JPMCSurvey, combine_exclusive_columns
from plots import plot_average_scores_with_counts, plot_top3_ranking_distribution
return (
JPMCSurvey,
combine_exclusive_columns,
mo,
pl,
plot_average_scores_with_counts,
plot_top3_ranking_distribution,
)
from utils import QualtricsSurvey, combine_exclusive_columns
return QualtricsSurvey, combine_exclusive_columns, mo, pl
@app.cell
def _(mo):
mo.outline()
return
@app.cell
@@ -31,8 +29,8 @@ def _():
@app.cell
def _(JPMCSurvey, QSF_FILE, RESULTS_FILE):
survey = JPMCSurvey(RESULTS_FILE, QSF_FILE)
def _(QualtricsSurvey, QSF_FILE, RESULTS_FILE):
survey = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
data = survey.load_data()
data.collect()
return data, survey
@@ -44,14 +42,6 @@ def _(survey):
return
app._unparsable_cell(
r"""
data.
""",
name="_"
)
@app.cell
def _(mo):
mo.md(r"""
@@ -66,11 +56,10 @@ def _(data, mo, pl):
def check_progress(data):
if data.collect().select(pl.col('progress').unique()).shape[0] == 1:
return mo.md("""## ✅ All responses are complete (progress = 100) """)
return mo.md("## ⚠️ There are incomplete responses (progress < 100) ⚠️")
check_progress(data)
return
@@ -87,11 +76,11 @@ def _(data, mo, pl):
std_duration = duration_stats['std_duration'][0]
upper_outlier_threshold = mean_duration + 3 * std_duration
lower_outlier_threshold = mean_duration - 3 * std_duration
_d = data.with_columns(
((pl.col('duration') > upper_outlier_threshold) | (pl.col('duration') < lower_outlier_threshold)).alias('outlier_duration')
)
# Show durations with outlier flag is true
outlier_data = _d.filter(pl.col('outlier_duration') == True).collect()
@@ -105,16 +94,16 @@ def _(data, mo, pl):
- Upper Outlier Threshold (Mean + 3*Std): {upper_outlier_threshold:.2f} seconds
- Lower Outlier Threshold (Mean - 3*Std): {lower_outlier_threshold:.2f} seconds
- Number of Outlier Responses: {outlier_data.shape[0]}
Outliers:
{mo.ui.table(outlier_data)}
** NOTE: These have not been removed from the dataset **
""")
duration_validation(data)
return
@@ -208,7 +197,7 @@ def _(mo):
@app.cell
def _(data, survey):
vscales = survey.get_voice_scale_1_10(data)[0].collect()
vscales
print(vscales.head())
return (vscales,)
@@ -229,10 +218,18 @@ def _(mo):
@app.cell
def _(data, survey):
_lf, _choice_map = survey.get_ss_green_blue(data)
# _lf.collect()
print(_lf.collect().head())
return
@app.cell
def _(df):
df
return
@app.cell
def _(mo):
mo.md(r"""
@@ -297,7 +294,6 @@ def _(data, survey):
traits_refined = survey.get_character_refine(data)[0]
traits_refined.collect()
return (traits_refined,)

View File

@@ -0,0 +1,73 @@
import marimo
__generated_with = "0.19.2"
app = marimo.App(width="medium")
with app.setup:
import marimo as mo
from pathlib import Path
import utils
@app.cell
def _():
mo.md(r"""
# Tag existing images with Alt-Text
Based on image content
""")
return
@app.cell
def _():
TAG_SOURCE = Path('data/test_tag_source.pptx')
TAG_TARGET = Path('data/test_tag_target.pptx')
TAG_IMAGE_DIR = Path('figures/OneDrive_2026-01-28/')
return TAG_IMAGE_DIR, TAG_SOURCE, TAG_TARGET
@app.cell
def _(TAG_IMAGE_DIR, TAG_SOURCE, TAG_TARGET):
utils.update_ppt_alt_text(ppt_path=TAG_SOURCE, image_source_dir=TAG_IMAGE_DIR, output_path=TAG_TARGET)
return
@app.cell
def _():
return
@app.cell
def _():
mo.md(r"""
# Replace Images using Alt-Text
""")
return
@app.cell
def _():
REPLACE_SOURCE = Path('data/test_replace_source.pptx')
REPLACE_TARGET = Path('data/test_replace_target.pptx')
return REPLACE_SOURCE, REPLACE_TARGET
@app.cell
def _():
IMAGE_FILE = Path('figures/OneDrive_2026-01-28/Cons-Early_Professional/cold_distant_approachable_familiar_warm.png')
return (IMAGE_FILE,)
@app.cell
def _(IMAGE_FILE, REPLACE_SOURCE, REPLACE_TARGET):
utils.pptx_replace_named_image(
presentation_path=REPLACE_SOURCE,
target_tag=utils.image_alt_text_generator(IMAGE_FILE),
new_image_path=IMAGE_FILE,
save_path=REPLACE_TARGET)
return
if __name__ == "__main__":
app.run()

239
README.md
View File

@@ -0,0 +1,239 @@
# Voice Branding Quantitative Analysis
## Running Marimo Notebooks
Running on Ct-105 for shared access:
```bash
uv run marimo run 02_quant_analysis.py --headless --port 8080
```
---
## Batch Report Generation
The quant report can be run with different filter combinations via CLI or automated batch processing.
### Single Filter Run (CLI)
Run the report script directly with JSON-encoded filter arguments:
```bash
# Single consumer segment
uv run python 03_quant_report.script.py --consumer '["Starter"]'
# Single age group
uv run python 03_quant_report.script.py --age '["18 to 21 years"]'
# Multiple filters combined
uv run python 03_quant_report.script.py --age '["18 to 21 years", "22 to 24 years"]' --gender '["Male"]'
# All respondents (no filters = defaults to all options selected)
uv run python 03_quant_report.script.py
```
Available filter arguments:
- `--age` — JSON list of age groups
- `--gender` — JSON list of genders
- `--ethnicity` — JSON list of ethnicities
- `--income` — JSON list of income groups
- `--consumer` — JSON list of consumer segments
### Batch Runner (All Combinations)
Run all single-filter combinations automatically with progress tracking:
```bash
# Preview all combinations without running
uv run python run_filter_combinations.py --dry-run
# Run all combinations (shows progress bar)
uv run python run_filter_combinations.py
# Or use the registered CLI entry point
uv run quant-report-batch
uv run quant-report-batch --dry-run
```
This generates reports for:
- All Respondents (no filters)
- Each age group individually
- Each gender individually
- Each ethnicity individually
- Each income group individually
- Each consumer segment individually
Output figures are saved to `figures/<export_date>/<filter_slug>/`.
### Jupyter Notebook Debugging
The script auto-detects Jupyter/IPython environments. When running in VS Code's Jupyter extension, CLI args default to `None` (all options selected), so you can debug cell-by-cell normally.
---
## Adding Custom Filter Combinations
To add new filter combinations to the batch runner, edit `run_filter_combinations.py`:
### Checklist
1. **Open** `run_filter_combinations.py`
2. **Find** the `get_filter_combinations()` function
3. **Add** your combination to the list before the `return` statement:
```python
# Example: Add a specific age + consumer cross-filter
combinations.append({
'name': 'Age-18to24_Consumer-Starter', # Used for output folder naming
'filters': {
'age': ['18 to 21 years', '22 to 24 years'],
'consumer': ['Starter']
}
})
```
4. **Filter keys** must match CLI argument names (defined in `FILTER_CONFIG` in `03_quant_report.script.py`):
- `age` — values from `survey.options_age`
- `gender` — values from `survey.options_gender`
- `ethnicity` — values from `survey.options_ethnicity`
- `income` — values from `survey.options_income`
- `consumer` — values from `survey.options_consumer`
5. **Check available values** by running:
```python
from utils import QualtricsSurvey
S = QualtricsSurvey('data/exports/2-2-26/...Labels.csv', 'data/exports/.../....qsf')
S.load_data()
print(S.options_age)
print(S.options_consumer)
# etc.
```
6. **Test** with dry-run first:
```bash
uv run python run_filter_combinations.py --dry-run
```
### Example: Adding Multiple Cross-Filters
```python
# In get_filter_combinations(), before return:
# Young professionals
combinations.append({
'name': 'Young_Professionals',
'filters': {
'age': ['22 to 24 years', '25 to 34 years'],
'consumer': ['Early Professional']
}
})
# High income males
combinations.append({
'name': 'High_Income_Male',
'filters': {
'income': ['$150,000 - $199,999', '$200,000 or more'],
'gender': ['Male']
}
})
```
### Notes
- **Empty filters dict** = all respondents (no filtering)
- **Omitted filter keys** = all options for that dimension selected
- **Output folder names** are auto-generated from active filters by `QualtricsSurvey.filter_data()`
---
## Adding a New Filter Dimension
To add an entirely new filter dimension (e.g., a new demographic question), you need to update several files:
### Checklist
1. **Update `utils.py` — `QualtricsSurvey.__init__()`** to initialize the filter state attribute:
```python
# In __init__(), add after existing filter_ attributes (around line 758):
self.filter_region:list = None # QID99
```
2. **Update `utils.py` — `load_data()`** to populate the `options_*` attribute:
```python
# In load_data(), add after existing options:
self.options_region = sorted(df['QID99'].drop_nulls().unique().to_list()) if 'QID99' in df.columns else []
```
3. **Update `utils.py` — `filter_data()`** to accept and apply the filter:
```python
# Add parameter to function signature:
def filter_data(self, q: pl.LazyFrame, ..., region:list=None) -> pl.LazyFrame:
# Add filter logic in function body:
self.filter_region = region
if region is not None:
q = q.filter(pl.col('QID99').is_in(region))
```
4. **Update `plots.py` — `_get_filter_slug()`** to include the filter in directory slugs:
```python
# Add to the filters list:
('region', 'Reg', getattr(self, 'filter_region', None), 'options_region'),
```
5. **Update `plots.py` — `_get_filter_description()`** for human-readable descriptions:
```python
# Add to the filters list:
('Region', getattr(self, 'filter_region', None), 'options_region'),
```
6. **Update `03_quant_report.script.py` — `FILTER_CONFIG`**:
```python
FILTER_CONFIG = {
'age': 'options_age',
'gender': 'options_gender',
# ... existing filters ...
'region': 'options_region', # ← New filter
}
```
This **automatically**:
- Adds `--region` CLI argument
- Includes it in Jupyter mode (defaults to all options)
- Passes it to `S.filter_data()`
- Writes it to the `.txt` filter description file
7. **Update `run_filter_combinations.py`** to generate combinations (optional):
```python
# Add after existing filter loops:
for region in survey.options_region:
combinations.append({
'name': f'Region-{region}',
'filters': {'region': [region]}
})
```
### Currently Available Filters
| CLI Argument | Options Attribute | QID Column | Description |
|--------------|-------------------|------------|-------------|
| `--age` | `options_age` | QID1 | Age groups |
| `--gender` | `options_gender` | QID2 | Gender |
| `--ethnicity` | `options_ethnicity` | QID3 | Ethnicity |
| `--income` | `options_income` | QID15 | Income brackets |
| `--consumer` | `options_consumer` | Consumer | Consumer segments |
| `--business_owner` | `options_business_owner` | QID4 | Business owner status |
| `--employment_status` | `options_employment_status` | QID13 | Employment status |
| `--personal_products` | `options_personal_products` | QID14 | Personal products |
| `--ai_user` | `options_ai_user` | QID22 | AI user status |
| `--investable_assets` | `options_investable_assets` | QID16 | Investable assets |
| `--industry` | `options_industry` | QID17 | Industry |

View File

@@ -0,0 +1,263 @@
"""Extra analyses of the traits"""
# %% Imports
import utils
import polars as pl
import argparse
import json
import re
from pathlib import Path
from validation import check_straight_liners
# %% Fixed Variables
RESULTS_FILE = 'data/exports/2-4-26/JPMC_Chase Brand Personality_Quant Round 1_February 4, 2026_Labels.csv'
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
# %% CLI argument parsing for batch automation
# When run as script: uv run XX_statistical_significance.script.py --age '["18
# Central filter configuration - add new filters here only
# Format: 'cli_arg_name': 'QualtricsSurvey.options_* attribute name'
FILTER_CONFIG = {
'age': 'options_age',
'gender': 'options_gender',
'ethnicity': 'options_ethnicity',
'income': 'options_income',
'consumer': 'options_consumer',
'business_owner': 'options_business_owner',
'ai_user': 'options_ai_user',
'investable_assets': 'options_investable_assets',
'industry': 'options_industry',
}
def parse_cli_args():
parser = argparse.ArgumentParser(description='Generate quant report with optional filters')
# Dynamically add filter arguments from config
for filter_name in FILTER_CONFIG:
parser.add_argument(f'--{filter_name}', type=str, default=None, help=f'JSON list of {filter_name} values')
parser.add_argument('--filter-name', type=str, default=None, help='Name for this filter combination (used for .txt description file)')
parser.add_argument('--figures-dir', type=str, default=f'figures/traits-likert-analysis/{Path(RESULTS_FILE).parts[2]}', help='Override the default figures directory')
# Only parse if running as script (not in Jupyter/interactive)
try:
# Check if running in Jupyter by looking for ipykernel
get_ipython() # noqa: F821 # type: ignore
# Return namespace with all filters set to None
no_filters = {f: None for f in FILTER_CONFIG}
# Use the same default as argparse
default_fig_dir = f'figures/traits-likert-analysis/{Path(RESULTS_FILE).parts[2]}'
return argparse.Namespace(**no_filters, filter_name=None, figures_dir=default_fig_dir)
except NameError:
args = parser.parse_args()
# Parse JSON strings to lists
for filter_name in FILTER_CONFIG:
val = getattr(args, filter_name)
setattr(args, filter_name, json.loads(val) if val else None)
return args
cli_args = parse_cli_args()
# %%
S = utils.QualtricsSurvey(RESULTS_FILE, QSF_FILE, figures_dir=cli_args.figures_dir)
data_all = S.load_data()
# %% Build filtered dataset based on CLI args
# CLI args: None means "no filter applied" - filter_data() will skip None filters
# Build filter values dict dynamically from FILTER_CONFIG
_active_filters = {filter_name: getattr(cli_args, filter_name) for filter_name in FILTER_CONFIG}
_d = S.filter_data(data_all, **_active_filters)
# Write filter description file if filter-name is provided
if cli_args.filter_name and S.fig_save_dir:
# Get the filter slug (e.g., "All_Respondents", "Cons-Starter", etc.)
_filter_slug = S._get_filter_slug()
_filter_slug_dir = S.fig_save_dir / _filter_slug
_filter_slug_dir.mkdir(parents=True, exist_ok=True)
# Build filter description
_filter_desc_lines = [
f"Filter: {cli_args.filter_name}",
"",
"Applied Filters:",
]
_short_desc_parts = []
for filter_name, options_attr in FILTER_CONFIG.items():
all_options = getattr(S, options_attr)
values = _active_filters[filter_name]
display_name = filter_name.replace('_', ' ').title()
# None means no filter applied (same as "All")
if values is not None and values != all_options:
_short_desc_parts.append(f"{display_name}: {', '.join(values)}")
_filter_desc_lines.append(f" {display_name}: {', '.join(values)}")
else:
_filter_desc_lines.append(f" {display_name}: All")
# Write detailed description INSIDE the filter-slug directory
# Sanitize filter name for filename usage (replace / and other chars)
_safe_filter_name = re.sub(r'[^\w\s-]', '_', cli_args.filter_name)
_filter_file = _filter_slug_dir / f"{_safe_filter_name}.txt"
_filter_file.write_text('\n'.join(_filter_desc_lines))
# Append to summary index file at figures/<export_date>/filter_index.txt
_summary_file = S.fig_save_dir / "filter_index.txt"
_short_desc = "; ".join(_short_desc_parts) if _short_desc_parts else "All Respondents"
_summary_line = f"{_filter_slug} | {cli_args.filter_name} | {_short_desc}\n"
# Append or create the summary file
if _summary_file.exists():
_existing = _summary_file.read_text()
# Avoid duplicate entries for same slug
if _filter_slug not in _existing:
with _summary_file.open('a') as f:
f.write(_summary_line)
else:
_header = "Filter Index\n" + "=" * 80 + "\n\n"
_header += "Directory | Filter Name | Description\n"
_header += "-" * 80 + "\n"
_summary_file.write_text(_header + _summary_line)
# Save to logical variable name for further analysis
data = _d
data.collect()
# %% Voices per trait
ss_or, choice_map_or = S.get_ss_orange_red(data)
ss_gb, choice_map_gb = S.get_ss_green_blue(data)
# Combine the data
ss_all = ss_or.join(ss_gb, on='_recordId')
_d = ss_all.collect()
choice_map = {**choice_map_or, **choice_map_gb}
# print(_d.head())
# print(choice_map)
ss_long = utils.process_speaking_style_data(ss_all, choice_map)
# %% Create plots
for i, trait in enumerate(ss_long.select("Description").unique().to_series().to_list()):
trait_d = ss_long.filter(pl.col("Description") == trait)
S.plot_speaking_style_trait_scores(trait_d, title=trait.replace(":", ""), height=550, color_gender=True)
# %% Filter out straight-liner (PER TRAIT) and re-plot to see if any changes
# Save with different filename suffix so we can compare with/without straight-liners
print("\n--- Straight-lining Checks on TRAITS ---")
sl_report_traits, sl_traits_df = check_straight_liners(ss_all, max_score=5)
sl_traits_df
# %%
if sl_traits_df is not None and not sl_traits_df.is_empty():
sl_ids = sl_traits_df.select(pl.col("Record ID").unique()).to_series().to_list()
n_sl_groups = sl_traits_df.height
print(f"\nExcluding {n_sl_groups} straight-lined question blocks from {len(sl_ids)} respondents.")
# Create key in ss_long to match sl_traits_df for anti-join
# Question Group key in sl_traits_df is like "SS_Orange_Red__V14"
# ss_long has "Style_Group" and "Voice"
ss_long_w_key = ss_long.with_columns(
(pl.col("Style_Group") + "__" + pl.col("Voice")).alias("Question Group")
)
# Prepare filter table: Record ID + Question Group
sl_filter = sl_traits_df.select([
pl.col("Record ID").alias("_recordId"),
pl.col("Question Group")
])
# Anti-join to remove specific question blocks that were straight-lined
ss_long_clean = ss_long_w_key.join(sl_filter, on=["_recordId", "Question Group"], how="anti").drop("Question Group")
# Re-plot with suffix in title
print("Re-plotting traits (Cleaned)...")
for i, trait in enumerate(ss_long_clean.select("Description").unique().to_series().to_list()):
trait_d = ss_long_clean.filter(pl.col("Description") == trait)
# Modify title to create unique filename (and display title)
title_clean = trait.replace(":", "") + " (Excl. Straight-Liners)"
S.plot_speaking_style_trait_scores(trait_d, title=title_clean, height=550, color_gender=True)
else:
print("No straight-liners found on traits.")
# %% Compare All vs Cleaned
if sl_traits_df is not None and not sl_traits_df.is_empty():
print("Generating Comparison Plots (All vs Cleaned)...")
# Always apply the per-question-group filtering here to ensure consistency
# (Matches the logic used in the re-plotting section above)
print("Applying filter to remove straight-lined question blocks...")
ss_long_w_key = ss_long.with_columns(
(pl.col("Style_Group") + "__" + pl.col("Voice")).alias("Question Group")
)
sl_filter = sl_traits_df.select([
pl.col("Record ID").alias("_recordId"),
pl.col("Question Group")
])
ss_long_clean = ss_long_w_key.join(sl_filter, on=["_recordId", "Question Group"], how="anti").drop("Question Group")
sl_ids = sl_traits_df.select(pl.col("Record ID").unique()).to_series().to_list()
# --- Verification Prints ---
print(f"\n--- Verification of Filter ---")
print(f"Original Row Count: {ss_long.height}")
print(f"Number of Straight-Liner Question Blocks: {sl_traits_df.height}")
print(f"Sample IDs affected: {sl_ids[:5]}")
print(f"Cleaned Row Count: {ss_long_clean.height}")
print(f"Rows Removed: {ss_long.height - ss_long_clean.height}")
# Verify removal
# Re-construct key to verify
ss_long_check = ss_long.with_columns(
(pl.col("Style_Group") + "__" + pl.col("Voice")).alias("Question Group")
)
sl_filter_check = sl_traits_df.select([
pl.col("Record ID").alias("_recordId"),
pl.col("Question Group")
])
should_be_removed = ss_long_check.join(sl_filter_check, on=["_recordId", "Question Group"], how="inner").height
print(f"Discrepancy Check (Should be 0): { (ss_long.height - ss_long_clean.height) - should_be_removed }")
# Show what was removed (the straight lining behavior)
print("\nSample of Straight-Liner Data (Values that caused removal):")
print(sl_traits_df.head(5))
print("-" * 30 + "\n")
# ---------------------------
for i, trait in enumerate(ss_long.select("Description").unique().to_series().to_list()):
# Get data for this trait from both datasets
trait_d_all = ss_long.filter(pl.col("Description") == trait)
trait_d_clean = ss_long_clean.filter(pl.col("Description") == trait)
# Plot comparison
title_comp = trait.replace(":", "") + " (Impact of Straight-Liners)"
S.plot_speaking_style_trait_scores_comparison(
trait_d_all,
trait_d_clean,
title=title_comp,
height=600 # Slightly taller for grouped bars
)

849
XX_quant_report.script.py Normal file
View File

@@ -0,0 +1,849 @@
__generated_with = "0.19.7"
# %%
import marimo as mo
import polars as pl
from pathlib import Path
import argparse
import json
import re
from validation import check_progress, duration_validation, check_straight_liners
from utils import QualtricsSurvey, combine_exclusive_columns, calculate_weighted_ranking_scores
import utils
from speaking_styles import SPEAKING_STYLES
# %% Fixed Variables
RESULTS_FILE = 'data/exports/2-4-26/JPMC_Chase Brand Personality_Quant Round 1_February 4, 2026_Labels.csv'
# RESULTS_FILE = 'data/exports/debug/JPMC_Chase Brand Personality_Quant Round 1_February 2, 2026_Labels.csv'
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
# %%
# CLI argument parsing for batch automation
# When run as script: python 03_quant_report.script.py --age '["18 to 21 years"]' --consumer '["Starter"]'
# When run in Jupyter: args will use defaults (all filters = None = all options selected)
# Central filter configuration - add new filters here only
# Format: 'cli_arg_name': 'QualtricsSurvey.options_* attribute name'
FILTER_CONFIG = {
'age': 'options_age',
'gender': 'options_gender',
'ethnicity': 'options_ethnicity',
'income': 'options_income',
'consumer': 'options_consumer',
'business_owner': 'options_business_owner',
'ai_user': 'options_ai_user',
'investable_assets': 'options_investable_assets',
'industry': 'options_industry',
}
def parse_cli_args():
parser = argparse.ArgumentParser(description='Generate quant report with optional filters')
# Dynamically add filter arguments from config
for filter_name in FILTER_CONFIG:
parser.add_argument(f'--{filter_name}', type=str, default=None, help=f'JSON list of {filter_name} values')
parser.add_argument('--filter-name', type=str, default=None, help='Name for this filter combination (used for .txt description file)')
parser.add_argument('--figures-dir', type=str, default=f'figures/{Path(RESULTS_FILE).parts[2]}', help='Override the default figures directory')
parser.add_argument('--best-character', type=str, default="the_coach", help='Slug of the best chosen character (default: "the_coach")')
parser.add_argument('--sl-threshold', type=int, default=None, help='Exclude respondents who straight-lined >= N question groups (e.g. 3 removes anyone with 3+ straight-lined groups)')
parser.add_argument('--voice-ranking-filter', type=str, default=None, choices=['only-missing', 'exclude-missing'], help='Filter by voice ranking completeness: "only-missing" keeps only respondents missing QID98 ranking data, "exclude-missing" removes them')
# Only parse if running as script (not in Jupyter/interactive)
try:
# Check if running in Jupyter by looking for ipykernel
get_ipython() # noqa: F821 # type: ignore
# Return namespace with all filters set to None
no_filters = {f: None for f in FILTER_CONFIG}
return argparse.Namespace(**no_filters, filter_name=None, figures_dir=f'figures/statistical_significance/{Path(RESULTS_FILE).parts[2]}', best_character="the_coach", sl_threshold=None, voice_ranking_filter=None)
except NameError:
args = parser.parse_args()
# Parse JSON strings to lists
for filter_name in FILTER_CONFIG:
val = getattr(args, filter_name)
setattr(args, filter_name, json.loads(val) if val else None)
return args
cli_args = parse_cli_args()
BEST_CHOSEN_CHARACTER = cli_args.best_character
# %%
S = QualtricsSurvey(RESULTS_FILE, QSF_FILE, figures_dir=cli_args.figures_dir)
try:
data_all = S.load_data()
except NotImplementedError as e:
mo.stop(True, mo.md(f"**⚠️ {str(e)}**"))
# %% Build filtered dataset based on CLI args
# CLI args: None means "no filter applied" - filter_data() will skip None filters
# Build filter values dict dynamically from FILTER_CONFIG
_active_filters = {filter_name: getattr(cli_args, filter_name) for filter_name in FILTER_CONFIG}
# %% Apply filters
_d = S.filter_data(data_all, **_active_filters)
# Write filter description file if filter-name is provided
if cli_args.filter_name and S.fig_save_dir:
# Get the filter slug (e.g., "All_Respondents", "Cons-Starter", etc.)
_filter_slug = S._get_filter_slug()
_filter_slug_dir = S.fig_save_dir / _filter_slug
_filter_slug_dir.mkdir(parents=True, exist_ok=True)
# Build filter description
_filter_desc_lines = [
f"Filter: {cli_args.filter_name}",
"",
"Applied Filters:",
]
_short_desc_parts = []
for filter_name, options_attr in FILTER_CONFIG.items():
all_options = getattr(S, options_attr)
values = _active_filters[filter_name]
display_name = filter_name.replace('_', ' ').title()
# None means no filter applied (same as "All")
if values is not None and values != all_options:
_short_desc_parts.append(f"{display_name}: {', '.join(values)}")
_filter_desc_lines.append(f" {display_name}: {', '.join(values)}")
else:
_filter_desc_lines.append(f" {display_name}: All")
# Write detailed description INSIDE the filter-slug directory
# Sanitize filter name for filename usage (replace / and other chars)
_safe_filter_name = re.sub(r'[^\w\s-]', '_', cli_args.filter_name)
_filter_file = _filter_slug_dir / f"{_safe_filter_name}.txt"
_filter_file.write_text('\n'.join(_filter_desc_lines))
# Append to summary index file at figures/<export_date>/filter_index.txt
_summary_file = S.fig_save_dir / "filter_index.txt"
_short_desc = "; ".join(_short_desc_parts) if _short_desc_parts else "All Respondents"
_summary_line = f"{_filter_slug} | {cli_args.filter_name} | {_short_desc}\n"
# Append or create the summary file
if _summary_file.exists():
_existing = _summary_file.read_text()
# Avoid duplicate entries for same slug
if _filter_slug not in _existing:
with _summary_file.open('a') as f:
f.write(_summary_line)
else:
_header = "Filter Index\n" + "=" * 80 + "\n\n"
_header += "Directory | Filter Name | Description\n"
_header += "-" * 80 + "\n"
_summary_file.write_text(_header + _summary_line)
# %% Apply straight-liner threshold filter (if specified)
# Removes respondents who straight-lined >= N question groups across
# speaking style and voice scale questions.
if cli_args.sl_threshold is not None:
_sl_n = cli_args.sl_threshold
S.sl_threshold = _sl_n # Store on Survey so filter slug/description include it
print(f"Applying straight-liner filter: excluding respondents with ≥{_sl_n} straight-lined question groups...")
_n_before = _d.select(pl.len()).collect().item()
# Extract question groups with renamed columns for check_straight_liners
_sl_ss_or, _ = S.get_ss_orange_red(_d)
_sl_ss_gb, _ = S.get_ss_green_blue(_d)
_sl_vs, _ = S.get_voice_scale_1_10(_d)
_sl_all_q = _sl_ss_or.join(_sl_ss_gb, on='_recordId').join(_sl_vs, on='_recordId')
_, _sl_df = check_straight_liners(_sl_all_q, max_score=5)
if _sl_df is not None and not _sl_df.is_empty():
# Count straight-lined question groups per respondent
_sl_counts = (
_sl_df
.group_by("Record ID")
.agg(pl.len().alias("sl_count"))
.filter(pl.col("sl_count") >= _sl_n)
.select(pl.col("Record ID").alias("_recordId"))
)
# Anti-join to remove offending respondents
_d = _d.collect().join(_sl_counts, on="_recordId", how="anti").lazy()
# Update filtered data on the Survey object so sample size is correct
S.data_filtered = _d
_n_after = _d.select(pl.len()).collect().item()
print(f" Removed {_n_before - _n_after} respondents ({_n_before}{_n_after})")
else:
print(" No straight-liners detected — no respondents removed.")
# %% Apply voice-ranking completeness filter (if specified)
# Keeps only / excludes respondents who are missing the explicit voice
# ranking question (QID98) despite completing the top-3 selection (QID36).
if cli_args.voice_ranking_filter is not None:
S.voice_ranking_filter = cli_args.voice_ranking_filter # Store on Survey so filter slug/description include it
_vr_missing = S.get_top_3_voices_missing_ranking(_d)
_vr_missing_ids = _vr_missing.select('_recordId')
_n_before = _d.select(pl.len()).collect().item()
if cli_args.voice_ranking_filter == 'only-missing':
print(f"Voice ranking filter: keeping ONLY respondents missing QID98 ranking data...")
_d = _d.collect().join(_vr_missing_ids, on='_recordId', how='inner').lazy()
elif cli_args.voice_ranking_filter == 'exclude-missing':
print(f"Voice ranking filter: EXCLUDING respondents missing QID98 ranking data...")
_d = _d.collect().join(_vr_missing_ids, on='_recordId', how='anti').lazy()
S.data_filtered = _d
_n_after = _d.select(pl.len()).collect().item()
print(f" {_n_before}{_n_after} respondents ({_vr_missing_ids.height} missing ranking data)")
# Save to logical variable name for further analysis
data = _d
data.collect()
# %%
# Check if all business owners are missing a 'Consumer type' in demographics
# assert all([a is None for a in data_all.filter(pl.col('QID4') == 'Yes').collect()['Consumer'].unique()]) , "Not all business owners are missing 'Consumer type' in demographics."
# %%
mo.md(r"""
# Demographic Distributions
""")
# %%
demo_plot_cols = [
'Age',
'Gender',
# 'Race/Ethnicity',
'Bussiness_Owner',
'Consumer'
]
# %%
_content = """
"""
for c in demo_plot_cols:
_fig = S.plot_demographic_distribution(
data=S.get_demographics(data)[0],
column=c,
title=f"{c.replace('Bussiness', 'Business').replace('_', ' ')} Distribution of Survey Respondents"
)
_content += f"""{mo.ui.altair_chart(_fig)}\n\n"""
mo.md(_content)
# %%
mo.md(r"""
---
# Brand Character Results
""")
# %%
mo.md(r"""
## Best performing: Original vs Refined frankenstein
""")
# %%
char_refine_rank = S.get_character_refine(data)[0]
# print(char_rank.collect().head())
print(char_refine_rank.collect().head())
# %%
mo.md(r"""
## Character ranking points
""")
# %%
mo.md(r"""
## Character ranking 1-2-3
""")
# %%
char_rank = S.get_character_ranking(data)[0]
# %%
char_rank_weighted = calculate_weighted_ranking_scores(char_rank)
S.plot_weighted_ranking_score(char_rank_weighted, title="Most Popular Character - Weighted Popularity Score<br>(1st=3pts, 2nd=2pts, 3rd=1pt)", x_label='Voice')
# %%
S.plot_top3_ranking_distribution(char_rank, x_label='Character Personality', title='Character Personality: Rankings Top 3')
# %%
mo.md(r"""
### Statistical Significance Character Ranking
""")
# %%
# _pairwise_df, _meta = S.compute_ranking_significance(char_rank)
# # print(_pairwise_df.columns)
# mo.md(f"""
# {mo.ui.altair_chart(S.plot_significance_heatmap(_pairwise_df, metadata=_meta))}
# {mo.ui.altair_chart(S.plot_significance_summary(_pairwise_df, metadata=_meta))}
# """)
# %%
mo.md(r"""
## Character Ranking: times 1st place
""")
# %%
S.plot_most_ranked_1(char_rank, title="Most Popular Character<br>(Number of Times Ranked 1st)", x_label='Character Personality')
# %%
mo.md(r"""
## Prominent predefined personality traits wordcloud
""")
# %%
top8_traits = S.get_top_8_traits(data)[0]
S.plot_traits_wordcloud(
data=top8_traits,
column='Top_8_Traits',
title="Most Prominent Personality Traits",
)
# %%
mo.md(r"""
## Trait frequency per brand character
""")
# %%
char_df = S.get_character_refine(data)[0]
# %%
from theme import ColorPalette
# Assuming you already have char_df (your data from get_character_refine or similar)
characters = ['Bank Teller', 'Familiar Friend', 'The Coach', 'Personal Assistant']
character_colors = {
'Bank Teller': (ColorPalette.CHARACTER_BANK_TELLER, ColorPalette.CHARACTER_BANK_TELLER_HIGHLIGHT),
'Familiar Friend': (ColorPalette.CHARACTER_FAMILIAR_FRIEND, ColorPalette.CHARACTER_FAMILIAR_FRIEND_HIGHLIGHT),
'The Coach': (ColorPalette.CHARACTER_COACH, ColorPalette.CHARACTER_COACH_HIGHLIGHT),
'Personal Assistant': (ColorPalette.CHARACTER_PERSONAL_ASSISTANT, ColorPalette.CHARACTER_PERSONAL_ASSISTANT_HIGHLIGHT),
}
# Build consistent sort order (by total frequency across all characters)
all_trait_counts = {}
for char in characters:
freq_df, _ = S.transform_character_trait_frequency(char_df, char)
for row in freq_df.iter_rows(named=True):
all_trait_counts[row['trait']] = all_trait_counts.get(row['trait'], 0) + row['count']
consistent_sort_order = sorted(all_trait_counts.keys(), key=lambda x: -all_trait_counts[x])
_content = """"""
# Generate 4 plots (one per character)
for char in characters:
freq_df, _ = S.transform_character_trait_frequency(char_df, char)
main_color, highlight_color = character_colors[char]
chart = S.plot_single_character_trait_frequency(
data=freq_df,
character_name=char,
bar_color=main_color,
highlight_color=highlight_color,
trait_sort_order=consistent_sort_order,
)
_content += f"""
{mo.ui.altair_chart(chart)}
"""
mo.md(_content)
# %%
mo.md(r"""
## Statistical significance best characters
zie chat
> voorbeeld: als de nr 1 en 2 niet significant verschillen maar wel van de nr 3 bijvoorbeeld is dat ook top. Beetje meedenkend over hoe ik het kan presenteren weetje wat ik bedoel?:)
>
""")
# %%
# %%
# %%
mo.md(r"""
---
# Spoken Voice Results
""")
# %%
COLOR_GENDER = True
# %%
mo.md(r"""
## Top 8 Most Chosen out of 18
""")
# %%
v_18_8_3 = S.get_18_8_3(data)[0]
# %%
S.plot_voice_selection_counts(v_18_8_3, title="Top 8 Voice Selection from 18 Voices", x_label='Voice', color_gender=COLOR_GENDER)
# %%
mo.md(r"""
## Top 3 most chosen out of 8
""")
# %%
S.plot_top3_selection_counts(v_18_8_3, title="Top 3 Voice Selection Counts from 8 Voices", x_label='Voice', color_gender=COLOR_GENDER)
# %%
mo.md(r"""
## Voice Ranking Weighted Score
""")
# %%
top3_voices = S.get_top_3_voices(data)[0]
top3_voices_weighted = calculate_weighted_ranking_scores(top3_voices)
# %%
S.plot_weighted_ranking_score(top3_voices_weighted, title="Most Popular Voice - Weighted Popularity Score<br>(1st = 3pts, 2nd = 2pts, 3rd = 1pt)", color_gender=COLOR_GENDER)
# %%
mo.md(r"""
## Which voice is ranked best in the ranking question for top 3?
(not best 3 out of 8 question)
""")
# %%
S.plot_ranking_distribution(top3_voices, x_label='Voice', title="Distribution of Top 3 Voice Rankings (1st, 2nd, 3rd)", color_gender=COLOR_GENDER)
# %%
mo.md(r"""
### Statistical significance for voice ranking
""")
# %%
# print(top3_voices.collect().head())
# %%
# _pairwise_df, _metadata = S.compute_ranking_significance(
# top3_voices,alpha=0.05,correction="none")
# # View significant pairs
# # print(pairwise_df.filter(pl.col('significant') == True))
# # Create heatmap visualization
# _heatmap = S.plot_significance_heatmap(
# _pairwise_df,
# metadata=_metadata,
# title="Weighted Voice Ranking Significance<br>(Pairwise Comparisons)"
# )
# # Create summary bar chart
# _summary = S.plot_significance_summary(
# _pairwise_df,
# metadata=_metadata
# )
# mo.md(f"""
# {mo.ui.altair_chart(_heatmap)}
# {mo.ui.altair_chart(_summary)}
# """)
# %%
## Voice Ranked 1st the most
# %%
S.plot_most_ranked_1(top3_voices, title="Most Popular Voice<br>(Number of Times Ranked 1st)", x_label='Voice', color_gender=COLOR_GENDER)
# %%
mo.md(r"""
## Voice Scale 1-10
""")
# %%
# Get your voice scale data (from notebook)
voice_1_10, _ = S.get_voice_scale_1_10(data)
S.plot_average_scores_with_counts(voice_1_10, x_label='Voice', domain=[1,10], title="Voice General Impression (Scale 1-10)", color_gender=COLOR_GENDER)
# %%
mo.md(r"""
### Statistical Significance (Scale 1-10)
""")
# %%
# Compute pairwise significance tests
# pairwise_df, metadata = S.compute_pairwise_significance(
# voice_1_10,
# test_type="mannwhitney", # or "ttest", "chi2", "auto"
# alpha=0.05,
# correction="bonferroni" # or "holm", "none"
# )
# # View significant pairs
# # print(pairwise_df.filter(pl.col('significant') == True))
# # Create heatmap visualization
# _heatmap = S.plot_significance_heatmap(
# pairwise_df,
# metadata=metadata,
# title="Voice Rating Significance<br>(Pairwise Comparisons)"
# )
# # Create summary bar chart
# _summary = S.plot_significance_summary(
# pairwise_df,
# metadata=metadata
# )
# mo.md(f"""
# {mo.ui.altair_chart(_heatmap)}
# {mo.ui.altair_chart(_summary)}
# """)
# %%
# %%
mo.md(r"""
## Ranking points for Voice per Chosen Brand Character
**missing mapping**
""")
# %%
mo.md(r"""
## Correlation Speaking Styles
""")
# %%
ss_or, choice_map_or = S.get_ss_orange_red(data)
ss_gb, choice_map_gb = S.get_ss_green_blue(data)
# Combine the data
ss_all = ss_or.join(ss_gb, on='_recordId')
_d = ss_all.collect()
choice_map = {**choice_map_or, **choice_map_gb}
# print(_d.head())
# print(choice_map)
ss_long = utils.process_speaking_style_data(ss_all, choice_map)
df_style = utils.process_speaking_style_data(ss_all, choice_map)
vscales = S.get_voice_scale_1_10(data)[0]
df_scale_long = utils.process_voice_scale_data(vscales)
joined_scale = df_style.join(df_scale_long, on=["_recordId", "Voice"], how="inner")
df_ranking = utils.process_voice_ranking_data(top3_voices)
joined_ranking = df_style.join(df_ranking, on=['_recordId', 'Voice'], how='inner')
# %%
joined_ranking.head()
# %%
mo.md(r"""
### Colors vs Scale 1-10
""")
# %%
# Transform to get one row per color with average correlation
color_corr_scale, _ = utils.transform_speaking_style_color_correlation(joined_scale, SPEAKING_STYLES)
S.plot_speaking_style_color_correlation(
data=color_corr_scale,
title="Correlation: Speaking Style Colors and Voice Scale 1-10"
)
# %%
mo.md(r"""
### Colors vs Ranking Points
""")
# %%
color_corr_ranking, _ = utils.transform_speaking_style_color_correlation(
joined_ranking,
SPEAKING_STYLES,
target_column="Ranking_Points"
)
S.plot_speaking_style_color_correlation(
data=color_corr_ranking,
title="Correlation: Speaking Style Colors and Voice Ranking Points"
)
# %%
# Gender-filtered correlation plots (Male vs Female voices)
from reference import VOICE_GENDER_MAPPING
MALE_VOICES = [v for v, g in VOICE_GENDER_MAPPING.items() if g == "Male"]
FEMALE_VOICES = [v for v, g in VOICE_GENDER_MAPPING.items() if g == "Female"]
# Filter joined data by voice gender
joined_scale_male = joined_scale.filter(pl.col("Voice").is_in(MALE_VOICES))
joined_scale_female = joined_scale.filter(pl.col("Voice").is_in(FEMALE_VOICES))
joined_ranking_male = joined_ranking.filter(pl.col("Voice").is_in(MALE_VOICES))
joined_ranking_female = joined_ranking.filter(pl.col("Voice").is_in(FEMALE_VOICES))
# Colors vs Scale 1-10 (grouped by voice gender)
S.plot_speaking_style_color_correlation_by_gender(
data_male=joined_scale_male,
data_female=joined_scale_female,
speaking_styles=SPEAKING_STYLES,
target_column="Voice_Scale_Score",
title="Correlation: Speaking Style Colors and Voice Scale 1-10 (by Voice Gender)",
filename="correlation_speaking_style_and_voice_scale_1-10_by_voice_gender_color",
)
# Colors vs Ranking Points (grouped by voice gender)
S.plot_speaking_style_color_correlation_by_gender(
data_male=joined_ranking_male,
data_female=joined_ranking_female,
speaking_styles=SPEAKING_STYLES,
target_column="Ranking_Points",
title="Correlation: Speaking Style Colors and Voice Ranking Points (by Voice Gender)",
filename="correlation_speaking_style_and_voice_ranking_points_by_voice_gender_color",
)
# %%
mo.md(r"""
### Individual Traits vs Scale 1-10
""")
# %%
_content = """"""
for _style, _traits in SPEAKING_STYLES.items():
# print(f"Correlation plot for {style}...")
_fig = S.plot_speaking_style_scale_correlation(
data=joined_scale,
style_color=_style,
style_traits=_traits,
title=f"Correlation: Speaking Style {_style} and Voice Scale 1-10",
)
_content += f"""
#### Speaking Style **{_style}**:
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
# %%
mo.md(r"""
### Individual Traits vs Ranking Points
""")
# %%
_content = """"""
for _style, _traits in SPEAKING_STYLES.items():
# print(f"Correlation plot for {style}...")
_fig = S.plot_speaking_style_ranking_correlation(
data=joined_ranking,
style_color=_style,
style_traits=_traits,
title=f"Correlation: Speaking Style {_style} and Voice Ranking Points",
)
_content += f"""
#### Speaking Style **{_style}**:
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
# %%
# Individual Traits vs Scale 1-10 (grouped by voice gender)
_content = """### Individual Traits vs Scale 1-10 (by Voice Gender)\n\n"""
for _style, _traits in SPEAKING_STYLES.items():
_fig = S.plot_speaking_style_scale_correlation_by_gender(
data_male=joined_scale_male,
data_female=joined_scale_female,
style_color=_style,
style_traits=_traits,
title=f"Correlation: Speaking Style {_style} and Voice Scale 1-10 (by Voice Gender)",
filename=f"correlation_speaking_style_and_voice_scale_1-10_by_voice_gender_{_style.lower()}",
)
_content += f"""
#### Speaking Style **{_style}**:
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
# %%
# Individual Traits vs Ranking Points (grouped by voice gender)
_content = """### Individual Traits vs Ranking Points (by Voice Gender)\n\n"""
for _style, _traits in SPEAKING_STYLES.items():
_fig = S.plot_speaking_style_ranking_correlation_by_gender(
data_male=joined_ranking_male,
data_female=joined_ranking_female,
style_color=_style,
style_traits=_traits,
title=f"Correlation: Speaking Style {_style} and Voice Ranking Points (by Voice Gender)",
filename=f"correlation_speaking_style_and_voice_ranking_points_by_voice_gender_{_style.lower()}",
)
_content += f"""
#### Speaking Style **{_style}**:
{mo.ui.altair_chart(_fig)}
"""
mo.md(_content)
# %%
# ## Correlations when "Best Brand Character" is chosen
# For each of the 4 brand characters, filter the dataset to only those respondents
# who selected that character as their #1 choice.
# %%
# Prepare character-filtered data subsets
char_rank_for_filter = S.get_character_ranking(data)[0].collect()
CHARACTER_FILTER_MAP = {
'Familiar Friend': 'Character_Ranking_Familiar_Friend',
'The Coach': 'Character_Ranking_The_Coach',
'Personal Assistant': 'Character_Ranking_The_Personal_Assistant',
'Bank Teller': 'Character_Ranking_The_Bank_Teller',
}
def get_filtered_data_for_character(char_name: str) -> tuple[pl.DataFrame, pl.DataFrame, int]:
"""Filter joined_scale and joined_ranking to respondents who ranked char_name #1."""
col = CHARACTER_FILTER_MAP[char_name]
respondents = char_rank_for_filter.filter(pl.col(col) == 1).select('_recordId')
n = respondents.height
filtered_scale = joined_scale.join(respondents, on='_recordId', how='inner')
filtered_ranking = joined_ranking.join(respondents, on='_recordId', how='inner')
return filtered_scale, filtered_ranking, n
def _char_filename(char_name: str, suffix: str) -> str:
"""Generate filename for character-filtered plots (without n-value).
Format: bc_ranked_1_{suffix}__{char_slug}
This groups all plot types together in directory listings.
"""
char_slug = char_name.lower().replace(' ', '_')
return f"bc_ranked_1_{suffix}__{char_slug}"
# %%
# ### Voice Weighted Ranking Score (by Best Character)
for char_name in CHARACTER_FILTER_MAP:
_, _, n = get_filtered_data_for_character(char_name)
# Get top3 voices for this character subset using _recordIds
respondents = char_rank_for_filter.filter(
pl.col(CHARACTER_FILTER_MAP[char_name]) == 1
).select('_recordId')
# Collect top3_voices if it's a LazyFrame, then join
top3_df = top3_voices.collect() if isinstance(top3_voices, pl.LazyFrame) else top3_voices
filtered_top3 = top3_df.join(respondents, on='_recordId', how='inner')
weighted = calculate_weighted_ranking_scores(filtered_top3)
S.plot_weighted_ranking_score(
data=weighted,
title=f'"{char_name}" Ranked #1 (n={n})<br>Most Popular Voice - Weighted Score (1st=3pts, 2nd=2pts, 3rd=1pt)',
filename=_char_filename(char_name, "voice_weighted_ranking_score"),
color_gender=COLOR_GENDER,
)
# %%
# ### Voice Scale 1-10 Average Scores (by Best Character)
for char_name in CHARACTER_FILTER_MAP:
_, _, n = get_filtered_data_for_character(char_name)
# Get voice scale data for this character subset using _recordIds
respondents = char_rank_for_filter.filter(
pl.col(CHARACTER_FILTER_MAP[char_name]) == 1
).select('_recordId')
# Collect voice_1_10 if it's a LazyFrame, then join
voice_1_10_df = voice_1_10.collect() if isinstance(voice_1_10, pl.LazyFrame) else voice_1_10
filtered_voice_1_10 = voice_1_10_df.join(respondents, on='_recordId', how='inner')
S.plot_average_scores_with_counts(
data=filtered_voice_1_10,
title=f'"{char_name}" Ranked #1 (n={n})<br>Voice General Impression (Scale 1-10)',
filename=_char_filename(char_name, "voice_scale_1-10"),
x_label='Voice',
domain=[1, 10],
color_gender=COLOR_GENDER,
)
# %%
# ### Speaking Style Colors vs Scale 1-10 (only for Best Character)
for char_name in CHARACTER_FILTER_MAP:
if char_name.lower().replace(' ', '_') != BEST_CHOSEN_CHARACTER:
continue
filtered_scale, _, n = get_filtered_data_for_character(char_name)
color_corr, _ = utils.transform_speaking_style_color_correlation(filtered_scale, SPEAKING_STYLES)
S.plot_speaking_style_color_correlation(
data=color_corr,
title=f'"{char_name}" Ranked #1 (n={n})<br>Correlation: Speaking Style Colors vs Voice Scale 1-10',
filename=_char_filename(char_name, "colors_vs_voice_scale_1-10"),
)
# %%
# ### Speaking Style Colors vs Ranking Points (only for Best Character)
for char_name in CHARACTER_FILTER_MAP:
if char_name.lower().replace(' ', '_') != BEST_CHOSEN_CHARACTER:
continue
_, filtered_ranking, n = get_filtered_data_for_character(char_name)
color_corr, _ = utils.transform_speaking_style_color_correlation(
filtered_ranking, SPEAKING_STYLES, target_column="Ranking_Points"
)
S.plot_speaking_style_color_correlation(
data=color_corr,
title=f'"{char_name}" Ranked #1 (n={n})<br>Correlation: Speaking Style Colors vs Voice Ranking Points',
filename=_char_filename(char_name, "colors_vs_voice_ranking_points"),
)
# %%
# ### Individual Traits vs Scale 1-10 (only for Best Character)
for _style, _traits in SPEAKING_STYLES.items():
print(f"--- Speaking Style: {_style} ---")
for char_name in CHARACTER_FILTER_MAP:
if char_name.lower().replace(' ', '_') != BEST_CHOSEN_CHARACTER:
continue
filtered_scale, _, n = get_filtered_data_for_character(char_name)
S.plot_speaking_style_scale_correlation(
data=filtered_scale,
style_color=_style,
style_traits=_traits,
title=f'"{char_name}" Ranked #1 (n={n})<br>Correlation: {_style} vs Voice Scale 1-10',
filename=_char_filename(char_name, f"{_style.lower()}_vs_voice_scale_1-10"),
)
# %%
# ### Individual Traits vs Ranking Points (only for Best Character)
for _style, _traits in SPEAKING_STYLES.items():
print(f"--- Speaking Style: {_style} ---")
for char_name in CHARACTER_FILTER_MAP:
if char_name.lower().replace(' ', '_') != BEST_CHOSEN_CHARACTER:
continue
_, filtered_ranking, n = get_filtered_data_for_character(char_name)
S.plot_speaking_style_ranking_correlation(
data=filtered_ranking,
style_color=_style,
style_traits=_traits,
title=f'"{char_name}" Ranked #1 (n={n})<br>Correlation: {_style} vs Voice Ranking Points',
filename=_char_filename(char_name, f"{_style.lower()}_vs_voice_ranking_points"),
)
# %%

View File

@@ -0,0 +1,370 @@
"""Extra statistical significance analyses for quant report."""
# %% Imports
import utils
import polars as pl
import argparse
import json
import re
from pathlib import Path
# %% Fixed Variables
RESULTS_FILE = 'data/exports/2-4-26/JPMC_Chase Brand Personality_Quant Round 1_February 4, 2026_Labels.csv'
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
# %% CLI argument parsing for batch automation
# When run as script: uv run XX_statistical_significance.script.py --age '["18
# Central filter configuration - add new filters here only
# Format: 'cli_arg_name': 'QualtricsSurvey.options_* attribute name'
FILTER_CONFIG = {
'age': 'options_age',
'gender': 'options_gender',
'ethnicity': 'options_ethnicity',
'income': 'options_income',
'consumer': 'options_consumer',
'business_owner': 'options_business_owner',
'ai_user': 'options_ai_user',
'investable_assets': 'options_investable_assets',
'industry': 'options_industry',
}
def parse_cli_args():
parser = argparse.ArgumentParser(description='Generate quant report with optional filters')
# Dynamically add filter arguments from config
for filter_name in FILTER_CONFIG:
parser.add_argument(f'--{filter_name}', type=str, default=None, help=f'JSON list of {filter_name} values')
parser.add_argument('--filter-name', type=str, default=None, help='Name for this filter combination (used for .txt description file)')
parser.add_argument('--figures-dir', type=str, default=f'figures/statistical_significance/{Path(RESULTS_FILE).parts[2]}', help='Override the default figures directory')
# Only parse if running as script (not in Jupyter/interactive)
try:
# Check if running in Jupyter by looking for ipykernel
get_ipython() # noqa: F821 # type: ignore
# Return namespace with all filters set to None
no_filters = {f: None for f in FILTER_CONFIG}
# Use the same default as argparse
default_fig_dir = f'figures/statistical_significance/{Path(RESULTS_FILE).parts[2]}'
return argparse.Namespace(**no_filters, filter_name=None, figures_dir=default_fig_dir)
except NameError:
args = parser.parse_args()
# Parse JSON strings to lists
for filter_name in FILTER_CONFIG:
val = getattr(args, filter_name)
setattr(args, filter_name, json.loads(val) if val else None)
return args
cli_args = parse_cli_args()
# %%
S = utils.QualtricsSurvey(RESULTS_FILE, QSF_FILE, figures_dir=cli_args.figures_dir)
data_all = S.load_data()
# %% Build filtered dataset based on CLI args
# CLI args: None means "no filter applied" - filter_data() will skip None filters
# Build filter values dict dynamically from FILTER_CONFIG
_active_filters = {filter_name: getattr(cli_args, filter_name) for filter_name in FILTER_CONFIG}
_d = S.filter_data(data_all, **_active_filters)
# Write filter description file if filter-name is provided
if cli_args.filter_name and S.fig_save_dir:
# Get the filter slug (e.g., "All_Respondents", "Cons-Starter", etc.)
_filter_slug = S._get_filter_slug()
_filter_slug_dir = S.fig_save_dir / _filter_slug
_filter_slug_dir.mkdir(parents=True, exist_ok=True)
# Build filter description
_filter_desc_lines = [
f"Filter: {cli_args.filter_name}",
"",
"Applied Filters:",
]
_short_desc_parts = []
for filter_name, options_attr in FILTER_CONFIG.items():
all_options = getattr(S, options_attr)
values = _active_filters[filter_name]
display_name = filter_name.replace('_', ' ').title()
# None means no filter applied (same as "All")
if values is not None and values != all_options:
_short_desc_parts.append(f"{display_name}: {', '.join(values)}")
_filter_desc_lines.append(f" {display_name}: {', '.join(values)}")
else:
_filter_desc_lines.append(f" {display_name}: All")
# Write detailed description INSIDE the filter-slug directory
# Sanitize filter name for filename usage (replace / and other chars)
_safe_filter_name = re.sub(r'[^\w\s-]', '_', cli_args.filter_name)
_filter_file = _filter_slug_dir / f"{_safe_filter_name}.txt"
_filter_file.write_text('\n'.join(_filter_desc_lines))
# Append to summary index file at figures/<export_date>/filter_index.txt
_summary_file = S.fig_save_dir / "filter_index.txt"
_short_desc = "; ".join(_short_desc_parts) if _short_desc_parts else "All Respondents"
_summary_line = f"{_filter_slug} | {cli_args.filter_name} | {_short_desc}\n"
# Append or create the summary file
if _summary_file.exists():
_existing = _summary_file.read_text()
# Avoid duplicate entries for same slug
if _filter_slug not in _existing:
with _summary_file.open('a') as f:
f.write(_summary_line)
else:
_header = "Filter Index\n" + "=" * 80 + "\n\n"
_header += "Directory | Filter Name | Description\n"
_header += "-" * 80 + "\n"
_summary_file.write_text(_header + _summary_line)
# Save to logical variable name for further analysis
data = _d
data.collect()
# %% Character coach significatly higher than others
char_rank = S.get_character_ranking(data)[0]
_pairwise_df, _meta = S.compute_ranking_significance(
char_rank,
alpha=0.05,
correction="none",
)
# %% [markdown]
"""
### Methodology Analysis
**Input Data (`char_rank`)**:
* Generated by `S.get_character_ranking(data)`.
* Contains the ranking values (1st, 2nd, 3rd, 4th) assigned by each respondent to the four options ("The Coach", etc.).
* Columns represent the characters; rows represent individual respondents; values are the numerical rank (1 = Top Choice).
**Processing**:
* The function `compute_ranking_significance` aggregates these rankings to find the **"Rank 1 Share"** (the percentage of respondents who picked that character as their #1 favorite).
* It builds a contingency table of how many times each character was ranked 1st vs. not 1st (or 1st v 2nd v 3rd).
**Statistical Test**:
* **Test Used**: Pairwise Z-test for two proportions (uncorrected).
* **Comparison**: It compares the **Rank 1 Share** of every pair of characters.
* *Example*: "Is the 42% of people who chose 'Coach' significantly different from the 29% who chose 'Familiar Friend'?"
* **Significance**: A result of `p < 0.05` means the difference in popularity (top-choice preference) is statistically significant and not due to random chance.
"""
# %% Plot heatmap of pairwise significance
S.plot_significance_heatmap(_pairwise_df, metadata=_meta, title="Statistical Significance: Character Top Choice Preference")
# %% Plot summary of significant differences (e.g., which characters are significantly higher than others)
# S.plot_significance_summary(_pairwise_df, metadata=_meta)
# %% [markdown]
"""
# Analysis: Significance of "The Coach"
**Parameters**: `alpha=0.05`, `correction='none'`
* **Rationale**: No correction was applied to allow for detection of all potential pairwise differences (uncorrected p < 0.05). If strict control for family-wise error rate were required (e.g., Bonferroni), the significance threshold would be lower (p < 0.0083).
**Results**:
"The Coach" is the top-ranked option (42.0% Rank 1 share) and shows strong separation from the field.
* **Vs. Bottom Two**: "The Coach" is significantly higher than both "The Bank Teller" (26.9%, p < 0.001) and "Familiar Friend" (29.4%, p < 0.001).
* **Vs. Runner-Up**: "The Coach" is widely preferred over "The Personal Assistant" (33.4%). The difference of **8.6 percentage points** is statistically significant (p = 0.017) at the standard 0.05 level.
* *Note*: While p=0.017 is significant in isolation, it would not meet the stricter Bonferroni threshold (0.0083). However, the effect size (+8.6%) is commercially meaningful.
**Conclusion**:
Yes, "The Coach" can be considered statistically more significant than the other options. It is clearly superior to the bottom two options and holds a statistically significant lead over the runner-up ("Personal Assistant") in direct comparison.
"""
# %% Mentions significance analysis
char_pairwise_df_mentions, _meta_mentions = S.compute_mentions_significance(
char_rank,
alpha=0.05,
correction="none",
)
S.plot_significance_heatmap(
char_pairwise_df_mentions,
metadata=_meta_mentions,
title="Statistical Significance: Character Total Mentions (Top 3 Visibility)"
)
# %% voices analysis
top3_voices = S.get_top_3_voices(data)[0]
_pairwise_df_voice, _metadata = S.compute_ranking_significance(
top3_voices,alpha=0.05,correction="none")
S.plot_significance_heatmap(
_pairwise_df_voice,
metadata=_metadata,
title="Statistical Significance: Voice Top Choice Preference"
)
# %% Total Mentions Significance (Rank 1+2+3 Combined)
# This tests "Quantity" (Visibility) instead of "Quality" (Preference)
_pairwise_df_mentions, _meta_mentions = S.compute_mentions_significance(
top3_voices,
alpha=0.05,
correction="none"
)
S.plot_significance_heatmap(
_pairwise_df_mentions,
metadata=_meta_mentions,
title="Statistical Significance: Voice Total Mentions (Top 3 Visibility)"
)
# %% Male Voices Only Analysis
import reference
def filter_voices_by_gender(df: pl.DataFrame, target_gender: str) -> pl.DataFrame:
"""Filter ranking columns to keep only those matching target gender."""
cols_to_keep = []
# Always keep identifier if present
if '_recordId' in df.columns:
cols_to_keep.append('_recordId')
for col in df.columns:
# Check if column is a voice column (contains Vxx)
# Format is typically "Top_3_Voices_ranking__V14"
if '__V' in col:
voice_id = col.split('__')[1]
if reference.VOICE_GENDER_MAPPING.get(voice_id) == target_gender:
cols_to_keep.append(col)
return df.select(cols_to_keep)
# Get full ranking data as DataFrame
df_voices = top3_voices.collect()
# Filter for Male voices
df_male_voices = filter_voices_by_gender(df_voices, 'Male')
# 1. Male Voices: Top Choice Preference (Rank 1)
_pairwise_male_pref, _meta_male_pref = S.compute_ranking_significance(
df_male_voices,
alpha=0.05,
correction="none"
)
S.plot_significance_heatmap(
_pairwise_male_pref,
metadata=_meta_male_pref,
title="Male Voices Only: Top Choice Preference Significance"
)
# 2. Male Voices: Total Mentions (Visibility)
_pairwise_male_vis, _meta_male_vis = S.compute_mentions_significance(
df_male_voices,
alpha=0.05,
correction="none"
)
S.plot_significance_heatmap(
_pairwise_male_vis,
metadata=_meta_male_vis,
title="Male Voices Only: Total Mentions Significance"
)
# %% Male Voices (Excluding Bottom 3: V88, V86, V81)
# Start with the male voices dataframe from the previous step
voices_to_exclude = ['V88', 'V86', 'V81']
def filter_exclude_voices(df: pl.DataFrame, exclude_list: list[str]) -> pl.DataFrame:
"""Filter ranking columns to exclude specific voices."""
cols_to_keep = []
# Always keep identifier if present
if '_recordId' in df.columns:
cols_to_keep.append('_recordId')
for col in df.columns:
# Check if column is a voice column (contains Vxx)
if '__V' in col:
voice_id = col.split('__')[1]
if voice_id not in exclude_list:
cols_to_keep.append(col)
return df.select(cols_to_keep)
df_male_top = filter_exclude_voices(df_male_voices, voices_to_exclude)
# 1. Male Top Candidates: Top Choice Preference
_pairwise_male_top_pref, _meta_male_top_pref = S.compute_ranking_significance(
df_male_top,
alpha=0.05,
correction="none"
)
S.plot_significance_heatmap(
_pairwise_male_top_pref,
metadata=_meta_male_top_pref,
title="Male Voices (Excl. Bottom 3): Top Choice Preference Significance"
)
# 2. Male Top Candidates: Total Mentions
_pairwise_male_top_vis, _meta_male_top_vis = S.compute_mentions_significance(
df_male_top,
alpha=0.05,
correction="none"
)
S.plot_significance_heatmap(
_pairwise_male_top_vis,
metadata=_meta_male_top_vis,
title="Male Voices (Excl. Bottom 3): Total Mentions Significance"
)
# %% [markdown]
"""
# Rank 1 Selection Significance (Voice Level)
Similar to the Total Mentions significance analysis above, but counting
only how many times each voice was ranked **1st** (out of all respondents).
This isolates first-choice preference rather than overall top-3 visibility.
"""
# %% Rank 1 Significance: All Voices
_pairwise_df_rank1, _meta_rank1 = S.compute_rank1_significance(
top3_voices,
alpha=0.05,
correction="none",
)
S.plot_significance_heatmap(
_pairwise_df_rank1,
metadata=_meta_rank1,
title="Statistical Significance: Voice Rank 1 Selection"
)
# %% Rank 1 Significance: Male Voices Only
_pairwise_df_rank1_male, _meta_rank1_male = S.compute_rank1_significance(
df_male_voices,
alpha=0.05,
correction="none",
)
S.plot_significance_heatmap(
_pairwise_df_rank1_male,
metadata=_meta_rank1_male,
title="Male Voices Only: Rank 1 Selection Significance"
)
# %%

267
XX_straight_liners.py Normal file
View File

@@ -0,0 +1,267 @@
"""Extra analyses of the straight-liners"""
# %% Imports
import utils
import polars as pl
import argparse
import json
import re
from pathlib import Path
from validation import check_straight_liners
# %% Fixed Variables
RESULTS_FILE = 'data/exports/2-4-26/JPMC_Chase Brand Personality_Quant Round 1_February 4, 2026_Labels.csv'
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
# %% CLI argument parsing for batch automation
# When run as script: uv run XX_statistical_significance.script.py --age '["18
# Central filter configuration - add new filters here only
# Format: 'cli_arg_name': 'QualtricsSurvey.options_* attribute name'
FILTER_CONFIG = {
'age': 'options_age',
'gender': 'options_gender',
'ethnicity': 'options_ethnicity',
'income': 'options_income',
'consumer': 'options_consumer',
'business_owner': 'options_business_owner',
'ai_user': 'options_ai_user',
'investable_assets': 'options_investable_assets',
'industry': 'options_industry',
}
def parse_cli_args():
parser = argparse.ArgumentParser(description='Generate quant report with optional filters')
# Dynamically add filter arguments from config
for filter_name in FILTER_CONFIG:
parser.add_argument(f'--{filter_name}', type=str, default=None, help=f'JSON list of {filter_name} values')
parser.add_argument('--filter-name', type=str, default=None, help='Name for this filter combination (used for .txt description file)')
parser.add_argument('--figures-dir', type=str, default=f'figures/straight-liner-analysis/{Path(RESULTS_FILE).parts[2]}', help='Override the default figures directory')
# Only parse if running as script (not in Jupyter/interactive)
try:
# Check if running in Jupyter by looking for ipykernel
get_ipython() # noqa: F821 # type: ignore
# Return namespace with all filters set to None
no_filters = {f: None for f in FILTER_CONFIG}
# Use the same default as argparse
default_fig_dir = f'figures/straight-liner-analysis/{Path(RESULTS_FILE).parts[2]}'
return argparse.Namespace(**no_filters, filter_name=None, figures_dir=default_fig_dir)
except NameError:
args = parser.parse_args()
# Parse JSON strings to lists
for filter_name in FILTER_CONFIG:
val = getattr(args, filter_name)
setattr(args, filter_name, json.loads(val) if val else None)
return args
cli_args = parse_cli_args()
# %%
S = utils.QualtricsSurvey(RESULTS_FILE, QSF_FILE, figures_dir=cli_args.figures_dir)
data_all = S.load_data()
# %% Build filtered dataset based on CLI args
# CLI args: None means "no filter applied" - filter_data() will skip None filters
# Build filter values dict dynamically from FILTER_CONFIG
_active_filters = {filter_name: getattr(cli_args, filter_name) for filter_name in FILTER_CONFIG}
_d = S.filter_data(data_all, **_active_filters)
# Write filter description file if filter-name is provided
if cli_args.filter_name and S.fig_save_dir:
# Get the filter slug (e.g., "All_Respondents", "Cons-Starter", etc.)
_filter_slug = S._get_filter_slug()
_filter_slug_dir = S.fig_save_dir / _filter_slug
_filter_slug_dir.mkdir(parents=True, exist_ok=True)
# Build filter description
_filter_desc_lines = [
f"Filter: {cli_args.filter_name}",
"",
"Applied Filters:",
]
_short_desc_parts = []
for filter_name, options_attr in FILTER_CONFIG.items():
all_options = getattr(S, options_attr)
values = _active_filters[filter_name]
display_name = filter_name.replace('_', ' ').title()
# None means no filter applied (same as "All")
if values is not None and values != all_options:
_short_desc_parts.append(f"{display_name}: {', '.join(values)}")
_filter_desc_lines.append(f" {display_name}: {', '.join(values)}")
else:
_filter_desc_lines.append(f" {display_name}: All")
# Write detailed description INSIDE the filter-slug directory
# Sanitize filter name for filename usage (replace / and other chars)
_safe_filter_name = re.sub(r'[^\w\s-]', '_', cli_args.filter_name)
_filter_file = _filter_slug_dir / f"{_safe_filter_name}.txt"
_filter_file.write_text('\n'.join(_filter_desc_lines))
# Append to summary index file at figures/<export_date>/filter_index.txt
_summary_file = S.fig_save_dir / "filter_index.txt"
_short_desc = "; ".join(_short_desc_parts) if _short_desc_parts else "All Respondents"
_summary_line = f"{_filter_slug} | {cli_args.filter_name} | {_short_desc}\n"
# Append or create the summary file
if _summary_file.exists():
_existing = _summary_file.read_text()
# Avoid duplicate entries for same slug
if _filter_slug not in _existing:
with _summary_file.open('a') as f:
f.write(_summary_line)
else:
_header = "Filter Index\n" + "=" * 80 + "\n\n"
_header += "Directory | Filter Name | Description\n"
_header += "-" * 80 + "\n"
_summary_file.write_text(_header + _summary_line)
# Save to logical variable name for further analysis
data = _d
data.collect()
# %% Determine straight-liner repeat offenders
# Extract question groups with renamed columns that check_straight_liners expects.
# The raw `data` has QID-based column names; the getter methods rename them to
# patterns like SS_Green_Blue__V14__Choice_1, Voice_Scale_1_10__V48, etc.
ss_or, _ = S.get_ss_orange_red(data)
ss_gb, _ = S.get_ss_green_blue(data)
vs, _ = S.get_voice_scale_1_10(data)
# Combine all question groups into one wide LazyFrame (joined on _recordId)
all_questions = ss_or.join(ss_gb, on='_recordId').join(vs, on='_recordId')
# Run straight-liner detection across all question groups
# max_score=5 catches all speaking-style straight-lining (1-5 scale)
# and voice-scale values ≤5 on the 1-10 scale
# Note: sl_threshold is NOT set on S here — this script analyses straight-liners,
# it doesn't filter them out of the dataset.
print("Running straight-liner detection across all question groups...")
sl_report, sl_df = check_straight_liners(all_questions, max_score=5)
# %% Quantify repeat offenders
# sl_df has one row per (Record ID, Question Group) that was straight-lined.
# Group by Record ID to count how many question groups each person SL'd.
if sl_df is not None and not sl_df.is_empty():
total_respondents = data.select(pl.len()).collect().item()
# Per-respondent count of straight-lined question groups
respondent_sl_counts = (
sl_df
.group_by("Record ID")
.agg(pl.len().alias("sl_count"))
.sort("sl_count", descending=True)
)
max_sl = respondent_sl_counts["sl_count"].max()
print(f"\nTotal respondents: {total_respondents}")
print(f"Respondents who straight-lined at least 1 question group: "
f"{respondent_sl_counts.height}")
print(f"Maximum question groups straight-lined by one person: {max_sl}")
print()
# Build cumulative distribution: for each threshold N, count respondents
# who straight-lined >= N question groups
cumulative_rows = []
for threshold in range(1, max_sl + 1):
count = respondent_sl_counts.filter(
pl.col("sl_count") >= threshold
).height
pct = (count / total_respondents) * 100
cumulative_rows.append({
"threshold": threshold,
"count": count,
"pct": pct,
})
print(
f"{threshold} question groups straight-lined: "
f"{count} respondents ({pct:.1f}%)"
)
cumulative_df = pl.DataFrame(cumulative_rows)
print(f"\n{cumulative_df}")
# %% Save cumulative data to CSV
_filter_slug = S._get_filter_slug()
_csv_dir = Path(S.fig_save_dir) / _filter_slug
_csv_dir.mkdir(parents=True, exist_ok=True)
_csv_path = _csv_dir / "straight_liner_repeat_offenders.csv"
cumulative_df.write_csv(_csv_path)
print(f"Saved cumulative data to {_csv_path}")
# %% Plot the cumulative distribution
S.plot_straight_liner_repeat_offenders(
cumulative_df,
total_respondents=total_respondents,
)
# %% Per-question straight-lining frequency
# Build human-readable question group names from the raw keys
def _humanise_question_group(key: str) -> str:
"""Convert internal question group key to a readable label.
Examples:
SS_Green_Blue__V14 → Green/Blue V14
SS_Orange_Red__V48 → Orange/Red V48
Voice_Scale_1_10 → Voice Scale (1-10)
"""
if key.startswith("SS_Green_Blue__"):
voice = key.split("__")[1]
return f"Green/Blue {voice}"
if key.startswith("SS_Orange_Red__"):
voice = key.split("__")[1]
return f"Orange/Red {voice}"
if key == "Voice_Scale_1_10":
return "Voice Scale (1-10)"
# Fallback: replace underscores
return key.replace("_", " ")
per_question_counts = (
sl_df
.group_by("Question Group")
.agg(pl.col("Record ID").n_unique().alias("count"))
.sort("count", descending=True)
.with_columns(
(pl.col("count") / total_respondents * 100).alias("pct")
)
)
# Add human-readable names
per_question_counts = per_question_counts.with_columns(
pl.col("Question Group").map_elements(
_humanise_question_group, return_dtype=pl.Utf8
).alias("question")
)
print("\n--- Per-Question Straight-Lining Frequency ---")
print(per_question_counts)
# Save per-question data to CSV
_csv_path_pq = _csv_dir / "straight_liner_per_question.csv"
per_question_counts.write_csv(_csv_path_pq)
print(f"Saved per-question data to {_csv_path_pq}")
# Plot
S.plot_straight_liner_per_question(
per_question_counts,
total_respondents=total_respondents,
)
# %% Show the top repeat offenders (respondents with most SL'd groups)
print("\n--- Top Repeat Offenders ---")
print(respondent_sl_counts.head(20))
else:
print("No straight-liners detected in the dataset.")

File diff suppressed because one or more lines are too long

BIN
docs/README.pdf Normal file

Binary file not shown.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,104 @@
# Appendix: Quantitative Analysis Plots - Folder Structure Manual
This folder contains all the quantitative analysis plots, sorted by the filters applied to the dataset. Each folder corresponds to a specific demographic cut.
## Folder Overview
* `All_Respondents/`: Analysis of the full dataset (no filters).
* `filter_index.txt`: A master list of every folder code and its corresponding demographic filter.
* **Filter Folders**: All other folders represent specific demographic cuts (e.g., `Age-18to21years`, `Gen-Woman`).
## How to Navigate
Each folder contains the same set of charts generated for that specific filter.
## Directory Reference Table
Below is the complete list of folder names. These names are encodings of the filters applied to the dataset, which we use to maintain consistency across our analysis.
| Directory Code | Filter Description |
| :--- | :--- |
| All_Respondents | All Respondents |
| Age-18to21years | Age: 18 to 21 years |
| Age-22to24years | Age: 22 to 24 years |
| Age-25to34years | Age: 25 to 34 years |
| Age-35to40years | Age: 35 to 40 years |
| Age-41to50years | Age: 41 to 50 years |
| Age-51to59years | Age: 51 to 59 years |
| Age-60to70years | Age: 60 to 70 years |
| Age-70yearsormore | Age: 70 years or more |
| Gen-Man | Gender: Man |
| Gen-Prefernottosay | Gender: Prefer not to say |
| Gen-Woman | Gender: Woman |
| Eth-6_grps_c64411 | Ethnicity: All options containing 'Alaska Native or Indigenous American' |
| Eth-6_grps_8f145b | Ethnicity: All options containing 'Asian or Asian American' |
| Eth-8_grps_71ac47 | Ethnicity: All options containing 'Black or African American' |
| Eth-7_grps_c5b3ce | Ethnicity: All options containing 'Hispanic or Latinx' |
| Eth-BlackorAfricanAmerican<br>MiddleEasternorNorthAfrican<br>WhiteorCaucasian+<br>MiddleEasternorNorthAfrican | Ethnicity: Middle Eastern or North African |
| Eth-AsianorAsianAmericanBlackorAfricanAmerican<br>NativeHawaiianorOtherPacificIslander+<br>NativeHawaiianorOtherPacificIslander | Ethnicity: Native Hawaiian or Other Pacific Islander |
| Eth-10_grps_cef760 | Ethnicity: All options containing 'White or Caucasian' |
| Inc-100000to149999 | Income: $100,000 to $149,999 |
| Inc-150000to199999 | Income: $150,000 to $199,999 |
| Inc-200000ormore | Income: $200,000 or more |
| Inc-25000to34999 | Income: $25,000 to $34,999 |
| Inc-35000to54999 | Income: $35,000 to $54,999 |
| Inc-55000to79999 | Income: $55,000 to $79,999 |
| Inc-80000to99999 | Income: $80,000 to $99,999 |
| Inc-Lessthan25000 | Income: Less than $25,000 |
| Cons-Lower_Mass_A+Lower_Mass_B | Consumer: Lower_Mass_A, Lower_Mass_B |
| Cons-MassAffluent_A+MassAffluent_B | Consumer: MassAffluent_A, MassAffluent_B |
| Cons-Mass_A+Mass_B | Consumer: Mass_A, Mass_B |
| Cons-Mix_of_Affluent_Wealth__<br>High_Net_Woth_A+<br>Mix_of_Affluent_Wealth__<br>High_Net_Woth_B | Consumer: Mix_of_Affluent_Wealth_&_High_Net_Woth_A, Mix_of_Affluent_Wealth_&_High_Net_Woth_B |
| Cons-Early_Professional | Consumer: Early_Professional |
| Cons-Lower_Mass_B | Consumer: Lower_Mass_B |
| Cons-MassAffluent_B | Consumer: MassAffluent_B |
| Cons-Mass_B | Consumer: Mass_B |
| Cons-Mix_of_Affluent_Wealth__<br>High_Net_Woth_B | Consumer: Mix_of_Affluent_Wealth_&_High_Net_Woth_B |
| Cons-Starter | Consumer: Starter |
| BizOwn-No | Business Owner: No |
| BizOwn-Yes | Business Owner: Yes |
| AI-Daily | Ai User: Daily |
| AI-Lessthanonceamonth | Ai User: Less than once a month |
| AI-Morethanoncedaily | Ai User: More than once daily |
| AI-Multipletimesperweek | Ai User: Multiple times per week |
| AI-Onceamonth | Ai User: Once a month |
| AI-Onceaweek | Ai User: Once a week |
| AI-RarelyNever | Ai User: Rarely/Never |
| AI-Daily+<br>Morethanoncedaily+<br>Multipletimesperweek | Ai User: Daily, More than once daily, Multiple times per week |
| AI-4_grps_d4f57a | Ai User: Once a week, Once a month, Less than once a month, Rarely/Never |
| InvAsts-0to24999 | Investable Assets: $0 to $24,999 |
| InvAsts-150000to249999 | Investable Assets: $150,000 to $249,999 |
| InvAsts-1Mto4.9M | Investable Assets: $1M to $4.9M |
| InvAsts-25000to49999 | Investable Assets: $25,000 to $49,999 |
| InvAsts-250000to499999 | Investable Assets: $250,000 to $499,999 |
| InvAsts-50000to149999 | Investable Assets: $50,000 to $149,999 |
| InvAsts-500000to999999 | Investable Assets: $500,000 to $999,999 |
| InvAsts-5Mormore | Investable Assets: $5M or more |
| InvAsts-Prefernottoanswer | Investable Assets: Prefer not to answer |
| Ind-Agricultureforestryfishingorhunting | Industry: Agriculture, forestry, fishing, or hunting |
| Ind-Artsentertainmentorrecreation | Industry: Arts, entertainment, or recreation |
| Ind-Broadcasting | Industry: Broadcasting |
| Ind-Construction | Industry: Construction |
| Ind-EducationCollegeuniversityoradult | Industry: Education College, university, or adult |
| Ind-EducationOther | Industry: Education Other |
| Ind-EducationPrimarysecondaryK-12 | Industry: Education Primary/secondary (K-12) |
| Ind-Governmentandpublicadministration | Industry: Government and public administration |
| Ind-Hotelandfoodservices | Industry: Hotel and food services |
| Ind-InformationOther | Industry: Information Other |
| Ind-InformationServicesanddata | Industry: Information Services and data |
| Ind-Legalservices | Industry: Legal services |
| Ind-ManufacturingComputerandelectronics | Industry: Manufacturing Computer and electronics |
| Ind-ManufacturingOther | Industry: Manufacturing Other |
| Ind-Notemployed | Industry: Not employed |
| Ind-Otherindustrypleasespecify | Industry: Other industry (please specify) |
| Ind-Processing | Industry: Processing |
| Ind-Publishing | Industry: Publishing |
| Ind-Realestaterentalorleasing | Industry: Real estate, rental, or leasing |
| Ind-Retired | Industry: Retired |
| Ind-Scientificortechnicalservices | Industry: Scientific or technical services |
| Ind-Software | Industry: Software |
| Ind-Telecommunications | Industry: Telecommunications |
| Ind-Transportationandwarehousing | Industry: Transportation and warehousing |
| Ind-Utilities | Industry: Utilities |
| Ind-Wholesale | Industry: Wholesale |

View File

@@ -0,0 +1,428 @@
# Statistical Significance Testing Guide
A beginner-friendly reference for choosing the right statistical test and correction method for your Voice Branding analysis.
---
## Table of Contents
1. [Quick Decision Flowchart](#quick-decision-flowchart)
2. [Understanding Your Data Types](#understanding-your-data-types)
3. [Available Tests](#available-tests)
4. [Multiple Comparison Corrections](#multiple-comparison-corrections)
5. [Interpreting Results](#interpreting-results)
6. [Code Examples](#code-examples)
---
## Quick Decision Flowchart
```
What kind of data do you have?
├─► Continuous scores (1-10 ratings, averages)
│ │
│ └─► Use: compute_pairwise_significance()
│ │
│ ├─► Data normally distributed? → test_type="ttest"
│ └─► Not sure / skewed data? → test_type="mannwhitney" (safer choice)
└─► Ranking data (1st, 2nd, 3rd place votes)
└─► Use: compute_ranking_significance()
(automatically uses proportion z-test)
```
---
## Understanding Your Data Types
### Continuous Data
**What it looks like:** Numbers on a scale with many possible values.
| Example | Data Source |
|---------|-------------|
| Voice ratings 1-10 | `get_voice_scale_1_10()` |
| Speaking style scores | `get_ss_green_blue()` |
| Any averaged scores | Custom aggregations |
```
shape: (5, 3)
┌───────────┬─────────────────┬─────────────────┐
│ _recordId │ Voice_Scale__V14│ Voice_Scale__V04│
│ str │ f64 │ f64 │
├───────────┼─────────────────┼─────────────────┤
│ R_001 │ 7.5 │ 6.0 │
│ R_002 │ 8.0 │ 7.5 │
│ R_003 │ 6.5 │ 8.0 │
```
### Ranking Data
**What it looks like:** Discrete ranks (1, 2, 3) or null if not ranked.
| Example | Data Source |
|---------|-------------|
| Top 3 voice rankings | `get_top_3_voices()` |
| Character rankings | `get_character_ranking()` |
```
shape: (5, 3)
┌───────────┬──────────────────┬──────────────────┐
│ _recordId │ Top_3__V14 │ Top_3__V04 │
│ str │ i64 │ i64 │
├───────────┼──────────────────┼──────────────────┤
│ R_001 │ 1 │ null │ ← V14 was ranked 1st
│ R_002 │ 2 │ 1 │ ← V04 was ranked 1st
│ R_003 │ null │ 3 │ ← V04 was ranked 3rd
```
### ⚠️ Aggregated Data (Cannot Test!)
**What it looks like:** Already summarized/totaled data.
```
shape: (3, 2)
┌───────────┬────────────────┐
│ Character │ Weighted Score │ ← ALREADY AGGREGATED
│ str │ i64 │ Lost individual variance
├───────────┼────────────────┤ Cannot do significance tests!
│ V14 │ 209 │
│ V04 │ 180 │
```
**Solution:** Go back to the raw data before aggregation.
---
## Available Tests
### 1. Mann-Whitney U Test (Default for Continuous)
**Use when:** Comparing scores/ratings between groups
**Assumes:** Nothing about distribution shape (non-parametric)
**Best for:** Most survey data, Likert scales, ratings
```python
pairwise_df, meta = S.compute_pairwise_significance(
voice_data,
test_type="mannwhitney" # This is the default
)
```
**Pros:**
- Works with any distribution shape
- Robust to outliers
- Safe choice when unsure
**Cons:**
- Slightly less powerful than t-test when data IS normally distributed
---
### 2. Independent t-Test
**Use when:** Comparing means between groups
**Assumes:** Data is approximately normally distributed
**Best for:** Large samples (n > 30 per group), truly continuous data
```python
pairwise_df, meta = S.compute_pairwise_significance(
voice_data,
test_type="ttest"
)
```
**Pros:**
- Most powerful when assumptions are met
- Well-understood, commonly reported
**Cons:**
- Can give misleading results if data is skewed
- Sensitive to outliers
---
### 3. Chi-Square Test
**Use when:** Comparing frequency distributions
**Assumes:** Expected counts ≥ 5 in each cell
**Best for:** Count data, categorical comparisons
```python
pairwise_df, meta = S.compute_pairwise_significance(
count_data,
test_type="chi2"
)
```
**Pros:**
- Designed for count/frequency data
- Tests if distributions differ
**Cons:**
- Needs sufficient sample sizes
- Less informative about direction of difference
---
### 4. Two-Proportion Z-Test (For Rankings)
**Use when:** Comparing ranking vote proportions
**Automatically used by:** `compute_ranking_significance()`
```python
pairwise_df, meta = S.compute_ranking_significance(ranking_data)
```
**What it tests:** "Does Voice A get a significantly different proportion of Rank 1 votes than Voice B?"
---
## Multiple Comparison Corrections
### Why Do We Need Corrections?
When you compare many groups, you're doing many tests. Each test has a 5% chance of a false positive (if α = 0.05). With 17 voices:
| Comparisons | Expected False Positives (no correction) |
|-------------|------------------------------------------|
| 136 pairs | ~7 false "significant" results! |
**Corrections adjust p-values to account for this.**
---
### Bonferroni Correction (Conservative)
**Formula:** `p_adjusted = p_value × number_of_comparisons`
```python
pairwise_df, meta = S.compute_pairwise_significance(
data,
correction="bonferroni" # This is the default
)
```
**Use when:**
- You want to be very confident about significant results
- False positives are costly (publishing, major decisions)
- You have few comparisons (< 20)
**Trade-off:** May miss real differences (more false negatives)
---
### Holm-Bonferroni Correction (Less Conservative)
**Formula:** Step-down procedure that's less strict than Bonferroni
```python
pairwise_df, meta = S.compute_pairwise_significance(
data,
correction="holm"
)
```
**Use when:**
- You have many comparisons
- You want better power to detect real differences
- Exploratory analysis where missing a real effect is costly
**Trade-off:** Slightly higher false positive risk than Bonferroni
---
### No Correction
**Not recommended for final analysis**, but useful for exploration.
```python
pairwise_df, meta = S.compute_pairwise_significance(
data,
correction="none"
)
```
**Use when:**
- Initial exploration only
- You'll follow up with specific hypotheses
- You understand and accept the inflated false positive rate
---
### Correction Method Comparison
| Method | Strictness | Best For | Risk |
|--------|------------|----------|------|
| Bonferroni | Most strict | Few comparisons, high stakes | Miss real effects |
| Holm | Moderate | Many comparisons, balanced approach | Slightly more false positives |
| None | No control | Exploration only | Many false positives |
**Recommendation for Voice Branding:** Use **Holm** for exploratory analysis, **Bonferroni** for final reporting.
---
## Interpreting Results
### Key Output Columns
| Column | Meaning |
|--------|---------|
| `p_value` | Raw probability this difference happened by chance |
| `p_adjusted` | Corrected p-value (use this for decisions!) |
| `significant` | TRUE if p_adjusted < alpha (usually 0.05) |
| `effect_size` | How big is the difference (practical significance) |
### What the p-value Means
| p-value | Interpretation |
|---------|----------------|
| < 0.001 | Very strong evidence of difference |
| < 0.01 | Strong evidence |
| < 0.05 | Moderate evidence (traditional threshold) |
| 0.05 - 0.10 | Weak evidence, "trending" |
| > 0.10 | No significant evidence |
### Statistical vs Practical Significance
**Statistical significance** (p < 0.05) means the difference is unlikely due to chance.
**Practical significance** (effect size) means the difference matters in the real world.
| Effect Size (Cohen's d) | Interpretation |
|-------------------------|----------------|
| < 0.2 | Small (may not matter practically) |
| 0.2 - 0.5 | Medium |
| 0.5 - 0.8 | Large |
| > 0.8 | Very large |
**Example:** A p-value of 0.001 with effect size of 0.1 means "we're confident there's a difference, but it's tiny."
---
## Code Examples
### Example 1: Voice Scale Ratings
```python
# Get the raw rating data
voice_data, _ = S.get_voice_scale_1_10(data)
# Test for significant differences
pairwise_df, meta = S.compute_pairwise_significance(
voice_data,
test_type="mannwhitney", # Safe default for ratings
alpha=0.05,
correction="bonferroni"
)
# Check overall test first
print(f"Overall test: {meta['overall_test']}")
print(f"Overall p-value: {meta['overall_p_value']:.4f}")
# If overall is significant, look at pairwise
if meta['overall_p_value'] < 0.05:
sig_pairs = pairwise_df.filter(pl.col('significant') == True)
print(f"Found {sig_pairs.height} significant pairwise differences")
# Visualize
S.plot_significance_heatmap(pairwise_df, metadata=meta)
```
### Example 2: Top 3 Voice Rankings
```python
# Get the raw ranking data (NOT the weighted scores!)
ranking_data, _ = S.get_top_3_voices(data)
# Test for significant differences in Rank 1 proportions
pairwise_df, meta = S.compute_ranking_significance(
ranking_data,
alpha=0.05,
correction="holm" # Less conservative for many comparisons
)
# Check chi-square test
print(f"Chi-square p-value: {meta['chi2_p_value']:.4f}")
# View contingency table (Rank 1, 2, 3 counts per voice)
for voice, counts in meta['contingency_table'].items():
print(f"{voice}: R1={counts[0]}, R2={counts[1]}, R3={counts[2]}")
# Find significant pairs
sig_pairs = pairwise_df.filter(pl.col('significant') == True)
print(sig_pairs)
```
### Example 3: Comparing Demographic Subgroups
```python
# Filter to specific demographics
S.filter_data(data, consumer=['Early Professional'])
early_pro_data, _ = S.get_voice_scale_1_10(data)
S.filter_data(data, consumer=['Established Professional'])
estab_pro_data, _ = S.get_voice_scale_1_10(data)
# Test each group separately, then compare results qualitatively
# (For direct group comparison, you'd need a different test design)
```
---
## Common Mistakes to Avoid
### ❌ Using Aggregated Data
```python
# WRONG - already summarized, lost individual variance
weighted_scores = calculate_weighted_ranking_scores(ranking_data)
S.compute_pairwise_significance(weighted_scores) # Will fail!
```
### ✅ Use Raw Data
```python
# RIGHT - use raw data before aggregation
ranking_data, _ = S.get_top_3_voices(data)
S.compute_ranking_significance(ranking_data)
```
### ❌ Ignoring Multiple Comparisons
```python
# WRONG - 7% of pairs will be "significant" by chance alone!
S.compute_pairwise_significance(data, correction="none")
```
### ✅ Apply Correction
```python
# RIGHT - corrected p-values control false positives
S.compute_pairwise_significance(data, correction="bonferroni")
```
### ❌ Only Reporting p-values
```python
# WRONG - statistical significance isn't everything
print(f"p = {p_value}") # Missing context!
```
### ✅ Report Effect Sizes Too
```python
# RIGHT - include practical significance
print(f"p = {p_value}, effect size = {effect_size}")
print(f"Mean difference: {mean1 - mean2:.2f} points")
```
---
## Quick Reference Card
| Data Type | Function | Default Test | Recommended Correction |
|-----------|----------|--------------|------------------------|
| Ratings (1-10) | `compute_pairwise_significance()` | Mann-Whitney U | Bonferroni |
| Rankings (1st/2nd/3rd) | `compute_ranking_significance()` | Proportion Z | Holm |
| Count frequencies | `compute_pairwise_significance(test_type="chi2")` | Chi-square | Bonferroni |
| Scenario | Correction |
|----------|------------|
| Publishing results | Bonferroni |
| Client presentation | Bonferroni |
| Exploratory analysis | Holm |
| Quick internal check | Holm or None |
---
## Further Reading
- [Statistics for Dummies Cheat Sheet](https://www.dummies.com/article/academics-the-arts/math/statistics/statistics-for-dummies-cheat-sheet-208650/)
- [Choosing the Right Statistical Test](https://stats.oarc.ucla.edu/other/mult-pkg/whatstat/)
- [Multiple Comparisons Problem (Wikipedia)](https://en.wikipedia.org/wiki/Multiple_comparisons_problem)

85
docs/wordcloud-usage.md Normal file
View File

@@ -0,0 +1,85 @@
# Word Cloud for Personality Traits - Usage Example
This example shows how to use the `create_traits_wordcloud` function to visualize the most prominent personality traits from survey data.
## Basic Usage in Jupyter/Marimo Notebook
```python
from utils import QualtricsSurvey, create_traits_wordcloud
from pathlib import Path
# Load your survey data
RESULTS_FILE = "data/exports/1-23-26/JPMC_Chase Brand Personality_Quant Round 1_January 23, 2026_Labels.csv"
QSF_FILE = "data/19-dec_V1_quant_incl_shani_comments.qsf"
S = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
data = S.load_data()
# Get Top 3 Traits data
top3_traits = S.get_top_3_traits(data)[0]
# Create and display word cloud
fig = create_traits_wordcloud(
data=top3_traits,
column='Top_3_Traits',
title="Most Prominent Personality Traits",
fig_save_dir='figures', # Will save to figures/All_Respondents/
filter_slug='All_Respondents'
)
# Display in notebook
fig # or plt.show()
```
## With Active Filters
If you're using the survey filter methods, you can pass the filter slug:
```python
# Apply filters
S.set_filter_consumer(['Early Professional', 'Established Professional'])
filtered_data = S.get_filtered_data()
# Get traits from filtered data
top3_traits = S.get_top_3_traits(filtered_data)[0]
# Get the filter slug for directory naming
filter_slug = S._get_filter_slug()
# Create word cloud with filtered data
fig = create_traits_wordcloud(
data=top3_traits,
column='Top_3_Traits',
title="Most Prominent Personality Traits<br>(Early & Established Professionals)",
fig_save_dir='figures',
filter_slug=filter_slug # e.g., 'Cons-Early_Professional_Established_Professional'
)
fig
```
## Function Parameters
- **data**: Polars DataFrame or LazyFrame with trait data
- **column**: Column name containing comma-separated traits (default: 'Top_3_Traits')
- **title**: Title for the word cloud
- **width**: Width in pixels (default: 1600)
- **height**: Height in pixels (default: 800)
- **background_color**: Background color (default: 'white')
- **fig_save_dir**: Directory to save PNG (default: None - doesn't save)
- **filter_slug**: Subdirectory name for filtered results (default: 'All_Respondents')
## Colors
The word cloud uses colors from `theme.py`:
- PRIMARY: #0077B6 (Medium Blue)
- RANK_1: #004C6D (Dark Blue)
- RANK_2: #008493 (Teal)
- RANK_3: #5AAE95 (Sea Green)
## Output
- **Returns**: matplotlib Figure object for display in notebooks
- **Saves**: PNG file to `{fig_save_dir}/{filter_slug}/{sanitized_title}.png` at 300 DPI
The saved files follow the same naming convention as plots in `plots.py`.

View File

@@ -0,0 +1,60 @@
import polars as pl
from utils import QualtricsSurvey, process_speaking_style_data, process_voice_scale_data, join_voice_and_style_data
from plots import plot_speaking_style_correlation
from speaking_styles import SPEAKING_STYLES
# 1. Initialize Survey and Load Data
# We need to point to the actual data files if possible, or use standard paths
# Assuming the file structure observed in workspace:
# Data: data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase Brand Personality_Quant Round 1_January 21, 2026_Soft Launch_Values.csv
# QSF: data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf
RESULTS_FILE = "data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase Brand Personality_Quant Round 1_January 21, 2026_Soft Launch_Values.csv"
QSF_FILE = "data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf"
try:
survey = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
except TypeError:
# Fallback if signature is different or file not found (just in case)
print("Error initializing survey with paths. Checking signature...")
# This part is just for debugging if it fails again
raise
data = survey.load_data()
# 2. Extract Data
# Speaking Styles
ss_gb, map_gb = survey.get_ss_green_blue(data)
ss_or, map_or = survey.get_ss_orange_red(data)
# Voice Scale 1-10
voice_scale, _ = survey.get_voice_scale_1_10(data)
# 3. Process Dataframes (Wide to Long)
# Note: process_speaking_style_data handles the melt and parsing
# We collect them because the plotting functions expect eager DataFrames usually,
# but polars functions here return eager DFs currently based on `utils.py` implementation (return result.collect())
df_style_gb = process_speaking_style_data(ss_gb, map_gb)
df_style_or = process_speaking_style_data(ss_or, map_or)
# Combine both style dataframes
df_style_all = pl.concat([df_style_gb, df_style_or])
# Process Voice Scale
df_voice_long = process_voice_scale_data(voice_scale)
# 4. Join Style + Voice Data
joined_df = join_voice_and_style_data(df_style_all, df_voice_long)
# 5. Generate Plots for each Style Color
for style, traits in SPEAKING_STYLES.items():
print(f"Generating plot for {style}...")
fig = plot_speaking_style_correlation(
df=joined_df,
style_color=style,
style_traits=traits
)
fig.show()
# If in Marimo/Jupyter, just 'fig' or 'mo.ui.plotly(fig)'

3440
plots.py

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,3 @@
- V46 not in scale 1-10. Qualtrics
- Straightliners
- V45 goed in qual maar slecht in quant

View File

@@ -6,6 +6,8 @@ readme = "README.md"
requires-python = ">=3.12"
dependencies = [
"altair>=6.0.0",
"imagehash>=4.3.1",
"jupyter>=1.1.1",
"marimo>=0.18.0",
"matplotlib>=3.10.8",
"modin[dask]>=0.37.1",
@@ -14,13 +16,21 @@ dependencies = [
"openai>=2.9.0",
"openpyxl>=3.1.5",
"pandas>=2.3.3",
"plotly>=6.5.1",
"pillow>=11.0.0",
"polars>=1.37.1",
"pyarrow>=23.0.0",
"pysqlite3>=0.6.0",
"python-pptx>=1.0.2",
"pyzmq>=27.1.0",
"requests>=2.32.5",
"scipy>=1.14.0",
"taguette>=1.5.1",
"tqdm>=4.66.0",
"vl-convert-python>=1.9.0.post1",
"wordcloud>=1.9.5",
]
[project.scripts]
quant-report-batch = "run_filter_combinations:main"

59
reference.py Normal file
View File

@@ -0,0 +1,59 @@
ORIGINAL_CHARACTER_TRAITS = {
"the_familiar_friend": [
"Warm",
"Friendly",
"Approachable",
"Familiar",
"Casual",
"Appreciative",
"Benevolent",
],
"the_coach": [
"Empowering",
"Encouraging",
"Caring",
"Positive",
"Optimistic",
"Guiding",
"Reassuring",
],
"the_personal_assistant": [
"Forward-thinking",
"Progressive",
"Cooperative",
"Intentional",
"Resourceful",
"Attentive",
"Adaptive",
],
"the_bank_teller": [
"Patient",
"Grounded",
"Down-to-earth",
"Stable",
"Formal",
"Balanced",
"Efficient",
]
}
VOICE_GENDER_MAPPING = {
"V14": "Female",
"V04": "Female",
"V08": "Female",
"V77": "Female",
"V48": "Female",
"V82": "Female",
"V89": "Female",
"V91": "Female",
"V34": "Male",
"V69": "Male",
"V45": "Male",
"V46": "Male",
"V54": "Male",
"V74": "Male",
"V81": "Male",
"V86": "Male",
"V88": "Male",
"V16": "Male",
}

306
run_filter_combinations.py Normal file
View File

@@ -0,0 +1,306 @@
#!/usr/bin/env python
"""
Batch runner for quant report with different filter combinations.
Runs 03_quant_report.script.py for each single-filter combination:
- Each age group (with all others active)
- Each gender (with all others active)
- Each ethnicity (with all others active)
- Each income group (with all others active)
- Each consumer segment (with all others active)
Usage:
uv run python run_filter_combinations.py
uv run python run_filter_combinations.py --dry-run # Preview combinations without running
uv run python run_filter_combinations.py --category age # Only run age combinations
uv run python run_filter_combinations.py --category consumer # Only run consumer segment combinations
"""
import subprocess
import sys
import json
from pathlib import Path
from tqdm import tqdm
from utils import QualtricsSurvey
# Default data paths (same as in 03_quant_report.script.py)
RESULTS_FILE = 'data/exports/2-2-26/JPMC_Chase Brand Personality_Quant Round 1_February 2, 2026_Labels.csv'
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
REPORT_SCRIPT = Path(__file__).parent / '03_quant_report.script.py'
def get_filter_combinations(survey: QualtricsSurvey, category: str = None) -> list[dict]:
"""
Generate all single-filter combinations.
Each combination isolates ONE filter value while keeping all others at "all selected".
Args:
survey: QualtricsSurvey instance with loaded data
category: Optional filter category to limit combinations to.
Valid values: 'all', 'age', 'gender', 'ethnicity', 'income', 'consumer',
'business_owner', 'ai_user', 'investable_assets', 'industry'
If None or 'all', generates all combinations.
Returns:
List of dicts with filter kwargs for each run.
"""
combinations = []
# Add "All Respondents" run (no filters = all options selected)
if not category or category in ['all_filters', 'all']:
combinations.append({
'name': 'All_Respondents',
'filters': {} # Empty = use defaults (all selected)
})
# Age groups - one at a time
if not category or category in ['all_filters', 'age']:
for age in survey.options_age:
combinations.append({
'name': f'Age-{age}',
'filters': {'age': [age]}
})
# Gender - one at a time
if not category or category in ['all_filters', 'gender']:
for gender in survey.options_gender:
combinations.append({
'name': f'Gender-{gender}',
'filters': {'gender': [gender]}
})
# Ethnicity - grouped by individual values
if not category or category in ['all_filters', 'ethnicity']:
# Ethnicity options are comma-separated (e.g., "White or Caucasian, Hispanic or Latino")
# Create filters that include ALL options containing each individual ethnicity value
ethnicity_values = set()
for ethnicity_option in survey.options_ethnicity:
# Split by comma and strip whitespace
values = [v.strip() for v in ethnicity_option.split(',')]
ethnicity_values.update(values)
for ethnicity_value in sorted(ethnicity_values):
# Find all options that contain this value
matching_options = [
opt for opt in survey.options_ethnicity
if ethnicity_value in [v.strip() for v in opt.split(',')]
]
combinations.append({
'name': f'Ethnicity-{ethnicity_value}',
'filters': {'ethnicity': matching_options}
})
# Income - one at a time
if not category or category in ['all_filters', 'income']:
for income in survey.options_income:
combinations.append({
'name': f'Income-{income}',
'filters': {'income': [income]}
})
# Consumer segments - combine _A and _B options, and also include standalone
if not category or category in ['all_filters', 'consumer']:
# Group options by base name (removing _A/_B suffix)
consumer_groups = {}
for consumer in survey.options_consumer:
# Check if ends with _A or _B
if consumer.endswith('_A') or consumer.endswith('_B'):
base_name = consumer[:-2] # Remove last 2 chars (_A or _B)
if base_name not in consumer_groups:
consumer_groups[base_name] = []
consumer_groups[base_name].append(consumer)
else:
# Not an _A/_B option, keep as-is
consumer_groups[consumer] = [consumer]
# Add combined _A+_B options
for base_name, options in consumer_groups.items():
if len(options) > 1: # Only combine if there are multiple (_A and _B)
combinations.append({
'name': f'Consumer-{base_name}',
'filters': {'consumer': options}
})
# Add standalone options (including individual _A and _B)
for consumer in survey.options_consumer:
combinations.append({
'name': f'Consumer-{consumer}',
'filters': {'consumer': [consumer]}
})
# Business Owner - one at a time
if not category or category in ['all_filters', 'business_owner']:
for business_owner in survey.options_business_owner:
combinations.append({
'name': f'BusinessOwner-{business_owner}',
'filters': {'business_owner': [business_owner]}
})
# AI User - one at a time
if not category or category in ['all_filters', 'ai_user']:
for ai_user in survey.options_ai_user:
combinations.append({
'name': f'AIUser-{ai_user}',
'filters': {'ai_user': [ai_user]}
})
# AI user daily, more than once daily, en multiple times a week = frequent
combinations.append({
'name': 'AIUser-Frequent',
'filters': {'ai_user': [
'Daily', 'More than once daily', 'Multiple times per week'
]}
})
combinations.append({
'name': 'AIUser-RarelyNever',
'filters': {'ai_user': [
'Once a month', 'Less than once a month', 'Once a week', 'Rarely/Never'
]}
})
# Investable Assets - one at a time
if not category or category in ['all_filters', 'investable_assets']:
for investable_assets in survey.options_investable_assets:
combinations.append({
'name': f'Assets-{investable_assets}',
'filters': {'investable_assets': [investable_assets]}
})
# Industry - one at a time
if not category or category in ['all_filters', 'industry']:
for industry in survey.options_industry:
combinations.append({
'name': f'Industry-{industry}',
'filters': {'industry': [industry]}
})
# Voice ranking completeness filter
# These use a special flag rather than demographic filters, so we store
# the mode in a dedicated key that run_report passes as --voice-ranking-filter.
if not category or category in ['all_filters', 'voice_ranking']:
combinations.append({
'name': 'VoiceRanking-OnlyMissing',
'filters': {},
'voice_ranking_filter': 'only-missing',
})
combinations.append({
'name': 'VoiceRanking-ExcludeMissing',
'filters': {},
'voice_ranking_filter': 'exclude-missing',
})
return combinations
def run_report(filters: dict, name: str = None, dry_run: bool = False, sl_threshold: int = None, voice_ranking_filter: str = None) -> bool:
"""
Run the report script with given filters.
Args:
filters: Dict of filter_name -> list of values
name: Name for this filter combination (used for .txt description file)
dry_run: If True, just print command without running
sl_threshold: If set, exclude respondents with >= N straight-lined question groups
voice_ranking_filter: If set, filter by voice ranking completeness.
'only-missing' keeps only respondents missing QID98 data,
'exclude-missing' removes them.
Returns:
True if successful, False otherwise
"""
cmd = [sys.executable, str(REPORT_SCRIPT)]
# Add filter-name for description file
if name:
cmd.extend(['--filter-name', name])
# Pass straight-liner threshold if specified
if sl_threshold is not None:
cmd.extend(['--sl-threshold', str(sl_threshold)])
# Pass voice ranking filter if specified
if voice_ranking_filter is not None:
cmd.extend(['--voice-ranking-filter', voice_ranking_filter])
for filter_name, values in filters.items():
if values:
cmd.extend([f'--{filter_name}', json.dumps(values)])
if dry_run:
print(f" Would run: {' '.join(cmd)}")
return True
try:
result = subprocess.run(
cmd,
capture_output=True,
text=True,
cwd=Path(__file__).parent
)
if result.returncode != 0:
print(f"\n ERROR: {result.stderr[:500]}")
return False
return True
except Exception as e:
print(f"\n ERROR: {e}")
return False
def main():
import argparse
parser = argparse.ArgumentParser(description='Run quant report for all filter combinations')
parser.add_argument('--dry-run', action='store_true', help='Preview combinations without running')
parser.add_argument(
'--category',
choices=['all_filters', 'all', 'age', 'gender', 'ethnicity', 'income', 'consumer', 'business_owner', 'ai_user', 'investable_assets', 'industry', 'voice_ranking'],
default='all_filters',
help='Filter category to run combinations for (default: all_filters)'
)
parser.add_argument('--sl-threshold', type=int, default=None, help='Exclude respondents who straight-lined >= N question groups (passed to report script)')
args = parser.parse_args()
# Load survey to get available filter options
print("Loading survey to get filter options...")
survey = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
survey.load_data() # Populates options_* attributes
# Generate combinations for specified category
combinations = get_filter_combinations(survey, category=args.category)
category_desc = f" for category '{args.category}'" if args.category != 'all' else ''
print(f"Generated {len(combinations)} filter combinations{category_desc}")
if args.sl_threshold is not None:
print(f"Straight-liner threshold: excluding respondents with ≥{args.sl_threshold} straight-lined question groups")
if args.dry_run:
print("\nDRY RUN - Commands that would be executed:")
for combo in combinations:
print(f"\n{combo['name']}:")
run_report(combo['filters'], name=combo['name'], dry_run=True, sl_threshold=args.sl_threshold, voice_ranking_filter=combo.get('voice_ranking_filter'))
return
# Run each combination with progress bar
successful = 0
failed = []
for combo in tqdm(combinations, desc="Running reports", unit="filter"):
tqdm.write(f"Running: {combo['name']}")
if run_report(combo['filters'], name=combo['name'], sl_threshold=args.sl_threshold, voice_ranking_filter=combo.get('voice_ranking_filter')):
successful += 1
else:
failed.append(combo['name'])
# Summary
print(f"\n{'='*50}")
print(f"Completed: {successful}/{len(combinations)} successful")
if failed:
print(f"Failed: {', '.join(failed)}")
if __name__ == '__main__':
main()

33
speaking_styles.py Normal file
View File

@@ -0,0 +1,33 @@
"""
Mapping of Speaking Styles (Colors) to their constituent Traits (Positive side).
Derived from "Speaking Style Traits Quantitative test design.pdf".
"""
SPEAKING_STYLES = {
"Green": [
"Friendly | Conversational | Down-to-earth",
"Approachable | Familiar | Warm",
"Optimistic | Benevolent | Positive | Appreciative"
],
"Blue": [
"Proactive | Cooperative",
"Knowledgable | Resourceful | Savvy",
"Clear | Straightforward | Direct",
"Confident | Competent",
"Respectable | Respectful"
],
"Orange": [
"Attentive | Helpful | Caring | Deliberate",
"Reassuring | Empowering",
"Progressive | Guiding | Intentional",
"Patient | Open-minded"
],
"Red": [
"Trustworthy | Reliable | Dependable",
"Calm | Steady/Stable | Controlled",
"Transparent | Upright | Altruistic",
"Adaptive | Flexible"
]
}

File diff suppressed because one or more lines are too long

124
theme.py
View File

@@ -16,7 +16,131 @@ class ColorPalette:
RANK_3 = "#5AAE95" # Sea Green (3rd Choice)
RANK_4 = "#9E9E9E" # Grey (4th Choice / Worst)
# Neutral color for unhighlighted comparison items
NEUTRAL = "#D3D3D3" # Light Grey
# Character-specific colors (for individual character plots)
# Each character has a main color and a lighter highlight for original traits
CHARACTER_BANK_TELLER = "#004C6D" # Dark Blue
CHARACTER_BANK_TELLER_HIGHLIGHT = "#669BBC" # Light Steel Blue
CHARACTER_FAMILIAR_FRIEND = "#008493" # Teal
CHARACTER_FAMILIAR_FRIEND_HIGHLIGHT = "#A8DADC" # Pale Cyan
CHARACTER_COACH = "#5AAE95" # Sea Green
CHARACTER_COACH_HIGHLIGHT = "#A8DADC" # Pale Cyan
CHARACTER_PERSONAL_ASSISTANT = "#457B9D" # Steel Blue
CHARACTER_PERSONAL_ASSISTANT_HIGHLIGHT = "#669BBC" # Light Steel Blue
# General UI elements
TEXT = "black"
GRID = "lightgray"
BACKGROUND = "white"
# Statistical significance colors (for heatmaps/annotations)
SIG_STRONG = "#004C6D" # p < 0.001 - Dark Blue (highly significant)
SIG_MODERATE = "#0077B6" # p < 0.01 - Medium Blue (significant)
SIG_WEAK = "#5AAE95" # p < 0.05 - Sea Green (marginally significant)
SIG_NONE = "#E8E8E8" # p >= 0.05 - Light Grey (not significant)
SIG_DIAGONAL = "#FFFFFF" # White for diagonal (self-comparison)
# Extended palette for categorical charts (e.g., pie charts with many categories)
CATEGORICAL = [
"#0077B6", # PRIMARY - Medium Blue
"#004C6D", # RANK_1 - Dark Blue
"#008493", # RANK_2 - Teal
"#5AAE95", # RANK_3 - Sea Green
"#9E9E9E", # RANK_4 - Grey
"#D3D3D3", # NEUTRAL - Light Grey
"#003049", # Dark Navy
"#669BBC", # Light Steel Blue
"#A8DADC", # Pale Cyan
"#457B9D", # Steel Blue
]
# Gender-based colors (Male = Blue tones, Female = Pink tones)
# Primary colors by gender
GENDER_MALE = "#0077B6" # Medium Blue (same as PRIMARY)
GENDER_FEMALE = "#B6007A" # Medium Pink
# Ranking colors by gender (Darkest -> Lightest)
GENDER_MALE_RANK_1 = "#004C6D" # Dark Blue
GENDER_MALE_RANK_2 = "#0077B6" # Medium Blue
GENDER_MALE_RANK_3 = "#669BBC" # Light Steel Blue
GENDER_FEMALE_RANK_1 = "#6D004C" # Dark Pink
GENDER_FEMALE_RANK_2 = "#B6007A" # Medium Pink
GENDER_FEMALE_RANK_3 = "#BC669B" # Light Pink
# Neutral colors by gender (for non-highlighted items)
GENDER_MALE_NEUTRAL = "#B8C9D9" # Grey-Blue
GENDER_FEMALE_NEUTRAL = "#D9B8C9" # Grey-Pink
# Gender colors for correlation plots (green/red indicate +/- correlation)
# Male = darker shade, Female = lighter shade
CORR_MALE_POSITIVE = "#1B5E20" # Dark Green
CORR_FEMALE_POSITIVE = "#81C784" # Light Green
CORR_MALE_NEGATIVE = "#B71C1C" # Dark Red
CORR_FEMALE_NEGATIVE = "#E57373" # Light Red
# Speaking Style Colors (named after the style quadrant colors)
STYLE_GREEN = "#2E7D32" # Forest Green
STYLE_BLUE = "#1565C0" # Strong Blue
STYLE_ORANGE = "#E07A00" # Burnt Orange
STYLE_RED = "#C62828" # Deep Red
def jpmc_altair_theme():
"""JPMC brand theme for Altair charts."""
return {
'config': {
'view': {
'continuousWidth': 1000,
'continuousHeight': 500,
'strokeWidth': 0
},
'background': ColorPalette.BACKGROUND,
'axis': {
'grid': True,
'gridColor': ColorPalette.GRID,
'labelFontSize': 11,
'titleFontSize': 12,
'labelColor': ColorPalette.TEXT,
'titleColor': ColorPalette.TEXT,
'labelLimit': 200 # Allow longer labels before truncation
},
'axisX': {
'labelAngle': -45,
'labelLimit': 200 # Allow longer x-axis labels
},
'axisY': {
'labelAngle': 0
},
'legend': {
'orient': 'top',
'direction': 'horizontal',
'titleFontSize': 11,
'labelFontSize': 11
},
'title': {
'fontSize': 14,
'color': ColorPalette.TEXT,
'anchor': 'start',
'subtitleFontSize': 10,
'subtitleColor': 'gray'
},
'bar': {
'color': ColorPalette.PRIMARY
}
}
}
# Register Altair theme
try:
import altair as alt
alt.themes.register('jpmc', jpmc_altair_theme)
alt.themes.enable('jpmc')
except ImportError:
pass # Altair not installed

1828
utils.py

File diff suppressed because it is too large Load Diff

1748
uv.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,13 +1,14 @@
import marimo as mo
import polars as pl
import altair as alt
from theme import ColorPalette
def check_progress(data):
"""Check if all responses are complete based on 'progress' column."""
if data.collect().select(pl.col('progress').unique()).shape[0] == 1:
return mo.md("""### Responses Complete: \n\n✅ All responses are complete (progress = 100) """)
return """## Responses Complete: \n\n✅ All responses are complete (progress = 100) """
return mo.md("### Responses Complete: \n\n⚠️ There are incomplete responses (progress < 100) ⚠️")
return "## Responses Complete: \n\n⚠️ There are incomplete responses (progress < 100) ⚠️"
def duration_validation(data):
@@ -30,18 +31,19 @@ def duration_validation(data):
outlier_data = _d.filter(pl.col('outlier_duration') == True).collect()
if outlier_data.shape[0] == 0:
return mo.md("### Duration Outliers: \n\n✅ No duration outliers detected")
return "## Duration Outliers: \n\n✅ No duration outliers detected"
return mo.md(f"""
### Duration Outliers:
return f"""## Duration Outliers:
**⚠️ Potential outliers detected based on response duration ⚠️**
- Mean Duration: {mean_duration:.2f} seconds (approximately {mean_duration/60:.2f} minutes)
- Standard Deviation of Duration: {std_duration:.2f} seconds
- Upper Outlier Threshold (Mean + 3*Std): {upper_outlier_threshold:.2f} seconds
- Lower Outlier Threshold (Mean - 3*Std): {lower_outlier_threshold:.2f} seconds
- Number of Outlier Responses: {outlier_data.shape[0]}
| Metric | Value |
|--------|-------|
| Mean Duration | {mean_duration:.2f} seconds (approximately {mean_duration/60:.2f} minutes) |
| Standard Deviation of Duration | {std_duration:.2f} seconds |
| Upper Outlier Threshold (Mean + 3*Std) | {upper_outlier_threshold:.2f} seconds |
| Lower Outlier Threshold (Mean - 3*Std) | {lower_outlier_threshold:.2f} seconds |
| Number of Outlier Responses | {outlier_data.shape[0]} |
Outliers:
@@ -50,5 +52,289 @@ def duration_validation(data):
**⚠️ NOTE: These have not been removed from the dataset ⚠️**
""")
"""
def check_straight_liners(data, max_score=3):
"""
Check for straight-lining behavior (selecting same value for all attributes).
Args:
data: Polars LazyFrame
max_score: The maximum score that is flagged if straight-lined (e.g., if 4, then 5s are allowed).
"""
import re
# detect columns groups based on pattern SS_...__Vxx__Choice_y
schema_names = data.collect_schema().names()
# regex groupings
pattern_choice = re.compile(r"(.*__V\d+)__Choice_\d+")
pattern_scale = re.compile(r"Voice_Scale_1_10__V\d+")
groups = {}
for col in schema_names:
# Check for Choice pattern (SS_...__Vxx__Choice_y)
match_choice = pattern_choice.search(col)
if match_choice:
group_key = match_choice.group(1)
if group_key not in groups:
groups[group_key] = []
groups[group_key].append(col)
continue
# Check for Voice Scale pattern (Voice_Scale_1_10__Vxx)
# All of these form a single group "Voice_Scale_1_10"
if pattern_scale.search(col):
group_key = "Voice_Scale_1_10"
if group_key not in groups:
groups[group_key] = []
groups[group_key].append(col)
# Filter for groups with multiple attributes/choices
multi_attribute_groups = {k: v for k, v in groups.items() if len(v) > 1}
if not multi_attribute_groups:
return "### Straight-lining Checks: \n\n No multi-attribute question groups found."
# Cast all involved columns to Float64 (strict=False) to handle potential string columns
# and 1-10 scale floats (e.g. 5.5). Float64 covers integers as well.
all_group_cols = [col for cols in multi_attribute_groups.values() for col in cols]
data = data.with_columns([
pl.col(col).cast(pl.Float64, strict=False) for col in all_group_cols
])
# Build expressions
expressions = []
for key, cols in multi_attribute_groups.items():
# Logic:
# 1. Create list of values
# 2. Drop nulls
# 3. Check if all remaining are equal (n_unique == 1) AND value <= max_score
list_expr = pl.concat_list(cols).list.drop_nulls()
# Use .list.min() instead of .list.get(0) to avoid "index out of bounds" on empty lists
# If n_unique == 1, min() is the same as the single value.
# If list is empty, min() is null, which is safe.
safe_val = list_expr.list.min()
is_straight = (
(list_expr.list.len() > 0) &
(list_expr.list.n_unique() == 1) &
(safe_val <= max_score)
).alias(f"__is_straight__{key}")
value_expr = safe_val.alias(f"__val__{key}")
has_data = (list_expr.list.len() > 0).alias(f"__has_data__{key}")
expressions.extend([is_straight, value_expr, has_data])
# collect data with checks
# We only need _recordId and the check columns
# We do with_columns then select implicitly/explicitly via filter/select later.
checked_data = data.with_columns(expressions).collect()
# Process results into a nice table
outliers = []
for key, group_cols in multi_attribute_groups.items():
flag_col = f"__is_straight__{key}"
val_col = f"__val__{key}"
filtered = checked_data.filter(pl.col(flag_col))
if filtered.height > 0:
# Sort group_cols logic
# If Choice columns, sort by choice number.
# If Voice Scale columns (no Choice_), sort by Voice ID (Vxx)
if all("__Choice_" in c for c in group_cols):
key_func = lambda c: int(c.split('__Choice_')[-1])
else:
# Extract digits from Vxx
def key_func(c):
m = re.search(r"__V(\d+)", c)
return int(m.group(1)) if m else 0
sorted_group_cols = sorted(group_cols, key=key_func)
# Select relevant columns: Record ID, Value, and the sorted group columns
subset = filtered.select(["_recordId", val_col] + sorted_group_cols)
for row in subset.iter_rows(named=True):
# Create ordered list of values, using 'NaN' for missing data
resp_list = [row[c] if row[c] is not None else 'NaN' for c in sorted_group_cols]
outliers.append({
"Record ID": row["_recordId"],
"Question Group": key,
"Value": row[val_col],
"Responses": str(resp_list)
})
if not outliers:
return f"### Straight-lining Checks: \n\n✅ No straight-liners detected (value <= {max_score})", None
outlier_df = pl.DataFrame(outliers)
# --- Analysis & Visualization ---
total_respondents = checked_data.height
# 1. & 3. Percentage Calculation
group_stats = []
value_dist_data = []
# Calculate Straight-Liners for ALL groups found in Data
# Condition: Respondent straight-lined ALL questions that they actually answered (ignoring empty/skipped questions)
# Logic: For every group G: if G has data (len > 0), then G must be straight.
# Also, the respondent must have answered at least one question group.
conditions = []
has_any_data_exprs = []
for key in multi_attribute_groups.keys():
flag_col = f"__is_straight__{key}"
data_col = f"__has_data__{key}"
# If has_data is True, is_straight MUST be True for it to count as valid straight-lining behavior for that user.
# Equivalent: (not has_data) OR is_straight
cond = (~pl.col(data_col)) | pl.col(flag_col)
conditions.append(cond)
has_any_data_exprs.append(pl.col(data_col))
all_straight_count = checked_data.filter(
pl.all_horizontal(conditions) & pl.any_horizontal(has_any_data_exprs)
).height
all_straight_pct = (all_straight_count / total_respondents) * 100
for key in multi_attribute_groups.keys():
flag_col = f"__is_straight__{key}"
val_col = f"__val__{key}"
# Filter for straight-liners in this specific group
sl_sub = checked_data.filter(pl.col(flag_col))
count = sl_sub.height
pct = (count / total_respondents) * 100
group_stats.append({
"Question Group": key,
"Straight-Liner %": pct,
"Count": count
})
# Get Value Distribution for this group's straight-liners
if count > 0:
# Group by the Value they straight-lined
dist = sl_sub.group_by(val_col).agg(pl.len().alias("count"))
for row in dist.iter_rows(named=True):
value_dist_data.append({
"Question Group": key,
"Value": row[val_col],
"Count": row["count"]
})
stats_df = pl.DataFrame(group_stats)
dist_df = pl.DataFrame(value_dist_data)
# Plot 1: % of Responses with Straight-Liners per Question
# Vertical bars with Count label on top
base_pct = alt.Chart(stats_df).encode(
x=alt.X("Question Group", sort=alt.EncodingSortField(field="Straight-Liner %", order="descending"))
)
bars_pct = base_pct.mark_bar(color=ColorPalette.PRIMARY).encode(
y=alt.Y("Straight-Liner %:Q", axis=alt.Axis(format=".1f", title="Share of all responses [%]")),
tooltip=["Question Group", alt.Tooltip("Straight-Liner %:Q", format=".1f"), "Count"]
)
text_pct = base_pct.mark_text(dy=-10).encode(
y=alt.Y("Straight-Liner %:Q"),
text=alt.Text("Count")
)
chart_pct = (bars_pct + text_pct).properties(
title="Share of Responses with Straight-Liners per Question",
width=800,
height=300
)
# Plot 2: Value Distribution (Horizontal Stacked Normalized Bar)
# Question Groups sorted by Total Count
# Values stacked 1 (left) -> 5 (right)
# Legend on top
# Total count at bar end
# Sort order for Y axis (Question Group) based on total Count (descending)
# Explicitly calculate sort order from stats_df to ensure consistency across layers
# High counts at the top
sorted_groups = stats_df.sort("Count", descending=True)["Question Group"].to_list()
# Base chart for Bars
# Use JPMC-aligned colors (blues) instead of default categorical rainbow
# Remove legend title as per plots.py style
bars_dist = alt.Chart(dist_df).mark_bar().encode(
y=alt.Y("Question Group", sort=sorted_groups),
x=alt.X("Count", stack="normalize", axis=alt.Axis(format="%"), title="Share of SL Responses"),
color=alt.Color("Value:O",
title=None, # explicit removal of title like in plots.py
scale=alt.Scale(scheme="blues"), # Professional blue scale
legend=alt.Legend(orient="top", direction="horizontal")
),
order=alt.Order("Value", sort="ascending"), # Ensures 1 is Left, 5 is Right
tooltip=["Question Group", "Value", "Count"]
)
# Text layer for Total Count (using stats_df which already has totals)
# using same sort for Y
text_dist = alt.Chart(stats_df).mark_text(align='left', dx=5).encode(
y=alt.Y("Question Group", sort=sorted_groups),
x=alt.datum(1.0), # Position at 100%
text=alt.Text("Count")
)
chart_dist = (bars_dist + text_dist).properties(
title="Distribution of Straight-Lined Values",
width=800,
height=500
)
analysis_md = f"""
### Straight-Lining Analysis
*"Straight-lining" is defined here as selecting the same response value for all attributes within a multi-attribute question group.*
* **Total Respondents**: {total_respondents}
* **Respondents straight-lining ALL questions presented to them**: {all_straight_pct:.2f}% ({all_straight_count} respondents)
"""
return (mo.vstack([
mo.md(f"**⚠️ Potential straight-liners detected ⚠️**\n\n"),
mo.ui.table(outlier_df),
mo.md(analysis_md),
alt.vconcat(chart_pct, chart_dist).resolve_legend(color="independent")
]), outlier_df)
if __name__ == "__main__":
from utils import QualtricsSurvey
RESULTS_FILE = "data/exports/OneDrive_2026-01-28/1-28-26 Afternoon/JPMC_Chase Brand Personality_Quant Round 1_January 28, 2026_Afternoon_Labels.csv"
QSF_FILE = "data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf"
S = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
data = S.load_data()
# print("Checking Green Blue:")
# print(check_straight_liners(S.get_ss_green_blue(data)[0]))
# print("Checking Orange Red:")
# print(check_straight_liners(S.get_ss_orange_red(data)[0]))
print("Checking Voice Scale 1-10:")
print(check_straight_liners(S.get_voice_scale_1_10(data)[0]))

File diff suppressed because it is too large Load Diff

20
voices.py Normal file
View File

@@ -0,0 +1,20 @@
Voice Reference Gender
Voice 14 Female
Voice 04 Female
Voice 08 Female
Voice 77 Female
Voice 48 Female
Voice 82 Female
Voice 89 Female
Voice 91 Emily (Current IVR Voice) Female
Voice 34 Male
Voice 69 Male
Voice 45 Male
Voice 46 Male
Voice 54 Male
Voice 74 Male
Voice 81 Male
Voice 86 Male
Voice 88 Male
Voice 16 Male

18
wordclouds.py Normal file
View File

@@ -0,0 +1,18 @@
"""Word cloud utilities for Voice Branding analysis.
The main wordcloud function is available as a method on QualtricsSurvey:
S.plot_traits_wordcloud(data, column='Top_3_Traits', title='...')
This module provides standalone imports for backwards compatibility.
"""
import numpy as np
from os import path
from PIL import Image, ImageDraw
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")