Compare commits
64 Commits
bc12df28a5
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 03a716e8ec | |||
| 8720bb670d | |||
| 9dfab75925 | |||
| 14e28cf368 | |||
| 8e181e193a | |||
| 6c16993cb3 | |||
| 92c6fc03ab | |||
| 7fb6570190 | |||
| 840bd2940d | |||
| af9a15ccb0 | |||
| a3cf9f103d | |||
| f0eab32c34 | |||
| d231fc02db | |||
| fc76bb0ab5 | |||
| ab78276a97 | |||
| e17646eb70 | |||
| ad1d8c6e58 | |||
| f5b4c247b8 | |||
| a35670aa72 | |||
| 36280a6ff8 | |||
| 9a587dcc4c | |||
| 9a49d1c690 | |||
| 8f505da550 | |||
| 495b56307c | |||
| 1e76a82f24 | |||
| 01b7d50637 | |||
| dca9ac11ba | |||
| 081fb0dd6e | |||
| 2817ed240a | |||
| e44251c3d6 | |||
| 8dd41dfc96 | |||
| 840cb4e6dc | |||
| a162701e94 | |||
| 38f6d8a87c | |||
| 5c39bbb23a | |||
| 190e4fbdc4 | |||
| 2408d06098 | |||
| 1dce4db909 | |||
| acf9c45844 | |||
| 77fdd6e8f6 | |||
| 426495ebe3 | |||
| a7ee854ed0 | |||
| 97c4b07208 | |||
| fd14038253 | |||
| 611fc8d19a | |||
| 3ac330263f | |||
| bda4d54231 | |||
| f2c659c266 | |||
| 29df6a4bd9 | |||
| a62524c6e4 | |||
| 43b41a01f5 | |||
| b7cf6adfb8 | |||
| 6ba30ff041 | |||
| 02a0214539 | |||
| 45dd121d90 | |||
| d770645d8e | |||
| 6b3fcb2f43 | |||
| 036dd911df | |||
| becc435d3c | |||
| 8aee09f968 | |||
| c1729d4896 | |||
| 2958fed780 | |||
| 5f9e67a312 | |||
| 3ee25f9e33 |
216
.github/agents/plot-creator.agent.md
vendored
Normal file
216
.github/agents/plot-creator.agent.md
vendored
Normal file
@@ -0,0 +1,216 @@
|
||||
# Plot Creator Agent
|
||||
|
||||
You are a specialized agent for creating data visualizations for the Voice Branding Qualtrics survey analysis project.
|
||||
|
||||
## ⚠️ Critical Data Handling Rules
|
||||
|
||||
1. **NEVER assume or load datasets without explicit user consent** - This is confidential data
|
||||
2. **NEVER guess file paths or dataset locations**
|
||||
3. **DO NOT assume data comes from a `Survey.get_*()` method** - Data may have been manually manipulated in a notebook
|
||||
4. **Use ONLY the data snippet provided by the user** for understanding structure and testing
|
||||
|
||||
## Your Workflow
|
||||
|
||||
When the user provides a plotting request (e.g., "I need a bar plot that shows the frequency of the times each trait is chosen per brand character"), follow this workflow:
|
||||
|
||||
### Step 1: Understand the Request
|
||||
- Parse the user's natural language request to identify:
|
||||
- **Chart type** (bar, stacked bar, line, heatmap, etc.)
|
||||
- **X-axis variable**
|
||||
- **Y-axis variable / aggregation** (count, mean, sum, etc.)
|
||||
- **Grouping / color encoding** (if any)
|
||||
- **Filtering requirements** (if any)
|
||||
|
||||
- Think critically about whether the requested plot is feasible with the available data.
|
||||
- Think critically about the best way to visualize the requested information, and if the requested chart type is appropriate. If not, consider alternatives and ask the user for confirmation before proceeding.
|
||||
|
||||
### Step 2: Analyze Provided Data
|
||||
The user will paste a `df.head()` output. Examine:
|
||||
- Column names and their meaning (refer to column naming conventions in `.github/copilot-instructions.md`)
|
||||
- Data types
|
||||
- Whether the data is in the right shape for the desired plot
|
||||
|
||||
**Important:** Do NOT make assumptions about where this data came from. It may be:
|
||||
- Output from a `Survey.get_*()` method
|
||||
- Manually transformed in a notebook
|
||||
- A join of multiple data sources
|
||||
- Any other custom manipulation
|
||||
|
||||
### Step 3: Determine Data Manipulation Needs
|
||||
Decide if the provided data can be plotted directly, or if transformations are needed:
|
||||
- **No manipulation**: Data is ready → proceed to Step 5
|
||||
- **Manipulation needed**: Aggregation, pivoting, melting, filtering, or new computed columns required → proceed to Step 4
|
||||
|
||||
### Step 4: Create Data Manipulation Function (if needed)
|
||||
Check if an existing `transform_<descriptive_name>` function exists in `utils.py` that performs the needed data manipulation. If not, create a dedicated method in the `QualtricsSurvey` class (`utils.py`):
|
||||
|
||||
```python
|
||||
def transform_<descriptive_name>(self, df: pl.LazyFrame | pl.DataFrame) -> tuple[pl.LazyFrame, dict | None]:
|
||||
"""Transform <input_description> to <output_description>.
|
||||
|
||||
Original use-case: "<paste user's original question here>"
|
||||
|
||||
This function <concise 1-2 sentence explanation of what it does>.
|
||||
|
||||
Args:
|
||||
df: Pre-fetched data as a Polars LazyFrame or DataFrame.
|
||||
|
||||
Returns:
|
||||
tuple: (LazyFrame with columns [...], Optional metadata dict)
|
||||
"""
|
||||
# Implementation - transform the INPUT data only
|
||||
# NEVER call self.get_*() methods here
|
||||
return result, metadata
|
||||
```
|
||||
|
||||
**Requirements:**
|
||||
- **NEVER retrieve data inside transform functions** - The function receives already-fetched data as input
|
||||
- Data retrieval (`get_*()` calls) stays in the notebook so analysts can see all steps
|
||||
- Method must return `(pl.LazyFrame, Optional[dict])` tuple
|
||||
- Docstring MUST contain the original question verbatim
|
||||
- Follow existing patterns class methods of the `QualtricsSurvey()` in `utils.py`
|
||||
|
||||
**❌ BAD Example (do NOT do this):**
|
||||
```python
|
||||
def transform_character_trait_frequency(self, q: pl.LazyFrame):
|
||||
# BAD: Fetching data inside transform function
|
||||
char_df, _ = self.get_character_refine(q) # ← WRONG!
|
||||
# ... rest of transform
|
||||
```
|
||||
|
||||
**✅ GOOD Example:**
|
||||
```python
|
||||
def transform_character_trait_frequency(self, char_df: pl.LazyFrame | pl.DataFrame):
|
||||
# GOOD: Receives pre-fetched data as input
|
||||
if isinstance(char_df, pl.LazyFrame):
|
||||
char_df = char_df.collect()
|
||||
# ... rest of transform
|
||||
```
|
||||
|
||||
**In the notebook, the analyst writes:**
|
||||
```python
|
||||
char_data, _ = S.get_character_refine(data) # Step visible to analyst
|
||||
trait_freq, _ = S.transform_character_trait_frequency(char_data) # Transform step
|
||||
chart = S.plot_character_trait_frequency(trait_freq)
|
||||
```
|
||||
|
||||
### Step 5: Create Temporary Test File
|
||||
Create `debug_plot_temp.py` for testing. **Prefer using the data snippet already provided by the user.**
|
||||
|
||||
**Option A: Use provided data snippet (preferred)**
|
||||
If the user provided a `df.head()` or sample data output, create inline test data from it:
|
||||
|
||||
```python
|
||||
"""Temporary test file for <plot_name>.
|
||||
Delete after testing.
|
||||
"""
|
||||
import polars as pl
|
||||
from theme import ColorPalette
|
||||
import altair as alt
|
||||
|
||||
# ============================================================
|
||||
# TEST DATA (reconstructed from user's df.head() output)
|
||||
# ============================================================
|
||||
test_data = pl.DataFrame({
|
||||
"Column1": ["value1", "value2", ...],
|
||||
"Column2": [1, 2, ...],
|
||||
# ... recreate structure from provided sample
|
||||
})
|
||||
# ============================================================
|
||||
|
||||
# Test the plot function
|
||||
from plots import QualtricsPlotsMixin
|
||||
# ... test code
|
||||
```
|
||||
|
||||
**Option B: Ask user (only if necessary)**
|
||||
Only ask the user for additional code if:
|
||||
- The provided sample is insufficient to test the plot logic
|
||||
- You need to understand complex data relationships not visible in the sample
|
||||
- The transformation requires understanding the full data pipeline
|
||||
|
||||
If you must ask:
|
||||
> "The sample data you provided should work for basic testing. However, I need [specific reason]. Could you provide:
|
||||
> 1. [specific information needed]
|
||||
>
|
||||
> If you'd prefer, I can proceed with a minimal test using the sample data you shared."
|
||||
|
||||
### Step 6: Create Plot Function
|
||||
Add a new method to `QualtricsPlotsMixin` in `plots.py`:
|
||||
|
||||
```python
|
||||
def plot_<descriptive_name>(
|
||||
self,
|
||||
data: pl.LazyFrame | pl.DataFrame | None = None,
|
||||
title: str = "<Default title>",
|
||||
x_label: str = "<X label>",
|
||||
y_label: str = "<Y label>",
|
||||
height: int | None = None,
|
||||
width: int | str | None = None,
|
||||
) -> alt.Chart:
|
||||
"""<Docstring with original question and description>."""
|
||||
df = self._ensure_dataframe(data)
|
||||
|
||||
# Build chart using ONLY ColorPalette from theme.py
|
||||
chart = alt.Chart(...).mark_bar(color=ColorPalette.PRIMARY)...
|
||||
|
||||
chart = self._save_plot(chart, title)
|
||||
return chart
|
||||
```
|
||||
|
||||
**Requirements:**
|
||||
- ALL colors MUST use `ColorPalette` constants from `theme.py`
|
||||
- Use `self._ensure_dataframe()` to handle LazyFrame/DataFrame
|
||||
- Use `self._save_plot()` at the end to enable auto-save
|
||||
- Use `self._process_title()` for titles with `<br>` tags
|
||||
- Follow existing plot patterns (see `plot_average_scores_with_counts`, `plot_top3_ranking_distribution`)
|
||||
|
||||
### Step 7: Test
|
||||
Run the temporary test file to verify the plot works:
|
||||
```bash
|
||||
uv run python debug_plot_temp.py
|
||||
```
|
||||
|
||||
### Step 8: Provide Summary
|
||||
After successful completion, output a summary:
|
||||
|
||||
```
|
||||
✅ Plot created successfully!
|
||||
|
||||
**Data function** (if created): `S.transform_<name>(data)`
|
||||
**Plot function**: `S.plot_<name>(data, title="...")`
|
||||
|
||||
**Usage example:**
|
||||
```python
|
||||
# Assuming you have your data already prepared as `plot_data`
|
||||
chart = S.plot_<name>(plot_data, title="Your Title Here")
|
||||
chart # Display in Marimo
|
||||
```
|
||||
|
||||
**Files modified:**
|
||||
- `utils.py` - Added `transform_<name>()` (if applicable)
|
||||
- `plots.py` - Added `plot_<name>()`
|
||||
- `debug_plot_temp.py` - Test file (can be deleted)
|
||||
```
|
||||
|
||||
## Critical Rules (from .github/copilot-instructions.md)
|
||||
|
||||
1. **NEVER load confidential data without explicit user-provided code**
|
||||
2. **NEVER assume data source** - do not guess which `get_*()` method produced the data
|
||||
3. **NEVER modify Marimo notebooks** (`0X_*.py` files)
|
||||
4. **NEVER run Marimo notebooks for debugging**
|
||||
5. **ALL colors MUST come from `theme.py`** - use `ColorPalette.PRIMARY`, `ColorPalette.RANK_1`, etc.
|
||||
6. **If a new color is needed**, add it to `ColorPalette` in `theme.py` first
|
||||
7. **No changelog markdown files** - do not create new .md files documenting changes
|
||||
8. **Reading notebooks is OK** to understand function usage patterns
|
||||
9. **Getter methods return tuples**: `(LazyFrame, Optional[metadata])`
|
||||
10. **Use Polars LazyFrames** until visualization, then `.collect()`
|
||||
|
||||
If any rule causes problems, ask user for permission before deviating.
|
||||
|
||||
## Reference: Column Patterns
|
||||
|
||||
- `SS_Green_Blue__V14__Choice_1` → Speaking Style trait score
|
||||
- `Voice_Scale_1_10__V48` → 1-10 voice rating
|
||||
- `Top_3_Voices_ranking__V77` → Ranking position
|
||||
- `Character_Ranking_<Name>` → Character personality ranking
|
||||
105
.github/copilot-instructions.md
vendored
Normal file
105
.github/copilot-instructions.md
vendored
Normal file
@@ -0,0 +1,105 @@
|
||||
# Voice Branding Quantitative Analysis - Copilot Instructions
|
||||
|
||||
## Project Overview
|
||||
Qualtrics survey analysis for brand personality research. Analyzes voice samples (V04-V91) across speaking style traits, character rankings, and demographic segments. Uses **Marimo notebooks** for interactive analysis and **Polars** for data processing.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Core Components
|
||||
- **`QualtricsSurvey`** (`utils.py`): Main class combining data loading, filtering, and plotting via `QualtricsPlotsMixin`
|
||||
- **Marimo notebooks** (`0X_*.py`): Interactive apps run via `uv run marimo run <file>.py`
|
||||
- **Data exports** (`data/exports/<date>/`): Qualtrics CSVs with `_Labels.csv` and `_Values.csv` variants
|
||||
- **QSF files**: Qualtrics survey definitions for mapping QIDs to question text
|
||||
|
||||
### Data Flow
|
||||
```
|
||||
Qualtrics CSV (3-row header) → QualtricsSurvey.load_data() → LazyFrame with QID columns
|
||||
↓
|
||||
filter_data() → get_*() methods → plot_*() methods → figures/<export>/<filter>/
|
||||
```
|
||||
|
||||
## ⚠️ Critical AI Agent Rules
|
||||
|
||||
1. **NEVER modify Marimo notebooks directly** - The `XX_*.py` files are Marimo notebooks and should not be edited by AI agents
|
||||
2. **NEVER run Marimo notebooks for debugging** - These are interactive apps, not test scripts
|
||||
3. **For debugging**: Create a standalone temporary Python script (e.g., `debug_temp.py`) to test functions
|
||||
4. **Reading notebooks is OK** - You may read notebook files to understand how functions are used. Ask the user which notebook they're working in for context
|
||||
5. **No changelog markdown files** - Do not create new markdown files to document small changes or describe new usage
|
||||
|
||||
## Key Patterns
|
||||
|
||||
### Polars LazyFrames
|
||||
Always work with `pl.LazyFrame` until visualization; call `.collect()` only when needed:
|
||||
```python
|
||||
data = S.load_data() # Returns LazyFrame
|
||||
subset, meta = S.get_voice_scale_1_10(data) # Returns (LazyFrame, Optional[dict])
|
||||
df = subset.collect() # Materialize for plotting
|
||||
```
|
||||
|
||||
### Column Naming Convention
|
||||
Survey columns follow patterns that encode voice/trait info:
|
||||
- `SS_Green_Blue__V14__Choice_1` → Speaking Style, Voice 14, Trait 1
|
||||
- `Voice_Scale_1_10__V48` → 1-10 rating for Voice 48
|
||||
- `Top_3_Voices_ranking__V77` → Ranking position for Voice 77
|
||||
|
||||
### Filter State & Figure Output
|
||||
`QualtricsSurvey` stores filter state and auto-generates output paths:
|
||||
```python
|
||||
S.filter_data(data, consumer=['Early Professional'])
|
||||
# Plots save to: figures/<export>/Cons-Early_Professional/<plot_name>.png
|
||||
```
|
||||
|
||||
### Getter Methods Return Tuples
|
||||
All `get_*()` methods return `(LazyFrame, Optional[metadata])`:
|
||||
```python
|
||||
df, choices_map = S.get_ss_green_blue(data) # choices_map has trait descriptions
|
||||
df, _ = S.get_character_ranking(data) # Second element may be None
|
||||
```
|
||||
|
||||
## Development Commands
|
||||
|
||||
```bash
|
||||
# Run interactive analysis notebook
|
||||
uv run marimo run 02_quant_analysis.py --port 8080
|
||||
|
||||
# Edit notebook in editor mode
|
||||
uv run marimo edit 02_quant_analysis.py
|
||||
|
||||
# Headless mode for shared access
|
||||
uv run marimo run 02_quant_analysis.py --headless --port 8080
|
||||
```
|
||||
|
||||
## Important Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `utils.py` | `QualtricsSurvey` class, data transformations, PPTX utilities |
|
||||
| `plots.py` | `QualtricsPlotsMixin` with all Altair plotting methods |
|
||||
| `theme.py` | `ColorPalette` and `jpmc_altair_theme()` for consistent styling |
|
||||
| `validation.py` | Data quality checks (progress, duration outliers, straight-liners) |
|
||||
| `speaking_styles.py` | `SPEAKING_STYLES` dict mapping colors to trait groups |
|
||||
|
||||
## Conventions
|
||||
|
||||
### Altair Charts & Colors
|
||||
- **ALL colors MUST come from `theme.py`** - Use `ColorPalette.PRIMARY`, `ColorPalette.RANK_1`, etc.
|
||||
- If a new color is needed, add it to `ColorPalette` in `theme.py` first, then use it
|
||||
- Never hardcode hex colors directly in plotting code
|
||||
- Charts auto-save via `_save_plot()` when `fig_save_dir` is set
|
||||
- Filter footnotes added automatically via `_add_filter_footnote()`
|
||||
|
||||
### QSF Parsing
|
||||
Use `_get_qsf_question_by_QID()` to extract question config:
|
||||
```python
|
||||
cfg = self._get_qsf_question_by_QID('QID27')['Payload']
|
||||
recode_map = cfg['RecodeValues'] # Maps choice numbers to values
|
||||
```
|
||||
|
||||
### PPTX Image Replacement
|
||||
Images matched by perceptual hash (not filename); alt-text encodes figure path:
|
||||
```python
|
||||
utils.update_ppt_alt_text(ppt_path, image_source_dir) # Tag images with alt-text
|
||||
utils.pptx_replace_named_image(ppt, target_tag, new_image) # Replace by alt-text
|
||||
```
|
||||
|
||||
This is a process that should be run manually be the user ONLY.
|
||||
5
.vscode/extensions.json
vendored
Normal file
5
.vscode/extensions.json
vendored
Normal file
@@ -0,0 +1,5 @@
|
||||
{
|
||||
"recommendations": [
|
||||
"wakatime.vscode-wakatime"
|
||||
]
|
||||
}
|
||||
5
.vscode/settings.json
vendored
Normal file
5
.vscode/settings.json
vendored
Normal file
@@ -0,0 +1,5 @@
|
||||
{
|
||||
"chat.tools.terminal.autoApprove": {
|
||||
"/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/.venv/bin/python": true
|
||||
}
|
||||
}
|
||||
@@ -27,7 +27,7 @@ def _(Path):
|
||||
|
||||
@app.cell
|
||||
def _(qsf_file, results_file, utils):
|
||||
survey = utils.JPMCSurvey(results_file, qsf_file)
|
||||
survey = utils.QualtricsSurvey(results_file, qsf_file)
|
||||
data_all = survey.load_data()
|
||||
return (survey,)
|
||||
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
import marimo
|
||||
|
||||
__generated_with = "0.19.2"
|
||||
app = marimo.App(width="medium")
|
||||
__generated_with = "0.19.7"
|
||||
app = marimo.App(width="full")
|
||||
|
||||
|
||||
@app.cell
|
||||
@@ -11,16 +11,17 @@ def _():
|
||||
from pathlib import Path
|
||||
|
||||
from validation import check_progress, duration_validation, check_straight_liners
|
||||
from utils import JPMCSurvey, combine_exclusive_columns, calculate_weighted_ranking_scores
|
||||
from utils import QualtricsSurvey, combine_exclusive_columns, calculate_weighted_ranking_scores
|
||||
import utils
|
||||
|
||||
from speaking_styles import SPEAKING_STYLES
|
||||
return (
|
||||
JPMCSurvey,
|
||||
Path,
|
||||
QualtricsSurvey,
|
||||
SPEAKING_STYLES,
|
||||
calculate_weighted_ranking_scores,
|
||||
check_progress,
|
||||
check_straight_liners,
|
||||
duration_validation,
|
||||
mo,
|
||||
pl,
|
||||
@@ -29,18 +30,6 @@ def _():
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(mo):
|
||||
mo.outline(label="Table of Contents")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
# Select Dataset
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(mo):
|
||||
file_browser = mo.ui.file_browser(
|
||||
initial_path="./data/exports", multiple=False, restrict_navigation=True, filetypes=[".csv"], label="Select 'Labels' File"
|
||||
@@ -60,14 +49,8 @@ def _(Path, file_browser, mo):
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(RESULTS_FILE, mo):
|
||||
mo.stop(not RESULTS_FILE.name.lower().endswith('labels.csv'), mo.md("**⚠️ Make sure you select a `_Labels.csv` file above**"))
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(JPMCSurvey, QSF_FILE, RESULTS_FILE, mo):
|
||||
S = JPMCSurvey(RESULTS_FILE, QSF_FILE)
|
||||
def _(QSF_FILE, QualtricsSurvey, RESULTS_FILE, mo):
|
||||
S = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
|
||||
try:
|
||||
data_all = S.load_data()
|
||||
except NotImplementedError as e:
|
||||
@@ -76,29 +59,47 @@ def _(JPMCSurvey, QSF_FILE, RESULTS_FILE, mo):
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
# check_straight_liners(S.get_ss_green_blue(data_all)[0])
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(Path, RESULTS_FILE, mo):
|
||||
def _(Path, RESULTS_FILE, data_all, mo):
|
||||
mo.md(f"""
|
||||
|
||||
---
|
||||
# Load Data
|
||||
|
||||
**Dataset:** `{Path(RESULTS_FILE).name}`
|
||||
|
||||
|
||||
**Responses**: `{data_all.collect().shape[0]}`
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(check_progress, data_all, duration_validation, mo):
|
||||
def _():
|
||||
sl_ss_max_score = 5
|
||||
sl_v1_10_max_score = 10
|
||||
return sl_ss_max_score, sl_v1_10_max_score
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(
|
||||
S,
|
||||
check_progress,
|
||||
check_straight_liners,
|
||||
data_all,
|
||||
duration_validation,
|
||||
mo,
|
||||
sl_ss_max_score,
|
||||
sl_v1_10_max_score,
|
||||
):
|
||||
_ss_all = S.get_ss_green_blue(data_all)[0].join(S.get_ss_orange_red(data_all)[0], on='_recordId')
|
||||
_sl_ss_c, sl_ss_df = check_straight_liners(_ss_all, max_score=sl_ss_max_score)
|
||||
|
||||
_sl_v1_10_c, sl_v1_10_df = check_straight_liners(
|
||||
S.get_voice_scale_1_10(data_all)[0],
|
||||
max_score=sl_v1_10_max_score
|
||||
)
|
||||
|
||||
|
||||
mo.md(f"""
|
||||
## Data Validation
|
||||
# Data Validation
|
||||
|
||||
{check_progress(data_all)}
|
||||
|
||||
@@ -107,11 +108,30 @@ def _(check_progress, data_all, duration_validation, mo):
|
||||
{duration_validation(data_all)}
|
||||
|
||||
|
||||
## Speaking Style - Straight Liners
|
||||
{_sl_ss_c}
|
||||
|
||||
|
||||
## Voice Score Scale 1-10 - Straight Liners
|
||||
{_sl_v1_10_c}
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(data_all):
|
||||
# # Drop any Voice Scale 1-10 responses with straight-lining, using sl_v1_10_df _responseId values
|
||||
# records_to_drop = sl_v1_10_df.select('Record ID').to_series().to_list()
|
||||
|
||||
# data_validated = data_all.filter(~pl.col('_recordId').is_in(records_to_drop))
|
||||
|
||||
# mo.md(f"""
|
||||
# Dropped `{len(records_to_drop)}` responses with straight-lining in Voice Scale 1-10 evaluation.
|
||||
# """)
|
||||
data_validated = data_all
|
||||
return (data_validated,)
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(S, mo):
|
||||
filter_form = mo.md('''
|
||||
@@ -142,19 +162,19 @@ def _(S, mo):
|
||||
|
||||
{filter_form}
|
||||
''')
|
||||
return
|
||||
|
||||
|
||||
return (filter_form,)
|
||||
@app.cell
|
||||
def _(data_validated):
|
||||
# mo.stop(filter_form.value is None, mo.md("**Please submit filter above to proceed**"))
|
||||
# _d = S.filter_data(data_validated, age=filter_form.value['age'], gender=filter_form.value['gender'], income=filter_form.value['income'], ethnicity=filter_form.value['ethnicity'], consumer=filter_form.value['consumer'])
|
||||
|
||||
# # Stop execution and prevent other cells from running if no data is selected
|
||||
# mo.stop(len(_d.collect()) == 0, mo.md("**No Data available for current filter combination**"))
|
||||
# data = _d
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(S, data_all, filter_form, mo):
|
||||
mo.stop(filter_form.value is None, mo.md("**Please submit filter above to proceed**"))
|
||||
_d = S.filter_data(data_all, age=filter_form.value['age'], gender=filter_form.value['gender'], income=filter_form.value['income'], ethnicity=filter_form.value['ethnicity'], consumer=filter_form.value['consumer'])
|
||||
|
||||
# Stop execution and prevent other cells from running if no data is selected
|
||||
mo.stop(len(_d.collect()) == 0, mo.md("**No Data available for current filter combination**"))
|
||||
data = _d
|
||||
data = data_validated
|
||||
|
||||
data.collect()
|
||||
return (data,)
|
||||
@@ -173,6 +193,20 @@ def _(S, data, mo):
|
||||
return (char_rank,)
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(mo):
|
||||
mo.md(r"""
|
||||
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _():
|
||||
# char_rank = S.get_character_ranking(data)[0]
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, char_rank, mo):
|
||||
mo.md(f"""
|
||||
@@ -218,6 +252,13 @@ def _(S, data, mo):
|
||||
return (v_18_8_3,)
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _():
|
||||
|
||||
# print(v_18_8_3.head())
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(S, mo, v_18_8_3):
|
||||
mo.md(f"""
|
||||
@@ -240,7 +281,7 @@ def _(S, mo, v_18_8_3):
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
@app.cell
|
||||
def _(S, calculate_weighted_ranking_scores, data):
|
||||
top3_voices = S.get_top_3_voices(data)[0]
|
||||
top3_voices_weighted = calculate_weighted_ranking_scores(top3_voices)
|
||||
@@ -284,6 +325,11 @@ def _(S, mo, top3_voices):
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(S, data, mo, utils):
|
||||
ss_or, choice_map_or = S.get_ss_orange_red(data)
|
||||
ss_gb, choice_map_gb = S.get_ss_green_blue(data)
|
||||
@@ -322,12 +368,12 @@ def _(S, mo, pl, ss_long):
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
@app.cell
|
||||
def _():
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
@app.cell(hide_code=True)
|
||||
def _(S, data, mo):
|
||||
vscales = S.get_voice_scale_1_10(data)[0]
|
||||
# plot_average_scores_with_counts(vscales, x_label='Voice', width=1000)
|
||||
@@ -338,32 +384,53 @@ def _(S, data, mo):
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(vscales):
|
||||
print(vscales.collect().head())
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(pl, vscales):
|
||||
# Count non-null values per row
|
||||
nn_vscale = vscales.with_columns(
|
||||
non_null_count = pl.sum_horizontal(pl.all().exclude("_recordID").is_not_null())
|
||||
)
|
||||
nn_vscale.collect()['non_null_count'].describe()
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(S, mo, vscales):
|
||||
mo.md(f"""
|
||||
### How does each voice score on a scale from 1-10?
|
||||
|
||||
{mo.ui.altair_chart(S.plot_average_scores_with_counts(vscales, x_label='Voice', width=1000))}
|
||||
{mo.ui.altair_chart(S.plot_average_scores_with_counts(vscales, x_label='Voice', width=1000, domain=[1,10], title="Voice General Impression (Scale 1-10)"))}
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _():
|
||||
return
|
||||
@app.cell
|
||||
def _(S, mo, utils, vscales):
|
||||
_target_cols=[c for c in vscales.collect().columns if c not in ['_recordId']]
|
||||
vscales_row_norm = utils.normalize_row_values(vscales.collect(), target_cols=_target_cols)
|
||||
|
||||
mo.md(f"""
|
||||
### Voice scale 1-10 normalized per respondent?
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(mo):
|
||||
mo.md(r"""
|
||||
|
||||
{mo.ui.altair_chart(S.plot_average_scores_with_counts(vscales_row_norm, x_label='Voice', width=1000, domain=[1,10], title="Voice General Impression (Scale 1-10) - Normalized per Respondent"))}
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(mo):
|
||||
mo.md(r"""
|
||||
|
||||
@app.cell
|
||||
def _(S, mo, utils, vscales):
|
||||
_target_cols=[c for c in vscales.collect().columns if c not in ['_recordId']]
|
||||
vscales_global_norm = utils.normalize_global_values(vscales.collect(), target_cols=_target_cols)
|
||||
|
||||
mo.md(f"""
|
||||
### Voice scale 1-10 normalized per respondent?
|
||||
|
||||
{mo.ui.altair_chart(S.plot_average_scores_with_counts(vscales_global_norm, x_label='Voice', width=1000, domain=[1,10], title="Voice General Impression (Scale 1-10) - Normalized Across All Respondents"))}
|
||||
""")
|
||||
return
|
||||
|
||||
@@ -400,7 +467,7 @@ def _(choice_map, mo, ss_all, utils, vscales):
|
||||
return df_style, joined_df
|
||||
|
||||
|
||||
@app.cell
|
||||
@app.cell(hide_code=True)
|
||||
def _(S, SPEAKING_STYLES, joined_df, mo):
|
||||
_content = """### Total Results
|
||||
|
||||
@@ -431,34 +498,18 @@ def _(mo):
|
||||
|
||||
- [ ] 4 correlation diagrams considering each speaking style (4) and all female voice results.
|
||||
- [ ] 4 correlation diagrams considering each speaking style (4) and all male voice results.
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(mo):
|
||||
mo.md(r"""
|
||||
|
||||
## Correlations Voice Speaking Styles <-> Voice Ranking Points
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(mo):
|
||||
mo.md(r"""
|
||||
Let’s show how scoring better on these speaking styles correlates (or not) with better Vocie Ranking results. For each speaking style we show how the traits in these speaking styles correlate with voice ranking points. This gives us a total of 4 correlation diagrams.
|
||||
|
||||
Example for speaking style green:
|
||||
- Trait 1: Friendly | Conversational | Down-to-earth
|
||||
- Trait 2: Approachable | Familiar | Warm
|
||||
- Trait 3: Optimistic | Benevolent | Positive | Appreciative
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(mo):
|
||||
mo.md(r"""
|
||||
### Total Results
|
||||
|
||||
- [ ] 4 correlation diagrams
|
||||
@@ -466,7 +517,31 @@ def _(mo):
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
@app.cell(hide_code=True)
|
||||
def _(mo):
|
||||
mo.md(r"""
|
||||
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(mo):
|
||||
mo.md(r"""
|
||||
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(mo):
|
||||
mo.md(r"""
|
||||
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(S, SPEAKING_STYLES, df_style, mo, top3_voices, utils):
|
||||
df_ranking = utils.process_voice_ranking_data(top3_voices)
|
||||
joined = df_style.join(df_ranking, on=['_recordId', 'Voice'], how='inner')
|
||||
@@ -490,7 +565,7 @@ def _(S, SPEAKING_STYLES, df_style, mo, top3_voices, utils):
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
@app.cell
|
||||
def _(mo):
|
||||
mo.md(r"""
|
||||
### Female / Male Voices considered seperately
|
||||
@@ -501,49 +576,5 @@ def _(mo):
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(mo):
|
||||
mo.md(r"""
|
||||
## Correlation Heatmap all evaluations <-> voice acoustic data
|
||||
|
||||
- [ ] Heatmap for male voices
|
||||
- [ ] Heatmap for female voices
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(mo):
|
||||
mo.md(r"""
|
||||
## Most Prominent Character Personality Traits
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(mo):
|
||||
mo.md(r"""
|
||||
The last question of the survey is about traits for the described character's personality. For each Character personality, we want to display the 8 most chosen character personality traits. This will give us a total of 4 diagrams, one for each character personality included in the test.
|
||||
|
||||
- [ ] Bank Teller
|
||||
- [ ] Familiar Friend
|
||||
- [ ] The Coach
|
||||
- [ ] Personal Assistant
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(mo):
|
||||
mo.md(r"""
|
||||
---
|
||||
|
||||
# Results per subgroup
|
||||
|
||||
Use the dropdown selector at the top to filter the data and generate all the plots again
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
app.run()
|
||||
|
||||
933
03_quant_report.py
Normal file
933
03_quant_report.py
Normal file
@@ -0,0 +1,933 @@
|
||||
import marimo
|
||||
|
||||
__generated_with = "0.19.7"
|
||||
app = marimo.App(width="full")
|
||||
|
||||
with app.setup:
|
||||
import marimo as mo
|
||||
import polars as pl
|
||||
from pathlib import Path
|
||||
|
||||
from validation import check_progress, duration_validation, check_straight_liners
|
||||
from utils import QualtricsSurvey, combine_exclusive_columns, calculate_weighted_ranking_scores
|
||||
import utils
|
||||
|
||||
from speaking_styles import SPEAKING_STYLES
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
|
||||
file_browser = mo.ui.file_browser(
|
||||
initial_path="./data/exports", multiple=False, restrict_navigation=True, filetypes=[".csv"], label="Select 'Labels' File"
|
||||
)
|
||||
file_browser
|
||||
return (file_browser,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(file_browser):
|
||||
mo.stop(file_browser.path(index=0) is None, mo.md("**⚠️ Please select a `_Labels.csv` file above to proceed**"))
|
||||
RESULTS_FILE = Path(file_browser.path(index=0))
|
||||
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
|
||||
return QSF_FILE, RESULTS_FILE
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(QSF_FILE, RESULTS_FILE):
|
||||
S = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
|
||||
try:
|
||||
data_all = S.load_data()
|
||||
except NotImplementedError as e:
|
||||
mo.stop(True, mo.md(f"**⚠️ {str(e)}**"))
|
||||
return S, data_all
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(RESULTS_FILE, data_all):
|
||||
mo.md(rf"""
|
||||
---
|
||||
# Load Data
|
||||
|
||||
**Dataset:** {Path(RESULTS_FILE).name}
|
||||
|
||||
**Responses**: {data_all.collect().shape[0]}
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, data_all):
|
||||
sl_ss_max_score = 5
|
||||
sl_v1_10_max_score = 10
|
||||
|
||||
_ss_all = S.get_ss_green_blue(data_all)[0].join(S.get_ss_orange_red(data_all)[0], on='_recordId')
|
||||
_sl_ss_c, sl_ss_df = check_straight_liners(_ss_all, max_score=sl_ss_max_score)
|
||||
|
||||
_sl_v1_10_c, sl_v1_10_df = check_straight_liners(
|
||||
S.get_voice_scale_1_10(data_all)[0],
|
||||
max_score=sl_v1_10_max_score
|
||||
)
|
||||
|
||||
|
||||
mo.md(f"""
|
||||
|
||||
{check_progress(data_all)}
|
||||
|
||||
|
||||
|
||||
{duration_validation(data_all)}
|
||||
|
||||
|
||||
## Speaking Style - Straight Liners
|
||||
{_sl_ss_c}
|
||||
|
||||
|
||||
## Voice Score Scale 1-10 - Straight Liners
|
||||
{_sl_v1_10_c}
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(data_all):
|
||||
# # Drop any Voice Scale 1-10 responses with straight-lining, using sl_v1_10_df _responseId values
|
||||
# records_to_drop = sl_v1_10_df.select('Record ID').to_series().to_list()
|
||||
|
||||
# data_validated = data_all.filter(~pl.col('_recordId').is_in(records_to_drop))
|
||||
|
||||
# mo.md(f"""
|
||||
# Dropped `{len(records_to_drop)}` responses with straight-lining in Voice Scale 1-10 evaluation.
|
||||
# """)
|
||||
data_validated = data_all
|
||||
return (data_validated,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
|
||||
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _():
|
||||
#
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Lucia confirmation missing 'Consumer' data
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, data_validated):
|
||||
demographics = S.get_demographics(data_validated)[0].collect()
|
||||
# demographics
|
||||
return (demographics,)
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(demographics):
|
||||
# Demographics where 'Consumer' is null
|
||||
demographics_no_consumer = demographics.filter(pl.col('Consumer').is_null())['_recordId'].to_list()
|
||||
# demographics_no_consumer
|
||||
return (demographics_no_consumer,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(data_all, demographics_no_consumer):
|
||||
# check if the responses with missing 'Consumer type' in demographics are all business owners as Lucia mentioned
|
||||
assert all(data_all.filter(pl.col('_recordId').is_in(demographics_no_consumer)).collect()['QID4'] == 'Yes'), "Not all respondents with missing 'Consumer' are business owners."
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
# Filter Data (Global corrections)
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
BEST_CHOSEN_CHARACTER = "the_coach"
|
||||
return (BEST_CHOSEN_CHARACTER,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S):
|
||||
filter_form = mo.md('''
|
||||
|
||||
|
||||
|
||||
{age}
|
||||
|
||||
{gender}
|
||||
|
||||
{ethnicity}
|
||||
|
||||
{income}
|
||||
|
||||
{consumer}
|
||||
'''
|
||||
).batch(
|
||||
age=mo.ui.multiselect(options=S.options_age, value=S.options_age, label="Select Age Group(s):"),
|
||||
gender=mo.ui.multiselect(options=S.options_gender, value=S.options_gender, label="Select Gender(s):"),
|
||||
ethnicity=mo.ui.multiselect(options=S.options_ethnicity, value=S.options_ethnicity, label="Select Ethnicities:"),
|
||||
income=mo.ui.multiselect(options=S.options_income, value=S.options_income, label="Select Income Group(s):"),
|
||||
consumer=mo.ui.multiselect(options=S.options_consumer, value=S.options_consumer, label="Select Consumer Groups:")
|
||||
).form()
|
||||
mo.md(f'''
|
||||
---
|
||||
|
||||
# Data Filter
|
||||
|
||||
{filter_form}
|
||||
''')
|
||||
return (filter_form,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, data_validated, filter_form):
|
||||
mo.stop(filter_form.value is None, mo.md("**Please submit filter above to proceed**"))
|
||||
_d = S.filter_data(data_validated, age=filter_form.value['age'], gender=filter_form.value['gender'], income=filter_form.value['income'], ethnicity=filter_form.value['ethnicity'], consumer=filter_form.value['consumer'])
|
||||
|
||||
# Stop execution and prevent other cells from running if no data is selected
|
||||
mo.stop(len(_d.collect()) == 0, mo.md("**No Data available for current filter combination**"))
|
||||
data = _d
|
||||
|
||||
# data = data_validated
|
||||
data.collect()
|
||||
return (data,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
# Check if all business owners are missing a 'Consumer type' in demographics
|
||||
# assert all([a is None for a in data_all.filter(pl.col('QID4') == 'Yes').collect()['Consumer'].unique()]) , "Not all business owners are missing 'Consumer type' in demographics."
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
# Demographic Distributions
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
demo_plot_cols = [
|
||||
'Age',
|
||||
'Gender',
|
||||
# 'Race/Ethnicity',
|
||||
'Bussiness_Owner',
|
||||
'Consumer'
|
||||
]
|
||||
return (demo_plot_cols,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, data, demo_plot_cols):
|
||||
_content = """
|
||||
|
||||
"""
|
||||
for c in demo_plot_cols:
|
||||
_fig = S.plot_demographic_distribution(
|
||||
data=S.get_demographics(data)[0],
|
||||
column=c,
|
||||
title=f"{c.replace('Bussiness', 'Business').replace('_', ' ')} Distribution of Survey Respondents"
|
||||
)
|
||||
_content += f"""{mo.ui.altair_chart(_fig)}\n\n"""
|
||||
|
||||
mo.md(_content)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
---
|
||||
|
||||
# Brand Character Results
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(disabled=True)
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Best performing: Original vs Refined frankenstein
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(disabled=True)
|
||||
def _(S, data):
|
||||
char_refine_rank = S.get_character_refine(data)[0]
|
||||
# print(char_rank.collect().head())
|
||||
print(char_refine_rank.collect().head())
|
||||
return
|
||||
|
||||
|
||||
@app.cell(disabled=True)
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Character ranking points
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, char_rank):
|
||||
char_rank_weighted = calculate_weighted_ranking_scores(char_rank)
|
||||
S.plot_weighted_ranking_score(char_rank_weighted, title="Most Popular Character - Weighted Popularity Score<br>(1st=3pts, 2nd=2pts, 3rd=1pt)", x_label='Voice')
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Character ranking 1-2-3
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, data):
|
||||
char_rank = S.get_character_ranking(data)[0]
|
||||
return (char_rank,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, char_rank):
|
||||
S.plot_top3_ranking_distribution(char_rank, x_label='Character Personality', title='Character Personality: Rankings Top 3')
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
### Statistical Significance Character Ranking
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(disabled=True)
|
||||
def _(S, char_rank):
|
||||
_pairwise_df, _meta = S.compute_ranking_significance(char_rank)
|
||||
|
||||
# print(_pairwise_df.columns)
|
||||
|
||||
mo.md(f"""
|
||||
|
||||
|
||||
{mo.ui.altair_chart(S.plot_significance_heatmap(_pairwise_df, metadata=_meta))}
|
||||
|
||||
{mo.ui.altair_chart(S.plot_significance_summary(_pairwise_df, metadata=_meta))}
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(disabled=True)
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Character Ranking: times 1st place
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, char_rank):
|
||||
S.plot_most_ranked_1(char_rank, title="Most Popular Character<br>(Number of Times Ranked 1st)", x_label='Character Personality')
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Prominent predefined personality traits wordcloud
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, data):
|
||||
top8_traits = S.get_top_8_traits(data)[0]
|
||||
S.plot_traits_wordcloud(
|
||||
data=top8_traits,
|
||||
column='Top_8_Traits',
|
||||
title="Most Prominent Personality Traits",
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Trait frequency per brand character
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, data):
|
||||
char_df = S.get_character_refine(data)[0]
|
||||
return (char_df,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, char_df):
|
||||
from theme import ColorPalette
|
||||
|
||||
# Assuming you already have char_df (your data from get_character_refine or similar)
|
||||
characters = ['Bank Teller', 'Familiar Friend', 'The Coach', 'Personal Assistant']
|
||||
character_colors = {
|
||||
'Bank Teller': (ColorPalette.CHARACTER_BANK_TELLER, ColorPalette.CHARACTER_BANK_TELLER_HIGHLIGHT),
|
||||
'Familiar Friend': (ColorPalette.CHARACTER_FAMILIAR_FRIEND, ColorPalette.CHARACTER_FAMILIAR_FRIEND_HIGHLIGHT),
|
||||
'The Coach': (ColorPalette.CHARACTER_COACH, ColorPalette.CHARACTER_COACH_HIGHLIGHT),
|
||||
'Personal Assistant': (ColorPalette.CHARACTER_PERSONAL_ASSISTANT, ColorPalette.CHARACTER_PERSONAL_ASSISTANT_HIGHLIGHT),
|
||||
}
|
||||
|
||||
# Build consistent sort order (by total frequency across all characters)
|
||||
all_trait_counts = {}
|
||||
for char in characters:
|
||||
freq_df, _ = S.transform_character_trait_frequency(char_df, char)
|
||||
for row in freq_df.iter_rows(named=True):
|
||||
all_trait_counts[row['trait']] = all_trait_counts.get(row['trait'], 0) + row['count']
|
||||
|
||||
consistent_sort_order = sorted(all_trait_counts.keys(), key=lambda x: -all_trait_counts[x])
|
||||
|
||||
_content = """"""
|
||||
# Generate 4 plots (one per character)
|
||||
for char in characters:
|
||||
freq_df, _ = S.transform_character_trait_frequency(char_df, char)
|
||||
main_color, highlight_color = character_colors[char]
|
||||
chart = S.plot_single_character_trait_frequency(
|
||||
data=freq_df,
|
||||
character_name=char,
|
||||
bar_color=main_color,
|
||||
highlight_color=highlight_color,
|
||||
trait_sort_order=consistent_sort_order,
|
||||
)
|
||||
_content += f"""
|
||||
{mo.ui.altair_chart(chart)}
|
||||
|
||||
|
||||
"""
|
||||
|
||||
mo.md(_content)
|
||||
return
|
||||
|
||||
|
||||
@app.cell(disabled=True)
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Statistical significance best characters
|
||||
|
||||
zie chat
|
||||
> voorbeeld: als de nr 1 en 2 niet significant verschillen maar wel van de nr 3 bijvoorbeeld is dat ook top. Beetje meedenkend over hoe ik het kan presenteren weetje wat ik bedoel?:)
|
||||
>
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(disabled=True)
|
||||
def _():
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
---
|
||||
|
||||
# Spoken Voice Results
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
COLOR_GENDER = True
|
||||
return (COLOR_GENDER,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Top 8 Most Chosen out of 18
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, data):
|
||||
v_18_8_3 = S.get_18_8_3(data)[0]
|
||||
return (v_18_8_3,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(COLOR_GENDER, S, v_18_8_3):
|
||||
S.plot_voice_selection_counts(v_18_8_3, title="Top 8 Voice Selection from 18 Voices", x_label='Voice', color_gender=COLOR_GENDER)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Top 3 most chosen out of 8
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(COLOR_GENDER, S, v_18_8_3):
|
||||
S.plot_top3_selection_counts(v_18_8_3, title="Top 3 Voice Selection Counts from 8 Voices", x_label='Voice', color_gender=COLOR_GENDER)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Voice Ranking Weighted Score
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, data):
|
||||
top3_voices = S.get_top_3_voices(data)[0]
|
||||
top3_voices_weighted = calculate_weighted_ranking_scores(top3_voices)
|
||||
return top3_voices, top3_voices_weighted
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(COLOR_GENDER, S, top3_voices_weighted):
|
||||
S.plot_weighted_ranking_score(top3_voices_weighted, title="Most Popular Voice - Weighted Popularity Score<br>(1st = 3pts, 2nd = 2pts, 3rd = 1pt)", color_gender=COLOR_GENDER)
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Which voice is ranked best in the ranking question for top 3?
|
||||
|
||||
(not best 3 out of 8 question)
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(COLOR_GENDER, S, top3_voices):
|
||||
S.plot_ranking_distribution(top3_voices, x_label='Voice', title="Distribution of Top 3 Voice Rankings (1st, 2nd, 3rd)", color_gender=COLOR_GENDER)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
### Statistical significance for voice ranking
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
# print(top3_voices.collect().head())
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
|
||||
# _pairwise_df, _metadata = S.compute_ranking_significance(
|
||||
# top3_voices,alpha=0.05,correction="none")
|
||||
|
||||
# # View significant pairs
|
||||
# # print(pairwise_df.filter(pl.col('significant') == True))
|
||||
|
||||
# # Create heatmap visualization
|
||||
# _heatmap = S.plot_significance_heatmap(
|
||||
# _pairwise_df,
|
||||
# metadata=_metadata,
|
||||
# title="Weighted Voice Ranking Significance<br>(Pairwise Comparisons)"
|
||||
# )
|
||||
|
||||
# # Create summary bar chart
|
||||
# _summary = S.plot_significance_summary(
|
||||
# _pairwise_df,
|
||||
# metadata=_metadata
|
||||
# )
|
||||
|
||||
# mo.md(f"""
|
||||
# {mo.ui.altair_chart(_heatmap)}
|
||||
|
||||
# {mo.ui.altair_chart(_summary)}
|
||||
# """)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
## Voice Ranked 1st the most
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(COLOR_GENDER, S, top3_voices):
|
||||
S.plot_most_ranked_1(top3_voices, title="Most Popular Voice<br>(Number of Times Ranked 1st)", x_label='Voice', color_gender=COLOR_GENDER)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Voice Scale 1-10
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(COLOR_GENDER, S, data):
|
||||
# Get your voice scale data (from notebook)
|
||||
voice_1_10, _ = S.get_voice_scale_1_10(data)
|
||||
S.plot_average_scores_with_counts(voice_1_10, x_label='Voice', domain=[1,10], title="Voice General Impression (Scale 1-10)", color_gender=COLOR_GENDER)
|
||||
return (voice_1_10,)
|
||||
|
||||
|
||||
@app.cell(disabled=True)
|
||||
def _():
|
||||
mo.md(r"""
|
||||
### Statistical Significance (Scale 1-10)
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(disabled=True)
|
||||
def _(S, voice_1_10):
|
||||
# Compute pairwise significance tests
|
||||
pairwise_df, metadata = S.compute_pairwise_significance(
|
||||
voice_1_10,
|
||||
test_type="mannwhitney", # or "ttest", "chi2", "auto"
|
||||
alpha=0.05,
|
||||
correction="bonferroni" # or "holm", "none"
|
||||
)
|
||||
|
||||
# View significant pairs
|
||||
# print(pairwise_df.filter(pl.col('significant') == True))
|
||||
|
||||
# Create heatmap visualization
|
||||
_heatmap = S.plot_significance_heatmap(
|
||||
pairwise_df,
|
||||
metadata=metadata,
|
||||
title="Voice Rating Significance<br>(Pairwise Comparisons)"
|
||||
)
|
||||
|
||||
# Create summary bar chart
|
||||
_summary = S.plot_significance_summary(
|
||||
pairwise_df,
|
||||
metadata=metadata
|
||||
)
|
||||
|
||||
mo.md(f"""
|
||||
{mo.ui.altair_chart(_heatmap)}
|
||||
|
||||
{mo.ui.altair_chart(_summary)}
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Ranking points for Voice per Chosen Brand Character
|
||||
|
||||
**missing mapping**
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Correlation Speaking Styles
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, data, top3_voices):
|
||||
ss_or, choice_map_or = S.get_ss_orange_red(data)
|
||||
ss_gb, choice_map_gb = S.get_ss_green_blue(data)
|
||||
|
||||
# Combine the data
|
||||
ss_all = ss_or.join(ss_gb, on='_recordId')
|
||||
_d = ss_all.collect()
|
||||
|
||||
choice_map = {**choice_map_or, **choice_map_gb}
|
||||
# print(_d.head())
|
||||
# print(choice_map)
|
||||
ss_long = utils.process_speaking_style_data(ss_all, choice_map)
|
||||
|
||||
df_style = utils.process_speaking_style_data(ss_all, choice_map)
|
||||
|
||||
vscales = S.get_voice_scale_1_10(data)[0]
|
||||
df_scale_long = utils.process_voice_scale_data(vscales)
|
||||
|
||||
joined_scale = df_style.join(df_scale_long, on=["_recordId", "Voice"], how="inner")
|
||||
|
||||
df_ranking = utils.process_voice_ranking_data(top3_voices)
|
||||
joined_ranking = df_style.join(df_ranking, on=['_recordId', 'Voice'], how='inner')
|
||||
return joined_ranking, joined_scale
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(joined_ranking):
|
||||
joined_ranking.head()
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
### Colors vs Scale 1-10
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, joined_scale):
|
||||
# Transform to get one row per color with average correlation
|
||||
color_corr_scale, _ = utils.transform_speaking_style_color_correlation(joined_scale, SPEAKING_STYLES)
|
||||
S.plot_speaking_style_color_correlation(
|
||||
data=color_corr_scale,
|
||||
title="Correlation: Speaking Style Colors and Voice Scale 1-10"
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
### Colors vs Ranking Points
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, joined_ranking):
|
||||
color_corr_ranking, _ = utils.transform_speaking_style_color_correlation(
|
||||
joined_ranking,
|
||||
SPEAKING_STYLES,
|
||||
target_column="Ranking_Points"
|
||||
)
|
||||
S.plot_speaking_style_color_correlation(
|
||||
data=color_corr_ranking,
|
||||
title="Correlation: Speaking Style Colors and Voice Ranking Points"
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
### Individual Traits vs Scale 1-10
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, joined_scale):
|
||||
_content = """"""
|
||||
|
||||
for _style, _traits in SPEAKING_STYLES.items():
|
||||
# print(f"Correlation plot for {style}...")
|
||||
_fig = S.plot_speaking_style_correlation(
|
||||
data=joined_scale,
|
||||
style_color=_style,
|
||||
style_traits=_traits,
|
||||
title=f"Correlation: Speaking Style {_style} and Voice Scale 1-10",
|
||||
)
|
||||
_content += f"""
|
||||
#### Speaking Style **{_style}**:
|
||||
|
||||
{mo.ui.altair_chart(_fig)}
|
||||
|
||||
"""
|
||||
mo.md(_content)
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _():
|
||||
mo.md(r"""
|
||||
### Individual Traits vs Ranking Points
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(S, joined_ranking):
|
||||
_content = """"""
|
||||
|
||||
for _style, _traits in SPEAKING_STYLES.items():
|
||||
# print(f"Correlation plot for {style}...")
|
||||
_fig = S.plot_speaking_style_ranking_correlation(
|
||||
data=joined_ranking,
|
||||
style_color=_style,
|
||||
style_traits=_traits,
|
||||
title=f"Correlation: Speaking Style {_style} and Voice Ranking Points",
|
||||
)
|
||||
_content += f"""
|
||||
#### Speaking Style **{_style}**:
|
||||
|
||||
{mo.ui.altair_chart(_fig)}
|
||||
|
||||
"""
|
||||
mo.md(_content)
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _():
|
||||
mo.md(r"""
|
||||
## Correlations when "Best Brand Character" is chosen
|
||||
|
||||
Select only the traits that fit with that character
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(BEST_CHOSEN_CHARACTER):
|
||||
from reference import ORIGINAL_CHARACTER_TRAITS
|
||||
chosen_bc_traits = ORIGINAL_CHARACTER_TRAITS[BEST_CHOSEN_CHARACTER]
|
||||
return (chosen_bc_traits,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(chosen_bc_traits):
|
||||
STYLES_SUBSET = utils.filter_speaking_styles(SPEAKING_STYLES, chosen_bc_traits)
|
||||
return (STYLES_SUBSET,)
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _():
|
||||
mo.md(r"""
|
||||
### Individual Traits vs Ranking Points
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(BEST_CHOSEN_CHARACTER, S, STYLES_SUBSET, joined_ranking):
|
||||
_content = ""
|
||||
for _style, _traits in STYLES_SUBSET.items():
|
||||
_fig = S.plot_speaking_style_ranking_correlation(
|
||||
data=joined_ranking,
|
||||
style_color=_style,
|
||||
style_traits=_traits,
|
||||
title=f"""Brand Character "{BEST_CHOSEN_CHARACTER.replace('_', ' ').title()}" - Correlation: Speaking Style {_style} and Voice Ranking Points"""
|
||||
)
|
||||
_content += f"""
|
||||
{mo.ui.altair_chart(_fig)}
|
||||
|
||||
"""
|
||||
mo.md(_content)
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _():
|
||||
mo.md(r"""
|
||||
### Individual Traits vs Scale 1-10
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(BEST_CHOSEN_CHARACTER, S, STYLES_SUBSET, joined_scale):
|
||||
_content = """"""
|
||||
|
||||
for _style, _traits in STYLES_SUBSET.items():
|
||||
# print(f"Correlation plot for {style}...")
|
||||
_fig = S.plot_speaking_style_correlation(
|
||||
data=joined_scale,
|
||||
style_color=_style,
|
||||
style_traits=_traits,
|
||||
title=f"""Brand Character "{BEST_CHOSEN_CHARACTER.replace('_', ' ').title()}" - Correlation: Speaking Style {_style} and Voice Scale 1-10""",
|
||||
)
|
||||
_content += f"""
|
||||
{mo.ui.altair_chart(_fig)}
|
||||
|
||||
"""
|
||||
mo.md(_content)
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _():
|
||||
mo.md(r"""
|
||||
### Colors vs Scale 1-10 (Best Character)
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(BEST_CHOSEN_CHARACTER, S, STYLES_SUBSET, joined_scale):
|
||||
# Transform to get one row per color with average correlation
|
||||
_color_corr_scale, _ = utils.transform_speaking_style_color_correlation(joined_scale, STYLES_SUBSET)
|
||||
S.plot_speaking_style_color_correlation(
|
||||
data=_color_corr_scale,
|
||||
title=f"""Brand Character "{BEST_CHOSEN_CHARACTER.replace('_', ' ').title()}" - Correlation: Speaking Style Colors and Voice Scale 1-10"""
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _():
|
||||
mo.md(r"""
|
||||
### Colors vs Ranking Points (Best Character)
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(BEST_CHOSEN_CHARACTER, S, STYLES_SUBSET, joined_ranking):
|
||||
_color_corr_ranking, _ = utils.transform_speaking_style_color_correlation(
|
||||
joined_ranking,
|
||||
STYLES_SUBSET,
|
||||
target_column="Ranking_Points"
|
||||
)
|
||||
S.plot_speaking_style_color_correlation(
|
||||
data=_color_corr_ranking,
|
||||
title=f"""Brand Character "{BEST_CHOSEN_CHARACTER.replace('_', ' ').title()}" - Correlation: Speaking Style Colors and Voice Ranking Points"""
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
app.run()
|
||||
74
04_PPTX_Update_Images.py
Normal file
74
04_PPTX_Update_Images.py
Normal file
@@ -0,0 +1,74 @@
|
||||
import marimo
|
||||
|
||||
__generated_with = "0.19.7"
|
||||
app = marimo.App(width="medium")
|
||||
|
||||
with app.setup:
|
||||
import marimo as mo
|
||||
from pathlib import Path
|
||||
import utils
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
# Tag existing images with Alt-Text
|
||||
|
||||
Based on image content
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
TAG_SOURCE = Path('data/reports/VOICE_Perception-Research-Report_4-2-26_19-30.pptx')
|
||||
# TAG_TARGET = Path('data/reports/Perception-Research-Report_2-2_tagged.pptx')
|
||||
TAG_IMAGE_DIR = Path('figures/debug')
|
||||
return TAG_IMAGE_DIR, TAG_SOURCE
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(TAG_IMAGE_DIR, TAG_SOURCE):
|
||||
utils.update_ppt_alt_text(
|
||||
ppt_path=TAG_SOURCE,
|
||||
image_source_dir=TAG_IMAGE_DIR,
|
||||
# output_path=TAG_TARGET
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _():
|
||||
mo.md(r"""
|
||||
# Replace Images using Alt-Text
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
REPLACE_SOURCE = Path('data/reports/VOICE_Perception-Research-Report_4-2-26_19-30.pptx')
|
||||
# REPLACE_TARGET = Path('data/reports/Perception-Research-Report_2-2_updated.pptx')
|
||||
|
||||
NEW_IMAGES_DIR = Path('figures/2-4-26')
|
||||
return NEW_IMAGES_DIR, REPLACE_SOURCE
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(NEW_IMAGES_DIR, REPLACE_SOURCE):
|
||||
# get all files in the image source directory and subdirectories
|
||||
results = utils.pptx_replace_images_from_directory(
|
||||
REPLACE_SOURCE, # Source presentation path,
|
||||
NEW_IMAGES_DIR, # Source directory with new images
|
||||
# REPLACE_TARGET # Output path (optional, defaults to overwrite)
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
app.run()
|
||||
@@ -10,8 +10,8 @@ def _():
|
||||
import polars as pl
|
||||
from pathlib import Path
|
||||
|
||||
from utils import JPMCSurvey, combine_exclusive_columns
|
||||
return JPMCSurvey, combine_exclusive_columns, mo, pl
|
||||
from utils import QualtricsSurvey, combine_exclusive_columns
|
||||
return QualtricsSurvey, combine_exclusive_columns, mo, pl
|
||||
|
||||
|
||||
@app.cell
|
||||
@@ -29,8 +29,8 @@ def _():
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(JPMCSurvey, QSF_FILE, RESULTS_FILE):
|
||||
survey = JPMCSurvey(RESULTS_FILE, QSF_FILE)
|
||||
def _(QualtricsSurvey, QSF_FILE, RESULTS_FILE):
|
||||
survey = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
|
||||
data = survey.load_data()
|
||||
data.collect()
|
||||
return data, survey
|
||||
@@ -42,14 +42,6 @@ def _(survey):
|
||||
return
|
||||
|
||||
|
||||
app._unparsable_cell(
|
||||
r"""
|
||||
data.
|
||||
""",
|
||||
name="_"
|
||||
)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(mo):
|
||||
mo.md(r"""
|
||||
@@ -205,7 +197,7 @@ def _(mo):
|
||||
@app.cell
|
||||
def _(data, survey):
|
||||
vscales = survey.get_voice_scale_1_10(data)[0].collect()
|
||||
vscales
|
||||
print(vscales.head())
|
||||
return (vscales,)
|
||||
|
||||
|
||||
73
99_example_ppt_replace_images.py
Normal file
73
99_example_ppt_replace_images.py
Normal file
@@ -0,0 +1,73 @@
|
||||
import marimo
|
||||
|
||||
__generated_with = "0.19.2"
|
||||
app = marimo.App(width="medium")
|
||||
|
||||
with app.setup:
|
||||
import marimo as mo
|
||||
from pathlib import Path
|
||||
import utils
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
# Tag existing images with Alt-Text
|
||||
|
||||
Based on image content
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
TAG_SOURCE = Path('data/test_tag_source.pptx')
|
||||
TAG_TARGET = Path('data/test_tag_target.pptx')
|
||||
TAG_IMAGE_DIR = Path('figures/OneDrive_2026-01-28/')
|
||||
return TAG_IMAGE_DIR, TAG_SOURCE, TAG_TARGET
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(TAG_IMAGE_DIR, TAG_SOURCE, TAG_TARGET):
|
||||
utils.update_ppt_alt_text(ppt_path=TAG_SOURCE, image_source_dir=TAG_IMAGE_DIR, output_path=TAG_TARGET)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
mo.md(r"""
|
||||
# Replace Images using Alt-Text
|
||||
""")
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
REPLACE_SOURCE = Path('data/test_replace_source.pptx')
|
||||
REPLACE_TARGET = Path('data/test_replace_target.pptx')
|
||||
return REPLACE_SOURCE, REPLACE_TARGET
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
IMAGE_FILE = Path('figures/OneDrive_2026-01-28/Cons-Early_Professional/cold_distant_approachable_familiar_warm.png')
|
||||
return (IMAGE_FILE,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(IMAGE_FILE, REPLACE_SOURCE, REPLACE_TARGET):
|
||||
utils.pptx_replace_named_image(
|
||||
presentation_path=REPLACE_SOURCE,
|
||||
target_tag=utils.image_alt_text_generator(IMAGE_FILE),
|
||||
new_image_path=IMAGE_FILE,
|
||||
save_path=REPLACE_TARGET)
|
||||
return
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
app.run()
|
||||
238
README.md
238
README.md
@@ -1,5 +1,239 @@
|
||||
# Voice Branding Quantitative Analysis
|
||||
|
||||
## Running Marimo Notebooks
|
||||
|
||||
Running on Ct-105 for shared access:
|
||||
|
||||
```
|
||||
```bash
|
||||
uv run marimo run 02_quant_analysis.py --headless --port 8080
|
||||
```
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Batch Report Generation
|
||||
|
||||
The quant report can be run with different filter combinations via CLI or automated batch processing.
|
||||
|
||||
### Single Filter Run (CLI)
|
||||
|
||||
Run the report script directly with JSON-encoded filter arguments:
|
||||
|
||||
```bash
|
||||
# Single consumer segment
|
||||
uv run python 03_quant_report.script.py --consumer '["Starter"]'
|
||||
|
||||
# Single age group
|
||||
uv run python 03_quant_report.script.py --age '["18 to 21 years"]'
|
||||
|
||||
# Multiple filters combined
|
||||
uv run python 03_quant_report.script.py --age '["18 to 21 years", "22 to 24 years"]' --gender '["Male"]'
|
||||
|
||||
# All respondents (no filters = defaults to all options selected)
|
||||
uv run python 03_quant_report.script.py
|
||||
```
|
||||
|
||||
Available filter arguments:
|
||||
- `--age` — JSON list of age groups
|
||||
- `--gender` — JSON list of genders
|
||||
- `--ethnicity` — JSON list of ethnicities
|
||||
- `--income` — JSON list of income groups
|
||||
- `--consumer` — JSON list of consumer segments
|
||||
|
||||
### Batch Runner (All Combinations)
|
||||
|
||||
Run all single-filter combinations automatically with progress tracking:
|
||||
|
||||
```bash
|
||||
# Preview all combinations without running
|
||||
uv run python run_filter_combinations.py --dry-run
|
||||
|
||||
# Run all combinations (shows progress bar)
|
||||
uv run python run_filter_combinations.py
|
||||
|
||||
# Or use the registered CLI entry point
|
||||
uv run quant-report-batch
|
||||
uv run quant-report-batch --dry-run
|
||||
```
|
||||
|
||||
This generates reports for:
|
||||
- All Respondents (no filters)
|
||||
- Each age group individually
|
||||
- Each gender individually
|
||||
- Each ethnicity individually
|
||||
- Each income group individually
|
||||
- Each consumer segment individually
|
||||
|
||||
Output figures are saved to `figures/<export_date>/<filter_slug>/`.
|
||||
|
||||
### Jupyter Notebook Debugging
|
||||
|
||||
The script auto-detects Jupyter/IPython environments. When running in VS Code's Jupyter extension, CLI args default to `None` (all options selected), so you can debug cell-by-cell normally.
|
||||
|
||||
---
|
||||
|
||||
## Adding Custom Filter Combinations
|
||||
|
||||
To add new filter combinations to the batch runner, edit `run_filter_combinations.py`:
|
||||
|
||||
### Checklist
|
||||
|
||||
1. **Open** `run_filter_combinations.py`
|
||||
|
||||
2. **Find** the `get_filter_combinations()` function
|
||||
|
||||
3. **Add** your combination to the list before the `return` statement:
|
||||
|
||||
```python
|
||||
# Example: Add a specific age + consumer cross-filter
|
||||
combinations.append({
|
||||
'name': 'Age-18to24_Consumer-Starter', # Used for output folder naming
|
||||
'filters': {
|
||||
'age': ['18 to 21 years', '22 to 24 years'],
|
||||
'consumer': ['Starter']
|
||||
}
|
||||
})
|
||||
```
|
||||
|
||||
4. **Filter keys** must match CLI argument names (defined in `FILTER_CONFIG` in `03_quant_report.script.py`):
|
||||
- `age` — values from `survey.options_age`
|
||||
- `gender` — values from `survey.options_gender`
|
||||
- `ethnicity` — values from `survey.options_ethnicity`
|
||||
- `income` — values from `survey.options_income`
|
||||
- `consumer` — values from `survey.options_consumer`
|
||||
|
||||
5. **Check available values** by running:
|
||||
```python
|
||||
from utils import QualtricsSurvey
|
||||
S = QualtricsSurvey('data/exports/2-2-26/...Labels.csv', 'data/exports/.../....qsf')
|
||||
S.load_data()
|
||||
print(S.options_age)
|
||||
print(S.options_consumer)
|
||||
# etc.
|
||||
```
|
||||
|
||||
6. **Test** with dry-run first:
|
||||
```bash
|
||||
uv run python run_filter_combinations.py --dry-run
|
||||
```
|
||||
|
||||
### Example: Adding Multiple Cross-Filters
|
||||
|
||||
```python
|
||||
# In get_filter_combinations(), before return:
|
||||
|
||||
# Young professionals
|
||||
combinations.append({
|
||||
'name': 'Young_Professionals',
|
||||
'filters': {
|
||||
'age': ['22 to 24 years', '25 to 34 years'],
|
||||
'consumer': ['Early Professional']
|
||||
}
|
||||
})
|
||||
|
||||
# High income males
|
||||
combinations.append({
|
||||
'name': 'High_Income_Male',
|
||||
'filters': {
|
||||
'income': ['$150,000 - $199,999', '$200,000 or more'],
|
||||
'gender': ['Male']
|
||||
}
|
||||
})
|
||||
```
|
||||
|
||||
### Notes
|
||||
|
||||
- **Empty filters dict** = all respondents (no filtering)
|
||||
- **Omitted filter keys** = all options for that dimension selected
|
||||
- **Output folder names** are auto-generated from active filters by `QualtricsSurvey.filter_data()`
|
||||
|
||||
---
|
||||
|
||||
## Adding a New Filter Dimension
|
||||
|
||||
To add an entirely new filter dimension (e.g., a new demographic question), you need to update several files:
|
||||
|
||||
### Checklist
|
||||
|
||||
1. **Update `utils.py` — `QualtricsSurvey.__init__()`** to initialize the filter state attribute:
|
||||
|
||||
```python
|
||||
# In __init__(), add after existing filter_ attributes (around line 758):
|
||||
self.filter_region:list = None # QID99
|
||||
```
|
||||
|
||||
2. **Update `utils.py` — `load_data()`** to populate the `options_*` attribute:
|
||||
|
||||
```python
|
||||
# In load_data(), add after existing options:
|
||||
self.options_region = sorted(df['QID99'].drop_nulls().unique().to_list()) if 'QID99' in df.columns else []
|
||||
```
|
||||
|
||||
3. **Update `utils.py` — `filter_data()`** to accept and apply the filter:
|
||||
|
||||
```python
|
||||
# Add parameter to function signature:
|
||||
def filter_data(self, q: pl.LazyFrame, ..., region:list=None) -> pl.LazyFrame:
|
||||
|
||||
# Add filter logic in function body:
|
||||
self.filter_region = region
|
||||
if region is not None:
|
||||
q = q.filter(pl.col('QID99').is_in(region))
|
||||
```
|
||||
|
||||
4. **Update `plots.py` — `_get_filter_slug()`** to include the filter in directory slugs:
|
||||
|
||||
```python
|
||||
# Add to the filters list:
|
||||
('region', 'Reg', getattr(self, 'filter_region', None), 'options_region'),
|
||||
```
|
||||
|
||||
5. **Update `plots.py` — `_get_filter_description()`** for human-readable descriptions:
|
||||
|
||||
```python
|
||||
# Add to the filters list:
|
||||
('Region', getattr(self, 'filter_region', None), 'options_region'),
|
||||
```
|
||||
|
||||
6. **Update `03_quant_report.script.py` — `FILTER_CONFIG`**:
|
||||
|
||||
```python
|
||||
FILTER_CONFIG = {
|
||||
'age': 'options_age',
|
||||
'gender': 'options_gender',
|
||||
# ... existing filters ...
|
||||
'region': 'options_region', # ← New filter
|
||||
}
|
||||
```
|
||||
|
||||
This **automatically**:
|
||||
- Adds `--region` CLI argument
|
||||
- Includes it in Jupyter mode (defaults to all options)
|
||||
- Passes it to `S.filter_data()`
|
||||
- Writes it to the `.txt` filter description file
|
||||
|
||||
7. **Update `run_filter_combinations.py`** to generate combinations (optional):
|
||||
|
||||
```python
|
||||
# Add after existing filter loops:
|
||||
for region in survey.options_region:
|
||||
combinations.append({
|
||||
'name': f'Region-{region}',
|
||||
'filters': {'region': [region]}
|
||||
})
|
||||
```
|
||||
|
||||
### Currently Available Filters
|
||||
|
||||
| CLI Argument | Options Attribute | QID Column | Description |
|
||||
|--------------|-------------------|------------|-------------|
|
||||
| `--age` | `options_age` | QID1 | Age groups |
|
||||
| `--gender` | `options_gender` | QID2 | Gender |
|
||||
| `--ethnicity` | `options_ethnicity` | QID3 | Ethnicity |
|
||||
| `--income` | `options_income` | QID15 | Income brackets |
|
||||
| `--consumer` | `options_consumer` | Consumer | Consumer segments |
|
||||
| `--business_owner` | `options_business_owner` | QID4 | Business owner status |
|
||||
| `--employment_status` | `options_employment_status` | QID13 | Employment status |
|
||||
| `--personal_products` | `options_personal_products` | QID14 | Personal products |
|
||||
| `--ai_user` | `options_ai_user` | QID22 | AI user status |
|
||||
| `--investable_assets` | `options_investable_assets` | QID16 | Investable assets |
|
||||
| `--industry` | `options_industry` | QID17 | Industry |
|
||||
263
XX_detailed_trait_analysis.py
Normal file
263
XX_detailed_trait_analysis.py
Normal file
@@ -0,0 +1,263 @@
|
||||
"""Extra analyses of the traits"""
|
||||
# %% Imports
|
||||
|
||||
import utils
|
||||
import polars as pl
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
from pathlib import Path
|
||||
from validation import check_straight_liners
|
||||
|
||||
|
||||
# %% Fixed Variables
|
||||
RESULTS_FILE = 'data/exports/2-4-26/JPMC_Chase Brand Personality_Quant Round 1_February 4, 2026_Labels.csv'
|
||||
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
|
||||
|
||||
|
||||
# %% CLI argument parsing for batch automation
|
||||
# When run as script: uv run XX_statistical_significance.script.py --age '["18
|
||||
# Central filter configuration - add new filters here only
|
||||
# Format: 'cli_arg_name': 'QualtricsSurvey.options_* attribute name'
|
||||
FILTER_CONFIG = {
|
||||
'age': 'options_age',
|
||||
'gender': 'options_gender',
|
||||
'ethnicity': 'options_ethnicity',
|
||||
'income': 'options_income',
|
||||
'consumer': 'options_consumer',
|
||||
'business_owner': 'options_business_owner',
|
||||
'ai_user': 'options_ai_user',
|
||||
'investable_assets': 'options_investable_assets',
|
||||
'industry': 'options_industry',
|
||||
}
|
||||
|
||||
def parse_cli_args():
|
||||
parser = argparse.ArgumentParser(description='Generate quant report with optional filters')
|
||||
|
||||
# Dynamically add filter arguments from config
|
||||
for filter_name in FILTER_CONFIG:
|
||||
parser.add_argument(f'--{filter_name}', type=str, default=None, help=f'JSON list of {filter_name} values')
|
||||
|
||||
parser.add_argument('--filter-name', type=str, default=None, help='Name for this filter combination (used for .txt description file)')
|
||||
parser.add_argument('--figures-dir', type=str, default=f'figures/traits-likert-analysis/{Path(RESULTS_FILE).parts[2]}', help='Override the default figures directory')
|
||||
|
||||
# Only parse if running as script (not in Jupyter/interactive)
|
||||
try:
|
||||
# Check if running in Jupyter by looking for ipykernel
|
||||
get_ipython() # noqa: F821 # type: ignore
|
||||
# Return namespace with all filters set to None
|
||||
no_filters = {f: None for f in FILTER_CONFIG}
|
||||
# Use the same default as argparse
|
||||
default_fig_dir = f'figures/traits-likert-analysis/{Path(RESULTS_FILE).parts[2]}'
|
||||
return argparse.Namespace(**no_filters, filter_name=None, figures_dir=default_fig_dir)
|
||||
except NameError:
|
||||
args = parser.parse_args()
|
||||
# Parse JSON strings to lists
|
||||
for filter_name in FILTER_CONFIG:
|
||||
val = getattr(args, filter_name)
|
||||
setattr(args, filter_name, json.loads(val) if val else None)
|
||||
return args
|
||||
|
||||
cli_args = parse_cli_args()
|
||||
|
||||
|
||||
# %%
|
||||
S = utils.QualtricsSurvey(RESULTS_FILE, QSF_FILE, figures_dir=cli_args.figures_dir)
|
||||
data_all = S.load_data()
|
||||
|
||||
|
||||
# %% Build filtered dataset based on CLI args
|
||||
|
||||
# CLI args: None means "no filter applied" - filter_data() will skip None filters
|
||||
|
||||
# Build filter values dict dynamically from FILTER_CONFIG
|
||||
_active_filters = {filter_name: getattr(cli_args, filter_name) for filter_name in FILTER_CONFIG}
|
||||
|
||||
_d = S.filter_data(data_all, **_active_filters)
|
||||
|
||||
# Write filter description file if filter-name is provided
|
||||
if cli_args.filter_name and S.fig_save_dir:
|
||||
# Get the filter slug (e.g., "All_Respondents", "Cons-Starter", etc.)
|
||||
_filter_slug = S._get_filter_slug()
|
||||
_filter_slug_dir = S.fig_save_dir / _filter_slug
|
||||
_filter_slug_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Build filter description
|
||||
_filter_desc_lines = [
|
||||
f"Filter: {cli_args.filter_name}",
|
||||
"",
|
||||
"Applied Filters:",
|
||||
]
|
||||
_short_desc_parts = []
|
||||
for filter_name, options_attr in FILTER_CONFIG.items():
|
||||
all_options = getattr(S, options_attr)
|
||||
values = _active_filters[filter_name]
|
||||
display_name = filter_name.replace('_', ' ').title()
|
||||
# None means no filter applied (same as "All")
|
||||
if values is not None and values != all_options:
|
||||
_short_desc_parts.append(f"{display_name}: {', '.join(values)}")
|
||||
_filter_desc_lines.append(f" {display_name}: {', '.join(values)}")
|
||||
else:
|
||||
_filter_desc_lines.append(f" {display_name}: All")
|
||||
|
||||
# Write detailed description INSIDE the filter-slug directory
|
||||
# Sanitize filter name for filename usage (replace / and other chars)
|
||||
_safe_filter_name = re.sub(r'[^\w\s-]', '_', cli_args.filter_name)
|
||||
_filter_file = _filter_slug_dir / f"{_safe_filter_name}.txt"
|
||||
_filter_file.write_text('\n'.join(_filter_desc_lines))
|
||||
|
||||
# Append to summary index file at figures/<export_date>/filter_index.txt
|
||||
_summary_file = S.fig_save_dir / "filter_index.txt"
|
||||
_short_desc = "; ".join(_short_desc_parts) if _short_desc_parts else "All Respondents"
|
||||
_summary_line = f"{_filter_slug} | {cli_args.filter_name} | {_short_desc}\n"
|
||||
|
||||
# Append or create the summary file
|
||||
if _summary_file.exists():
|
||||
_existing = _summary_file.read_text()
|
||||
# Avoid duplicate entries for same slug
|
||||
if _filter_slug not in _existing:
|
||||
with _summary_file.open('a') as f:
|
||||
f.write(_summary_line)
|
||||
else:
|
||||
_header = "Filter Index\n" + "=" * 80 + "\n\n"
|
||||
_header += "Directory | Filter Name | Description\n"
|
||||
_header += "-" * 80 + "\n"
|
||||
_summary_file.write_text(_header + _summary_line)
|
||||
|
||||
# Save to logical variable name for further analysis
|
||||
data = _d
|
||||
data.collect()
|
||||
|
||||
# %% Voices per trait
|
||||
|
||||
|
||||
ss_or, choice_map_or = S.get_ss_orange_red(data)
|
||||
ss_gb, choice_map_gb = S.get_ss_green_blue(data)
|
||||
|
||||
# Combine the data
|
||||
ss_all = ss_or.join(ss_gb, on='_recordId')
|
||||
_d = ss_all.collect()
|
||||
|
||||
choice_map = {**choice_map_or, **choice_map_gb}
|
||||
# print(_d.head())
|
||||
# print(choice_map)
|
||||
ss_long = utils.process_speaking_style_data(ss_all, choice_map)
|
||||
|
||||
|
||||
# %% Create plots
|
||||
|
||||
for i, trait in enumerate(ss_long.select("Description").unique().to_series().to_list()):
|
||||
trait_d = ss_long.filter(pl.col("Description") == trait)
|
||||
|
||||
S.plot_speaking_style_trait_scores(trait_d, title=trait.replace(":", " ↔ "), height=550, color_gender=True)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
# %% Filter out straight-liner (PER TRAIT) and re-plot to see if any changes
|
||||
# Save with different filename suffix so we can compare with/without straight-liners
|
||||
|
||||
print("\n--- Straight-lining Checks on TRAITS ---")
|
||||
sl_report_traits, sl_traits_df = check_straight_liners(ss_all, max_score=5)
|
||||
sl_traits_df
|
||||
|
||||
# %%
|
||||
|
||||
if sl_traits_df is not None and not sl_traits_df.is_empty():
|
||||
sl_ids = sl_traits_df.select(pl.col("Record ID").unique()).to_series().to_list()
|
||||
n_sl_groups = sl_traits_df.height
|
||||
print(f"\nExcluding {n_sl_groups} straight-lined question blocks from {len(sl_ids)} respondents.")
|
||||
|
||||
# Create key in ss_long to match sl_traits_df for anti-join
|
||||
# Question Group key in sl_traits_df is like "SS_Orange_Red__V14"
|
||||
# ss_long has "Style_Group" and "Voice"
|
||||
ss_long_w_key = ss_long.with_columns(
|
||||
(pl.col("Style_Group") + "__" + pl.col("Voice")).alias("Question Group")
|
||||
)
|
||||
|
||||
# Prepare filter table: Record ID + Question Group
|
||||
sl_filter = sl_traits_df.select([
|
||||
pl.col("Record ID").alias("_recordId"),
|
||||
pl.col("Question Group")
|
||||
])
|
||||
|
||||
# Anti-join to remove specific question blocks that were straight-lined
|
||||
ss_long_clean = ss_long_w_key.join(sl_filter, on=["_recordId", "Question Group"], how="anti").drop("Question Group")
|
||||
|
||||
# Re-plot with suffix in title
|
||||
print("Re-plotting traits (Cleaned)...")
|
||||
for i, trait in enumerate(ss_long_clean.select("Description").unique().to_series().to_list()):
|
||||
trait_d = ss_long_clean.filter(pl.col("Description") == trait)
|
||||
|
||||
# Modify title to create unique filename (and display title)
|
||||
title_clean = trait.replace(":", " ↔ ") + " (Excl. Straight-Liners)"
|
||||
|
||||
S.plot_speaking_style_trait_scores(trait_d, title=title_clean, height=550, color_gender=True)
|
||||
else:
|
||||
print("No straight-liners found on traits.")
|
||||
|
||||
|
||||
|
||||
|
||||
# %% Compare All vs Cleaned
|
||||
if sl_traits_df is not None and not sl_traits_df.is_empty():
|
||||
print("Generating Comparison Plots (All vs Cleaned)...")
|
||||
|
||||
# Always apply the per-question-group filtering here to ensure consistency
|
||||
# (Matches the logic used in the re-plotting section above)
|
||||
print("Applying filter to remove straight-lined question blocks...")
|
||||
ss_long_w_key = ss_long.with_columns(
|
||||
(pl.col("Style_Group") + "__" + pl.col("Voice")).alias("Question Group")
|
||||
)
|
||||
sl_filter = sl_traits_df.select([
|
||||
pl.col("Record ID").alias("_recordId"),
|
||||
pl.col("Question Group")
|
||||
])
|
||||
ss_long_clean = ss_long_w_key.join(sl_filter, on=["_recordId", "Question Group"], how="anti").drop("Question Group")
|
||||
|
||||
sl_ids = sl_traits_df.select(pl.col("Record ID").unique()).to_series().to_list()
|
||||
|
||||
# --- Verification Prints ---
|
||||
print(f"\n--- Verification of Filter ---")
|
||||
print(f"Original Row Count: {ss_long.height}")
|
||||
print(f"Number of Straight-Liner Question Blocks: {sl_traits_df.height}")
|
||||
print(f"Sample IDs affected: {sl_ids[:5]}")
|
||||
print(f"Cleaned Row Count: {ss_long_clean.height}")
|
||||
print(f"Rows Removed: {ss_long.height - ss_long_clean.height}")
|
||||
|
||||
# Verify removal
|
||||
# Re-construct key to verify
|
||||
ss_long_check = ss_long.with_columns(
|
||||
(pl.col("Style_Group") + "__" + pl.col("Voice")).alias("Question Group")
|
||||
)
|
||||
sl_filter_check = sl_traits_df.select([
|
||||
pl.col("Record ID").alias("_recordId"),
|
||||
pl.col("Question Group")
|
||||
])
|
||||
|
||||
should_be_removed = ss_long_check.join(sl_filter_check, on=["_recordId", "Question Group"], how="inner").height
|
||||
print(f"Discrepancy Check (Should be 0): { (ss_long.height - ss_long_clean.height) - should_be_removed }")
|
||||
|
||||
# Show what was removed (the straight lining behavior)
|
||||
print("\nSample of Straight-Liner Data (Values that caused removal):")
|
||||
print(sl_traits_df.head(5))
|
||||
print("-" * 30 + "\n")
|
||||
# ---------------------------
|
||||
|
||||
for i, trait in enumerate(ss_long.select("Description").unique().to_series().to_list()):
|
||||
|
||||
# Get data for this trait from both datasets
|
||||
trait_d_all = ss_long.filter(pl.col("Description") == trait)
|
||||
trait_d_clean = ss_long_clean.filter(pl.col("Description") == trait)
|
||||
|
||||
# Plot comparison
|
||||
title_comp = trait.replace(":", " ↔ ") + " (Impact of Straight-Liners)"
|
||||
|
||||
S.plot_speaking_style_trait_scores_comparison(
|
||||
trait_d_all,
|
||||
trait_d_clean,
|
||||
title=title_comp,
|
||||
height=600 # Slightly taller for grouped bars
|
||||
)
|
||||
|
||||
849
XX_quant_report.script.py
Normal file
849
XX_quant_report.script.py
Normal file
@@ -0,0 +1,849 @@
|
||||
|
||||
__generated_with = "0.19.7"
|
||||
|
||||
# %%
|
||||
import marimo as mo
|
||||
import polars as pl
|
||||
from pathlib import Path
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
from validation import check_progress, duration_validation, check_straight_liners
|
||||
from utils import QualtricsSurvey, combine_exclusive_columns, calculate_weighted_ranking_scores
|
||||
import utils
|
||||
|
||||
from speaking_styles import SPEAKING_STYLES
|
||||
|
||||
# %% Fixed Variables
|
||||
|
||||
RESULTS_FILE = 'data/exports/2-4-26/JPMC_Chase Brand Personality_Quant Round 1_February 4, 2026_Labels.csv'
|
||||
# RESULTS_FILE = 'data/exports/debug/JPMC_Chase Brand Personality_Quant Round 1_February 2, 2026_Labels.csv'
|
||||
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
|
||||
|
||||
|
||||
# %%
|
||||
# CLI argument parsing for batch automation
|
||||
# When run as script: python 03_quant_report.script.py --age '["18 to 21 years"]' --consumer '["Starter"]'
|
||||
# When run in Jupyter: args will use defaults (all filters = None = all options selected)
|
||||
|
||||
# Central filter configuration - add new filters here only
|
||||
# Format: 'cli_arg_name': 'QualtricsSurvey.options_* attribute name'
|
||||
FILTER_CONFIG = {
|
||||
'age': 'options_age',
|
||||
'gender': 'options_gender',
|
||||
'ethnicity': 'options_ethnicity',
|
||||
'income': 'options_income',
|
||||
'consumer': 'options_consumer',
|
||||
'business_owner': 'options_business_owner',
|
||||
'ai_user': 'options_ai_user',
|
||||
'investable_assets': 'options_investable_assets',
|
||||
'industry': 'options_industry',
|
||||
}
|
||||
|
||||
def parse_cli_args():
|
||||
parser = argparse.ArgumentParser(description='Generate quant report with optional filters')
|
||||
|
||||
# Dynamically add filter arguments from config
|
||||
for filter_name in FILTER_CONFIG:
|
||||
parser.add_argument(f'--{filter_name}', type=str, default=None, help=f'JSON list of {filter_name} values')
|
||||
|
||||
parser.add_argument('--filter-name', type=str, default=None, help='Name for this filter combination (used for .txt description file)')
|
||||
parser.add_argument('--figures-dir', type=str, default=f'figures/{Path(RESULTS_FILE).parts[2]}', help='Override the default figures directory')
|
||||
parser.add_argument('--best-character', type=str, default="the_coach", help='Slug of the best chosen character (default: "the_coach")')
|
||||
parser.add_argument('--sl-threshold', type=int, default=None, help='Exclude respondents who straight-lined >= N question groups (e.g. 3 removes anyone with 3+ straight-lined groups)')
|
||||
parser.add_argument('--voice-ranking-filter', type=str, default=None, choices=['only-missing', 'exclude-missing'], help='Filter by voice ranking completeness: "only-missing" keeps only respondents missing QID98 ranking data, "exclude-missing" removes them')
|
||||
|
||||
# Only parse if running as script (not in Jupyter/interactive)
|
||||
try:
|
||||
# Check if running in Jupyter by looking for ipykernel
|
||||
get_ipython() # noqa: F821 # type: ignore
|
||||
# Return namespace with all filters set to None
|
||||
no_filters = {f: None for f in FILTER_CONFIG}
|
||||
return argparse.Namespace(**no_filters, filter_name=None, figures_dir=f'figures/statistical_significance/{Path(RESULTS_FILE).parts[2]}', best_character="the_coach", sl_threshold=None, voice_ranking_filter=None)
|
||||
except NameError:
|
||||
args = parser.parse_args()
|
||||
# Parse JSON strings to lists
|
||||
for filter_name in FILTER_CONFIG:
|
||||
val = getattr(args, filter_name)
|
||||
setattr(args, filter_name, json.loads(val) if val else None)
|
||||
return args
|
||||
|
||||
cli_args = parse_cli_args()
|
||||
BEST_CHOSEN_CHARACTER = cli_args.best_character
|
||||
|
||||
|
||||
|
||||
# %%
|
||||
S = QualtricsSurvey(RESULTS_FILE, QSF_FILE, figures_dir=cli_args.figures_dir)
|
||||
try:
|
||||
data_all = S.load_data()
|
||||
except NotImplementedError as e:
|
||||
mo.stop(True, mo.md(f"**⚠️ {str(e)}**"))
|
||||
|
||||
|
||||
# %% Build filtered dataset based on CLI args
|
||||
|
||||
# CLI args: None means "no filter applied" - filter_data() will skip None filters
|
||||
|
||||
# Build filter values dict dynamically from FILTER_CONFIG
|
||||
_active_filters = {filter_name: getattr(cli_args, filter_name) for filter_name in FILTER_CONFIG}
|
||||
|
||||
# %% Apply filters
|
||||
_d = S.filter_data(data_all, **_active_filters)
|
||||
|
||||
# Write filter description file if filter-name is provided
|
||||
if cli_args.filter_name and S.fig_save_dir:
|
||||
# Get the filter slug (e.g., "All_Respondents", "Cons-Starter", etc.)
|
||||
_filter_slug = S._get_filter_slug()
|
||||
_filter_slug_dir = S.fig_save_dir / _filter_slug
|
||||
_filter_slug_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Build filter description
|
||||
_filter_desc_lines = [
|
||||
f"Filter: {cli_args.filter_name}",
|
||||
"",
|
||||
"Applied Filters:",
|
||||
]
|
||||
_short_desc_parts = []
|
||||
for filter_name, options_attr in FILTER_CONFIG.items():
|
||||
all_options = getattr(S, options_attr)
|
||||
values = _active_filters[filter_name]
|
||||
display_name = filter_name.replace('_', ' ').title()
|
||||
# None means no filter applied (same as "All")
|
||||
if values is not None and values != all_options:
|
||||
_short_desc_parts.append(f"{display_name}: {', '.join(values)}")
|
||||
_filter_desc_lines.append(f" {display_name}: {', '.join(values)}")
|
||||
else:
|
||||
_filter_desc_lines.append(f" {display_name}: All")
|
||||
|
||||
# Write detailed description INSIDE the filter-slug directory
|
||||
# Sanitize filter name for filename usage (replace / and other chars)
|
||||
_safe_filter_name = re.sub(r'[^\w\s-]', '_', cli_args.filter_name)
|
||||
_filter_file = _filter_slug_dir / f"{_safe_filter_name}.txt"
|
||||
_filter_file.write_text('\n'.join(_filter_desc_lines))
|
||||
|
||||
# Append to summary index file at figures/<export_date>/filter_index.txt
|
||||
_summary_file = S.fig_save_dir / "filter_index.txt"
|
||||
_short_desc = "; ".join(_short_desc_parts) if _short_desc_parts else "All Respondents"
|
||||
_summary_line = f"{_filter_slug} | {cli_args.filter_name} | {_short_desc}\n"
|
||||
|
||||
# Append or create the summary file
|
||||
if _summary_file.exists():
|
||||
_existing = _summary_file.read_text()
|
||||
# Avoid duplicate entries for same slug
|
||||
if _filter_slug not in _existing:
|
||||
with _summary_file.open('a') as f:
|
||||
f.write(_summary_line)
|
||||
else:
|
||||
_header = "Filter Index\n" + "=" * 80 + "\n\n"
|
||||
_header += "Directory | Filter Name | Description\n"
|
||||
_header += "-" * 80 + "\n"
|
||||
_summary_file.write_text(_header + _summary_line)
|
||||
|
||||
# %% Apply straight-liner threshold filter (if specified)
|
||||
# Removes respondents who straight-lined >= N question groups across
|
||||
# speaking style and voice scale questions.
|
||||
if cli_args.sl_threshold is not None:
|
||||
_sl_n = cli_args.sl_threshold
|
||||
S.sl_threshold = _sl_n # Store on Survey so filter slug/description include it
|
||||
print(f"Applying straight-liner filter: excluding respondents with ≥{_sl_n} straight-lined question groups...")
|
||||
_n_before = _d.select(pl.len()).collect().item()
|
||||
|
||||
# Extract question groups with renamed columns for check_straight_liners
|
||||
_sl_ss_or, _ = S.get_ss_orange_red(_d)
|
||||
_sl_ss_gb, _ = S.get_ss_green_blue(_d)
|
||||
_sl_vs, _ = S.get_voice_scale_1_10(_d)
|
||||
_sl_all_q = _sl_ss_or.join(_sl_ss_gb, on='_recordId').join(_sl_vs, on='_recordId')
|
||||
|
||||
_, _sl_df = check_straight_liners(_sl_all_q, max_score=5)
|
||||
|
||||
if _sl_df is not None and not _sl_df.is_empty():
|
||||
# Count straight-lined question groups per respondent
|
||||
_sl_counts = (
|
||||
_sl_df
|
||||
.group_by("Record ID")
|
||||
.agg(pl.len().alias("sl_count"))
|
||||
.filter(pl.col("sl_count") >= _sl_n)
|
||||
.select(pl.col("Record ID").alias("_recordId"))
|
||||
)
|
||||
# Anti-join to remove offending respondents
|
||||
_d = _d.collect().join(_sl_counts, on="_recordId", how="anti").lazy()
|
||||
# Update filtered data on the Survey object so sample size is correct
|
||||
S.data_filtered = _d
|
||||
_n_after = _d.select(pl.len()).collect().item()
|
||||
print(f" Removed {_n_before - _n_after} respondents ({_n_before} → {_n_after})")
|
||||
else:
|
||||
print(" No straight-liners detected — no respondents removed.")
|
||||
|
||||
# %% Apply voice-ranking completeness filter (if specified)
|
||||
# Keeps only / excludes respondents who are missing the explicit voice
|
||||
# ranking question (QID98) despite completing the top-3 selection (QID36).
|
||||
if cli_args.voice_ranking_filter is not None:
|
||||
S.voice_ranking_filter = cli_args.voice_ranking_filter # Store on Survey so filter slug/description include it
|
||||
_vr_missing = S.get_top_3_voices_missing_ranking(_d)
|
||||
_vr_missing_ids = _vr_missing.select('_recordId')
|
||||
_n_before = _d.select(pl.len()).collect().item()
|
||||
|
||||
if cli_args.voice_ranking_filter == 'only-missing':
|
||||
print(f"Voice ranking filter: keeping ONLY respondents missing QID98 ranking data...")
|
||||
_d = _d.collect().join(_vr_missing_ids, on='_recordId', how='inner').lazy()
|
||||
elif cli_args.voice_ranking_filter == 'exclude-missing':
|
||||
print(f"Voice ranking filter: EXCLUDING respondents missing QID98 ranking data...")
|
||||
_d = _d.collect().join(_vr_missing_ids, on='_recordId', how='anti').lazy()
|
||||
|
||||
S.data_filtered = _d
|
||||
_n_after = _d.select(pl.len()).collect().item()
|
||||
print(f" {_n_before} → {_n_after} respondents ({_vr_missing_ids.height} missing ranking data)")
|
||||
|
||||
# Save to logical variable name for further analysis
|
||||
data = _d
|
||||
data.collect()
|
||||
|
||||
|
||||
|
||||
# %%
|
||||
# Check if all business owners are missing a 'Consumer type' in demographics
|
||||
# assert all([a is None for a in data_all.filter(pl.col('QID4') == 'Yes').collect()['Consumer'].unique()]) , "Not all business owners are missing 'Consumer type' in demographics."
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
# Demographic Distributions
|
||||
""")
|
||||
|
||||
# %%
|
||||
demo_plot_cols = [
|
||||
'Age',
|
||||
'Gender',
|
||||
# 'Race/Ethnicity',
|
||||
'Bussiness_Owner',
|
||||
'Consumer'
|
||||
]
|
||||
|
||||
# %%
|
||||
_content = """
|
||||
|
||||
"""
|
||||
for c in demo_plot_cols:
|
||||
_fig = S.plot_demographic_distribution(
|
||||
data=S.get_demographics(data)[0],
|
||||
column=c,
|
||||
title=f"{c.replace('Bussiness', 'Business').replace('_', ' ')} Distribution of Survey Respondents"
|
||||
)
|
||||
_content += f"""{mo.ui.altair_chart(_fig)}\n\n"""
|
||||
|
||||
mo.md(_content)
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
---
|
||||
|
||||
# Brand Character Results
|
||||
""")
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
## Best performing: Original vs Refined frankenstein
|
||||
""")
|
||||
|
||||
# %%
|
||||
char_refine_rank = S.get_character_refine(data)[0]
|
||||
# print(char_rank.collect().head())
|
||||
print(char_refine_rank.collect().head())
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
## Character ranking points
|
||||
""")
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
## Character ranking 1-2-3
|
||||
""")
|
||||
|
||||
# %%
|
||||
char_rank = S.get_character_ranking(data)[0]
|
||||
|
||||
# %%
|
||||
char_rank_weighted = calculate_weighted_ranking_scores(char_rank)
|
||||
S.plot_weighted_ranking_score(char_rank_weighted, title="Most Popular Character - Weighted Popularity Score<br>(1st=3pts, 2nd=2pts, 3rd=1pt)", x_label='Voice')
|
||||
|
||||
# %%
|
||||
S.plot_top3_ranking_distribution(char_rank, x_label='Character Personality', title='Character Personality: Rankings Top 3')
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
### Statistical Significance Character Ranking
|
||||
""")
|
||||
|
||||
# %%
|
||||
# _pairwise_df, _meta = S.compute_ranking_significance(char_rank)
|
||||
|
||||
# # print(_pairwise_df.columns)
|
||||
|
||||
# mo.md(f"""
|
||||
|
||||
|
||||
# {mo.ui.altair_chart(S.plot_significance_heatmap(_pairwise_df, metadata=_meta))}
|
||||
|
||||
# {mo.ui.altair_chart(S.plot_significance_summary(_pairwise_df, metadata=_meta))}
|
||||
# """)
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
## Character Ranking: times 1st place
|
||||
""")
|
||||
|
||||
# %%
|
||||
S.plot_most_ranked_1(char_rank, title="Most Popular Character<br>(Number of Times Ranked 1st)", x_label='Character Personality')
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
## Prominent predefined personality traits wordcloud
|
||||
""")
|
||||
|
||||
# %%
|
||||
top8_traits = S.get_top_8_traits(data)[0]
|
||||
S.plot_traits_wordcloud(
|
||||
data=top8_traits,
|
||||
column='Top_8_Traits',
|
||||
title="Most Prominent Personality Traits",
|
||||
)
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
## Trait frequency per brand character
|
||||
""")
|
||||
|
||||
# %%
|
||||
char_df = S.get_character_refine(data)[0]
|
||||
|
||||
# %%
|
||||
from theme import ColorPalette
|
||||
|
||||
# Assuming you already have char_df (your data from get_character_refine or similar)
|
||||
characters = ['Bank Teller', 'Familiar Friend', 'The Coach', 'Personal Assistant']
|
||||
character_colors = {
|
||||
'Bank Teller': (ColorPalette.CHARACTER_BANK_TELLER, ColorPalette.CHARACTER_BANK_TELLER_HIGHLIGHT),
|
||||
'Familiar Friend': (ColorPalette.CHARACTER_FAMILIAR_FRIEND, ColorPalette.CHARACTER_FAMILIAR_FRIEND_HIGHLIGHT),
|
||||
'The Coach': (ColorPalette.CHARACTER_COACH, ColorPalette.CHARACTER_COACH_HIGHLIGHT),
|
||||
'Personal Assistant': (ColorPalette.CHARACTER_PERSONAL_ASSISTANT, ColorPalette.CHARACTER_PERSONAL_ASSISTANT_HIGHLIGHT),
|
||||
}
|
||||
|
||||
# Build consistent sort order (by total frequency across all characters)
|
||||
all_trait_counts = {}
|
||||
for char in characters:
|
||||
freq_df, _ = S.transform_character_trait_frequency(char_df, char)
|
||||
for row in freq_df.iter_rows(named=True):
|
||||
all_trait_counts[row['trait']] = all_trait_counts.get(row['trait'], 0) + row['count']
|
||||
|
||||
consistent_sort_order = sorted(all_trait_counts.keys(), key=lambda x: -all_trait_counts[x])
|
||||
|
||||
_content = """"""
|
||||
# Generate 4 plots (one per character)
|
||||
for char in characters:
|
||||
freq_df, _ = S.transform_character_trait_frequency(char_df, char)
|
||||
main_color, highlight_color = character_colors[char]
|
||||
chart = S.plot_single_character_trait_frequency(
|
||||
data=freq_df,
|
||||
character_name=char,
|
||||
bar_color=main_color,
|
||||
highlight_color=highlight_color,
|
||||
trait_sort_order=consistent_sort_order,
|
||||
)
|
||||
_content += f"""
|
||||
{mo.ui.altair_chart(chart)}
|
||||
|
||||
|
||||
"""
|
||||
|
||||
mo.md(_content)
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
## Statistical significance best characters
|
||||
|
||||
zie chat
|
||||
> voorbeeld: als de nr 1 en 2 niet significant verschillen maar wel van de nr 3 bijvoorbeeld is dat ook top. Beetje meedenkend over hoe ik het kan presenteren weetje wat ik bedoel?:)
|
||||
>
|
||||
""")
|
||||
|
||||
# %%
|
||||
|
||||
|
||||
# %%
|
||||
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
---
|
||||
|
||||
# Spoken Voice Results
|
||||
""")
|
||||
|
||||
# %%
|
||||
COLOR_GENDER = True
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
## Top 8 Most Chosen out of 18
|
||||
""")
|
||||
|
||||
# %%
|
||||
v_18_8_3 = S.get_18_8_3(data)[0]
|
||||
|
||||
# %%
|
||||
S.plot_voice_selection_counts(v_18_8_3, title="Top 8 Voice Selection from 18 Voices", x_label='Voice', color_gender=COLOR_GENDER)
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
## Top 3 most chosen out of 8
|
||||
""")
|
||||
|
||||
# %%
|
||||
S.plot_top3_selection_counts(v_18_8_3, title="Top 3 Voice Selection Counts from 8 Voices", x_label='Voice', color_gender=COLOR_GENDER)
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
## Voice Ranking Weighted Score
|
||||
""")
|
||||
|
||||
# %%
|
||||
top3_voices = S.get_top_3_voices(data)[0]
|
||||
top3_voices_weighted = calculate_weighted_ranking_scores(top3_voices)
|
||||
|
||||
# %%
|
||||
S.plot_weighted_ranking_score(top3_voices_weighted, title="Most Popular Voice - Weighted Popularity Score<br>(1st = 3pts, 2nd = 2pts, 3rd = 1pt)", color_gender=COLOR_GENDER)
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
## Which voice is ranked best in the ranking question for top 3?
|
||||
|
||||
(not best 3 out of 8 question)
|
||||
""")
|
||||
|
||||
# %%
|
||||
S.plot_ranking_distribution(top3_voices, x_label='Voice', title="Distribution of Top 3 Voice Rankings (1st, 2nd, 3rd)", color_gender=COLOR_GENDER)
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
### Statistical significance for voice ranking
|
||||
""")
|
||||
|
||||
# %%
|
||||
# print(top3_voices.collect().head())
|
||||
|
||||
# %%
|
||||
|
||||
# _pairwise_df, _metadata = S.compute_ranking_significance(
|
||||
# top3_voices,alpha=0.05,correction="none")
|
||||
|
||||
# # View significant pairs
|
||||
# # print(pairwise_df.filter(pl.col('significant') == True))
|
||||
|
||||
# # Create heatmap visualization
|
||||
# _heatmap = S.plot_significance_heatmap(
|
||||
# _pairwise_df,
|
||||
# metadata=_metadata,
|
||||
# title="Weighted Voice Ranking Significance<br>(Pairwise Comparisons)"
|
||||
# )
|
||||
|
||||
# # Create summary bar chart
|
||||
# _summary = S.plot_significance_summary(
|
||||
# _pairwise_df,
|
||||
# metadata=_metadata
|
||||
# )
|
||||
|
||||
# mo.md(f"""
|
||||
# {mo.ui.altair_chart(_heatmap)}
|
||||
|
||||
# {mo.ui.altair_chart(_summary)}
|
||||
# """)
|
||||
|
||||
# %%
|
||||
## Voice Ranked 1st the most
|
||||
|
||||
# %%
|
||||
S.plot_most_ranked_1(top3_voices, title="Most Popular Voice<br>(Number of Times Ranked 1st)", x_label='Voice', color_gender=COLOR_GENDER)
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
## Voice Scale 1-10
|
||||
""")
|
||||
|
||||
# %%
|
||||
# Get your voice scale data (from notebook)
|
||||
voice_1_10, _ = S.get_voice_scale_1_10(data)
|
||||
S.plot_average_scores_with_counts(voice_1_10, x_label='Voice', domain=[1,10], title="Voice General Impression (Scale 1-10)", color_gender=COLOR_GENDER)
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
### Statistical Significance (Scale 1-10)
|
||||
""")
|
||||
|
||||
# %%
|
||||
# Compute pairwise significance tests
|
||||
# pairwise_df, metadata = S.compute_pairwise_significance(
|
||||
# voice_1_10,
|
||||
# test_type="mannwhitney", # or "ttest", "chi2", "auto"
|
||||
# alpha=0.05,
|
||||
# correction="bonferroni" # or "holm", "none"
|
||||
# )
|
||||
|
||||
# # View significant pairs
|
||||
# # print(pairwise_df.filter(pl.col('significant') == True))
|
||||
|
||||
# # Create heatmap visualization
|
||||
# _heatmap = S.plot_significance_heatmap(
|
||||
# pairwise_df,
|
||||
# metadata=metadata,
|
||||
# title="Voice Rating Significance<br>(Pairwise Comparisons)"
|
||||
# )
|
||||
|
||||
# # Create summary bar chart
|
||||
# _summary = S.plot_significance_summary(
|
||||
# pairwise_df,
|
||||
# metadata=metadata
|
||||
# )
|
||||
|
||||
# mo.md(f"""
|
||||
# {mo.ui.altair_chart(_heatmap)}
|
||||
|
||||
# {mo.ui.altair_chart(_summary)}
|
||||
# """)
|
||||
|
||||
# %%
|
||||
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
## Ranking points for Voice per Chosen Brand Character
|
||||
|
||||
**missing mapping**
|
||||
""")
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
## Correlation Speaking Styles
|
||||
""")
|
||||
|
||||
# %%
|
||||
ss_or, choice_map_or = S.get_ss_orange_red(data)
|
||||
ss_gb, choice_map_gb = S.get_ss_green_blue(data)
|
||||
|
||||
# Combine the data
|
||||
ss_all = ss_or.join(ss_gb, on='_recordId')
|
||||
_d = ss_all.collect()
|
||||
|
||||
choice_map = {**choice_map_or, **choice_map_gb}
|
||||
# print(_d.head())
|
||||
# print(choice_map)
|
||||
ss_long = utils.process_speaking_style_data(ss_all, choice_map)
|
||||
|
||||
df_style = utils.process_speaking_style_data(ss_all, choice_map)
|
||||
|
||||
vscales = S.get_voice_scale_1_10(data)[0]
|
||||
df_scale_long = utils.process_voice_scale_data(vscales)
|
||||
|
||||
joined_scale = df_style.join(df_scale_long, on=["_recordId", "Voice"], how="inner")
|
||||
|
||||
df_ranking = utils.process_voice_ranking_data(top3_voices)
|
||||
joined_ranking = df_style.join(df_ranking, on=['_recordId', 'Voice'], how='inner')
|
||||
|
||||
# %%
|
||||
joined_ranking.head()
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
### Colors vs Scale 1-10
|
||||
""")
|
||||
|
||||
# %%
|
||||
# Transform to get one row per color with average correlation
|
||||
color_corr_scale, _ = utils.transform_speaking_style_color_correlation(joined_scale, SPEAKING_STYLES)
|
||||
S.plot_speaking_style_color_correlation(
|
||||
data=color_corr_scale,
|
||||
title="Correlation: Speaking Style Colors and Voice Scale 1-10"
|
||||
)
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
### Colors vs Ranking Points
|
||||
""")
|
||||
|
||||
# %%
|
||||
color_corr_ranking, _ = utils.transform_speaking_style_color_correlation(
|
||||
joined_ranking,
|
||||
SPEAKING_STYLES,
|
||||
target_column="Ranking_Points"
|
||||
)
|
||||
S.plot_speaking_style_color_correlation(
|
||||
data=color_corr_ranking,
|
||||
title="Correlation: Speaking Style Colors and Voice Ranking Points"
|
||||
)
|
||||
|
||||
# %%
|
||||
# Gender-filtered correlation plots (Male vs Female voices)
|
||||
from reference import VOICE_GENDER_MAPPING
|
||||
|
||||
MALE_VOICES = [v for v, g in VOICE_GENDER_MAPPING.items() if g == "Male"]
|
||||
FEMALE_VOICES = [v for v, g in VOICE_GENDER_MAPPING.items() if g == "Female"]
|
||||
|
||||
# Filter joined data by voice gender
|
||||
joined_scale_male = joined_scale.filter(pl.col("Voice").is_in(MALE_VOICES))
|
||||
joined_scale_female = joined_scale.filter(pl.col("Voice").is_in(FEMALE_VOICES))
|
||||
joined_ranking_male = joined_ranking.filter(pl.col("Voice").is_in(MALE_VOICES))
|
||||
joined_ranking_female = joined_ranking.filter(pl.col("Voice").is_in(FEMALE_VOICES))
|
||||
|
||||
# Colors vs Scale 1-10 (grouped by voice gender)
|
||||
S.plot_speaking_style_color_correlation_by_gender(
|
||||
data_male=joined_scale_male,
|
||||
data_female=joined_scale_female,
|
||||
speaking_styles=SPEAKING_STYLES,
|
||||
target_column="Voice_Scale_Score",
|
||||
title="Correlation: Speaking Style Colors and Voice Scale 1-10 (by Voice Gender)",
|
||||
filename="correlation_speaking_style_and_voice_scale_1-10_by_voice_gender_color",
|
||||
)
|
||||
|
||||
# Colors vs Ranking Points (grouped by voice gender)
|
||||
S.plot_speaking_style_color_correlation_by_gender(
|
||||
data_male=joined_ranking_male,
|
||||
data_female=joined_ranking_female,
|
||||
speaking_styles=SPEAKING_STYLES,
|
||||
target_column="Ranking_Points",
|
||||
title="Correlation: Speaking Style Colors and Voice Ranking Points (by Voice Gender)",
|
||||
filename="correlation_speaking_style_and_voice_ranking_points_by_voice_gender_color",
|
||||
)
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
### Individual Traits vs Scale 1-10
|
||||
""")
|
||||
|
||||
# %%
|
||||
_content = """"""
|
||||
|
||||
for _style, _traits in SPEAKING_STYLES.items():
|
||||
# print(f"Correlation plot for {style}...")
|
||||
_fig = S.plot_speaking_style_scale_correlation(
|
||||
data=joined_scale,
|
||||
style_color=_style,
|
||||
style_traits=_traits,
|
||||
title=f"Correlation: Speaking Style {_style} and Voice Scale 1-10",
|
||||
)
|
||||
_content += f"""
|
||||
#### Speaking Style **{_style}**:
|
||||
|
||||
{mo.ui.altair_chart(_fig)}
|
||||
|
||||
"""
|
||||
mo.md(_content)
|
||||
|
||||
# %%
|
||||
mo.md(r"""
|
||||
### Individual Traits vs Ranking Points
|
||||
""")
|
||||
|
||||
# %%
|
||||
_content = """"""
|
||||
|
||||
for _style, _traits in SPEAKING_STYLES.items():
|
||||
# print(f"Correlation plot for {style}...")
|
||||
_fig = S.plot_speaking_style_ranking_correlation(
|
||||
data=joined_ranking,
|
||||
style_color=_style,
|
||||
style_traits=_traits,
|
||||
title=f"Correlation: Speaking Style {_style} and Voice Ranking Points",
|
||||
)
|
||||
_content += f"""
|
||||
#### Speaking Style **{_style}**:
|
||||
|
||||
{mo.ui.altair_chart(_fig)}
|
||||
|
||||
"""
|
||||
mo.md(_content)
|
||||
|
||||
# %%
|
||||
# Individual Traits vs Scale 1-10 (grouped by voice gender)
|
||||
_content = """### Individual Traits vs Scale 1-10 (by Voice Gender)\n\n"""
|
||||
|
||||
for _style, _traits in SPEAKING_STYLES.items():
|
||||
_fig = S.plot_speaking_style_scale_correlation_by_gender(
|
||||
data_male=joined_scale_male,
|
||||
data_female=joined_scale_female,
|
||||
style_color=_style,
|
||||
style_traits=_traits,
|
||||
title=f"Correlation: Speaking Style {_style} and Voice Scale 1-10 (by Voice Gender)",
|
||||
filename=f"correlation_speaking_style_and_voice_scale_1-10_by_voice_gender_{_style.lower()}",
|
||||
)
|
||||
_content += f"""
|
||||
#### Speaking Style **{_style}**:
|
||||
|
||||
{mo.ui.altair_chart(_fig)}
|
||||
|
||||
"""
|
||||
mo.md(_content)
|
||||
|
||||
# %%
|
||||
# Individual Traits vs Ranking Points (grouped by voice gender)
|
||||
_content = """### Individual Traits vs Ranking Points (by Voice Gender)\n\n"""
|
||||
|
||||
for _style, _traits in SPEAKING_STYLES.items():
|
||||
_fig = S.plot_speaking_style_ranking_correlation_by_gender(
|
||||
data_male=joined_ranking_male,
|
||||
data_female=joined_ranking_female,
|
||||
style_color=_style,
|
||||
style_traits=_traits,
|
||||
title=f"Correlation: Speaking Style {_style} and Voice Ranking Points (by Voice Gender)",
|
||||
filename=f"correlation_speaking_style_and_voice_ranking_points_by_voice_gender_{_style.lower()}",
|
||||
)
|
||||
_content += f"""
|
||||
#### Speaking Style **{_style}**:
|
||||
|
||||
{mo.ui.altair_chart(_fig)}
|
||||
|
||||
"""
|
||||
mo.md(_content)
|
||||
|
||||
# %%
|
||||
# ## Correlations when "Best Brand Character" is chosen
|
||||
# For each of the 4 brand characters, filter the dataset to only those respondents
|
||||
# who selected that character as their #1 choice.
|
||||
|
||||
# %%
|
||||
# Prepare character-filtered data subsets
|
||||
char_rank_for_filter = S.get_character_ranking(data)[0].collect()
|
||||
|
||||
CHARACTER_FILTER_MAP = {
|
||||
'Familiar Friend': 'Character_Ranking_Familiar_Friend',
|
||||
'The Coach': 'Character_Ranking_The_Coach',
|
||||
'Personal Assistant': 'Character_Ranking_The_Personal_Assistant',
|
||||
'Bank Teller': 'Character_Ranking_The_Bank_Teller',
|
||||
}
|
||||
|
||||
def get_filtered_data_for_character(char_name: str) -> tuple[pl.DataFrame, pl.DataFrame, int]:
|
||||
"""Filter joined_scale and joined_ranking to respondents who ranked char_name #1."""
|
||||
col = CHARACTER_FILTER_MAP[char_name]
|
||||
respondents = char_rank_for_filter.filter(pl.col(col) == 1).select('_recordId')
|
||||
n = respondents.height
|
||||
filtered_scale = joined_scale.join(respondents, on='_recordId', how='inner')
|
||||
filtered_ranking = joined_ranking.join(respondents, on='_recordId', how='inner')
|
||||
return filtered_scale, filtered_ranking, n
|
||||
|
||||
def _char_filename(char_name: str, suffix: str) -> str:
|
||||
"""Generate filename for character-filtered plots (without n-value).
|
||||
|
||||
Format: bc_ranked_1_{suffix}__{char_slug}
|
||||
This groups all plot types together in directory listings.
|
||||
"""
|
||||
char_slug = char_name.lower().replace(' ', '_')
|
||||
return f"bc_ranked_1_{suffix}__{char_slug}"
|
||||
|
||||
|
||||
|
||||
# %%
|
||||
# ### Voice Weighted Ranking Score (by Best Character)
|
||||
for char_name in CHARACTER_FILTER_MAP:
|
||||
_, _, n = get_filtered_data_for_character(char_name)
|
||||
# Get top3 voices for this character subset using _recordIds
|
||||
respondents = char_rank_for_filter.filter(
|
||||
pl.col(CHARACTER_FILTER_MAP[char_name]) == 1
|
||||
).select('_recordId')
|
||||
# Collect top3_voices if it's a LazyFrame, then join
|
||||
top3_df = top3_voices.collect() if isinstance(top3_voices, pl.LazyFrame) else top3_voices
|
||||
filtered_top3 = top3_df.join(respondents, on='_recordId', how='inner')
|
||||
weighted = calculate_weighted_ranking_scores(filtered_top3)
|
||||
S.plot_weighted_ranking_score(
|
||||
data=weighted,
|
||||
title=f'"{char_name}" Ranked #1 (n={n})<br>Most Popular Voice - Weighted Score (1st=3pts, 2nd=2pts, 3rd=1pt)',
|
||||
filename=_char_filename(char_name, "voice_weighted_ranking_score"),
|
||||
color_gender=COLOR_GENDER,
|
||||
)
|
||||
|
||||
# %%
|
||||
# ### Voice Scale 1-10 Average Scores (by Best Character)
|
||||
for char_name in CHARACTER_FILTER_MAP:
|
||||
_, _, n = get_filtered_data_for_character(char_name)
|
||||
# Get voice scale data for this character subset using _recordIds
|
||||
respondents = char_rank_for_filter.filter(
|
||||
pl.col(CHARACTER_FILTER_MAP[char_name]) == 1
|
||||
).select('_recordId')
|
||||
# Collect voice_1_10 if it's a LazyFrame, then join
|
||||
voice_1_10_df = voice_1_10.collect() if isinstance(voice_1_10, pl.LazyFrame) else voice_1_10
|
||||
filtered_voice_1_10 = voice_1_10_df.join(respondents, on='_recordId', how='inner')
|
||||
S.plot_average_scores_with_counts(
|
||||
data=filtered_voice_1_10,
|
||||
title=f'"{char_name}" Ranked #1 (n={n})<br>Voice General Impression (Scale 1-10)',
|
||||
filename=_char_filename(char_name, "voice_scale_1-10"),
|
||||
x_label='Voice',
|
||||
domain=[1, 10],
|
||||
color_gender=COLOR_GENDER,
|
||||
)
|
||||
|
||||
|
||||
|
||||
# %%
|
||||
# ### Speaking Style Colors vs Scale 1-10 (only for Best Character)
|
||||
for char_name in CHARACTER_FILTER_MAP:
|
||||
if char_name.lower().replace(' ', '_') != BEST_CHOSEN_CHARACTER:
|
||||
continue
|
||||
|
||||
filtered_scale, _, n = get_filtered_data_for_character(char_name)
|
||||
color_corr, _ = utils.transform_speaking_style_color_correlation(filtered_scale, SPEAKING_STYLES)
|
||||
S.plot_speaking_style_color_correlation(
|
||||
data=color_corr,
|
||||
title=f'"{char_name}" Ranked #1 (n={n})<br>Correlation: Speaking Style Colors vs Voice Scale 1-10',
|
||||
filename=_char_filename(char_name, "colors_vs_voice_scale_1-10"),
|
||||
)
|
||||
|
||||
# %%
|
||||
# ### Speaking Style Colors vs Ranking Points (only for Best Character)
|
||||
for char_name in CHARACTER_FILTER_MAP:
|
||||
if char_name.lower().replace(' ', '_') != BEST_CHOSEN_CHARACTER:
|
||||
continue
|
||||
|
||||
_, filtered_ranking, n = get_filtered_data_for_character(char_name)
|
||||
color_corr, _ = utils.transform_speaking_style_color_correlation(
|
||||
filtered_ranking, SPEAKING_STYLES, target_column="Ranking_Points"
|
||||
)
|
||||
S.plot_speaking_style_color_correlation(
|
||||
data=color_corr,
|
||||
title=f'"{char_name}" Ranked #1 (n={n})<br>Correlation: Speaking Style Colors vs Voice Ranking Points',
|
||||
filename=_char_filename(char_name, "colors_vs_voice_ranking_points"),
|
||||
)
|
||||
|
||||
# %%
|
||||
# ### Individual Traits vs Scale 1-10 (only for Best Character)
|
||||
for _style, _traits in SPEAKING_STYLES.items():
|
||||
print(f"--- Speaking Style: {_style} ---")
|
||||
for char_name in CHARACTER_FILTER_MAP:
|
||||
if char_name.lower().replace(' ', '_') != BEST_CHOSEN_CHARACTER:
|
||||
continue
|
||||
|
||||
filtered_scale, _, n = get_filtered_data_for_character(char_name)
|
||||
S.plot_speaking_style_scale_correlation(
|
||||
data=filtered_scale,
|
||||
style_color=_style,
|
||||
style_traits=_traits,
|
||||
title=f'"{char_name}" Ranked #1 (n={n})<br>Correlation: {_style} vs Voice Scale 1-10',
|
||||
filename=_char_filename(char_name, f"{_style.lower()}_vs_voice_scale_1-10"),
|
||||
)
|
||||
|
||||
# %%
|
||||
# ### Individual Traits vs Ranking Points (only for Best Character)
|
||||
for _style, _traits in SPEAKING_STYLES.items():
|
||||
print(f"--- Speaking Style: {_style} ---")
|
||||
for char_name in CHARACTER_FILTER_MAP:
|
||||
if char_name.lower().replace(' ', '_') != BEST_CHOSEN_CHARACTER:
|
||||
continue
|
||||
|
||||
_, filtered_ranking, n = get_filtered_data_for_character(char_name)
|
||||
S.plot_speaking_style_ranking_correlation(
|
||||
data=filtered_ranking,
|
||||
style_color=_style,
|
||||
style_traits=_traits,
|
||||
title=f'"{char_name}" Ranked #1 (n={n})<br>Correlation: {_style} vs Voice Ranking Points',
|
||||
filename=_char_filename(char_name, f"{_style.lower()}_vs_voice_ranking_points"),
|
||||
)
|
||||
|
||||
|
||||
# %%
|
||||
370
XX_statistical_significance.script.py
Normal file
370
XX_statistical_significance.script.py
Normal file
@@ -0,0 +1,370 @@
|
||||
"""Extra statistical significance analyses for quant report."""
|
||||
# %% Imports
|
||||
|
||||
import utils
|
||||
import polars as pl
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
|
||||
# %% Fixed Variables
|
||||
RESULTS_FILE = 'data/exports/2-4-26/JPMC_Chase Brand Personality_Quant Round 1_February 4, 2026_Labels.csv'
|
||||
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
|
||||
|
||||
|
||||
# %% CLI argument parsing for batch automation
|
||||
# When run as script: uv run XX_statistical_significance.script.py --age '["18
|
||||
# Central filter configuration - add new filters here only
|
||||
# Format: 'cli_arg_name': 'QualtricsSurvey.options_* attribute name'
|
||||
FILTER_CONFIG = {
|
||||
'age': 'options_age',
|
||||
'gender': 'options_gender',
|
||||
'ethnicity': 'options_ethnicity',
|
||||
'income': 'options_income',
|
||||
'consumer': 'options_consumer',
|
||||
'business_owner': 'options_business_owner',
|
||||
'ai_user': 'options_ai_user',
|
||||
'investable_assets': 'options_investable_assets',
|
||||
'industry': 'options_industry',
|
||||
}
|
||||
|
||||
def parse_cli_args():
|
||||
parser = argparse.ArgumentParser(description='Generate quant report with optional filters')
|
||||
|
||||
# Dynamically add filter arguments from config
|
||||
for filter_name in FILTER_CONFIG:
|
||||
parser.add_argument(f'--{filter_name}', type=str, default=None, help=f'JSON list of {filter_name} values')
|
||||
|
||||
parser.add_argument('--filter-name', type=str, default=None, help='Name for this filter combination (used for .txt description file)')
|
||||
parser.add_argument('--figures-dir', type=str, default=f'figures/statistical_significance/{Path(RESULTS_FILE).parts[2]}', help='Override the default figures directory')
|
||||
|
||||
# Only parse if running as script (not in Jupyter/interactive)
|
||||
try:
|
||||
# Check if running in Jupyter by looking for ipykernel
|
||||
get_ipython() # noqa: F821 # type: ignore
|
||||
# Return namespace with all filters set to None
|
||||
no_filters = {f: None for f in FILTER_CONFIG}
|
||||
# Use the same default as argparse
|
||||
default_fig_dir = f'figures/statistical_significance/{Path(RESULTS_FILE).parts[2]}'
|
||||
return argparse.Namespace(**no_filters, filter_name=None, figures_dir=default_fig_dir)
|
||||
except NameError:
|
||||
args = parser.parse_args()
|
||||
# Parse JSON strings to lists
|
||||
for filter_name in FILTER_CONFIG:
|
||||
val = getattr(args, filter_name)
|
||||
setattr(args, filter_name, json.loads(val) if val else None)
|
||||
return args
|
||||
|
||||
cli_args = parse_cli_args()
|
||||
|
||||
|
||||
# %%
|
||||
S = utils.QualtricsSurvey(RESULTS_FILE, QSF_FILE, figures_dir=cli_args.figures_dir)
|
||||
data_all = S.load_data()
|
||||
|
||||
|
||||
# %% Build filtered dataset based on CLI args
|
||||
|
||||
# CLI args: None means "no filter applied" - filter_data() will skip None filters
|
||||
|
||||
# Build filter values dict dynamically from FILTER_CONFIG
|
||||
_active_filters = {filter_name: getattr(cli_args, filter_name) for filter_name in FILTER_CONFIG}
|
||||
|
||||
_d = S.filter_data(data_all, **_active_filters)
|
||||
|
||||
# Write filter description file if filter-name is provided
|
||||
if cli_args.filter_name and S.fig_save_dir:
|
||||
# Get the filter slug (e.g., "All_Respondents", "Cons-Starter", etc.)
|
||||
_filter_slug = S._get_filter_slug()
|
||||
_filter_slug_dir = S.fig_save_dir / _filter_slug
|
||||
_filter_slug_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Build filter description
|
||||
_filter_desc_lines = [
|
||||
f"Filter: {cli_args.filter_name}",
|
||||
"",
|
||||
"Applied Filters:",
|
||||
]
|
||||
_short_desc_parts = []
|
||||
for filter_name, options_attr in FILTER_CONFIG.items():
|
||||
all_options = getattr(S, options_attr)
|
||||
values = _active_filters[filter_name]
|
||||
display_name = filter_name.replace('_', ' ').title()
|
||||
# None means no filter applied (same as "All")
|
||||
if values is not None and values != all_options:
|
||||
_short_desc_parts.append(f"{display_name}: {', '.join(values)}")
|
||||
_filter_desc_lines.append(f" {display_name}: {', '.join(values)}")
|
||||
else:
|
||||
_filter_desc_lines.append(f" {display_name}: All")
|
||||
|
||||
# Write detailed description INSIDE the filter-slug directory
|
||||
# Sanitize filter name for filename usage (replace / and other chars)
|
||||
_safe_filter_name = re.sub(r'[^\w\s-]', '_', cli_args.filter_name)
|
||||
_filter_file = _filter_slug_dir / f"{_safe_filter_name}.txt"
|
||||
_filter_file.write_text('\n'.join(_filter_desc_lines))
|
||||
|
||||
# Append to summary index file at figures/<export_date>/filter_index.txt
|
||||
_summary_file = S.fig_save_dir / "filter_index.txt"
|
||||
_short_desc = "; ".join(_short_desc_parts) if _short_desc_parts else "All Respondents"
|
||||
_summary_line = f"{_filter_slug} | {cli_args.filter_name} | {_short_desc}\n"
|
||||
|
||||
# Append or create the summary file
|
||||
if _summary_file.exists():
|
||||
_existing = _summary_file.read_text()
|
||||
# Avoid duplicate entries for same slug
|
||||
if _filter_slug not in _existing:
|
||||
with _summary_file.open('a') as f:
|
||||
f.write(_summary_line)
|
||||
else:
|
||||
_header = "Filter Index\n" + "=" * 80 + "\n\n"
|
||||
_header += "Directory | Filter Name | Description\n"
|
||||
_header += "-" * 80 + "\n"
|
||||
_summary_file.write_text(_header + _summary_line)
|
||||
|
||||
# Save to logical variable name for further analysis
|
||||
data = _d
|
||||
data.collect()
|
||||
|
||||
# %% Character coach significatly higher than others
|
||||
|
||||
|
||||
char_rank = S.get_character_ranking(data)[0]
|
||||
|
||||
|
||||
|
||||
_pairwise_df, _meta = S.compute_ranking_significance(
|
||||
char_rank,
|
||||
alpha=0.05,
|
||||
correction="none",
|
||||
)
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
### Methodology Analysis
|
||||
|
||||
**Input Data (`char_rank`)**:
|
||||
* Generated by `S.get_character_ranking(data)`.
|
||||
* Contains the ranking values (1st, 2nd, 3rd, 4th) assigned by each respondent to the four options ("The Coach", etc.).
|
||||
* Columns represent the characters; rows represent individual respondents; values are the numerical rank (1 = Top Choice).
|
||||
|
||||
**Processing**:
|
||||
* The function `compute_ranking_significance` aggregates these rankings to find the **"Rank 1 Share"** (the percentage of respondents who picked that character as their #1 favorite).
|
||||
* It builds a contingency table of how many times each character was ranked 1st vs. not 1st (or 1st v 2nd v 3rd).
|
||||
|
||||
**Statistical Test**:
|
||||
* **Test Used**: Pairwise Z-test for two proportions (uncorrected).
|
||||
* **Comparison**: It compares the **Rank 1 Share** of every pair of characters.
|
||||
* *Example*: "Is the 42% of people who chose 'Coach' significantly different from the 29% who chose 'Familiar Friend'?"
|
||||
* **Significance**: A result of `p < 0.05` means the difference in popularity (top-choice preference) is statistically significant and not due to random chance.
|
||||
"""
|
||||
|
||||
|
||||
# %% Plot heatmap of pairwise significance
|
||||
S.plot_significance_heatmap(_pairwise_df, metadata=_meta, title="Statistical Significance: Character Top Choice Preference")
|
||||
|
||||
# %% Plot summary of significant differences (e.g., which characters are significantly higher than others)
|
||||
# S.plot_significance_summary(_pairwise_df, metadata=_meta)
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
# Analysis: Significance of "The Coach"
|
||||
|
||||
**Parameters**: `alpha=0.05`, `correction='none'`
|
||||
* **Rationale**: No correction was applied to allow for detection of all potential pairwise differences (uncorrected p < 0.05). If strict control for family-wise error rate were required (e.g., Bonferroni), the significance threshold would be lower (p < 0.0083).
|
||||
|
||||
**Results**:
|
||||
"The Coach" is the top-ranked option (42.0% Rank 1 share) and shows strong separation from the field.
|
||||
|
||||
* **Vs. Bottom Two**: "The Coach" is significantly higher than both "The Bank Teller" (26.9%, p < 0.001) and "Familiar Friend" (29.4%, p < 0.001).
|
||||
* **Vs. Runner-Up**: "The Coach" is widely preferred over "The Personal Assistant" (33.4%). The difference of **8.6 percentage points** is statistically significant (p = 0.017) at the standard 0.05 level.
|
||||
* *Note*: While p=0.017 is significant in isolation, it would not meet the stricter Bonferroni threshold (0.0083). However, the effect size (+8.6%) is commercially meaningful.
|
||||
|
||||
**Conclusion**:
|
||||
Yes, "The Coach" can be considered statistically more significant than the other options. It is clearly superior to the bottom two options and holds a statistically significant lead over the runner-up ("Personal Assistant") in direct comparison.
|
||||
"""
|
||||
|
||||
# %% Mentions significance analysis
|
||||
|
||||
char_pairwise_df_mentions, _meta_mentions = S.compute_mentions_significance(
|
||||
char_rank,
|
||||
alpha=0.05,
|
||||
correction="none",
|
||||
)
|
||||
|
||||
S.plot_significance_heatmap(
|
||||
char_pairwise_df_mentions,
|
||||
metadata=_meta_mentions,
|
||||
title="Statistical Significance: Character Total Mentions (Top 3 Visibility)"
|
||||
)
|
||||
|
||||
|
||||
# %% voices analysis
|
||||
top3_voices = S.get_top_3_voices(data)[0]
|
||||
|
||||
|
||||
_pairwise_df_voice, _metadata = S.compute_ranking_significance(
|
||||
top3_voices,alpha=0.05,correction="none")
|
||||
|
||||
|
||||
S.plot_significance_heatmap(
|
||||
_pairwise_df_voice,
|
||||
metadata=_metadata,
|
||||
title="Statistical Significance: Voice Top Choice Preference"
|
||||
)
|
||||
# %% Total Mentions Significance (Rank 1+2+3 Combined)
|
||||
# This tests "Quantity" (Visibility) instead of "Quality" (Preference)
|
||||
|
||||
_pairwise_df_mentions, _meta_mentions = S.compute_mentions_significance(
|
||||
top3_voices,
|
||||
alpha=0.05,
|
||||
correction="none"
|
||||
)
|
||||
|
||||
S.plot_significance_heatmap(
|
||||
_pairwise_df_mentions,
|
||||
metadata=_meta_mentions,
|
||||
title="Statistical Significance: Voice Total Mentions (Top 3 Visibility)"
|
||||
)
|
||||
# %% Male Voices Only Analysis
|
||||
import reference
|
||||
|
||||
def filter_voices_by_gender(df: pl.DataFrame, target_gender: str) -> pl.DataFrame:
|
||||
"""Filter ranking columns to keep only those matching target gender."""
|
||||
cols_to_keep = []
|
||||
|
||||
# Always keep identifier if present
|
||||
if '_recordId' in df.columns:
|
||||
cols_to_keep.append('_recordId')
|
||||
|
||||
for col in df.columns:
|
||||
# Check if column is a voice column (contains Vxx)
|
||||
# Format is typically "Top_3_Voices_ranking__V14"
|
||||
if '__V' in col:
|
||||
voice_id = col.split('__')[1]
|
||||
if reference.VOICE_GENDER_MAPPING.get(voice_id) == target_gender:
|
||||
cols_to_keep.append(col)
|
||||
|
||||
return df.select(cols_to_keep)
|
||||
|
||||
# Get full ranking data as DataFrame
|
||||
df_voices = top3_voices.collect()
|
||||
|
||||
# Filter for Male voices
|
||||
df_male_voices = filter_voices_by_gender(df_voices, 'Male')
|
||||
|
||||
# 1. Male Voices: Top Choice Preference (Rank 1)
|
||||
_pairwise_male_pref, _meta_male_pref = S.compute_ranking_significance(
|
||||
df_male_voices,
|
||||
alpha=0.05,
|
||||
correction="none"
|
||||
)
|
||||
|
||||
S.plot_significance_heatmap(
|
||||
_pairwise_male_pref,
|
||||
metadata=_meta_male_pref,
|
||||
title="Male Voices Only: Top Choice Preference Significance"
|
||||
)
|
||||
|
||||
# 2. Male Voices: Total Mentions (Visibility)
|
||||
_pairwise_male_vis, _meta_male_vis = S.compute_mentions_significance(
|
||||
df_male_voices,
|
||||
alpha=0.05,
|
||||
correction="none"
|
||||
)
|
||||
|
||||
S.plot_significance_heatmap(
|
||||
_pairwise_male_vis,
|
||||
metadata=_meta_male_vis,
|
||||
title="Male Voices Only: Total Mentions Significance"
|
||||
)
|
||||
# %% Male Voices (Excluding Bottom 3: V88, V86, V81)
|
||||
|
||||
# Start with the male voices dataframe from the previous step
|
||||
voices_to_exclude = ['V88', 'V86', 'V81']
|
||||
|
||||
def filter_exclude_voices(df: pl.DataFrame, exclude_list: list[str]) -> pl.DataFrame:
|
||||
"""Filter ranking columns to exclude specific voices."""
|
||||
cols_to_keep = []
|
||||
|
||||
# Always keep identifier if present
|
||||
if '_recordId' in df.columns:
|
||||
cols_to_keep.append('_recordId')
|
||||
|
||||
for col in df.columns:
|
||||
# Check if column is a voice column (contains Vxx)
|
||||
if '__V' in col:
|
||||
voice_id = col.split('__')[1]
|
||||
if voice_id not in exclude_list:
|
||||
cols_to_keep.append(col)
|
||||
|
||||
return df.select(cols_to_keep)
|
||||
|
||||
df_male_top = filter_exclude_voices(df_male_voices, voices_to_exclude)
|
||||
|
||||
# 1. Male Top Candidates: Top Choice Preference
|
||||
_pairwise_male_top_pref, _meta_male_top_pref = S.compute_ranking_significance(
|
||||
df_male_top,
|
||||
alpha=0.05,
|
||||
correction="none"
|
||||
)
|
||||
|
||||
S.plot_significance_heatmap(
|
||||
_pairwise_male_top_pref,
|
||||
metadata=_meta_male_top_pref,
|
||||
title="Male Voices (Excl. Bottom 3): Top Choice Preference Significance"
|
||||
)
|
||||
|
||||
# 2. Male Top Candidates: Total Mentions
|
||||
_pairwise_male_top_vis, _meta_male_top_vis = S.compute_mentions_significance(
|
||||
df_male_top,
|
||||
alpha=0.05,
|
||||
correction="none"
|
||||
)
|
||||
|
||||
S.plot_significance_heatmap(
|
||||
_pairwise_male_top_vis,
|
||||
metadata=_meta_male_top_vis,
|
||||
title="Male Voices (Excl. Bottom 3): Total Mentions Significance"
|
||||
)
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
# Rank 1 Selection Significance (Voice Level)
|
||||
|
||||
Similar to the Total Mentions significance analysis above, but counting
|
||||
only how many times each voice was ranked **1st** (out of all respondents).
|
||||
This isolates first-choice preference rather than overall top-3 visibility.
|
||||
"""
|
||||
|
||||
# %% Rank 1 Significance: All Voices
|
||||
|
||||
_pairwise_df_rank1, _meta_rank1 = S.compute_rank1_significance(
|
||||
top3_voices,
|
||||
alpha=0.05,
|
||||
correction="none",
|
||||
)
|
||||
|
||||
S.plot_significance_heatmap(
|
||||
_pairwise_df_rank1,
|
||||
metadata=_meta_rank1,
|
||||
title="Statistical Significance: Voice Rank 1 Selection"
|
||||
)
|
||||
|
||||
# %% Rank 1 Significance: Male Voices Only
|
||||
|
||||
_pairwise_df_rank1_male, _meta_rank1_male = S.compute_rank1_significance(
|
||||
df_male_voices,
|
||||
alpha=0.05,
|
||||
correction="none",
|
||||
)
|
||||
|
||||
S.plot_significance_heatmap(
|
||||
_pairwise_df_rank1_male,
|
||||
metadata=_meta_rank1_male,
|
||||
title="Male Voices Only: Rank 1 Selection Significance"
|
||||
)
|
||||
|
||||
# %%
|
||||
267
XX_straight_liners.py
Normal file
267
XX_straight_liners.py
Normal file
@@ -0,0 +1,267 @@
|
||||
"""Extra analyses of the straight-liners"""
|
||||
# %% Imports
|
||||
|
||||
import utils
|
||||
import polars as pl
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
from pathlib import Path
|
||||
from validation import check_straight_liners
|
||||
|
||||
|
||||
# %% Fixed Variables
|
||||
RESULTS_FILE = 'data/exports/2-4-26/JPMC_Chase Brand Personality_Quant Round 1_February 4, 2026_Labels.csv'
|
||||
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
|
||||
|
||||
|
||||
# %% CLI argument parsing for batch automation
|
||||
# When run as script: uv run XX_statistical_significance.script.py --age '["18
|
||||
# Central filter configuration - add new filters here only
|
||||
# Format: 'cli_arg_name': 'QualtricsSurvey.options_* attribute name'
|
||||
FILTER_CONFIG = {
|
||||
'age': 'options_age',
|
||||
'gender': 'options_gender',
|
||||
'ethnicity': 'options_ethnicity',
|
||||
'income': 'options_income',
|
||||
'consumer': 'options_consumer',
|
||||
'business_owner': 'options_business_owner',
|
||||
'ai_user': 'options_ai_user',
|
||||
'investable_assets': 'options_investable_assets',
|
||||
'industry': 'options_industry',
|
||||
}
|
||||
|
||||
def parse_cli_args():
|
||||
parser = argparse.ArgumentParser(description='Generate quant report with optional filters')
|
||||
|
||||
# Dynamically add filter arguments from config
|
||||
for filter_name in FILTER_CONFIG:
|
||||
parser.add_argument(f'--{filter_name}', type=str, default=None, help=f'JSON list of {filter_name} values')
|
||||
|
||||
parser.add_argument('--filter-name', type=str, default=None, help='Name for this filter combination (used for .txt description file)')
|
||||
parser.add_argument('--figures-dir', type=str, default=f'figures/straight-liner-analysis/{Path(RESULTS_FILE).parts[2]}', help='Override the default figures directory')
|
||||
|
||||
# Only parse if running as script (not in Jupyter/interactive)
|
||||
try:
|
||||
# Check if running in Jupyter by looking for ipykernel
|
||||
get_ipython() # noqa: F821 # type: ignore
|
||||
# Return namespace with all filters set to None
|
||||
no_filters = {f: None for f in FILTER_CONFIG}
|
||||
# Use the same default as argparse
|
||||
default_fig_dir = f'figures/straight-liner-analysis/{Path(RESULTS_FILE).parts[2]}'
|
||||
return argparse.Namespace(**no_filters, filter_name=None, figures_dir=default_fig_dir)
|
||||
except NameError:
|
||||
args = parser.parse_args()
|
||||
# Parse JSON strings to lists
|
||||
for filter_name in FILTER_CONFIG:
|
||||
val = getattr(args, filter_name)
|
||||
setattr(args, filter_name, json.loads(val) if val else None)
|
||||
return args
|
||||
|
||||
cli_args = parse_cli_args()
|
||||
|
||||
|
||||
# %%
|
||||
S = utils.QualtricsSurvey(RESULTS_FILE, QSF_FILE, figures_dir=cli_args.figures_dir)
|
||||
data_all = S.load_data()
|
||||
|
||||
|
||||
# %% Build filtered dataset based on CLI args
|
||||
|
||||
# CLI args: None means "no filter applied" - filter_data() will skip None filters
|
||||
|
||||
# Build filter values dict dynamically from FILTER_CONFIG
|
||||
_active_filters = {filter_name: getattr(cli_args, filter_name) for filter_name in FILTER_CONFIG}
|
||||
|
||||
_d = S.filter_data(data_all, **_active_filters)
|
||||
|
||||
# Write filter description file if filter-name is provided
|
||||
if cli_args.filter_name and S.fig_save_dir:
|
||||
# Get the filter slug (e.g., "All_Respondents", "Cons-Starter", etc.)
|
||||
_filter_slug = S._get_filter_slug()
|
||||
_filter_slug_dir = S.fig_save_dir / _filter_slug
|
||||
_filter_slug_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Build filter description
|
||||
_filter_desc_lines = [
|
||||
f"Filter: {cli_args.filter_name}",
|
||||
"",
|
||||
"Applied Filters:",
|
||||
]
|
||||
_short_desc_parts = []
|
||||
for filter_name, options_attr in FILTER_CONFIG.items():
|
||||
all_options = getattr(S, options_attr)
|
||||
values = _active_filters[filter_name]
|
||||
display_name = filter_name.replace('_', ' ').title()
|
||||
# None means no filter applied (same as "All")
|
||||
if values is not None and values != all_options:
|
||||
_short_desc_parts.append(f"{display_name}: {', '.join(values)}")
|
||||
_filter_desc_lines.append(f" {display_name}: {', '.join(values)}")
|
||||
else:
|
||||
_filter_desc_lines.append(f" {display_name}: All")
|
||||
|
||||
# Write detailed description INSIDE the filter-slug directory
|
||||
# Sanitize filter name for filename usage (replace / and other chars)
|
||||
_safe_filter_name = re.sub(r'[^\w\s-]', '_', cli_args.filter_name)
|
||||
_filter_file = _filter_slug_dir / f"{_safe_filter_name}.txt"
|
||||
_filter_file.write_text('\n'.join(_filter_desc_lines))
|
||||
|
||||
# Append to summary index file at figures/<export_date>/filter_index.txt
|
||||
_summary_file = S.fig_save_dir / "filter_index.txt"
|
||||
_short_desc = "; ".join(_short_desc_parts) if _short_desc_parts else "All Respondents"
|
||||
_summary_line = f"{_filter_slug} | {cli_args.filter_name} | {_short_desc}\n"
|
||||
|
||||
# Append or create the summary file
|
||||
if _summary_file.exists():
|
||||
_existing = _summary_file.read_text()
|
||||
# Avoid duplicate entries for same slug
|
||||
if _filter_slug not in _existing:
|
||||
with _summary_file.open('a') as f:
|
||||
f.write(_summary_line)
|
||||
else:
|
||||
_header = "Filter Index\n" + "=" * 80 + "\n\n"
|
||||
_header += "Directory | Filter Name | Description\n"
|
||||
_header += "-" * 80 + "\n"
|
||||
_summary_file.write_text(_header + _summary_line)
|
||||
|
||||
# Save to logical variable name for further analysis
|
||||
data = _d
|
||||
data.collect()
|
||||
|
||||
|
||||
# %% Determine straight-liner repeat offenders
|
||||
# Extract question groups with renamed columns that check_straight_liners expects.
|
||||
# The raw `data` has QID-based column names; the getter methods rename them to
|
||||
# patterns like SS_Green_Blue__V14__Choice_1, Voice_Scale_1_10__V48, etc.
|
||||
|
||||
ss_or, _ = S.get_ss_orange_red(data)
|
||||
ss_gb, _ = S.get_ss_green_blue(data)
|
||||
vs, _ = S.get_voice_scale_1_10(data)
|
||||
|
||||
# Combine all question groups into one wide LazyFrame (joined on _recordId)
|
||||
all_questions = ss_or.join(ss_gb, on='_recordId').join(vs, on='_recordId')
|
||||
|
||||
# Run straight-liner detection across all question groups
|
||||
# max_score=5 catches all speaking-style straight-lining (1-5 scale)
|
||||
# and voice-scale values ≤5 on the 1-10 scale
|
||||
# Note: sl_threshold is NOT set on S here — this script analyses straight-liners,
|
||||
# it doesn't filter them out of the dataset.
|
||||
print("Running straight-liner detection across all question groups...")
|
||||
sl_report, sl_df = check_straight_liners(all_questions, max_score=5)
|
||||
|
||||
# %% Quantify repeat offenders
|
||||
# sl_df has one row per (Record ID, Question Group) that was straight-lined.
|
||||
# Group by Record ID to count how many question groups each person SL'd.
|
||||
|
||||
if sl_df is not None and not sl_df.is_empty():
|
||||
total_respondents = data.select(pl.len()).collect().item()
|
||||
|
||||
# Per-respondent count of straight-lined question groups
|
||||
respondent_sl_counts = (
|
||||
sl_df
|
||||
.group_by("Record ID")
|
||||
.agg(pl.len().alias("sl_count"))
|
||||
.sort("sl_count", descending=True)
|
||||
)
|
||||
|
||||
max_sl = respondent_sl_counts["sl_count"].max()
|
||||
print(f"\nTotal respondents: {total_respondents}")
|
||||
print(f"Respondents who straight-lined at least 1 question group: "
|
||||
f"{respondent_sl_counts.height}")
|
||||
print(f"Maximum question groups straight-lined by one person: {max_sl}")
|
||||
print()
|
||||
|
||||
# Build cumulative distribution: for each threshold N, count respondents
|
||||
# who straight-lined >= N question groups
|
||||
cumulative_rows = []
|
||||
for threshold in range(1, max_sl + 1):
|
||||
count = respondent_sl_counts.filter(
|
||||
pl.col("sl_count") >= threshold
|
||||
).height
|
||||
pct = (count / total_respondents) * 100
|
||||
cumulative_rows.append({
|
||||
"threshold": threshold,
|
||||
"count": count,
|
||||
"pct": pct,
|
||||
})
|
||||
print(
|
||||
f" ≥{threshold} question groups straight-lined: "
|
||||
f"{count} respondents ({pct:.1f}%)"
|
||||
)
|
||||
|
||||
cumulative_df = pl.DataFrame(cumulative_rows)
|
||||
print(f"\n{cumulative_df}")
|
||||
|
||||
# %% Save cumulative data to CSV
|
||||
_filter_slug = S._get_filter_slug()
|
||||
_csv_dir = Path(S.fig_save_dir) / _filter_slug
|
||||
_csv_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
_csv_path = _csv_dir / "straight_liner_repeat_offenders.csv"
|
||||
cumulative_df.write_csv(_csv_path)
|
||||
print(f"Saved cumulative data to {_csv_path}")
|
||||
|
||||
# %% Plot the cumulative distribution
|
||||
S.plot_straight_liner_repeat_offenders(
|
||||
cumulative_df,
|
||||
total_respondents=total_respondents,
|
||||
)
|
||||
|
||||
# %% Per-question straight-lining frequency
|
||||
# Build human-readable question group names from the raw keys
|
||||
def _humanise_question_group(key: str) -> str:
|
||||
"""Convert internal question group key to a readable label.
|
||||
|
||||
Examples:
|
||||
SS_Green_Blue__V14 → Green/Blue – V14
|
||||
SS_Orange_Red__V48 → Orange/Red – V48
|
||||
Voice_Scale_1_10 → Voice Scale (1-10)
|
||||
"""
|
||||
if key.startswith("SS_Green_Blue__"):
|
||||
voice = key.split("__")[1]
|
||||
return f"Green/Blue – {voice}"
|
||||
if key.startswith("SS_Orange_Red__"):
|
||||
voice = key.split("__")[1]
|
||||
return f"Orange/Red – {voice}"
|
||||
if key == "Voice_Scale_1_10":
|
||||
return "Voice Scale (1-10)"
|
||||
# Fallback: replace underscores
|
||||
return key.replace("_", " ")
|
||||
|
||||
per_question_counts = (
|
||||
sl_df
|
||||
.group_by("Question Group")
|
||||
.agg(pl.col("Record ID").n_unique().alias("count"))
|
||||
.sort("count", descending=True)
|
||||
.with_columns(
|
||||
(pl.col("count") / total_respondents * 100).alias("pct")
|
||||
)
|
||||
)
|
||||
|
||||
# Add human-readable names
|
||||
per_question_counts = per_question_counts.with_columns(
|
||||
pl.col("Question Group").map_elements(
|
||||
_humanise_question_group, return_dtype=pl.Utf8
|
||||
).alias("question")
|
||||
)
|
||||
|
||||
print("\n--- Per-Question Straight-Lining Frequency ---")
|
||||
print(per_question_counts)
|
||||
|
||||
# Save per-question data to CSV
|
||||
_csv_path_pq = _csv_dir / "straight_liner_per_question.csv"
|
||||
per_question_counts.write_csv(_csv_path_pq)
|
||||
print(f"Saved per-question data to {_csv_path_pq}")
|
||||
|
||||
# Plot
|
||||
S.plot_straight_liner_per_question(
|
||||
per_question_counts,
|
||||
total_respondents=total_respondents,
|
||||
)
|
||||
|
||||
# %% Show the top repeat offenders (respondents with most SL'd groups)
|
||||
print("\n--- Top Repeat Offenders ---")
|
||||
print(respondent_sl_counts.head(20))
|
||||
|
||||
else:
|
||||
print("No straight-liners detected in the dataset.")
|
||||
1359
analysis_missing_voice_ranking.ipynb
Normal file
1359
analysis_missing_voice_ranking.ipynb
Normal file
File diff suppressed because one or more lines are too long
BIN
docs/README.pdf
Normal file
BIN
docs/README.pdf
Normal file
Binary file not shown.
@@ -1,4 +1,4 @@
|
||||
# Altair Migration Plan: Plotly → Altair for JPMCPlotsMixin
|
||||
# Altair Migration Plan: Plotly → Altair for QualtricsPlotsMixin
|
||||
|
||||
**Date:** January 28, 2026
|
||||
**Status:** Not Started
|
||||
@@ -22,9 +22,9 @@ Current Plotly implementation has a critical layout issue: filter annotations ov
|
||||
## Current System Analysis
|
||||
|
||||
### File Structure
|
||||
- **`plots.py`** - Contains `JPMCPlotsMixin` class with 10 plotting methods
|
||||
- **`plots.py`** - Contains `QualtricsPlotsMixin` class with 10 plotting methods
|
||||
- **`theme.py`** - Contains `ColorPalette` class with all styling constants
|
||||
- **`utils.py`** - Contains `JPMCSurvey` class that mixes in `JPMCPlotsMixin`
|
||||
- **`utils.py`** - Contains `QualtricsSurvey` class that mixes in `QualtricsPlotsMixin`
|
||||
|
||||
### Color Palette (from theme.py)
|
||||
```python
|
||||
@@ -1140,10 +1140,10 @@ uv remove plotly kaleido
|
||||
```python
|
||||
import marimo as mo
|
||||
import polars as pl
|
||||
from utils import JPMCSurvey
|
||||
from utils import QualtricsSurvey
|
||||
|
||||
# Load sample data
|
||||
survey = JPMCSurvey()
|
||||
survey = QualtricsSurvey()
|
||||
survey.load_data('path/to/data')
|
||||
survey.fig_save_dir = 'figures/altair_test'
|
||||
|
||||
@@ -1244,7 +1244,7 @@ After completing all tasks, verify the following:
|
||||
### Regression Testing
|
||||
- [ ] Existing Marimo notebooks still work
|
||||
- [ ] Data filtering still works (`filter_data()`)
|
||||
- [ ] `JPMCSurvey` class initialization unchanged
|
||||
- [ ] `QualtricsSurvey` class initialization unchanged
|
||||
- [ ] No breaking changes to public API
|
||||
|
||||
### Documentation
|
||||
|
||||
104
docs/figures_structure_manual.md
Normal file
104
docs/figures_structure_manual.md
Normal file
@@ -0,0 +1,104 @@
|
||||
# Appendix: Quantitative Analysis Plots - Folder Structure Manual
|
||||
|
||||
This folder contains all the quantitative analysis plots, sorted by the filters applied to the dataset. Each folder corresponds to a specific demographic cut.
|
||||
|
||||
## Folder Overview
|
||||
|
||||
* `All_Respondents/`: Analysis of the full dataset (no filters).
|
||||
* `filter_index.txt`: A master list of every folder code and its corresponding demographic filter.
|
||||
* **Filter Folders**: All other folders represent specific demographic cuts (e.g., `Age-18to21years`, `Gen-Woman`).
|
||||
|
||||
## How to Navigate
|
||||
|
||||
Each folder contains the same set of charts generated for that specific filter.
|
||||
|
||||
## Directory Reference Table
|
||||
|
||||
Below is the complete list of folder names. These names are encodings of the filters applied to the dataset, which we use to maintain consistency across our analysis.
|
||||
|
||||
| Directory Code | Filter Description |
|
||||
| :--- | :--- |
|
||||
| All_Respondents | All Respondents |
|
||||
| Age-18to21years | Age: 18 to 21 years |
|
||||
| Age-22to24years | Age: 22 to 24 years |
|
||||
| Age-25to34years | Age: 25 to 34 years |
|
||||
| Age-35to40years | Age: 35 to 40 years |
|
||||
| Age-41to50years | Age: 41 to 50 years |
|
||||
| Age-51to59years | Age: 51 to 59 years |
|
||||
| Age-60to70years | Age: 60 to 70 years |
|
||||
| Age-70yearsormore | Age: 70 years or more |
|
||||
| Gen-Man | Gender: Man |
|
||||
| Gen-Prefernottosay | Gender: Prefer not to say |
|
||||
| Gen-Woman | Gender: Woman |
|
||||
| Eth-6_grps_c64411 | Ethnicity: All options containing 'Alaska Native or Indigenous American' |
|
||||
| Eth-6_grps_8f145b | Ethnicity: All options containing 'Asian or Asian American' |
|
||||
| Eth-8_grps_71ac47 | Ethnicity: All options containing 'Black or African American' |
|
||||
| Eth-7_grps_c5b3ce | Ethnicity: All options containing 'Hispanic or Latinx' |
|
||||
| Eth-BlackorAfricanAmerican<br>MiddleEasternorNorthAfrican<br>WhiteorCaucasian+<br>MiddleEasternorNorthAfrican | Ethnicity: Middle Eastern or North African |
|
||||
| Eth-AsianorAsianAmericanBlackorAfricanAmerican<br>NativeHawaiianorOtherPacificIslander+<br>NativeHawaiianorOtherPacificIslander | Ethnicity: Native Hawaiian or Other Pacific Islander |
|
||||
| Eth-10_grps_cef760 | Ethnicity: All options containing 'White or Caucasian' |
|
||||
| Inc-100000to149999 | Income: $100,000 to $149,999 |
|
||||
| Inc-150000to199999 | Income: $150,000 to $199,999 |
|
||||
| Inc-200000ormore | Income: $200,000 or more |
|
||||
| Inc-25000to34999 | Income: $25,000 to $34,999 |
|
||||
| Inc-35000to54999 | Income: $35,000 to $54,999 |
|
||||
| Inc-55000to79999 | Income: $55,000 to $79,999 |
|
||||
| Inc-80000to99999 | Income: $80,000 to $99,999 |
|
||||
| Inc-Lessthan25000 | Income: Less than $25,000 |
|
||||
| Cons-Lower_Mass_A+Lower_Mass_B | Consumer: Lower_Mass_A, Lower_Mass_B |
|
||||
| Cons-MassAffluent_A+MassAffluent_B | Consumer: MassAffluent_A, MassAffluent_B |
|
||||
| Cons-Mass_A+Mass_B | Consumer: Mass_A, Mass_B |
|
||||
| Cons-Mix_of_Affluent_Wealth__<br>High_Net_Woth_A+<br>Mix_of_Affluent_Wealth__<br>High_Net_Woth_B | Consumer: Mix_of_Affluent_Wealth_&_High_Net_Woth_A, Mix_of_Affluent_Wealth_&_High_Net_Woth_B |
|
||||
| Cons-Early_Professional | Consumer: Early_Professional |
|
||||
| Cons-Lower_Mass_B | Consumer: Lower_Mass_B |
|
||||
| Cons-MassAffluent_B | Consumer: MassAffluent_B |
|
||||
| Cons-Mass_B | Consumer: Mass_B |
|
||||
| Cons-Mix_of_Affluent_Wealth__<br>High_Net_Woth_B | Consumer: Mix_of_Affluent_Wealth_&_High_Net_Woth_B |
|
||||
| Cons-Starter | Consumer: Starter |
|
||||
| BizOwn-No | Business Owner: No |
|
||||
| BizOwn-Yes | Business Owner: Yes |
|
||||
| AI-Daily | Ai User: Daily |
|
||||
| AI-Lessthanonceamonth | Ai User: Less than once a month |
|
||||
| AI-Morethanoncedaily | Ai User: More than once daily |
|
||||
| AI-Multipletimesperweek | Ai User: Multiple times per week |
|
||||
| AI-Onceamonth | Ai User: Once a month |
|
||||
| AI-Onceaweek | Ai User: Once a week |
|
||||
| AI-RarelyNever | Ai User: Rarely/Never |
|
||||
| AI-Daily+<br>Morethanoncedaily+<br>Multipletimesperweek | Ai User: Daily, More than once daily, Multiple times per week |
|
||||
| AI-4_grps_d4f57a | Ai User: Once a week, Once a month, Less than once a month, Rarely/Never |
|
||||
| InvAsts-0to24999 | Investable Assets: $0 to $24,999 |
|
||||
| InvAsts-150000to249999 | Investable Assets: $150,000 to $249,999 |
|
||||
| InvAsts-1Mto4.9M | Investable Assets: $1M to $4.9M |
|
||||
| InvAsts-25000to49999 | Investable Assets: $25,000 to $49,999 |
|
||||
| InvAsts-250000to499999 | Investable Assets: $250,000 to $499,999 |
|
||||
| InvAsts-50000to149999 | Investable Assets: $50,000 to $149,999 |
|
||||
| InvAsts-500000to999999 | Investable Assets: $500,000 to $999,999 |
|
||||
| InvAsts-5Mormore | Investable Assets: $5M or more |
|
||||
| InvAsts-Prefernottoanswer | Investable Assets: Prefer not to answer |
|
||||
| Ind-Agricultureforestryfishingorhunting | Industry: Agriculture, forestry, fishing, or hunting |
|
||||
| Ind-Artsentertainmentorrecreation | Industry: Arts, entertainment, or recreation |
|
||||
| Ind-Broadcasting | Industry: Broadcasting |
|
||||
| Ind-Construction | Industry: Construction |
|
||||
| Ind-EducationCollegeuniversityoradult | Industry: Education – College, university, or adult |
|
||||
| Ind-EducationOther | Industry: Education – Other |
|
||||
| Ind-EducationPrimarysecondaryK-12 | Industry: Education – Primary/secondary (K-12) |
|
||||
| Ind-Governmentandpublicadministration | Industry: Government and public administration |
|
||||
| Ind-Hotelandfoodservices | Industry: Hotel and food services |
|
||||
| Ind-InformationOther | Industry: Information – Other |
|
||||
| Ind-InformationServicesanddata | Industry: Information – Services and data |
|
||||
| Ind-Legalservices | Industry: Legal services |
|
||||
| Ind-ManufacturingComputerandelectronics | Industry: Manufacturing – Computer and electronics |
|
||||
| Ind-ManufacturingOther | Industry: Manufacturing – Other |
|
||||
| Ind-Notemployed | Industry: Not employed |
|
||||
| Ind-Otherindustrypleasespecify | Industry: Other industry (please specify) |
|
||||
| Ind-Processing | Industry: Processing |
|
||||
| Ind-Publishing | Industry: Publishing |
|
||||
| Ind-Realestaterentalorleasing | Industry: Real estate, rental, or leasing |
|
||||
| Ind-Retired | Industry: Retired |
|
||||
| Ind-Scientificortechnicalservices | Industry: Scientific or technical services |
|
||||
| Ind-Software | Industry: Software |
|
||||
| Ind-Telecommunications | Industry: Telecommunications |
|
||||
| Ind-Transportationandwarehousing | Industry: Transportation and warehousing |
|
||||
| Ind-Utilities | Industry: Utilities |
|
||||
| Ind-Wholesale | Industry: Wholesale |
|
||||
|
||||
428
docs/statistical-significance-guide.md
Normal file
428
docs/statistical-significance-guide.md
Normal file
@@ -0,0 +1,428 @@
|
||||
# Statistical Significance Testing Guide
|
||||
|
||||
A beginner-friendly reference for choosing the right statistical test and correction method for your Voice Branding analysis.
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
1. [Quick Decision Flowchart](#quick-decision-flowchart)
|
||||
2. [Understanding Your Data Types](#understanding-your-data-types)
|
||||
3. [Available Tests](#available-tests)
|
||||
4. [Multiple Comparison Corrections](#multiple-comparison-corrections)
|
||||
5. [Interpreting Results](#interpreting-results)
|
||||
6. [Code Examples](#code-examples)
|
||||
|
||||
---
|
||||
|
||||
## Quick Decision Flowchart
|
||||
|
||||
```
|
||||
What kind of data do you have?
|
||||
│
|
||||
├─► Continuous scores (1-10 ratings, averages)
|
||||
│ │
|
||||
│ └─► Use: compute_pairwise_significance()
|
||||
│ │
|
||||
│ ├─► Data normally distributed? → test_type="ttest"
|
||||
│ └─► Not sure / skewed data? → test_type="mannwhitney" (safer choice)
|
||||
│
|
||||
└─► Ranking data (1st, 2nd, 3rd place votes)
|
||||
│
|
||||
└─► Use: compute_ranking_significance()
|
||||
(automatically uses proportion z-test)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Understanding Your Data Types
|
||||
|
||||
### Continuous Data
|
||||
**What it looks like:** Numbers on a scale with many possible values.
|
||||
|
||||
| Example | Data Source |
|
||||
|---------|-------------|
|
||||
| Voice ratings 1-10 | `get_voice_scale_1_10()` |
|
||||
| Speaking style scores | `get_ss_green_blue()` |
|
||||
| Any averaged scores | Custom aggregations |
|
||||
|
||||
```
|
||||
shape: (5, 3)
|
||||
┌───────────┬─────────────────┬─────────────────┐
|
||||
│ _recordId │ Voice_Scale__V14│ Voice_Scale__V04│
|
||||
│ str │ f64 │ f64 │
|
||||
├───────────┼─────────────────┼─────────────────┤
|
||||
│ R_001 │ 7.5 │ 6.0 │
|
||||
│ R_002 │ 8.0 │ 7.5 │
|
||||
│ R_003 │ 6.5 │ 8.0 │
|
||||
```
|
||||
|
||||
### Ranking Data
|
||||
**What it looks like:** Discrete ranks (1, 2, 3) or null if not ranked.
|
||||
|
||||
| Example | Data Source |
|
||||
|---------|-------------|
|
||||
| Top 3 voice rankings | `get_top_3_voices()` |
|
||||
| Character rankings | `get_character_ranking()` |
|
||||
|
||||
```
|
||||
shape: (5, 3)
|
||||
┌───────────┬──────────────────┬──────────────────┐
|
||||
│ _recordId │ Top_3__V14 │ Top_3__V04 │
|
||||
│ str │ i64 │ i64 │
|
||||
├───────────┼──────────────────┼──────────────────┤
|
||||
│ R_001 │ 1 │ null │ ← V14 was ranked 1st
|
||||
│ R_002 │ 2 │ 1 │ ← V04 was ranked 1st
|
||||
│ R_003 │ null │ 3 │ ← V04 was ranked 3rd
|
||||
```
|
||||
|
||||
### ⚠️ Aggregated Data (Cannot Test!)
|
||||
**What it looks like:** Already summarized/totaled data.
|
||||
|
||||
```
|
||||
shape: (3, 2)
|
||||
┌───────────┬────────────────┐
|
||||
│ Character │ Weighted Score │ ← ALREADY AGGREGATED
|
||||
│ str │ i64 │ Lost individual variance
|
||||
├───────────┼────────────────┤ Cannot do significance tests!
|
||||
│ V14 │ 209 │
|
||||
│ V04 │ 180 │
|
||||
```
|
||||
|
||||
**Solution:** Go back to the raw data before aggregation.
|
||||
|
||||
---
|
||||
|
||||
## Available Tests
|
||||
|
||||
### 1. Mann-Whitney U Test (Default for Continuous)
|
||||
**Use when:** Comparing scores/ratings between groups
|
||||
**Assumes:** Nothing about distribution shape (non-parametric)
|
||||
**Best for:** Most survey data, Likert scales, ratings
|
||||
|
||||
```python
|
||||
pairwise_df, meta = S.compute_pairwise_significance(
|
||||
voice_data,
|
||||
test_type="mannwhitney" # This is the default
|
||||
)
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- Works with any distribution shape
|
||||
- Robust to outliers
|
||||
- Safe choice when unsure
|
||||
|
||||
**Cons:**
|
||||
- Slightly less powerful than t-test when data IS normally distributed
|
||||
|
||||
---
|
||||
|
||||
### 2. Independent t-Test
|
||||
**Use when:** Comparing means between groups
|
||||
**Assumes:** Data is approximately normally distributed
|
||||
**Best for:** Large samples (n > 30 per group), truly continuous data
|
||||
|
||||
```python
|
||||
pairwise_df, meta = S.compute_pairwise_significance(
|
||||
voice_data,
|
||||
test_type="ttest"
|
||||
)
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- Most powerful when assumptions are met
|
||||
- Well-understood, commonly reported
|
||||
|
||||
**Cons:**
|
||||
- Can give misleading results if data is skewed
|
||||
- Sensitive to outliers
|
||||
|
||||
---
|
||||
|
||||
### 3. Chi-Square Test
|
||||
**Use when:** Comparing frequency distributions
|
||||
**Assumes:** Expected counts ≥ 5 in each cell
|
||||
**Best for:** Count data, categorical comparisons
|
||||
|
||||
```python
|
||||
pairwise_df, meta = S.compute_pairwise_significance(
|
||||
count_data,
|
||||
test_type="chi2"
|
||||
)
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- Designed for count/frequency data
|
||||
- Tests if distributions differ
|
||||
|
||||
**Cons:**
|
||||
- Needs sufficient sample sizes
|
||||
- Less informative about direction of difference
|
||||
|
||||
---
|
||||
|
||||
### 4. Two-Proportion Z-Test (For Rankings)
|
||||
**Use when:** Comparing ranking vote proportions
|
||||
**Automatically used by:** `compute_ranking_significance()`
|
||||
|
||||
```python
|
||||
pairwise_df, meta = S.compute_ranking_significance(ranking_data)
|
||||
```
|
||||
|
||||
**What it tests:** "Does Voice A get a significantly different proportion of Rank 1 votes than Voice B?"
|
||||
|
||||
---
|
||||
|
||||
## Multiple Comparison Corrections
|
||||
|
||||
### Why Do We Need Corrections?
|
||||
|
||||
When you compare many groups, you're doing many tests. Each test has a 5% chance of a false positive (if α = 0.05). With 17 voices:
|
||||
|
||||
| Comparisons | Expected False Positives (no correction) |
|
||||
|-------------|------------------------------------------|
|
||||
| 136 pairs | ~7 false "significant" results! |
|
||||
|
||||
**Corrections adjust p-values to account for this.**
|
||||
|
||||
---
|
||||
|
||||
### Bonferroni Correction (Conservative)
|
||||
**Formula:** `p_adjusted = p_value × number_of_comparisons`
|
||||
|
||||
```python
|
||||
pairwise_df, meta = S.compute_pairwise_significance(
|
||||
data,
|
||||
correction="bonferroni" # This is the default
|
||||
)
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- You want to be very confident about significant results
|
||||
- False positives are costly (publishing, major decisions)
|
||||
- You have few comparisons (< 20)
|
||||
|
||||
**Trade-off:** May miss real differences (more false negatives)
|
||||
|
||||
---
|
||||
|
||||
### Holm-Bonferroni Correction (Less Conservative)
|
||||
**Formula:** Step-down procedure that's less strict than Bonferroni
|
||||
|
||||
```python
|
||||
pairwise_df, meta = S.compute_pairwise_significance(
|
||||
data,
|
||||
correction="holm"
|
||||
)
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- You have many comparisons
|
||||
- You want better power to detect real differences
|
||||
- Exploratory analysis where missing a real effect is costly
|
||||
|
||||
**Trade-off:** Slightly higher false positive risk than Bonferroni
|
||||
|
||||
---
|
||||
|
||||
### No Correction
|
||||
**Not recommended for final analysis**, but useful for exploration.
|
||||
|
||||
```python
|
||||
pairwise_df, meta = S.compute_pairwise_significance(
|
||||
data,
|
||||
correction="none"
|
||||
)
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- Initial exploration only
|
||||
- You'll follow up with specific hypotheses
|
||||
- You understand and accept the inflated false positive rate
|
||||
|
||||
---
|
||||
|
||||
### Correction Method Comparison
|
||||
|
||||
| Method | Strictness | Best For | Risk |
|
||||
|--------|------------|----------|------|
|
||||
| Bonferroni | Most strict | Few comparisons, high stakes | Miss real effects |
|
||||
| Holm | Moderate | Many comparisons, balanced approach | Slightly more false positives |
|
||||
| None | No control | Exploration only | Many false positives |
|
||||
|
||||
**Recommendation for Voice Branding:** Use **Holm** for exploratory analysis, **Bonferroni** for final reporting.
|
||||
|
||||
---
|
||||
|
||||
## Interpreting Results
|
||||
|
||||
### Key Output Columns
|
||||
|
||||
| Column | Meaning |
|
||||
|--------|---------|
|
||||
| `p_value` | Raw probability this difference happened by chance |
|
||||
| `p_adjusted` | Corrected p-value (use this for decisions!) |
|
||||
| `significant` | TRUE if p_adjusted < alpha (usually 0.05) |
|
||||
| `effect_size` | How big is the difference (practical significance) |
|
||||
|
||||
### What the p-value Means
|
||||
|
||||
| p-value | Interpretation |
|
||||
|---------|----------------|
|
||||
| < 0.001 | Very strong evidence of difference |
|
||||
| < 0.01 | Strong evidence |
|
||||
| < 0.05 | Moderate evidence (traditional threshold) |
|
||||
| 0.05 - 0.10 | Weak evidence, "trending" |
|
||||
| > 0.10 | No significant evidence |
|
||||
|
||||
### Statistical vs Practical Significance
|
||||
|
||||
**Statistical significance** (p < 0.05) means the difference is unlikely due to chance.
|
||||
|
||||
**Practical significance** (effect size) means the difference matters in the real world.
|
||||
|
||||
| Effect Size (Cohen's d) | Interpretation |
|
||||
|-------------------------|----------------|
|
||||
| < 0.2 | Small (may not matter practically) |
|
||||
| 0.2 - 0.5 | Medium |
|
||||
| 0.5 - 0.8 | Large |
|
||||
| > 0.8 | Very large |
|
||||
|
||||
**Example:** A p-value of 0.001 with effect size of 0.1 means "we're confident there's a difference, but it's tiny."
|
||||
|
||||
---
|
||||
|
||||
## Code Examples
|
||||
|
||||
### Example 1: Voice Scale Ratings
|
||||
|
||||
```python
|
||||
# Get the raw rating data
|
||||
voice_data, _ = S.get_voice_scale_1_10(data)
|
||||
|
||||
# Test for significant differences
|
||||
pairwise_df, meta = S.compute_pairwise_significance(
|
||||
voice_data,
|
||||
test_type="mannwhitney", # Safe default for ratings
|
||||
alpha=0.05,
|
||||
correction="bonferroni"
|
||||
)
|
||||
|
||||
# Check overall test first
|
||||
print(f"Overall test: {meta['overall_test']}")
|
||||
print(f"Overall p-value: {meta['overall_p_value']:.4f}")
|
||||
|
||||
# If overall is significant, look at pairwise
|
||||
if meta['overall_p_value'] < 0.05:
|
||||
sig_pairs = pairwise_df.filter(pl.col('significant') == True)
|
||||
print(f"Found {sig_pairs.height} significant pairwise differences")
|
||||
|
||||
# Visualize
|
||||
S.plot_significance_heatmap(pairwise_df, metadata=meta)
|
||||
```
|
||||
|
||||
### Example 2: Top 3 Voice Rankings
|
||||
|
||||
```python
|
||||
# Get the raw ranking data (NOT the weighted scores!)
|
||||
ranking_data, _ = S.get_top_3_voices(data)
|
||||
|
||||
# Test for significant differences in Rank 1 proportions
|
||||
pairwise_df, meta = S.compute_ranking_significance(
|
||||
ranking_data,
|
||||
alpha=0.05,
|
||||
correction="holm" # Less conservative for many comparisons
|
||||
)
|
||||
|
||||
# Check chi-square test
|
||||
print(f"Chi-square p-value: {meta['chi2_p_value']:.4f}")
|
||||
|
||||
# View contingency table (Rank 1, 2, 3 counts per voice)
|
||||
for voice, counts in meta['contingency_table'].items():
|
||||
print(f"{voice}: R1={counts[0]}, R2={counts[1]}, R3={counts[2]}")
|
||||
|
||||
# Find significant pairs
|
||||
sig_pairs = pairwise_df.filter(pl.col('significant') == True)
|
||||
print(sig_pairs)
|
||||
```
|
||||
|
||||
### Example 3: Comparing Demographic Subgroups
|
||||
|
||||
```python
|
||||
# Filter to specific demographics
|
||||
S.filter_data(data, consumer=['Early Professional'])
|
||||
early_pro_data, _ = S.get_voice_scale_1_10(data)
|
||||
|
||||
S.filter_data(data, consumer=['Established Professional'])
|
||||
estab_pro_data, _ = S.get_voice_scale_1_10(data)
|
||||
|
||||
# Test each group separately, then compare results qualitatively
|
||||
# (For direct group comparison, you'd need a different test design)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Mistakes to Avoid
|
||||
|
||||
### ❌ Using Aggregated Data
|
||||
```python
|
||||
# WRONG - already summarized, lost individual variance
|
||||
weighted_scores = calculate_weighted_ranking_scores(ranking_data)
|
||||
S.compute_pairwise_significance(weighted_scores) # Will fail!
|
||||
```
|
||||
|
||||
### ✅ Use Raw Data
|
||||
```python
|
||||
# RIGHT - use raw data before aggregation
|
||||
ranking_data, _ = S.get_top_3_voices(data)
|
||||
S.compute_ranking_significance(ranking_data)
|
||||
```
|
||||
|
||||
### ❌ Ignoring Multiple Comparisons
|
||||
```python
|
||||
# WRONG - 7% of pairs will be "significant" by chance alone!
|
||||
S.compute_pairwise_significance(data, correction="none")
|
||||
```
|
||||
|
||||
### ✅ Apply Correction
|
||||
```python
|
||||
# RIGHT - corrected p-values control false positives
|
||||
S.compute_pairwise_significance(data, correction="bonferroni")
|
||||
```
|
||||
|
||||
### ❌ Only Reporting p-values
|
||||
```python
|
||||
# WRONG - statistical significance isn't everything
|
||||
print(f"p = {p_value}") # Missing context!
|
||||
```
|
||||
|
||||
### ✅ Report Effect Sizes Too
|
||||
```python
|
||||
# RIGHT - include practical significance
|
||||
print(f"p = {p_value}, effect size = {effect_size}")
|
||||
print(f"Mean difference: {mean1 - mean2:.2f} points")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference Card
|
||||
|
||||
| Data Type | Function | Default Test | Recommended Correction |
|
||||
|-----------|----------|--------------|------------------------|
|
||||
| Ratings (1-10) | `compute_pairwise_significance()` | Mann-Whitney U | Bonferroni |
|
||||
| Rankings (1st/2nd/3rd) | `compute_ranking_significance()` | Proportion Z | Holm |
|
||||
| Count frequencies | `compute_pairwise_significance(test_type="chi2")` | Chi-square | Bonferroni |
|
||||
|
||||
| Scenario | Correction |
|
||||
|----------|------------|
|
||||
| Publishing results | Bonferroni |
|
||||
| Client presentation | Bonferroni |
|
||||
| Exploratory analysis | Holm |
|
||||
| Quick internal check | Holm or None |
|
||||
|
||||
---
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [Statistics for Dummies Cheat Sheet](https://www.dummies.com/article/academics-the-arts/math/statistics/statistics-for-dummies-cheat-sheet-208650/)
|
||||
- [Choosing the Right Statistical Test](https://stats.oarc.ucla.edu/other/mult-pkg/whatstat/)
|
||||
- [Multiple Comparisons Problem (Wikipedia)](https://en.wikipedia.org/wiki/Multiple_comparisons_problem)
|
||||
85
docs/wordcloud-usage.md
Normal file
85
docs/wordcloud-usage.md
Normal file
@@ -0,0 +1,85 @@
|
||||
# Word Cloud for Personality Traits - Usage Example
|
||||
|
||||
This example shows how to use the `create_traits_wordcloud` function to visualize the most prominent personality traits from survey data.
|
||||
|
||||
## Basic Usage in Jupyter/Marimo Notebook
|
||||
|
||||
```python
|
||||
from utils import QualtricsSurvey, create_traits_wordcloud
|
||||
from pathlib import Path
|
||||
|
||||
# Load your survey data
|
||||
RESULTS_FILE = "data/exports/1-23-26/JPMC_Chase Brand Personality_Quant Round 1_January 23, 2026_Labels.csv"
|
||||
QSF_FILE = "data/19-dec_V1_quant_incl_shani_comments.qsf"
|
||||
|
||||
S = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
|
||||
data = S.load_data()
|
||||
|
||||
# Get Top 3 Traits data
|
||||
top3_traits = S.get_top_3_traits(data)[0]
|
||||
|
||||
# Create and display word cloud
|
||||
fig = create_traits_wordcloud(
|
||||
data=top3_traits,
|
||||
column='Top_3_Traits',
|
||||
title="Most Prominent Personality Traits",
|
||||
fig_save_dir='figures', # Will save to figures/All_Respondents/
|
||||
filter_slug='All_Respondents'
|
||||
)
|
||||
|
||||
# Display in notebook
|
||||
fig # or plt.show()
|
||||
```
|
||||
|
||||
## With Active Filters
|
||||
|
||||
If you're using the survey filter methods, you can pass the filter slug:
|
||||
|
||||
```python
|
||||
# Apply filters
|
||||
S.set_filter_consumer(['Early Professional', 'Established Professional'])
|
||||
filtered_data = S.get_filtered_data()
|
||||
|
||||
# Get traits from filtered data
|
||||
top3_traits = S.get_top_3_traits(filtered_data)[0]
|
||||
|
||||
# Get the filter slug for directory naming
|
||||
filter_slug = S._get_filter_slug()
|
||||
|
||||
# Create word cloud with filtered data
|
||||
fig = create_traits_wordcloud(
|
||||
data=top3_traits,
|
||||
column='Top_3_Traits',
|
||||
title="Most Prominent Personality Traits<br>(Early & Established Professionals)",
|
||||
fig_save_dir='figures',
|
||||
filter_slug=filter_slug # e.g., 'Cons-Early_Professional_Established_Professional'
|
||||
)
|
||||
|
||||
fig
|
||||
```
|
||||
|
||||
## Function Parameters
|
||||
|
||||
- **data**: Polars DataFrame or LazyFrame with trait data
|
||||
- **column**: Column name containing comma-separated traits (default: 'Top_3_Traits')
|
||||
- **title**: Title for the word cloud
|
||||
- **width**: Width in pixels (default: 1600)
|
||||
- **height**: Height in pixels (default: 800)
|
||||
- **background_color**: Background color (default: 'white')
|
||||
- **fig_save_dir**: Directory to save PNG (default: None - doesn't save)
|
||||
- **filter_slug**: Subdirectory name for filtered results (default: 'All_Respondents')
|
||||
|
||||
## Colors
|
||||
|
||||
The word cloud uses colors from `theme.py`:
|
||||
- PRIMARY: #0077B6 (Medium Blue)
|
||||
- RANK_1: #004C6D (Dark Blue)
|
||||
- RANK_2: #008493 (Teal)
|
||||
- RANK_3: #5AAE95 (Sea Green)
|
||||
|
||||
## Output
|
||||
|
||||
- **Returns**: matplotlib Figure object for display in notebooks
|
||||
- **Saves**: PNG file to `{fig_save_dir}/{filter_slug}/{sanitized_title}.png` at 300 DPI
|
||||
|
||||
The saved files follow the same naming convention as plots in `plots.py`.
|
||||
@@ -1,6 +1,6 @@
|
||||
|
||||
import polars as pl
|
||||
from utils import JPMCSurvey, process_speaking_style_data, process_voice_scale_data, join_voice_and_style_data
|
||||
from utils import QualtricsSurvey, process_speaking_style_data, process_voice_scale_data, join_voice_and_style_data
|
||||
from plots import plot_speaking_style_correlation
|
||||
from speaking_styles import SPEAKING_STYLES
|
||||
|
||||
@@ -14,7 +14,7 @@ RESULTS_FILE = "data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase Bra
|
||||
QSF_FILE = "data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf"
|
||||
|
||||
try:
|
||||
survey = JPMCSurvey(RESULTS_FILE, QSF_FILE)
|
||||
survey = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
|
||||
except TypeError:
|
||||
# Fallback if signature is different or file not found (just in case)
|
||||
print("Error initializing survey with paths. Checking signature...")
|
||||
|
||||
3
potential_dataset_issues.md
Normal file
3
potential_dataset_issues.md
Normal file
@@ -0,0 +1,3 @@
|
||||
- V46 not in scale 1-10. Qualtrics
|
||||
- Straightliners
|
||||
- V45 goed in qual maar slecht in quant
|
||||
@@ -6,6 +6,8 @@ readme = "README.md"
|
||||
requires-python = ">=3.12"
|
||||
dependencies = [
|
||||
"altair>=6.0.0",
|
||||
"imagehash>=4.3.1",
|
||||
"jupyter>=1.1.1",
|
||||
"marimo>=0.18.0",
|
||||
"matplotlib>=3.10.8",
|
||||
"modin[dask]>=0.37.1",
|
||||
@@ -14,14 +16,21 @@ dependencies = [
|
||||
"openai>=2.9.0",
|
||||
"openpyxl>=3.1.5",
|
||||
"pandas>=2.3.3",
|
||||
"pillow>=11.0.0",
|
||||
"polars>=1.37.1",
|
||||
"pyarrow>=23.0.0",
|
||||
"pysqlite3>=0.6.0",
|
||||
"python-pptx>=1.0.2",
|
||||
"pyzmq>=27.1.0",
|
||||
"requests>=2.32.5",
|
||||
"scipy>=1.14.0",
|
||||
"taguette>=1.5.1",
|
||||
"tqdm>=4.66.0",
|
||||
"vl-convert-python>=1.9.0.post1",
|
||||
"wordcloud>=1.9.5",
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
quant-report-batch = "run_filter_combinations:main"
|
||||
|
||||
|
||||
|
||||
59
reference.py
Normal file
59
reference.py
Normal file
@@ -0,0 +1,59 @@
|
||||
ORIGINAL_CHARACTER_TRAITS = {
|
||||
"the_familiar_friend": [
|
||||
"Warm",
|
||||
"Friendly",
|
||||
"Approachable",
|
||||
"Familiar",
|
||||
"Casual",
|
||||
"Appreciative",
|
||||
"Benevolent",
|
||||
],
|
||||
"the_coach": [
|
||||
"Empowering",
|
||||
"Encouraging",
|
||||
"Caring",
|
||||
"Positive",
|
||||
"Optimistic",
|
||||
"Guiding",
|
||||
"Reassuring",
|
||||
],
|
||||
"the_personal_assistant": [
|
||||
"Forward-thinking",
|
||||
"Progressive",
|
||||
"Cooperative",
|
||||
"Intentional",
|
||||
"Resourceful",
|
||||
"Attentive",
|
||||
"Adaptive",
|
||||
],
|
||||
"the_bank_teller": [
|
||||
"Patient",
|
||||
"Grounded",
|
||||
"Down-to-earth",
|
||||
"Stable",
|
||||
"Formal",
|
||||
"Balanced",
|
||||
"Efficient",
|
||||
]
|
||||
}
|
||||
|
||||
VOICE_GENDER_MAPPING = {
|
||||
"V14": "Female",
|
||||
"V04": "Female",
|
||||
"V08": "Female",
|
||||
"V77": "Female",
|
||||
"V48": "Female",
|
||||
"V82": "Female",
|
||||
"V89": "Female",
|
||||
"V91": "Female",
|
||||
"V34": "Male",
|
||||
"V69": "Male",
|
||||
"V45": "Male",
|
||||
"V46": "Male",
|
||||
"V54": "Male",
|
||||
"V74": "Male",
|
||||
"V81": "Male",
|
||||
"V86": "Male",
|
||||
"V88": "Male",
|
||||
"V16": "Male",
|
||||
}
|
||||
306
run_filter_combinations.py
Normal file
306
run_filter_combinations.py
Normal file
@@ -0,0 +1,306 @@
|
||||
#!/usr/bin/env python
|
||||
"""
|
||||
Batch runner for quant report with different filter combinations.
|
||||
|
||||
Runs 03_quant_report.script.py for each single-filter combination:
|
||||
- Each age group (with all others active)
|
||||
- Each gender (with all others active)
|
||||
- Each ethnicity (with all others active)
|
||||
- Each income group (with all others active)
|
||||
- Each consumer segment (with all others active)
|
||||
|
||||
Usage:
|
||||
uv run python run_filter_combinations.py
|
||||
uv run python run_filter_combinations.py --dry-run # Preview combinations without running
|
||||
uv run python run_filter_combinations.py --category age # Only run age combinations
|
||||
uv run python run_filter_combinations.py --category consumer # Only run consumer segment combinations
|
||||
"""
|
||||
|
||||
import subprocess
|
||||
import sys
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from tqdm import tqdm
|
||||
|
||||
from utils import QualtricsSurvey
|
||||
|
||||
|
||||
# Default data paths (same as in 03_quant_report.script.py)
|
||||
RESULTS_FILE = 'data/exports/2-2-26/JPMC_Chase Brand Personality_Quant Round 1_February 2, 2026_Labels.csv'
|
||||
QSF_FILE = 'data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf'
|
||||
|
||||
REPORT_SCRIPT = Path(__file__).parent / '03_quant_report.script.py'
|
||||
|
||||
|
||||
def get_filter_combinations(survey: QualtricsSurvey, category: str = None) -> list[dict]:
|
||||
"""
|
||||
Generate all single-filter combinations.
|
||||
|
||||
Each combination isolates ONE filter value while keeping all others at "all selected".
|
||||
|
||||
Args:
|
||||
survey: QualtricsSurvey instance with loaded data
|
||||
category: Optional filter category to limit combinations to.
|
||||
Valid values: 'all', 'age', 'gender', 'ethnicity', 'income', 'consumer',
|
||||
'business_owner', 'ai_user', 'investable_assets', 'industry'
|
||||
If None or 'all', generates all combinations.
|
||||
|
||||
Returns:
|
||||
List of dicts with filter kwargs for each run.
|
||||
"""
|
||||
combinations = []
|
||||
|
||||
# Add "All Respondents" run (no filters = all options selected)
|
||||
if not category or category in ['all_filters', 'all']:
|
||||
combinations.append({
|
||||
'name': 'All_Respondents',
|
||||
'filters': {} # Empty = use defaults (all selected)
|
||||
})
|
||||
|
||||
# Age groups - one at a time
|
||||
if not category or category in ['all_filters', 'age']:
|
||||
for age in survey.options_age:
|
||||
combinations.append({
|
||||
'name': f'Age-{age}',
|
||||
'filters': {'age': [age]}
|
||||
})
|
||||
|
||||
# Gender - one at a time
|
||||
if not category or category in ['all_filters', 'gender']:
|
||||
for gender in survey.options_gender:
|
||||
combinations.append({
|
||||
'name': f'Gender-{gender}',
|
||||
'filters': {'gender': [gender]}
|
||||
})
|
||||
|
||||
# Ethnicity - grouped by individual values
|
||||
if not category or category in ['all_filters', 'ethnicity']:
|
||||
# Ethnicity options are comma-separated (e.g., "White or Caucasian, Hispanic or Latino")
|
||||
# Create filters that include ALL options containing each individual ethnicity value
|
||||
ethnicity_values = set()
|
||||
for ethnicity_option in survey.options_ethnicity:
|
||||
# Split by comma and strip whitespace
|
||||
values = [v.strip() for v in ethnicity_option.split(',')]
|
||||
ethnicity_values.update(values)
|
||||
|
||||
for ethnicity_value in sorted(ethnicity_values):
|
||||
# Find all options that contain this value
|
||||
matching_options = [
|
||||
opt for opt in survey.options_ethnicity
|
||||
if ethnicity_value in [v.strip() for v in opt.split(',')]
|
||||
]
|
||||
combinations.append({
|
||||
'name': f'Ethnicity-{ethnicity_value}',
|
||||
'filters': {'ethnicity': matching_options}
|
||||
})
|
||||
|
||||
# Income - one at a time
|
||||
if not category or category in ['all_filters', 'income']:
|
||||
for income in survey.options_income:
|
||||
combinations.append({
|
||||
'name': f'Income-{income}',
|
||||
'filters': {'income': [income]}
|
||||
})
|
||||
|
||||
# Consumer segments - combine _A and _B options, and also include standalone
|
||||
if not category or category in ['all_filters', 'consumer']:
|
||||
# Group options by base name (removing _A/_B suffix)
|
||||
consumer_groups = {}
|
||||
for consumer in survey.options_consumer:
|
||||
# Check if ends with _A or _B
|
||||
if consumer.endswith('_A') or consumer.endswith('_B'):
|
||||
base_name = consumer[:-2] # Remove last 2 chars (_A or _B)
|
||||
if base_name not in consumer_groups:
|
||||
consumer_groups[base_name] = []
|
||||
consumer_groups[base_name].append(consumer)
|
||||
else:
|
||||
# Not an _A/_B option, keep as-is
|
||||
consumer_groups[consumer] = [consumer]
|
||||
|
||||
# Add combined _A+_B options
|
||||
for base_name, options in consumer_groups.items():
|
||||
if len(options) > 1: # Only combine if there are multiple (_A and _B)
|
||||
combinations.append({
|
||||
'name': f'Consumer-{base_name}',
|
||||
'filters': {'consumer': options}
|
||||
})
|
||||
|
||||
# Add standalone options (including individual _A and _B)
|
||||
for consumer in survey.options_consumer:
|
||||
combinations.append({
|
||||
'name': f'Consumer-{consumer}',
|
||||
'filters': {'consumer': [consumer]}
|
||||
})
|
||||
|
||||
# Business Owner - one at a time
|
||||
if not category or category in ['all_filters', 'business_owner']:
|
||||
for business_owner in survey.options_business_owner:
|
||||
combinations.append({
|
||||
'name': f'BusinessOwner-{business_owner}',
|
||||
'filters': {'business_owner': [business_owner]}
|
||||
})
|
||||
|
||||
# AI User - one at a time
|
||||
if not category or category in ['all_filters', 'ai_user']:
|
||||
for ai_user in survey.options_ai_user:
|
||||
combinations.append({
|
||||
'name': f'AIUser-{ai_user}',
|
||||
'filters': {'ai_user': [ai_user]}
|
||||
})
|
||||
|
||||
# AI user daily, more than once daily, en multiple times a week = frequent
|
||||
combinations.append({
|
||||
'name': 'AIUser-Frequent',
|
||||
'filters': {'ai_user': [
|
||||
'Daily', 'More than once daily', 'Multiple times per week'
|
||||
]}
|
||||
})
|
||||
combinations.append({
|
||||
'name': 'AIUser-RarelyNever',
|
||||
'filters': {'ai_user': [
|
||||
'Once a month', 'Less than once a month', 'Once a week', 'Rarely/Never'
|
||||
]}
|
||||
})
|
||||
|
||||
# Investable Assets - one at a time
|
||||
if not category or category in ['all_filters', 'investable_assets']:
|
||||
for investable_assets in survey.options_investable_assets:
|
||||
combinations.append({
|
||||
'name': f'Assets-{investable_assets}',
|
||||
'filters': {'investable_assets': [investable_assets]}
|
||||
})
|
||||
|
||||
# Industry - one at a time
|
||||
if not category or category in ['all_filters', 'industry']:
|
||||
for industry in survey.options_industry:
|
||||
combinations.append({
|
||||
'name': f'Industry-{industry}',
|
||||
'filters': {'industry': [industry]}
|
||||
})
|
||||
|
||||
# Voice ranking completeness filter
|
||||
# These use a special flag rather than demographic filters, so we store
|
||||
# the mode in a dedicated key that run_report passes as --voice-ranking-filter.
|
||||
if not category or category in ['all_filters', 'voice_ranking']:
|
||||
combinations.append({
|
||||
'name': 'VoiceRanking-OnlyMissing',
|
||||
'filters': {},
|
||||
'voice_ranking_filter': 'only-missing',
|
||||
})
|
||||
combinations.append({
|
||||
'name': 'VoiceRanking-ExcludeMissing',
|
||||
'filters': {},
|
||||
'voice_ranking_filter': 'exclude-missing',
|
||||
})
|
||||
|
||||
return combinations
|
||||
|
||||
|
||||
def run_report(filters: dict, name: str = None, dry_run: bool = False, sl_threshold: int = None, voice_ranking_filter: str = None) -> bool:
|
||||
"""
|
||||
Run the report script with given filters.
|
||||
|
||||
Args:
|
||||
filters: Dict of filter_name -> list of values
|
||||
name: Name for this filter combination (used for .txt description file)
|
||||
dry_run: If True, just print command without running
|
||||
sl_threshold: If set, exclude respondents with >= N straight-lined question groups
|
||||
voice_ranking_filter: If set, filter by voice ranking completeness.
|
||||
'only-missing' keeps only respondents missing QID98 data,
|
||||
'exclude-missing' removes them.
|
||||
|
||||
Returns:
|
||||
True if successful, False otherwise
|
||||
"""
|
||||
cmd = [sys.executable, str(REPORT_SCRIPT)]
|
||||
|
||||
# Add filter-name for description file
|
||||
if name:
|
||||
cmd.extend(['--filter-name', name])
|
||||
|
||||
# Pass straight-liner threshold if specified
|
||||
if sl_threshold is not None:
|
||||
cmd.extend(['--sl-threshold', str(sl_threshold)])
|
||||
|
||||
# Pass voice ranking filter if specified
|
||||
if voice_ranking_filter is not None:
|
||||
cmd.extend(['--voice-ranking-filter', voice_ranking_filter])
|
||||
|
||||
for filter_name, values in filters.items():
|
||||
if values:
|
||||
cmd.extend([f'--{filter_name}', json.dumps(values)])
|
||||
|
||||
if dry_run:
|
||||
print(f" Would run: {' '.join(cmd)}")
|
||||
return True
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
cwd=Path(__file__).parent
|
||||
)
|
||||
if result.returncode != 0:
|
||||
print(f"\n ERROR: {result.stderr[:500]}")
|
||||
return False
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"\n ERROR: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def main():
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser(description='Run quant report for all filter combinations')
|
||||
parser.add_argument('--dry-run', action='store_true', help='Preview combinations without running')
|
||||
parser.add_argument(
|
||||
'--category',
|
||||
choices=['all_filters', 'all', 'age', 'gender', 'ethnicity', 'income', 'consumer', 'business_owner', 'ai_user', 'investable_assets', 'industry', 'voice_ranking'],
|
||||
default='all_filters',
|
||||
help='Filter category to run combinations for (default: all_filters)'
|
||||
)
|
||||
parser.add_argument('--sl-threshold', type=int, default=None, help='Exclude respondents who straight-lined >= N question groups (passed to report script)')
|
||||
args = parser.parse_args()
|
||||
|
||||
# Load survey to get available filter options
|
||||
print("Loading survey to get filter options...")
|
||||
survey = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
|
||||
survey.load_data() # Populates options_* attributes
|
||||
|
||||
# Generate combinations for specified category
|
||||
combinations = get_filter_combinations(survey, category=args.category)
|
||||
category_desc = f" for category '{args.category}'" if args.category != 'all' else ''
|
||||
print(f"Generated {len(combinations)} filter combinations{category_desc}")
|
||||
|
||||
if args.sl_threshold is not None:
|
||||
print(f"Straight-liner threshold: excluding respondents with ≥{args.sl_threshold} straight-lined question groups")
|
||||
|
||||
if args.dry_run:
|
||||
print("\nDRY RUN - Commands that would be executed:")
|
||||
for combo in combinations:
|
||||
print(f"\n{combo['name']}:")
|
||||
run_report(combo['filters'], name=combo['name'], dry_run=True, sl_threshold=args.sl_threshold, voice_ranking_filter=combo.get('voice_ranking_filter'))
|
||||
return
|
||||
|
||||
# Run each combination with progress bar
|
||||
successful = 0
|
||||
failed = []
|
||||
|
||||
for combo in tqdm(combinations, desc="Running reports", unit="filter"):
|
||||
tqdm.write(f"Running: {combo['name']}")
|
||||
if run_report(combo['filters'], name=combo['name'], sl_threshold=args.sl_threshold, voice_ranking_filter=combo.get('voice_ranking_filter')):
|
||||
successful += 1
|
||||
else:
|
||||
failed.append(combo['name'])
|
||||
|
||||
# Summary
|
||||
print(f"\n{'='*50}")
|
||||
print(f"Completed: {successful}/{len(combinations)} successful")
|
||||
if failed:
|
||||
print(f"Failed: {', '.join(failed)}")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
992
speech_data_correlation.ipynb
Normal file
992
speech_data_correlation.ipynb
Normal file
File diff suppressed because one or more lines are too long
66
theme.py
66
theme.py
@@ -19,11 +19,77 @@ class ColorPalette:
|
||||
# Neutral color for unhighlighted comparison items
|
||||
NEUTRAL = "#D3D3D3" # Light Grey
|
||||
|
||||
# Character-specific colors (for individual character plots)
|
||||
# Each character has a main color and a lighter highlight for original traits
|
||||
CHARACTER_BANK_TELLER = "#004C6D" # Dark Blue
|
||||
CHARACTER_BANK_TELLER_HIGHLIGHT = "#669BBC" # Light Steel Blue
|
||||
|
||||
CHARACTER_FAMILIAR_FRIEND = "#008493" # Teal
|
||||
CHARACTER_FAMILIAR_FRIEND_HIGHLIGHT = "#A8DADC" # Pale Cyan
|
||||
|
||||
CHARACTER_COACH = "#5AAE95" # Sea Green
|
||||
CHARACTER_COACH_HIGHLIGHT = "#A8DADC" # Pale Cyan
|
||||
|
||||
CHARACTER_PERSONAL_ASSISTANT = "#457B9D" # Steel Blue
|
||||
CHARACTER_PERSONAL_ASSISTANT_HIGHLIGHT = "#669BBC" # Light Steel Blue
|
||||
|
||||
# General UI elements
|
||||
TEXT = "black"
|
||||
GRID = "lightgray"
|
||||
BACKGROUND = "white"
|
||||
|
||||
# Statistical significance colors (for heatmaps/annotations)
|
||||
SIG_STRONG = "#004C6D" # p < 0.001 - Dark Blue (highly significant)
|
||||
SIG_MODERATE = "#0077B6" # p < 0.01 - Medium Blue (significant)
|
||||
SIG_WEAK = "#5AAE95" # p < 0.05 - Sea Green (marginally significant)
|
||||
SIG_NONE = "#E8E8E8" # p >= 0.05 - Light Grey (not significant)
|
||||
SIG_DIAGONAL = "#FFFFFF" # White for diagonal (self-comparison)
|
||||
|
||||
# Extended palette for categorical charts (e.g., pie charts with many categories)
|
||||
CATEGORICAL = [
|
||||
"#0077B6", # PRIMARY - Medium Blue
|
||||
"#004C6D", # RANK_1 - Dark Blue
|
||||
"#008493", # RANK_2 - Teal
|
||||
"#5AAE95", # RANK_3 - Sea Green
|
||||
"#9E9E9E", # RANK_4 - Grey
|
||||
"#D3D3D3", # NEUTRAL - Light Grey
|
||||
"#003049", # Dark Navy
|
||||
"#669BBC", # Light Steel Blue
|
||||
"#A8DADC", # Pale Cyan
|
||||
"#457B9D", # Steel Blue
|
||||
]
|
||||
|
||||
# Gender-based colors (Male = Blue tones, Female = Pink tones)
|
||||
# Primary colors by gender
|
||||
GENDER_MALE = "#0077B6" # Medium Blue (same as PRIMARY)
|
||||
GENDER_FEMALE = "#B6007A" # Medium Pink
|
||||
|
||||
# Ranking colors by gender (Darkest -> Lightest)
|
||||
GENDER_MALE_RANK_1 = "#004C6D" # Dark Blue
|
||||
GENDER_MALE_RANK_2 = "#0077B6" # Medium Blue
|
||||
GENDER_MALE_RANK_3 = "#669BBC" # Light Steel Blue
|
||||
|
||||
GENDER_FEMALE_RANK_1 = "#6D004C" # Dark Pink
|
||||
GENDER_FEMALE_RANK_2 = "#B6007A" # Medium Pink
|
||||
GENDER_FEMALE_RANK_3 = "#BC669B" # Light Pink
|
||||
|
||||
# Neutral colors by gender (for non-highlighted items)
|
||||
GENDER_MALE_NEUTRAL = "#B8C9D9" # Grey-Blue
|
||||
GENDER_FEMALE_NEUTRAL = "#D9B8C9" # Grey-Pink
|
||||
|
||||
# Gender colors for correlation plots (green/red indicate +/- correlation)
|
||||
# Male = darker shade, Female = lighter shade
|
||||
CORR_MALE_POSITIVE = "#1B5E20" # Dark Green
|
||||
CORR_FEMALE_POSITIVE = "#81C784" # Light Green
|
||||
CORR_MALE_NEGATIVE = "#B71C1C" # Dark Red
|
||||
CORR_FEMALE_NEGATIVE = "#E57373" # Light Red
|
||||
|
||||
# Speaking Style Colors (named after the style quadrant colors)
|
||||
STYLE_GREEN = "#2E7D32" # Forest Green
|
||||
STYLE_BLUE = "#1565C0" # Strong Blue
|
||||
STYLE_ORANGE = "#E07A00" # Burnt Orange
|
||||
STYLE_RED = "#C62828" # Deep Red
|
||||
|
||||
|
||||
def jpmc_altair_theme():
|
||||
"""JPMC brand theme for Altair charts."""
|
||||
|
||||
240
validation.py
240
validation.py
@@ -1,13 +1,14 @@
|
||||
import marimo as mo
|
||||
import polars as pl
|
||||
|
||||
import altair as alt
|
||||
from theme import ColorPalette
|
||||
|
||||
def check_progress(data):
|
||||
"""Check if all responses are complete based on 'progress' column."""
|
||||
if data.collect().select(pl.col('progress').unique()).shape[0] == 1:
|
||||
return """### Responses Complete: \n\n✅ All responses are complete (progress = 100) """
|
||||
return """## Responses Complete: \n\n✅ All responses are complete (progress = 100) """
|
||||
|
||||
return "### Responses Complete: \n\n⚠️ There are incomplete responses (progress < 100) ⚠️"
|
||||
return "## Responses Complete: \n\n⚠️ There are incomplete responses (progress < 100) ⚠️"
|
||||
|
||||
|
||||
def duration_validation(data):
|
||||
@@ -30,9 +31,9 @@ def duration_validation(data):
|
||||
outlier_data = _d.filter(pl.col('outlier_duration') == True).collect()
|
||||
|
||||
if outlier_data.shape[0] == 0:
|
||||
return "### Duration Outliers: \n\n✅ No duration outliers detected"
|
||||
return "## Duration Outliers: \n\n✅ No duration outliers detected"
|
||||
|
||||
return f"""### Duration Outliers:
|
||||
return f"""## Duration Outliers:
|
||||
|
||||
**⚠️ Potential outliers detected based on response duration ⚠️**
|
||||
|
||||
@@ -68,13 +69,25 @@ def check_straight_liners(data, max_score=3):
|
||||
schema_names = data.collect_schema().names()
|
||||
|
||||
# regex groupings
|
||||
pattern = re.compile(r"(.*__V\d+)__Choice_\d+")
|
||||
pattern_choice = re.compile(r"(.*__V\d+)__Choice_\d+")
|
||||
pattern_scale = re.compile(r"Voice_Scale_1_10__V\d+")
|
||||
|
||||
groups = {}
|
||||
|
||||
for col in schema_names:
|
||||
match = pattern.search(col)
|
||||
if match:
|
||||
group_key = match.group(1)
|
||||
# Check for Choice pattern (SS_...__Vxx__Choice_y)
|
||||
match_choice = pattern_choice.search(col)
|
||||
if match_choice:
|
||||
group_key = match_choice.group(1)
|
||||
if group_key not in groups:
|
||||
groups[group_key] = []
|
||||
groups[group_key].append(col)
|
||||
continue
|
||||
|
||||
# Check for Voice Scale pattern (Voice_Scale_1_10__Vxx)
|
||||
# All of these form a single group "Voice_Scale_1_10"
|
||||
if pattern_scale.search(col):
|
||||
group_key = "Voice_Scale_1_10"
|
||||
if group_key not in groups:
|
||||
groups[group_key] = []
|
||||
groups[group_key].append(col)
|
||||
@@ -85,6 +98,13 @@ def check_straight_liners(data, max_score=3):
|
||||
if not multi_attribute_groups:
|
||||
return "### Straight-lining Checks: \n\nℹ️ No multi-attribute question groups found."
|
||||
|
||||
# Cast all involved columns to Float64 (strict=False) to handle potential string columns
|
||||
# and 1-10 scale floats (e.g. 5.5). Float64 covers integers as well.
|
||||
all_group_cols = [col for cols in multi_attribute_groups.values() for col in cols]
|
||||
data = data.with_columns([
|
||||
pl.col(col).cast(pl.Float64, strict=False) for col in all_group_cols
|
||||
])
|
||||
|
||||
# Build expressions
|
||||
expressions = []
|
||||
|
||||
@@ -108,8 +128,9 @@ def check_straight_liners(data, max_score=3):
|
||||
).alias(f"__is_straight__{key}")
|
||||
|
||||
value_expr = safe_val.alias(f"__val__{key}")
|
||||
has_data = (list_expr.list.len() > 0).alias(f"__has_data__{key}")
|
||||
|
||||
expressions.extend([is_straight, value_expr])
|
||||
expressions.extend([is_straight, value_expr, has_data])
|
||||
|
||||
# collect data with checks
|
||||
# We only need _recordId and the check columns
|
||||
@@ -120,33 +141,200 @@ def check_straight_liners(data, max_score=3):
|
||||
# Process results into a nice table
|
||||
outliers = []
|
||||
|
||||
for key in multi_attribute_groups.keys():
|
||||
for key, group_cols in multi_attribute_groups.items():
|
||||
flag_col = f"__is_straight__{key}"
|
||||
val_col = f"__val__{key}"
|
||||
|
||||
filtered = checked_data.filter(pl.col(flag_col))
|
||||
|
||||
if filtered.height > 0:
|
||||
rows = filtered.select(["_recordId", val_col]).rows()
|
||||
for row in rows:
|
||||
# Sort group_cols logic
|
||||
# If Choice columns, sort by choice number.
|
||||
# If Voice Scale columns (no Choice_), sort by Voice ID (Vxx)
|
||||
if all("__Choice_" in c for c in group_cols):
|
||||
key_func = lambda c: int(c.split('__Choice_')[-1])
|
||||
else:
|
||||
# Extract digits from Vxx
|
||||
def key_func(c):
|
||||
m = re.search(r"__V(\d+)", c)
|
||||
return int(m.group(1)) if m else 0
|
||||
|
||||
sorted_group_cols = sorted(group_cols, key=key_func)
|
||||
|
||||
# Select relevant columns: Record ID, Value, and the sorted group columns
|
||||
subset = filtered.select(["_recordId", val_col] + sorted_group_cols)
|
||||
|
||||
for row in subset.iter_rows(named=True):
|
||||
# Create ordered list of values, using 'NaN' for missing data
|
||||
resp_list = [row[c] if row[c] is not None else 'NaN' for c in sorted_group_cols]
|
||||
|
||||
outliers.append({
|
||||
"Record ID": row[0],
|
||||
"Record ID": row["_recordId"],
|
||||
"Question Group": key,
|
||||
"Value": row[1]
|
||||
"Value": row[val_col],
|
||||
"Responses": str(resp_list)
|
||||
})
|
||||
|
||||
if not outliers:
|
||||
return f"### Straight-lining Checks: \n\n✅ No straight-liners detected (value <= {max_score})"
|
||||
return f"### Straight-lining Checks: \n\n✅ No straight-liners detected (value <= {max_score})", None
|
||||
|
||||
outlier_df = pl.DataFrame(outliers)
|
||||
|
||||
return f"""### Straight-lining Checks:
|
||||
|
||||
**⚠️ Potential straight-liners detected ⚠️**
|
||||
|
||||
Respondents selected the same value (<= {max_score}) for all attributes in the following groups:
|
||||
|
||||
{mo.ui.table(outlier_df)}
|
||||
"""
|
||||
|
||||
|
||||
# --- Analysis & Visualization ---
|
||||
|
||||
total_respondents = checked_data.height
|
||||
|
||||
# 1. & 3. Percentage Calculation
|
||||
group_stats = []
|
||||
value_dist_data = []
|
||||
|
||||
# Calculate Straight-Liners for ALL groups found in Data
|
||||
# Condition: Respondent straight-lined ALL questions that they actually answered (ignoring empty/skipped questions)
|
||||
# Logic: For every group G: if G has data (len > 0), then G must be straight.
|
||||
# Also, the respondent must have answered at least one question group.
|
||||
|
||||
conditions = []
|
||||
has_any_data_exprs = []
|
||||
|
||||
for key in multi_attribute_groups.keys():
|
||||
flag_col = f"__is_straight__{key}"
|
||||
data_col = f"__has_data__{key}"
|
||||
|
||||
# If has_data is True, is_straight MUST be True for it to count as valid straight-lining behavior for that user.
|
||||
# Equivalent: (not has_data) OR is_straight
|
||||
cond = (~pl.col(data_col)) | pl.col(flag_col)
|
||||
conditions.append(cond)
|
||||
has_any_data_exprs.append(pl.col(data_col))
|
||||
|
||||
all_straight_count = checked_data.filter(
|
||||
pl.all_horizontal(conditions) & pl.any_horizontal(has_any_data_exprs)
|
||||
).height
|
||||
all_straight_pct = (all_straight_count / total_respondents) * 100
|
||||
|
||||
for key in multi_attribute_groups.keys():
|
||||
flag_col = f"__is_straight__{key}"
|
||||
val_col = f"__val__{key}"
|
||||
|
||||
# Filter for straight-liners in this specific group
|
||||
sl_sub = checked_data.filter(pl.col(flag_col))
|
||||
count = sl_sub.height
|
||||
pct = (count / total_respondents) * 100
|
||||
|
||||
group_stats.append({
|
||||
"Question Group": key,
|
||||
"Straight-Liner %": pct,
|
||||
"Count": count
|
||||
})
|
||||
|
||||
# Get Value Distribution for this group's straight-liners
|
||||
if count > 0:
|
||||
# Group by the Value they straight-lined
|
||||
dist = sl_sub.group_by(val_col).agg(pl.len().alias("count"))
|
||||
for row in dist.iter_rows(named=True):
|
||||
value_dist_data.append({
|
||||
"Question Group": key,
|
||||
"Value": row[val_col],
|
||||
"Count": row["count"]
|
||||
})
|
||||
|
||||
stats_df = pl.DataFrame(group_stats)
|
||||
dist_df = pl.DataFrame(value_dist_data)
|
||||
|
||||
# Plot 1: % of Responses with Straight-Liners per Question
|
||||
# Vertical bars with Count label on top
|
||||
base_pct = alt.Chart(stats_df).encode(
|
||||
x=alt.X("Question Group", sort=alt.EncodingSortField(field="Straight-Liner %", order="descending"))
|
||||
)
|
||||
|
||||
bars_pct = base_pct.mark_bar(color=ColorPalette.PRIMARY).encode(
|
||||
y=alt.Y("Straight-Liner %:Q", axis=alt.Axis(format=".1f", title="Share of all responses [%]")),
|
||||
tooltip=["Question Group", alt.Tooltip("Straight-Liner %:Q", format=".1f"), "Count"]
|
||||
)
|
||||
|
||||
text_pct = base_pct.mark_text(dy=-10).encode(
|
||||
y=alt.Y("Straight-Liner %:Q"),
|
||||
text=alt.Text("Count")
|
||||
)
|
||||
|
||||
chart_pct = (bars_pct + text_pct).properties(
|
||||
title="Share of Responses with Straight-Liners per Question",
|
||||
width=800,
|
||||
height=300
|
||||
)
|
||||
|
||||
# Plot 2: Value Distribution (Horizontal Stacked Normalized Bar)
|
||||
# Question Groups sorted by Total Count
|
||||
# Values stacked 1 (left) -> 5 (right)
|
||||
# Legend on top
|
||||
# Total count at bar end
|
||||
|
||||
# Sort order for Y axis (Question Group) based on total Count (descending)
|
||||
# Explicitly calculate sort order from stats_df to ensure consistency across layers
|
||||
# High counts at the top
|
||||
sorted_groups = stats_df.sort("Count", descending=True)["Question Group"].to_list()
|
||||
|
||||
# Base chart for Bars
|
||||
# Use JPMC-aligned colors (blues) instead of default categorical rainbow
|
||||
# Remove legend title as per plots.py style
|
||||
bars_dist = alt.Chart(dist_df).mark_bar().encode(
|
||||
y=alt.Y("Question Group", sort=sorted_groups),
|
||||
x=alt.X("Count", stack="normalize", axis=alt.Axis(format="%"), title="Share of SL Responses"),
|
||||
color=alt.Color("Value:O",
|
||||
title=None, # explicit removal of title like in plots.py
|
||||
scale=alt.Scale(scheme="blues"), # Professional blue scale
|
||||
legend=alt.Legend(orient="top", direction="horizontal")
|
||||
),
|
||||
order=alt.Order("Value", sort="ascending"), # Ensures 1 is Left, 5 is Right
|
||||
tooltip=["Question Group", "Value", "Count"]
|
||||
)
|
||||
|
||||
# Text layer for Total Count (using stats_df which already has totals)
|
||||
# using same sort for Y
|
||||
text_dist = alt.Chart(stats_df).mark_text(align='left', dx=5).encode(
|
||||
y=alt.Y("Question Group", sort=sorted_groups),
|
||||
x=alt.datum(1.0), # Position at 100%
|
||||
text=alt.Text("Count")
|
||||
)
|
||||
|
||||
chart_dist = (bars_dist + text_dist).properties(
|
||||
title="Distribution of Straight-Lined Values",
|
||||
width=800,
|
||||
height=500
|
||||
)
|
||||
|
||||
analysis_md = f"""
|
||||
### Straight-Lining Analysis
|
||||
|
||||
*"Straight-lining" is defined here as selecting the same response value for all attributes within a multi-attribute question group.*
|
||||
|
||||
* **Total Respondents**: {total_respondents}
|
||||
* **Respondents straight-lining ALL questions presented to them**: {all_straight_pct:.2f}% ({all_straight_count} respondents)
|
||||
|
||||
"""
|
||||
|
||||
return (mo.vstack([
|
||||
mo.md(f"**⚠️ Potential straight-liners detected ⚠️**\n\n"),
|
||||
mo.ui.table(outlier_df),
|
||||
mo.md(analysis_md),
|
||||
alt.vconcat(chart_pct, chart_dist).resolve_legend(color="independent")
|
||||
]), outlier_df)
|
||||
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
from utils import QualtricsSurvey
|
||||
|
||||
RESULTS_FILE = "data/exports/OneDrive_2026-01-28/1-28-26 Afternoon/JPMC_Chase Brand Personality_Quant Round 1_January 28, 2026_Afternoon_Labels.csv"
|
||||
QSF_FILE = "data/exports/OneDrive_2026-01-21/Soft Launch Data/JPMC_Chase_Brand_Personality_Quant_Round_1.qsf"
|
||||
|
||||
S = QualtricsSurvey(RESULTS_FILE, QSF_FILE)
|
||||
data = S.load_data()
|
||||
|
||||
# print("Checking Green Blue:")
|
||||
# print(check_straight_liners(S.get_ss_green_blue(data)[0]))
|
||||
# print("Checking Orange Red:")
|
||||
# print(check_straight_liners(S.get_ss_orange_red(data)[0]))
|
||||
|
||||
print("Checking Voice Scale 1-10:")
|
||||
print(check_straight_liners(S.get_voice_scale_1_10(data)[0]))
|
||||
18
wordclouds.py
Normal file
18
wordclouds.py
Normal file
@@ -0,0 +1,18 @@
|
||||
"""Word cloud utilities for Voice Branding analysis.
|
||||
|
||||
The main wordcloud function is available as a method on QualtricsSurvey:
|
||||
S.plot_traits_wordcloud(data, column='Top_3_Traits', title='...')
|
||||
|
||||
This module provides standalone imports for backwards compatibility.
|
||||
"""
|
||||
import numpy as np
|
||||
from os import path
|
||||
from PIL import Image, ImageDraw
|
||||
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
import warnings
|
||||
warnings.filterwarnings("ignore")
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user