diff --git a/docs/altair-migration-plan.md b/docs/altair-migration-plan.md new file mode 100644 index 0000000..3c87863 --- /dev/null +++ b/docs/altair-migration-plan.md @@ -0,0 +1,1307 @@ +# Altair Migration Plan: Plotly → Altair for JPMCPlotsMixin + +**Date:** January 28, 2026 +**Status:** Not Started +**Objective:** Migrate all plotting methods from Plotly to Altair to solve filter annotation overlap issues and ensure proper Marimo reactivity. + +--- + +## Background + +### Problem +Current Plotly implementation has a critical layout issue: filter annotations overlap with long rotated x-axis labels because Plotly doesn't support true bounding boxes. Elements overflow their assigned subplot areas. + +### Why Altair? +1. **Better layout control** - Vega-Lite (Altair's backend) properly calculates space for all elements +2. **Marimo reactivity** - Marimo documentation states reactive plots require Altair or Plotly; Altair is preferred +3. **Clean separation** - `vconcat()` creates true vertical stacking without overflow +4. **Already installed** - Altair >=6.0.0 is already a dependency (unused) + +--- + +## Current System Analysis + +### File Structure +- **`plots.py`** - Contains `JPMCPlotsMixin` class with 10 plotting methods +- **`theme.py`** - Contains `ColorPalette` class with all styling constants +- **`utils.py`** - Contains `JPMCSurvey` class that mixes in `JPMCPlotsMixin` + +### Color Palette (from theme.py) +```python +class ColorPalette: + PRIMARY = "#0077B6" # Medium Blue + RANK_1 = "#004C6D" # Dark Blue + RANK_2 = "#008493" # Teal + RANK_3 = "#5AAE95" # Sea Green + RANK_4 = "#9E9E9E" # Grey + NEUTRAL = "#D3D3D3" # Light Grey + TEXT = "black" + GRID = "lightgray" + BACKGROUND = "white" +``` + +### Current Plot Methods Inventory + +| Method Name | Chart Type | Input Format | Special Features | +|-------------|-----------|--------------|------------------| +| `plot_average_scores_with_counts` | Vertical Bar | Wide DF (score columns) | Text inside bars (count) | +| `plot_top3_ranking_distribution` | Stacked Vertical Bar | Wide DF (rank values 1-3) | 3-layer stack, legend | +| `plot_ranking_distribution` | Stacked Vertical Bar | Wide DF (rank values 1-4) | 4-layer stack, legend | +| `plot_most_ranked_1` | Vertical Bar | Wide DF (ranking columns) | Top 3 highlighted | +| `plot_weighted_ranking_score` | Vertical Bar | `Character`, `Weighted Score` | Text inside bars | +| `plot_voice_selection_counts` | Vertical Bar | Comma-separated string col | Explode strings, Top 8 highlight | +| `plot_top3_selection_counts` | Vertical Bar | Comma-separated string col | Explode strings, Top 3 highlight | +| `plot_speaking_style_trait_scores` | Horizontal Bar | `Voice`, `score`, anchors | Text annotations at bottom | +| `plot_speaking_style_correlation` | Vertical Bar | Correlation data | Red/Green conditional | +| `plot_speaking_style_ranking_correlation` | Vertical Bar | Correlation data | Red/Green conditional | + +### Filter System Components + +1. **`_get_filter_slug()`** - Generates directory name from filters (e.g., `Age-22to24_Gen-Man`) +2. **`_get_filter_description()`** - Generates HTML text (e.g., `Filters: Age: 22-24
Gender: Man`) +3. **`_add_filter_footnote(fig)`** - Currently creates 2-row Plotly subplot, adds annotation +4. **`_save_plot(fig, title)`** - Adds footer, saves to `figures/{slug}/{filename}.png` + +### Common Styling Pattern +All plots use: +- Height: 500px (default, can override) +- Width: 1000px (default, can override) +- Background: white +- Grid: light gray +- Font size: 11 +- X-axis: 45° rotated labels +- Legends (where applicable): Horizontal, positioned above plot + +--- + +## Prerequisites + +### Dependencies to Add +```bash +uv add vl-convert-python # For PNG export from Altair +``` + +### Dependencies Already Present +- `altair>=6.0.0` ✅ +- `polars>=1.37.1` ✅ +- `pandas>=2.3.3` ✅ + +--- + +## Migration Tasks + +### TASK 1: Create Altair Theme in theme.py + +**Location:** `/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/theme.py` + +**Action:** Add an Altair theme function and register it at the end of the file. + +**Code to add:** +```python +def jpmc_altair_theme(): + """JPMC brand theme for Altair charts.""" + return { + 'config': { + 'view': { + 'continuousWidth': 1000, + 'continuousHeight': 500, + 'strokeWidth': 0 + }, + 'background': ColorPalette.BACKGROUND, + 'axis': { + 'grid': True, + 'gridColor': ColorPalette.GRID, + 'labelAngle': -45, # Default rotated labels + 'labelFontSize': 11, + 'titleFontSize': 12, + 'labelColor': ColorPalette.TEXT, + 'titleColor': ColorPalette.TEXT + }, + 'axisX': { + 'labelAngle': -45 + }, + 'axisY': { + 'labelAngle': 0 + }, + 'legend': { + 'orient': 'top', + 'direction': 'horizontal', + 'titleFontSize': 11, + 'labelFontSize': 11 + }, + 'title': { + 'fontSize': 14, + 'color': ColorPalette.TEXT, + 'anchor': 'start' + }, + 'bar': { + 'color': ColorPalette.PRIMARY + } + } + } + +# Register theme (add at end of file) +try: + import altair as alt + alt.themes.register('jpmc', jpmc_altair_theme) + alt.themes.enable('jpmc') +except ImportError: + pass # Altair not installed +``` + +**Verification:** +- [ ] Function `jpmc_altair_theme()` exists +- [ ] Theme is registered as 'jpmc' +- [ ] Theme is enabled by default +- [ ] Import error is handled gracefully + +--- + +### TASK 2: Update imports in plots.py + +**Location:** `/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/plots.py` (lines 1-8) + +**Action:** Replace Plotly imports with Altair imports. + +**Current code:** +```python +import plotly.graph_objects as go +from plotly.subplots import make_subplots +``` + +**Replace with:** +```python +import altair as alt +``` + +**Keep these imports:** +```python +import re +from pathlib import Path +import polars as pl +from theme import ColorPalette +``` + +**Verification:** +- [ ] `import altair as alt` present +- [ ] No Plotly imports remain +- [ ] All other imports unchanged + +--- + +### TASK 3: Rewrite `_add_filter_footnote` for Altair + +**Location:** `/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/plots.py` (currently lines ~120-212) + +**Action:** Replace entire `_add_filter_footnote` method with Altair version. + +**New implementation:** +```python +def _add_filter_footnote(self, chart: alt.Chart) -> alt.Chart: + """Add a footnote with active filters to the chart. + + Creates a vconcat with main chart on top and filter text chart below. + Returns the combined chart (or original if no filters). + """ + filter_text = self._get_filter_description() + + # Skip if no filters active - return original chart + if not filter_text: + return chart + + # Remove HTML tags for plain text (Altair doesn't support HTML in mark_text) + plain_text = re.sub(r'<[^>]+>', '', filter_text) + # Replace
with newlines + plain_text = plain_text.replace('
', '\n') + + # Create a text-only chart for the footer + # Use a dummy dataframe with one row + import pandas as pd + footer_df = pd.DataFrame([{'text': plain_text, 'x': 0, 'y': 0}]) + + footer_chart = alt.Chart(footer_df).mark_text( + align='left', + baseline='top', + fontSize=9, + color='gray', + dx=5, # Small left padding + dy=5 # Small top padding + ).encode( + text='text:N' + ).properties( + height=60, # Fixed height for footer + width=chart.width if hasattr(chart, 'width') and chart.width else 1000 + ) + + # Combine with vconcat + combined = alt.vconcat(chart, footer_chart, spacing=10) + + return combined +``` + +**Verification:** +- [ ] Method signature changed from `fig: go.Figure` to `chart: alt.Chart` +- [ ] Returns `alt.Chart` instead of `go.Figure` +- [ ] Uses `vconcat` for vertical stacking +- [ ] HTML tags are stripped from filter text +- [ ] Footer has fixed height +- [ ] Spacing between chart and footer is set + +--- + +### TASK 4: Rewrite `_save_plot` for Altair + +**Location:** `/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/plots.py` (currently lines ~214-234) + +**Action:** Replace `_save_plot` method for Altair chart saving. + +**New implementation:** +```python +def _save_plot(self, chart: alt.Chart, title: str) -> alt.Chart: + """Save chart to PNG file if fig_save_dir is set. + + Returns the (potentially modified) chart with filter footnote added. + """ + # Add filter footnote - returns combined chart if filters active + chart = self._add_filter_footnote(chart) + + if hasattr(self, 'fig_save_dir') and self.fig_save_dir: + path = Path(self.fig_save_dir) + + # Add filter slug subfolder + filter_slug = self._get_filter_slug() + path = path / filter_slug + + if not path.exists(): + path.mkdir(parents=True, exist_ok=True) + + filename = f"{self._sanitize_filename(title)}.png" + + # Save using vl-convert backend + chart.save(str(path / filename), format='png', scale_factor=2.0) + + return chart +``` + +**Verification:** +- [ ] Method signature changed from `fig: go.Figure` to `chart: alt.Chart` +- [ ] Uses `chart.save()` instead of `fig.write_image()` +- [ ] PNG format specified +- [ ] Path handling unchanged (filter slug subdirectories) +- [ ] Returns modified chart + +--- + +### TASK 5: Migrate `plot_average_scores_with_counts` + +**Location:** `/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/plots.py` (currently lines ~248-313) + +**Action:** Rewrite using Altair bar chart with text overlay. + +**Current behavior:** +- Input: Wide DataFrame with score columns +- Output: Vertical bar chart with average scores, count text inside bars + +**New implementation:** +```python +def plot_average_scores_with_counts( + self, + data: pl.LazyFrame | pl.DataFrame | None = None, + title: str = "General Impression (1-10)\nPer Voice with Number of Participants Who Rated It", + x_label: str = "Stimuli", + y_label: str = "Average General Impression Rating (1-10)", + color: str = ColorPalette.PRIMARY, + height: int | None = None, + width: int | None = None, +) -> alt.Chart: + """Create a bar plot showing average scores and count of non-null values for each column.""" + df = self._ensure_dataframe(data) + + # Calculate stats for each column (exclude _recordId) + stats = [] + for col in [c for c in df.columns if c != '_recordId']: + avg_score = df[col].mean() + non_null_count = df[col].drop_nulls().len() + # Extract voice ID from column name + label = col.split('__')[-1] if '__' in col else col + stats.append({ + 'voice': label, + 'average': avg_score, + 'count': non_null_count + }) + + # Convert to pandas for Altair (sort by average descending) + stats_df = pl.DataFrame(stats).sort('average', descending=True).to_pandas() + + # Base bar chart + bars = alt.Chart(stats_df).mark_bar(color=color).encode( + x=alt.X('voice:N', title=x_label, sort='-y'), + y=alt.Y('average:Q', title=y_label, scale=alt.Scale(domain=[0, 10])), + tooltip=[ + alt.Tooltip('voice:N', title='Voice'), + alt.Tooltip('average:Q', title='Average', format='.2f'), + alt.Tooltip('count:Q', title='Count') + ] + ) + + # Text overlay for counts + text = alt.Chart(stats_df).mark_text( + dy=-5, # Slight offset above bar + color='black', + fontSize=10 + ).encode( + x=alt.X('voice:N', sort='-y'), + y=alt.Y('average:Q'), + text=alt.Text('count:Q') + ) + + # Combine layers + chart = (bars + text).properties( + title=title, + width=width or getattr(self, 'plot_width', 1000), + height=height or getattr(self, 'plot_height', 500) + ) + + chart = self._save_plot(chart, title) + return chart +``` + +**Verification:** +- [ ] Returns `alt.Chart` instead of `go.Figure` +- [ ] Data transformed to long format (pandas DataFrame) +- [ ] Bar chart created with `mark_bar()` +- [ ] Text overlay added with `mark_text()` +- [ ] Layers combined with `+` operator +- [ ] Sorting preserved (by average descending) +- [ ] Y-axis scale set to [0, 10] +- [ ] Tooltip includes voice, average, count +- [ ] Width/height properties set +- [ ] `_save_plot` called at end + +--- + +### TASK 6: Migrate `plot_most_ranked_1` + +**Location:** `/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/plots.py` (currently lines ~537-613) + +**Action:** Rewrite using Altair with conditional coloring (top 3 highlighted). + +**Current behavior:** +- Input: Wide DataFrame with ranking columns +- Output: Vertical bar chart, top 3 bars in PRIMARY color, rest in NEUTRAL + +**New implementation:** +```python +def plot_most_ranked_1( + self, + data: pl.LazyFrame | pl.DataFrame | None = None, + title: str = "Most Popular Choice\n(Number of Times Ranked 1st)", + x_label: str = "Item", + y_label: str = "Count of 1st Place Rankings", + height: int | None = None, + width: int | None = None, +) -> alt.Chart: + """Create a bar chart showing which item was ranked #1 the most. Top 3 highlighted.""" + df = self._ensure_dataframe(data) + + stats = [] + ranking_cols = [c for c in df.columns if c != '_recordId'] + + for col in ranking_cols: + count_rank_1 = df.filter(pl.col(col) == 1).height + # Clean label + label = col.replace('Character_Ranking_', '').replace('Top_3_Voices_ranking__', '').replace('_', ' ').strip() + stats.append({'item': label, 'count': count_rank_1}) + + # Convert and sort + stats_df = pl.DataFrame(stats).sort('count', descending=True) + + # Add rank column for coloring (1-3 vs 4+) + stats_df = stats_df.with_row_index('rank_index') + stats_df = stats_df.with_columns( + pl.when(pl.col('rank_index') < 3) + .then(pl.lit('Top 3')) + .otherwise(pl.lit('Other')) + .alias('category') + ).to_pandas() + + # Bar chart with conditional color + chart = alt.Chart(stats_df).mark_bar().encode( + x=alt.X('item:N', title=x_label, sort='-y'), + y=alt.Y('count:Q', title=y_label), + color=alt.Color('category:N', + scale=alt.Scale(domain=['Top 3', 'Other'], + range=[ColorPalette.PRIMARY, ColorPalette.NEUTRAL]), + legend=None), + tooltip=[ + alt.Tooltip('item:N', title='Item'), + alt.Tooltip('count:Q', title='1st Place Votes') + ] + ).properties( + title=title, + width=width or getattr(self, 'plot_width', 1000), + height=height or getattr(self, 'plot_height', 500) + ) + + chart = self._save_plot(chart, title) + return chart +``` + +**Verification:** +- [ ] Returns `alt.Chart` +- [ ] Counts rank 1 occurrences per column +- [ ] Adds `category` column for top 3 vs others +- [ ] Uses conditional color via `alt.Color()` with custom scale +- [ ] Tooltip shows item and count +- [ ] Sorted by count descending +- [ ] Legend hidden (color is self-explanatory) + +--- + +### TASK 7: Migrate `plot_weighted_ranking_score` + +**Location:** `/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/plots.py` (currently lines ~615-662) + +**Action:** Rewrite simple bar chart with text overlay. + +**Current behavior:** +- Input: DataFrame with `Character` and `Weighted Score` columns +- Output: Vertical bar chart with score text inside bars + +**New implementation:** +```python +def plot_weighted_ranking_score( + self, + data: pl.LazyFrame | pl.DataFrame | None = None, + title: str = "Weighted Popularity Score\n(1st=3pts, 2nd=2pts, 3rd=1pt)", + x_label: str = "Character Personality", + y_label: str = "Total Weighted Score", + color: str = ColorPalette.PRIMARY, + height: int | None = None, + width: int | None = None, +) -> alt.Chart: + """Create a bar chart showing the weighted ranking score for each character.""" + weighted_df = self._ensure_dataframe(data).to_pandas() + + # Bar chart + bars = alt.Chart(weighted_df).mark_bar(color=color).encode( + x=alt.X('Character:N', title=x_label), + y=alt.Y('Weighted Score:Q', title=y_label), + tooltip=[ + alt.Tooltip('Character:N'), + alt.Tooltip('Weighted Score:Q', title='Score') + ] + ) + + # Text overlay + text = bars.mark_text( + dy=-5, + color='white', + fontSize=11 + ).encode( + text='Weighted Score:Q' + ) + + chart = (bars + text).properties( + title=title, + width=width or getattr(self, 'plot_width', 1000), + height=height or getattr(self, 'plot_height', 500) + ) + + chart = self._save_plot(chart, title) + return chart +``` + +**Verification:** +- [ ] Returns `alt.Chart` +- [ ] Uses input columns as-is (`Character`, `Weighted Score`) +- [ ] Text overlay with white color inside bars +- [ ] Tooltip shows character and score + +--- + +### TASK 8: Migrate `plot_top3_ranking_distribution` + +**Location:** `/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/plots.py` (currently lines ~315-415) + +**Action:** Rewrite stacked bar chart (3 ranks). + +**Current behavior:** +- Input: Wide DataFrame with ranking columns (values 1, 2, 3) +- Output: Stacked bar chart with 3 layers (Rank 1, 2, 3), horizontal legend + +**New implementation:** +```python +def plot_top3_ranking_distribution( + self, + data: pl.LazyFrame | pl.DataFrame | None = None, + title: str = "Top 3 Rankings Distribution\nCount of 1st, 2nd, and 3rd Place Votes per Voice", + x_label: str = "Voices", + y_label: str = "Number of Mentions in Top 3", + height: int | None = None, + width: int | None = None, +) -> alt.Chart: + """Create a stacked bar chart showing how often each voice was ranked 1st, 2nd, or 3rd.""" + df = self._ensure_dataframe(data) + + # Calculate stats per column + stats = [] + for col in [c for c in df.columns if c != '_recordId']: + rank1 = df.filter(pl.col(col) == 1).height + rank2 = df.filter(pl.col(col) == 2).height + rank3 = df.filter(pl.col(col) == 3).height + total = rank1 + rank2 + rank3 + + if total > 0: + label = col.split('__')[-1] if '__' in col else col + # Add 3 rows (one per rank) + stats.append({'voice': label, 'rank': 'Rank 1 (1st Choice)', 'count': rank1, 'total': total}) + stats.append({'voice': label, 'rank': 'Rank 2 (2nd Choice)', 'count': rank2, 'total': total}) + stats.append({'voice': label, 'rank': 'Rank 3 (3rd Choice)', 'count': rank3, 'total': total}) + + # Convert to long format, sort by total + stats_df = pl.DataFrame(stats).to_pandas() + + # Create stacked bar chart + chart = alt.Chart(stats_df).mark_bar().encode( + x=alt.X('voice:N', title=x_label, sort=alt.EncodingSortField(field='total', op='sum', order='descending')), + y=alt.Y('count:Q', title=y_label, stack='zero'), + color=alt.Color('rank:N', + scale=alt.Scale(domain=['Rank 1 (1st Choice)', 'Rank 2 (2nd Choice)', 'Rank 3 (3rd Choice)'], + range=[ColorPalette.RANK_1, ColorPalette.RANK_2, ColorPalette.RANK_3]), + legend=alt.Legend(orient='top', direction='horizontal', title=None)), + tooltip=[ + alt.Tooltip('voice:N', title='Voice'), + alt.Tooltip('rank:N', title='Rank'), + alt.Tooltip('count:Q', title='Count') + ] + ).properties( + title=title, + width=width or getattr(self, 'plot_width', 1000), + height=height or getattr(self, 'plot_height', 500) + ) + + chart = self._save_plot(chart, title) + return chart +``` + +**Verification:** +- [ ] Returns `alt.Chart` +- [ ] Data converted to long format (one row per voice-rank combo) +- [ ] Stacked with `stack='zero'` in y encoding +- [ ] Custom color scale for 3 ranks +- [ ] Sorted by total (sum of all ranks per voice) +- [ ] Horizontal legend at top +- [ ] Tooltip shows voice, rank, count + +--- + +### TASK 9: Migrate `plot_ranking_distribution` + +**Location:** `/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/plots.py` (currently lines ~417-536) + +**Action:** Rewrite stacked bar chart (4 ranks) - very similar to Task 8. + +**Current behavior:** +- Input: Wide DataFrame with ranking columns (values 1, 2, 3, 4) +- Output: Stacked bar chart with 4 layers + +**New implementation:** +```python +def plot_ranking_distribution( + self, + data: pl.LazyFrame | pl.DataFrame | None = None, + title: str = "Rankings Distribution\n(1st to 4th Place)", + x_label: str = "Item", + y_label: str = "Number of Votes", + height: int | None = None, + width: int | None = None, +) -> alt.Chart: + """Create a stacked bar chart showing the distribution of rankings (1st to 4th).""" + df = self._ensure_dataframe(data) + + stats = [] + ranking_cols = [c for c in df.columns if c != '_recordId'] + + for col in ranking_cols: + r1 = df.filter(pl.col(col) == 1).height + r2 = df.filter(pl.col(col) == 2).height + r3 = df.filter(pl.col(col) == 3).height + r4 = df.filter(pl.col(col) == 4).height + total = r1 + r2 + r3 + r4 + + if total > 0: + label = col.replace('Character_Ranking_', '').replace('Top_3_Voices_ranking__', '').replace('_', ' ').strip() + stats.append({'item': label, 'rank': 'Rank 1 (Best)', 'count': r1, 'rank1': r1}) + stats.append({'item': label, 'rank': 'Rank 2', 'count': r2, 'rank1': r1}) + stats.append({'item': label, 'rank': 'Rank 3', 'count': r3, 'rank1': r1}) + stats.append({'item': label, 'rank': 'Rank 4 (Worst)', 'count': r4, 'rank1': r1}) + + if not stats: + return alt.Chart().mark_text(text="No data") + + stats_df = pl.DataFrame(stats).to_pandas() + + chart = alt.Chart(stats_df).mark_bar().encode( + x=alt.X('item:N', title=x_label, sort=alt.EncodingSortField(field='rank1', order='descending')), + y=alt.Y('count:Q', title=y_label, stack='zero'), + color=alt.Color('rank:N', + scale=alt.Scale(domain=['Rank 1 (Best)', 'Rank 2', 'Rank 3', 'Rank 4 (Worst)'], + range=[ColorPalette.RANK_1, ColorPalette.RANK_2, ColorPalette.RANK_3, ColorPalette.RANK_4]), + legend=alt.Legend(orient='top', direction='horizontal', title=None)), + tooltip=[ + alt.Tooltip('item:N', title='Item'), + alt.Tooltip('rank:N', title='Rank'), + alt.Tooltip('count:Q', title='Count') + ] + ).properties( + title=title, + width=width or getattr(self, 'plot_width', 1000), + height=height or getattr(self, 'plot_height', 500) + ) + + chart = self._save_plot(chart, title) + return chart +``` + +**Verification:** +- [ ] Returns `alt.Chart` +- [ ] 4 ranks supported +- [ ] Sorted by Rank 1 count (added `rank1` field for sorting) +- [ ] Custom color scale for 4 ranks +- [ ] Empty data handled (returns text mark) + +--- + +### TASK 10: Migrate `plot_voice_selection_counts` + +**Location:** `/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/plots.py` (currently lines ~664-737) + +**Action:** Rewrite with Polars data transformation + conditional coloring. + +**Current behavior:** +- Input: DataFrame with comma-separated string column (`8_Combined`) +- Process: Split strings, explode, count occurrences +- Output: Bar chart, top 8 bars in PRIMARY, rest in NEUTRAL + +**New implementation:** +```python +def plot_voice_selection_counts( + self, + data: pl.LazyFrame | pl.DataFrame | None = None, + target_column: str = "8_Combined", + title: str = "Most Frequently Chosen Voices\n(Top 8 Highlighted)", + x_label: str = "Voice", + y_label: str = "Number of Times Chosen", + height: int | None = None, + width: int | None = None, +) -> alt.Chart: + """Create a bar plot showing the frequency of voice selections.""" + df = self._ensure_dataframe(data) + + if target_column not in df.columns: + return alt.Chart().mark_text(text=f"Column '{target_column}' not found") + + # Process data: split, explode, count + stats_df = ( + df.select(pl.col(target_column)) + .drop_nulls() + .with_columns(pl.col(target_column).str.split(",")) + .explode(target_column) + .with_columns(pl.col(target_column).str.strip_chars()) + .filter(pl.col(target_column) != "") + .group_by(target_column) + .agg(pl.len().alias("count")) + .sort("count", descending=True) + .with_row_index('rank_index') + .with_columns( + pl.when(pl.col('rank_index') < 8) + .then(pl.lit('Top 8')) + .otherwise(pl.lit('Other')) + .alias('category') + ) + .to_pandas() + ) + + chart = alt.Chart(stats_df).mark_bar().encode( + x=alt.X(f'{target_column}:N', title=x_label, sort='-y'), + y=alt.Y('count:Q', title=y_label), + color=alt.Color('category:N', + scale=alt.Scale(domain=['Top 8', 'Other'], + range=[ColorPalette.PRIMARY, ColorPalette.NEUTRAL]), + legend=None), + tooltip=[ + alt.Tooltip(f'{target_column}:N', title='Voice'), + alt.Tooltip('count:Q', title='Selections') + ] + ).properties( + title=title, + width=width or getattr(self, 'plot_width', 1000), + height=height or getattr(self, 'plot_height', 500) + ) + + chart = self._save_plot(chart, title) + return chart +``` + +**Verification:** +- [ ] Returns `alt.Chart` +- [ ] Polars chain: split → explode → strip → group → count +- [ ] Top 8 categorization logic correct +- [ ] Conditional coloring applied +- [ ] Sorted by count descending + +--- + +### TASK 11: Migrate `plot_top3_selection_counts` + +**Location:** `/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/plots.py` (currently lines ~739-808) + +**Action:** Identical to Task 10, but default column is `3_Ranked` and top 3 highlighted. + +**New implementation:** +```python +def plot_top3_selection_counts( + self, + data: pl.LazyFrame | pl.DataFrame | None = None, + target_column: str = "3_Ranked", + title: str = "Most Frequently Chosen Top 3 Voices\n(Top 3 Highlighted)", + x_label: str = "Voice", + y_label: str = "Count of Mentions in Top 3", + height: int | None = None, + width: int | None = None, +) -> alt.Chart: + """Question: Which 3 voices are chosen the most out of 18?""" + df = self._ensure_dataframe(data) + + if target_column not in df.columns: + return alt.Chart().mark_text(text=f"Column '{target_column}' not found") + + stats_df = ( + df.select(pl.col(target_column)) + .drop_nulls() + .with_columns(pl.col(target_column).str.split(",")) + .explode(target_column) + .with_columns(pl.col(target_column).str.strip_chars()) + .filter(pl.col(target_column) != "") + .group_by(target_column) + .agg(pl.len().alias("count")) + .sort("count", descending=True) + .with_row_index('rank_index') + .with_columns( + pl.when(pl.col('rank_index') < 3) + .then(pl.lit('Top 3')) + .otherwise(pl.lit('Other')) + .alias('category') + ) + .to_pandas() + ) + + chart = alt.Chart(stats_df).mark_bar().encode( + x=alt.X(f'{target_column}:N', title=x_label, sort='-y'), + y=alt.Y('count:Q', title=y_label), + color=alt.Color('category:N', + scale=alt.Scale(domain=['Top 3', 'Other'], + range=[ColorPalette.PRIMARY, ColorPalette.NEUTRAL]), + legend=None), + tooltip=[ + alt.Tooltip(f'{target_column}:N', title='Voice'), + alt.Tooltip('count:Q', title='In Top 3') + ] + ).properties( + title=title, + width=width or getattr(self, 'plot_width', 1000), + height=height or getattr(self, 'plot_height', 500) + ) + + chart = self._save_plot(chart, title) + return chart +``` + +**Verification:** +- [ ] Default `target_column` is `"3_Ranked"` +- [ ] Top 3 categorization (not top 8) +- [ ] Otherwise identical to Task 10 + +--- + +### TASK 12: Migrate `plot_speaking_style_trait_scores` + +**Location:** `/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/plots.py` (currently lines ~810-926) + +**Action:** Rewrite horizontal bar chart with text annotations. + +**Current behavior:** +- Input: DataFrame with `Voice`, `score`, `Left_Anchor`, `Right_Anchor` columns +- Output: Horizontal bar chart with anchor labels at bottom + +**New implementation:** +```python +def plot_speaking_style_trait_scores( + self, + data: pl.LazyFrame | pl.DataFrame | None = None, + trait_description: str = None, + left_anchor: str = None, + right_anchor: str = None, + title: str = "Speaking Style Trait Analysis", + height: int | None = None, + width: int | None = None, +) -> alt.Chart: + """Plot scores for a single speaking style trait across multiple voices.""" + df = self._ensure_dataframe(data) + + if df.is_empty(): + return alt.Chart().mark_text(text="No data") + + required_cols = ["Voice", "score"] + if not all(col in df.columns for col in required_cols): + return alt.Chart().mark_text(text="Missing required columns") + + # Calculate stats: Mean, Count + stats = ( + df.filter(pl.col("score").is_not_null()) + .group_by("Voice") + .agg([ + pl.col("score").mean().alias("mean_score"), + pl.col("score").count().alias("count") + ]) + .sort("mean_score", descending=False) # Ascending for bottom-to-top display + .to_pandas() + ) + + # Extract anchors from data if not provided + if (left_anchor is None or right_anchor is None) and "Left_Anchor" in df.columns: + head = df.filter(pl.col("Left_Anchor").is_not_null()).head(1) + if not head.is_empty(): + if left_anchor is None: + left_anchor = head["Left_Anchor"][0] + if right_anchor is None: + right_anchor = head["Right_Anchor"][0] + + if trait_description is None: + if left_anchor and right_anchor: + trait_description = f"{left_anchor.split('|')[0]} vs. {right_anchor.split('|')[0]}" + elif "Description" in df.columns: + head = df.filter(pl.col("Description").is_not_null()).head(1) + trait_description = head["Description"][0] if not head.is_empty() else "" + else: + trait_description = "" + + # Horizontal bar chart + bars = alt.Chart(stats).mark_bar(color=ColorPalette.PRIMARY).encode( + x=alt.X('mean_score:Q', title='Average Score (1-5)', scale=alt.Scale(domain=[1, 5])), + y=alt.Y('Voice:N', title='Voice', sort='-x'), + tooltip=[ + alt.Tooltip('Voice:N'), + alt.Tooltip('mean_score:Q', title='Average', format='.2f'), + alt.Tooltip('count:Q', title='Count') + ] + ) + + # Count text inside bars + text = bars.mark_text( + align='center', + baseline='middle', + color='white', + fontSize=16 + ).encode( + text='count:Q' + ) + + # Combine + chart = (bars + text).properties( + title={ + "text": title, + "subtitle": [trait_description, "(Numbers on bars indicate respondent count)"] + }, + width=width or getattr(self, 'plot_width', 1000), + height=height or getattr(self, 'plot_height', 500) + ) + + # Note: Anchor annotations at bottom would require separate text marks + # positioned at fixed coordinates - can add if needed + + chart = self._save_plot(chart, title) + return chart +``` + +**Verification:** +- [ ] Returns `alt.Chart` +- [ ] Horizontal orientation (x=score, y=voice) +- [ ] X-axis domain set to [1, 5] +- [ ] Count text displayed inside bars (white, large font) +- [ ] Title includes subtitle with trait description +- [ ] Sorted by mean score (ascending for bottom-to-top) +- [ ] Anchor label annotations (optional - commented in code) + +--- + +### TASK 13: Migrate `plot_speaking_style_correlation` + +**Location:** `/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/plots.py` (currently lines ~928-1018) + +**Action:** Rewrite with red/green conditional coloring based on sign. + +**Current behavior:** +- Input: DataFrame with correlation data +- Process: Calculate Pearson correlation per trait +- Output: Bar chart, positive correlations green, negative red + +**New implementation:** +```python +def plot_speaking_style_correlation( + self, + style_color: str, + style_traits: list[str], + data: pl.LazyFrame | pl.DataFrame | None = None, + title: str | None = None, +) -> alt.Chart: + """Plots correlation between Speaking Style Trait Scores (1-5) and Voice Scale (1-10).""" + df = self._ensure_dataframe(data) + + if title is None: + title = f"Speaking style and voice scale 1-10 correlations" + + trait_correlations = [] + + # Calculate correlations + for i, trait in enumerate(style_traits): + subset = df.filter(pl.col("Right_Anchor") == trait) + valid_data = subset.select(["score", "Voice_Scale_Score"]).drop_nulls() + + if valid_data.height > 1: + corr_val = valid_data.select(pl.corr("score", "Voice_Scale_Score")).item() + # Handle trait text - wrap at '|' for display + trait_display = trait.replace('|', '\n') + trait_correlations.append({ + "trait_display": trait_display, + "trait_index": f"Trait {i+1}", + "correlation": corr_val if corr_val is not None else 0.0 + }) + + if not trait_correlations: + return alt.Chart().mark_text(text=f"No data for {style_color} Style") + + plot_df = pl.DataFrame(trait_correlations).to_pandas() + + # Conditional color based on sign + chart = alt.Chart(plot_df).mark_bar().encode( + x=alt.X('trait_display:N', title=None, axis=alt.Axis(labelAngle=0)), + y=alt.Y('correlation:Q', title='Correlation', scale=alt.Scale(domain=[-1, 1])), + color=alt.condition( + alt.datum.correlation >= 0, + alt.value('green'), + alt.value('red') + ), + tooltip=[ + alt.Tooltip('trait_display:N', title='Trait'), + alt.Tooltip('correlation:Q', format='.2f') + ] + ).properties( + title=title, + width=1000, + height=400 + ) + + chart = self._save_plot(chart, title) + return chart +``` + +**Verification:** +- [ ] Returns `alt.Chart` +- [ ] Pearson correlation calculated via `pl.corr()` +- [ ] Conditional coloring: green if positive, red if negative +- [ ] Y-axis domain [-1, 1] +- [ ] Trait text wrapped at '|' for display +- [ ] Tooltip shows trait and correlation value + +--- + +### TASK 14: Migrate `plot_speaking_style_ranking_correlation` + +**Location:** `/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/plots.py` (currently lines ~1020-1105) + +**Action:** Almost identical to Task 13, but correlates with `Ranking_Points` instead of `Voice_Scale_Score`. + +**New implementation:** +```python +def plot_speaking_style_ranking_correlation( + self, + style_color: str, + style_traits: list[str], + data: pl.LazyFrame | pl.DataFrame | None = None, + title: str | None = None, +) -> alt.Chart: + """Plots correlation between Speaking Style Trait Scores (1-5) and Voice Ranking Points (0-3).""" + df = self._ensure_dataframe(data) + + if title is None: + title = f"Speaking style {style_color} and voice ranking points correlations" + + trait_correlations = [] + + for i, trait in enumerate(style_traits): + subset = df.filter(pl.col("Right_Anchor") == trait) + valid_data = subset.select(["score", "Ranking_Points"]).drop_nulls() + + if valid_data.height > 1: + corr_val = valid_data.select(pl.corr("score", "Ranking_Points")).item() + trait_display = trait.replace('|', '\n') + trait_correlations.append({ + "trait_display": trait_display, + "trait_index": f"Trait {i+1}", + "correlation": corr_val if corr_val is not None else 0.0 + }) + + if not trait_correlations: + return alt.Chart().mark_text(text=f"No data for {style_color} Style") + + plot_df = pl.DataFrame(trait_correlations).to_pandas() + + chart = alt.Chart(plot_df).mark_bar().encode( + x=alt.X('trait_display:N', title=None, axis=alt.Axis(labelAngle=0)), + y=alt.Y('correlation:Q', title='Correlation', scale=alt.Scale(domain=[-1, 1])), + color=alt.condition( + alt.datum.correlation >= 0, + alt.value('green'), + alt.value('red') + ), + tooltip=[ + alt.Tooltip('trait_display:N', title='Trait'), + alt.Tooltip('correlation:Q', format='.2f') + ] + ).properties( + title=title, + width=1000, + height=400 + ) + + chart = self._save_plot(chart, title) + return chart +``` + +**Verification:** +- [ ] Returns `alt.Chart` +- [ ] Uses `Ranking_Points` column instead of `Voice_Scale_Score` +- [ ] Otherwise identical to Task 13 + +--- + +### TASK 15: Install vl-convert-python + +**Action:** Add vl-convert-python to project dependencies for PNG export. + +**Command:** +```bash +cd /home/luigi/Documents/VoiceBranding/JPMC/Phase-3 +uv add vl-convert-python +``` + +**Verification:** +- [ ] `vl-convert-python` appears in `pyproject.toml` dependencies +- [ ] Installation successful (no errors) + +--- + +### TASK 16: Remove Plotly dependencies (optional cleanup) + +**Action:** Remove unused Plotly packages. + +**Command:** +```bash +cd /home/luigi/Documents/VoiceBranding/JPMC/Phase-3 +uv remove plotly kaleido +``` + +**Verification:** +- [ ] `plotly` and `kaleido` removed from `pyproject.toml` +- [ ] No other code depends on Plotly (grep check) + +--- + +### TASK 17: Test all plot methods in Marimo notebook + +**Action:** Create a test notebook to verify all 10 plotting methods work correctly. + +**Test checklist per plot:** +- [ ] Chart renders without errors +- [ ] Chart has correct dimensions (width/height) +- [ ] Colors match ColorPalette constants +- [ ] Data is displayed correctly (bars, stacks, etc.) +- [ ] Text overlays render (counts, scores) +- [ ] Tooltips show correct information +- [ ] Filter annotation appears below chart (if filters active) +- [ ] PNG export works (check `figures/` directory) +- [ ] No overlap between chart elements and filter text + +**Create test file:** `/home/luigi/Documents/VoiceBranding/JPMC/Phase-3/test_altair_migration.py` + +**Test template:** +```python +import marimo as mo +import polars as pl +from utils import JPMCSurvey + +# Load sample data +survey = JPMCSurvey() +survey.load_data('path/to/data') +survey.fig_save_dir = 'figures/altair_test' + +# Test each plot method +mo.md("## Testing Altair Migration") + +# 1. Test plot_average_scores_with_counts +chart1 = survey.plot_average_scores_with_counts(...) +chart1 + +# 2. Test plot_most_ranked_1 +chart2 = survey.plot_most_ranked_1(...) +chart2 + +# ... repeat for all 10 methods +``` + +--- + +## Final Verification Checklist + +After completing all tasks, verify the following: + +### Code Quality +- [ ] No Plotly imports remain in `plots.py` +- [ ] All methods return `alt.Chart` instead of `go.Figure` +- [ ] No syntax errors (`python -m py_compile plots.py`) +- [ ] Type hints updated (if any reference `go.Figure`) +- [ ] Docstrings updated (if any mention Plotly) + +### Theme & Styling +- [ ] `jpmc_altair_theme()` function exists in `theme.py` +- [ ] Theme is registered and enabled +- [ ] All charts use ColorPalette constants +- [ ] Chart dimensions match original (width=1000, height=500 defaults) +- [ ] Font sizes match original (11pt for labels, 14pt for titles) + +### Data Handling +- [ ] All methods handle empty data gracefully +- [ ] Wide-to-long transformations correct (stacked bars, selection counts) +- [ ] Sorting preserved (by average, count, rank1, etc.) +- [ ] Column filtering works (`_recordId` excluded) +- [ ] String processing works (comma-split, strip, explode) + +### Visual Features +- [ ] Bar charts render correctly (vertical and horizontal) +- [ ] Stacked bars have correct layer order +- [ ] Text overlays positioned correctly (inside/outside bars) +- [ ] Conditional coloring works (top N highlighting, red/green by sign) +- [ ] Tooltips show correct fields with proper formatting +- [ ] Legends positioned correctly (top horizontal for stacked bars) +- [ ] X-axis labels rotated at -45° by default +- [ ] Grid lines visible + +### Filter System +- [ ] `_get_filter_slug()` unchanged (still works) +- [ ] `_get_filter_description()` unchanged (still works) +- [ ] `_add_filter_footnote()` uses `vconcat` approach +- [ ] Filter text appears at bottom of combined chart +- [ ] No overlap between chart and filter text +- [ ] Filter text is left-aligned +- [ ] HTML tags stripped from filter text (Altair doesn't support HTML) +- [ ] Filter subdirectories created correctly + +### PNG Export +- [ ] `vl-convert-python` installed +- [ ] `chart.save()` method works +- [ ] PNG files created in correct subdirectories +- [ ] PNG files have correct filenames (sanitized titles) +- [ ] Image quality acceptable (scale_factor=2.0) + +### Marimo Integration +- [ ] Charts render in Marimo notebooks +- [ ] Charts are reactive (update when data changes) +- [ ] No JavaScript console errors +- [ ] Interactive features work (tooltips, pan, zoom if enabled) + +### All 10 Plot Methods +1. [ ] `plot_average_scores_with_counts` - vertical bar + text +2. [ ] `plot_top3_ranking_distribution` - stacked bar (3 ranks) +3. [ ] `plot_ranking_distribution` - stacked bar (4 ranks) +4. [ ] `plot_most_ranked_1` - vertical bar + conditional color +5. [ ] `plot_weighted_ranking_score` - vertical bar + text +6. [ ] `plot_voice_selection_counts` - vertical bar + conditional color +7. [ ] `plot_top3_selection_counts` - vertical bar + conditional color +8. [ ] `plot_speaking_style_trait_scores` - horizontal bar + text +9. [ ] `plot_speaking_style_correlation` - vertical bar + red/green +10. [ ] `plot_speaking_style_ranking_correlation` - vertical bar + red/green + +### Edge Cases +- [ ] Empty DataFrame handled gracefully +- [ ] Missing columns detected and reported +- [ ] Zero counts/values don't break charts +- [ ] Single data point renders correctly +- [ ] Very long labels don't cause layout issues +- [ ] Many categories don't cause overcrowding + +### Regression Testing +- [ ] Existing Marimo notebooks still work +- [ ] Data filtering still works (`filter_data()`) +- [ ] `JPMCSurvey` class initialization unchanged +- [ ] No breaking changes to public API + +### Documentation +- [ ] This migration plan marked as "Complete" +- [ ] Any new dependencies documented +- [ ] Any breaking changes documented +- [ ] Example usage updated (if applicable) + +--- + +## Troubleshooting + +### Issue: Charts don't render +- Check Altair version: `python -c "import altair; print(altair.__version__)"` +- Check vl-convert: `python -c "import vl_convert; print(vl_convert.__version__)"` +- Check for JavaScript errors in browser console + +### Issue: PNG export fails +- Verify vl-convert-python installed: `pip show vl-convert-python` +- Check write permissions on `figures/` directory +- Try saving as HTML first: `chart.save('test.html')` + +### Issue: Colors don't match theme +- Verify theme is enabled: `print(alt.themes.active)` +- Check color scale definitions in each plot method +- Ensure ColorPalette imported correctly + +### Issue: Filter text overlaps chart +- Increase `spacing` parameter in `vconcat(chart, footer, spacing=20)` +- Increase footer chart `height` property +- Check if footer chart is actually created (debug with `print()`) + +### Issue: Data not displaying +- Check DataFrame format (wide vs long) +- Verify column names match encoding specs +- Check for null values (`.drop_nulls()`) +- Print intermediate DataFrames for debugging + +--- + +## Notes + +- **Backup:** Before starting, create a backup of `plots.py`: `cp plots.py plots.py.plotly_backup` +- **Incremental testing:** Test each plot method immediately after migration +- **Marimo restart:** May need to restart Marimo kernel after major changes +- **Performance:** Altair may be slightly slower for very large datasets (>5000 rows); use `.sample()` if needed + +--- + +## Completion Status + +- [ ] All tasks (1-17) completed +- [ ] All verification checks passed +- [ ] Existing notebooks tested and working +- [ ] Migration documented +- [ ] Ready for production use + +**Migration completed on:** _________________ +**Tested by:** _________________ +**Sign-off:** _________________