Emotional Stimuli and Beyond: Prompt Engineering Strategies for LLM Performance
Overview
Large Language Models (LLMs) are extremely sensitive to how they’re prompted. This project studies a family of prompt engineering strategies—from neutral baselines to emotionally charged framings—to understand how subtle language choices influence task performance. We emphasize practical techniques developers can use without model training or fine-tuning.
Prompting Strategies Studied
- Neutral Baseline — plain task instructions.
- Emotional Stimuli — positive/negative affect and urgency modifiers (“This is critical…”, “Please be careful…”).
- Role/Persona — “You are a meticulous data scientist…”.
- Context Priming — brief domain setup and constraints before the task.
- Few‑shot — 1–3 short, curated examples that match the task distribution.
- Reasoning Scaffolds — structured steps, e.g., “think then answer”, self‑check rubrics.
- Critique & Revise — generate, critique, then rewrite.
- Calibration — ask for confidence, abstain options, or uncertainty highlights.
Tasks & Metrics (High Level)
We evaluate across representative task types (classification, extraction, and short‑form reasoning). Primary metrics include accuracy and F1—particularly useful when label imbalance or partial extraction correctness matters. For generative answers, we apply rubric‑style checks, and when applicable, automatic exact/partial matches.
Key Findings (at a glance)
- Emotional framings can improve adherence to instructions and reduce “casual” mistakes on constrained tasks.
- Combining persona + context often yields more stable outputs than either alone.
- Few‑shot helps most when examples tightly match the evaluation distribution; otherwise it can add noise.
- Lightweight reasoning scaffolds (self‑checks, step hints) produce gains similar to long chain‑of‑thought, with lower verbosity.
- Calibration prompts reduce over‑confident wrong answers and make abstentions more principled.
Failure Modes & Risks
- Sentiment drift — emotional prompts can bias tone or lead the model.
- Over‑constrained formats — too rigid schemas increase refusals or truncation.
- Prompt overfitting — few‑shot with unrepresentative examples hurts generalization.
- Spurious explanations — scaffolds may sound “confident” even when wrong; prefer verifiable checks.
What We’d Do Next
- Automated prompt selection based on dev‑set telemetry (bandit‑style routing).
- Systematic ablation of each component (persona vs context vs emotions) across models.
- Task‑aware calibration targets (e.g., precision‑first vs recall‑first regimes).
This page is a concise summary. For the full methodology, ablations, and complete results, see the paper.