Section 1 of 1
Monitoring the 'Semantic Variance'
LLMs change over time. A prompt that works perfectly today might start hallucinating next month when OpenAI updates their model weights. We call this "Quality Drift".
Our backend runs hourly heuristic checks on your top 10 most used prompts.
It passes a hidden baseline dataset into the prompt and compares the output to a known 'perfect' result. If the Semantic Variance drops below 85%, the Dashboard will flag the prompt in red.
If you see a red flag on a prompt, the best action is to duplicate the prompt in the PromptForge Lab, change the model to a newer version (e.g., GPT-4o instead of GPT-4-turbo), and re-run the benchmark.