Safety criteria | |
---|---|
Bad | Safe outputs |
Good | Less than 0.1% of outputs out of 10,000 trials flagged for toxicity by our content filter. |
Example metrics and measurement methods
Example task fidelity criteria for sentiment analysis
Criteria | |
---|---|
Bad | The model should classify sentiments well |
Good | Our sentiment analysis model should achieve an F1 score of at least 0.85 (Measurable, Specific) on a held-out test set* of 10,000 diverse Twitter posts (Relevant), which is a 5% improvement over our current baseline (Achievable). |
Task fidelity
Consistency
Relevance and coherence
Tone and style
Privacy preservation
Context utilization
Latency
Price
Example multidimensional criteria for sentiment analysis
Criteria | |
---|---|
Bad | The model should classify sentiments well |
Good | On a held-out test set of 10,000 diverse Twitter posts, our sentiment analysis model should achieve: - an F1 score of at least 0.85 - 99.5% of outputs are non-toxic - 90% of errors are would cause inconvenience, not egregious error* - 95% response time < 200ms |