Reliability Scoring Win Tie Loss
Reliability scoring example for Win-Tie-Loss human evaluation
Overall logic for Reliability scoring (REL)
REL can be seen as accuracy against a trusted annotation - Subject Matter Expert (SME) or Quality Control (QC).
From an annotator standpoint, we calculate all correct matches from their annotations against QC and compute the accuracy.
Given that some annotations are hierarchical by definition - first annotators decide if a prompt can be rated, then perform annotation - there can be a mismatch between QC and annotator "matchability".
For instance, if an annotator flags the prompt as NSFW, the annotator will not have the chance to select the best response. However, if QC does not flag as NSFW, QC will select a response and annotator will be penalized by incorrect response, even if the error was in the first level (flagging). Thus we need to calculate this error separately so there are no confounding factors.
As an example, an annotator can perform poorly in NSFW flagging, but be extremely aligned with QC when selecting the best response. This annotator would then have low overall reliability if the first level (flagging) is not treated separately.
Below are some details about the logic.
Numerator
The numerator part of REL is straighforward. It is calculated as the total correct matches between annotator versus QC.
Denominator
As previously introduced, there are some adjustments (deductions) needed in the denominator part of REL, so that we do not overpenalize mistakes such as incorrect prompt flagging.
Why Adjust the Denominator?
This part explains the logic and rationale behind denominator deductions in reliability calculations.
In reliability calculations, the denominator represents the total number of opportunities to provide a correct annotation. However, not all opportunities are valid. Adjusting the denominator excludes irrelevant or invalid cases, ensuring the metric reflects only meaningful data.
Reasons for Adjustments
Below are somme examples that could be reasons to adjust the denominator.
-
Invalid Prompts:
- If a prompt is flagged as not being suitable, such as requesting PII, medical advice or other reasons for being flagged, this prompt should not be considered for downstream annotation. In this example, the selection of best response.
-
Skipped Cases:
- Cases marked as skipped by annotators or QCs indicate that there could be an error or undefined edge-case during annotation, such as non-rendered Markdown, unexpected language, etc.
- This is a very conservative approach.
- This case will not be addressed in this example.
- Cases marked as skipped by annotators or QCs indicate that there could be an error or undefined edge-case during annotation, such as non-rendered Markdown, unexpected language, etc.
-
No Consensus:
- For consensus-based evaluations, cases without consensus are unsuitable for reliability calculations. E.g. for binary ranking, if annotators say
[A,B,Tie]there's no consensus.- This case will not be addressed in this example.
- For consensus-based evaluations, cases without consensus are unsuitable for reliability calculations. E.g. for binary ranking, if annotators say
Steps
Importing relevant libraries
In case you have your data in a json file, you can load it with the following code:
Here we are going to use a toy sample of data to test the code:
Computing the metrics for Reliability
Here we are going to define the function that computes the metrics for Reliability. The main goal is to compute the following metrics for each evaluator:
- Reliability: Fraction of applicable items (both QC and evaluator have "item_flag" == "No") where evaluator’s choice equals QC’s choice.
- Flag Mismatch %: Percentage of items where evaluator's "item_flag" differs from QC's "item_flag". Here "item_flag" refers to the case where item is flagged as NSFW
- Total Items: Total number of items evaluated.
Generating the matrix plot for better visualization
Here we are going to define the function that generates the matrix plot using the dataframe resulting from the compute_metrics_for_plot function.
Putting all together
Here we are going to put all the functions together and compute the metrics and generate the matrix plot.
Overall prompts flagged as non-ratable (NSFW) by QC: 0.00%
Overall Reliability across all evaluators: 62.96%
Here is the plot showing the reliability of each evaluator, the percentage disagreement between evaluators and QC in the ratability of the prompts, and the total number of prompts evaluated by each evaluator.
It is extremely important to always check the scores having the total items evaluated by each evaluator in mind.
Evaluators with low number of total items might have high reliability scores just by chance.