Notebooks
A
Arize AI
Uqlm Phoenix Confidence Example

Uqlm Phoenix Confidence Example

arize-tutorialsphoenix_evals_examplescookbooksPython

UQLM Γ— Arize Phoenix β€” Response-Level Confidence & Hallucination Risk

This notebook shows how to compute model-agnostic, ground-truth-free confidence / risk scores using UQLM after creating a dataset and starting an experiment in Arize Phoenix.

What you'll do:

  1. Install and configure Phoenix & UQLM
  2. Create a small demo dataset with prompts & responses
  3. Upload your Dataset into Phoenix & Configure a task to create your pre-sampled responses
  4. Compute UQLM per-scorer confidences and an ensemble confidence
  5. Derive risk = 1 - confidence and an optional high_risk flag

🧩 Why: UQLM provides a production-friendly, model-agnostic uncertainty signal that complements judge-style evals and helps you flag risky answers without labeled ground truth.

In your terminal, after run pip install arize-phoenix, please run phoenix serve to locally host Phoenix. Then proceed with this notebook.

0) Install packages

[ ]

1) Imports & version check

[ ]

2) Configure Phoenix connection

Set these if you're sending results to Phoenix Cloud or your own self-hosted Phoenix.

[ ]

3) Demo dataset

We'll make a small dataset with:

  • input – the user prompt
  • output – the model response

Once we upload this dataset into Phoenix, we can run a task to create our:

  • sampled_responses – a list of stochastic responses for the same prompt (for black-box UQLM)
[ ]
[ ]
[ ]
[ ]
[ ]

4) UQLM adapter: per-scorer + ensemble confidence

Below is a compact adapter that:

  • Runs BlackBoxUQ score(...) if you already have responses and sampled_responses. # CVS: Added 'responses'
  • Optionally runs BlackBoxUQ generate_and_score(...) if you pass an llm and set num_responses > 0.
  • Optionally runs WhiteBoxUQ if you pass an llm that returns token-level logprobs. # CVS: Removed 'generate_and_score(...)'
  • Computes an ensemble confidence (mean/median/weighted) and adds uqlm_confidence, uqlm_risk, and uqlm_high_risk.
[ ]

4.a) Run BlackBoxUQ scoring on pre-sampled responses

This path requires no LLM callsβ€”it's the fastest way to test-drive UQLM.

[ ]

4.b) (Optional) Generate-and-score with your LLM client

If you provide an llm client to UQLM (e.g., OpenAI/Anthropic), set num_responses > 0 to have UQLM sample K responses per prompt and score on the fly.

⚠️ Note: Replace the placeholder MyLLMClient with your real client that UQLM supports.

[ ]

4.c) (Optional) WhiteBox scoring (token-level logprobs)

If your client returns token logprobs, UQLM can compute white-box signals like minimum token probability. Replace the placeholder client below with a real logprob-capable client and set USE_WHITEBOX=True.

[ ]