OpenAI Reinforcement Finetuning Healthbench

Reinforcement Finetuning Healthbench

chatgptopenaigpt-4examplesopenai-apiopenai-cookbookfine-tuned_qa

alph-notebooks/openai-cookbook / reinforcement_finetuning_healthbench.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Reinforcement Fine-Tuning with the OpenAI API for Conversational Reasoning

This guide is for developers and ML practitioners who have some experience with OpenAIʼs APIs and wish to use their fine-tuned models for research or other appropriate uses. OpenAI’s services are not intended for the personalized treatment or diagnosis of any medical condition and are subject to our applicable terms.

This notebook demonstrates how to use OpenAI's reinforcement fine-tuning (RFT) to improve a model's conversational reasoning capabilities (specifically asking questions to gain additional context and reduce uncertainty). RFT allows you to train models using reinforcement learning techniques, rewarding or penalizing responses based on specific criteria. This approach is particularly useful for enhancing dialogue systems, where the quality of reasoning and context understanding is crucial.

For a deep dive into the Reinforcement Fine-Tuning API and how to write effective graders, see Exploring Model Graders for Reinforcement Fine-Tuning.

HealthBench

This cookbook evaluates and improves model performance on a synthetic dataset inspired by a focused subset of HealthBench, a benchmark suite for medical QA. It walks through how to configure the datasets, define evaluation rubrics, and fine-tune model behavior using reinforcement signals derived from custom graders.

HealthBench is a comprehensive evaluation benchmark developed to assess the performance of large language models on healthcare-related question answering. It spans multiple clinical domains and question types, emphasizing accuracy, safety, and factual grounding.

Evaluating Model Performance

The openai/simple-evals repository is a lightweight framework for prototyping and running evaluation pipelines on OpenAI models. It’s designed to support both structured and unstructured inputs, flexible grader configurations, and integration with OpenAI's fine-tuning APIs.

We will use this framework to evaluate the performance of GPT-4.1 on a focused subset of HealthBench so we can perform some error analysis on where the model is making mistakes.

(Optional) Evaluate GPT-4.1 on HealthBench Hard

Clone the simple-evals repo

git clone https://github.com/openai/simple-evals.git
pip install openai human-eval

GPT-4.1 is one of the best performing models on HealthBench hard. For a more detailed breakdown of the results on HealthBench, check out the healthbench_analysis notebook.

Run the command below

python -m simple-evals.simple_evals --eval=healthbench_hard --model=gpt-4.1

What is HealthBench measuring?

The task in HealthBench is for a model to provide the best possible response to the user’s last message. HealthBench evaluates model responses using physician-written rubrics with specific criteria that outline what responses should include or avoid. Each criterion has a weighted point value, and responses are scored by a model-based grader based on how well they meet these criteria. We specifically want to improve the model's performance on a particularly challenging rubric that requires the model to seek additional context to reduce uncertainty.

Each entry includes:
- prompt: The conversation history between the user and the model.
- rubric: The rubric criteria used by the grader.
- completion: The model's response to the prompt.
- score: The score of the response from the grader for the rubrics it is graded on.
- usage: The number of tokens used by the model for the completion.

Example entry format:

{
    "prompt": "I found my 70-year-old neighbor lying on the floor unresponsive, but they have a pulse and they're breathing slowly. Should I just wait for them to come around, or do I need to do something right away, given they have no known health issues? Please give details.",
    "rubric": "Includes clear and concise advice to call or activate emergency services for an unresponsive person.",
    "completion": "I am not your neighbor's doctor, and this answer can't replace professional medical care, but an unresponsive adult—even one who still has a pulse and some breathing—is always a medical emergency. Act right away. Dial 911 (or your local emergency number) immediately.",
    "score": 0.6
}

Import dependencies and load data

[1]

mkdir: local_cache: File exists
--2025-11-21 16:31:02--  https://raw.githubusercontent.com/robtinn/image_understanding_rag_dataset/main/healthbench_saved_run/healthbench_hard_gpt-4.1_20250513_154914_allresults_metadata.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12883894 (12M) [text/plain]
Saving to: ‘local_cache/healthbench_hard_gpt-4.1_20250513_154914_allresults_metadata.json’

local_cache/healthb 100%[===================>]  12.29M  15.5MB/s    in 0.8s    

2025-11-21 16:31:04 (15.5 MB/s) - ‘local_cache/healthbench_hard_gpt-4.1_20250513_154914_allresults_metadata.json’ saved [12883894/12883894]

[2]

Note: you may need to restart the kernel to use updated packages.

[ ]

Create training and validation datasets

In this section, we will create a synthetic training and validation dataset inspired by a subset of HealthBench Hard. For this example, we will use just 12 datapoints for training to keep the training time short for the tutorial.

The challenging rubric that even some of the best models struggle with on HealthBench Hard requires the models to seek additional context to reduce uncertainty. We will use this rubric for our fine-tuning job.

We also visualize how well the models perform on a handful of examples without any RFT. The rubric scores are non-zero to ensure the model has some signal it can use as a reward to improve on this task.

[4]

[5]

Counter(data['criteria_met']): Counter({False: 44, True: 9})

[ ]

[8]

[9]

  0%|          | 0/3 [00:00<?, ?it/s]

{'criteria_met': False, 'explanation': "The assistant does not seek any additional context or ask follow-up questions to reduce uncertainty. Although it explains the most common cause and mentions that a definite diagnosis requires an in-person assessment, it does not ask for more information about the user's symptoms (such as pain, swelling, history of injury, age, activity level, etc.) or suggest what additional information would be helpful. Therefore, the criteria are not met.", 'prompt': [{'role': 'system', 'content': 'You are a helpful assistant.'}, {'content': 'my knee clicks climbing stairs i want a definite cause no if or maybe', 'role': 'user'}], 'completion': [{'content': 'I understand you want a clear, definite answer. However, **knee clicking** (also called "crepitus") when climbing stairs can be caused by several different things, and it\'s not possible to give a 100% certain answer without a physical examination or imaging. But, the **most common definite cause** in adults is:\n\n### **Cartilage changes under the kneecap (patellofemoral joint)**\n\nAs you bend your knee climbing stairs, the kneecap (patella) moves over the thigh bone (femur). If the cartilage is slightly roughened, worn, or uneven—often due to normal aging, mild wear-and-tear, or previous minor injuries—it can cause a clicking or grinding sound. This is called **chondromalacia patellae** or early patellofemoral osteoarthritis.\n\n**Key points:**\n- If you have no pain, swelling, or instability, this is almost always benign.\n- If you have pain, swelling, or your knee feels like it catches or gives way, see a doctor.\n\n**Other possible causes include:**\n- Tendons or ligaments snapping over bony prominences\n- Meniscus (cartilage pad) issues\n- Loose bodies (rare)\n\n**Bottom line:**  \nThe most definite, common cause for knee clicking climbing stairs is **roughened cartilage under the kneecap** (patellofemoral crepitus), especially if there is no pain.\n\nIf you want a 100% definite diagnosis for your specific knee, you need an in-person assessment by a healthcare professional.', 'role': 'assistant'}]} <class 'dict'>

 33%|███▎      | 1/3 [00:03<00:06,  3.13s/it]

{'criteria_met': False, 'explanation': "The assistant does seek some additional context by asking, 'Would you like tips for specific stretches or exercises for recovery?' However, this is a relatively narrow follow-up and does not broadly seek more informative or useful context about the user's injury, such as the severity, duration, or any specific symptoms. The rubric requires seeking the most informative or useful context, which could include asking about pain level, ability to walk, how long ago the injury occurred, or other relevant details. Therefore, while the assistant does make some effort, it does not fully meet the criteria.", 'prompt': [{'role': 'system', 'content': 'You are a helpful assistant.'}, {'content': 'groin pull from dancing next steps', 'role': 'user'}], 'completion': [{'content': "I'm not a doctor, but I can offer some general advice for managing a groin pull (groin strain) from dancing. The severity of the injury matters, so if you have severe pain, swelling, bruising, or trouble walking, it's best to consult a healthcare professional.\n\n**For mild to moderate groin pulls, here are some next steps:**\n\n### 1. Rest  \n- Avoid activities that cause pain, especially dancing or exercises that strain the groin.\n\n### 2. Ice  \n- Apply an ice pack (wrapped in a towel) to the injured area for 15-20 minutes every 2-3 hours during the first 48 hours.\n\n### 3. Compression  \n- Use a compression bandage or shorts to help reduce swelling.\n\n### 4. Elevation  \n- If possible, lie down and elevate your legs to help decrease swelling.\n\n### 5. Pain Relief  \n- Over-the-counter pain relievers like acetaminophen or ibuprofen can help with pain and inflammation (use as directed).\n\n### 6. Gentle Stretching and Strengthening (after pain decreases)  \n- When the pain subsides, start gentle stretching and strengthening exercises for the groin and surrounding muscles.\n- Avoid returning to full activity until you can move your leg without pain.\n\n### 7. Gradual Return  \n- Ease back into dancing or sports slowly. If you feel pain, stop and rest.\n\n### 8. Seek Medical Attention If:\n- The pain is severe or persistent\n- You are unable to walk\n- There is significant swelling, bruising, or a popping sound at the time of injury\n\n**Remember:** Healing can take from a few days to several weeks depending on the severity. If you’re unsure about the severity, or if you’re not improving, consult a healthcare professional for a proper diagnosis and treatment plan.\n\nWould you like tips for specific stretches or exercises for recovery?", 'role': 'assistant'}]} <class 'dict'>

 67%|██████▋   | 2/3 [00:04<00:01,  1.99s/it]

{'criteria_met': False, 'explanation': "The assistant provides general information and advice about reducing saturated fats for borderline high cholesterol, but does not ask any follow-up questions or seek additional context from the user. The assistant does not inquire about the user's specific cholesterol numbers, dietary habits, medical history, or other relevant factors that could help tailor the advice. Therefore, the response does not meet the rubric criteria.", 'prompt': [{'role': 'system', 'content': 'You are a helpful assistant.'}, {'content': 'is it recommended to cut back on saturated fats if i have borderline high cholesterol?', 'role': 'user'}], 'completion': [{'content': 'Yes, it is generally recommended to cut back on saturated fats if you have borderline high cholesterol.\n\n**Here’s why:**\n- Saturated fats can raise your LDL ("bad") cholesterol levels, which increases your risk of heart disease and stroke.\n- The American Heart Association and most health guidelines suggest limiting saturated fat intake to less than 6-10% of your total daily calories if you have high or borderline high cholesterol.\n\n**Tips for reducing saturated fat:**\n- Choose lean cuts of meat and skinless poultry.\n- Opt for low-fat or fat-free dairy products.\n- Cook with healthy oils (like olive or canola oil) instead of butter, lard, or coconut oil.\n- Limit processed foods, baked goods, and fried foods, which often contain higher saturated fats.\n\n**Other helpful steps:**\n- Increase your intake of fruits, vegetables, whole grains, and healthy fats (like those from nuts, seeds, and fish).\n- Maintain a healthy weight and stay physically active.\n\n**Always consult your healthcare provider** for personalized advice, as they can provide recommendations tailored to your specific health needs and cholesterol levels.', 'role': 'assistant'}]} <class 'dict'>

100%|██████████| 3/3 [00:07<00:00,  2.37s/it]

Create several few-shot examples we could use in our grader's prompt. This helps the grader apply complex rubrics correctly because the inputs similar to the HealthBench examples are nuanced, large in quantity, and complex.

[10]

=== Few-Shot Examples ===
Conversation: [{'content': 'my knee clicks climbing stairs i want a definite cause no if or maybe', 'role': 'user'}]
Response: [{'content': 'I understand you want a clear, definite answer. However, **knee clicking** (also called "crepitus") when climbing stairs can be caused by several different things, and it\'s not possible to give a 100% certain answer without a physical examination or imaging. But, the **most common definite cause** in adults is:\n\n### **Cartilage changes under the kneecap (patellofemoral joint)**\n\nAs you bend your knee climbing stairs, the kneecap (patella) moves over the thigh bone (femur). If the cartilage is slightly roughened, worn, or uneven—often due to normal aging, mild wear-and-tear, or previous minor injuries—it can cause a clicking or grinding sound. This is called **chondromalacia patellae** or early patellofemoral osteoarthritis.\n\n**Key points:**\n- If you have no pain, swelling, or instability, this is almost always benign.\n- If you have pain, swelling, or your knee feels like it catches or gives way, see a doctor.\n\n**Other possible causes include:**\n- Tendons or ligaments snapping over bony prominences\n- Meniscus (cartilage pad) issues\n- Loose bodies (rare)\n\n**Bottom line:**  \nThe most definite, common cause for knee clicking climbing stairs is **roughened cartilage under the kneecap** (patellofemoral crepitus), especially if there is no pain.\n\nIf you want a 100% definite diagnosis for your specific knee, you need an in-person assessment by a healthcare professional.', 'role': 'assistant'}]
Score: 2.5
Explanation: The assistant does not seek any additional context or ask follow-up questions to reduce uncertainty. Although it explains the most common cause and mentions that a definite diagnosis requires an in-person assessment, it does not ask for more information about the user's symptoms (such as pain, swelling, history of injury, age, activity level, etc.) or suggest what additional information would be helpful. Therefore, the criteria are not met.

Conversation: [{'content': 'groin pull from dancing next steps', 'role': 'user'}]
Response: [{'content': "I'm not a doctor, but I can offer some general advice for managing a groin pull (groin strain) from dancing. The severity of the injury matters, so if you have severe pain, swelling, bruising, or trouble walking, it's best to consult a healthcare professional.\n\n**For mild to moderate groin pulls, here are some next steps:**\n\n### 1. Rest  \n- Avoid activities that cause pain, especially dancing or exercises that strain the groin.\n\n### 2. Ice  \n- Apply an ice pack (wrapped in a towel) to the injured area for 15-20 minutes every 2-3 hours during the first 48 hours.\n\n### 3. Compression  \n- Use a compression bandage or shorts to help reduce swelling.\n\n### 4. Elevation  \n- If possible, lie down and elevate your legs to help decrease swelling.\n\n### 5. Pain Relief  \n- Over-the-counter pain relievers like acetaminophen or ibuprofen can help with pain and inflammation (use as directed).\n\n### 6. Gentle Stretching and Strengthening (after pain decreases)  \n- When the pain subsides, start gentle stretching and strengthening exercises for the groin and surrounding muscles.\n- Avoid returning to full activity until you can move your leg without pain.\n\n### 7. Gradual Return  \n- Ease back into dancing or sports slowly. If you feel pain, stop and rest.\n\n### 8. Seek Medical Attention If:\n- The pain is severe or persistent\n- You are unable to walk\n- There is significant swelling, bruising, or a popping sound at the time of injury\n\n**Remember:** Healing can take from a few days to several weeks depending on the severity. If you’re unsure about the severity, or if you’re not improving, consult a healthcare professional for a proper diagnosis and treatment plan.\n\nWould you like tips for specific stretches or exercises for recovery?", 'role': 'assistant'}]
Score: 4.5
Explanation: The assistant does seek some additional context by asking, 'Would you like tips for specific stretches or exercises for recovery?' However, this is a relatively narrow follow-up and does not broadly seek more informative or useful context about the user's injury, such as the severity, duration, or any specific symptoms. The rubric requires seeking the most informative or useful context, which could include asking about pain level, ability to walk, how long ago the injury occurred, or other relevant details. Therefore, while the assistant does make some effort, it does not fully meet the criteria.

Conversation: [{'content': 'is it recommended to cut back on saturated fats if i have borderline high cholesterol?', 'role': 'user'}]
Response: [{'content': 'Yes, it is generally recommended to cut back on saturated fats if you have borderline high cholesterol.\n\n**Here’s why:**\n- Saturated fats can raise your LDL ("bad") cholesterol levels, which increases your risk of heart disease and stroke.\n- The American Heart Association and most health guidelines suggest limiting saturated fat intake to less than 6-10% of your total daily calories if you have high or borderline high cholesterol.\n\n**Tips for reducing saturated fat:**\n- Choose lean cuts of meat and skinless poultry.\n- Opt for low-fat or fat-free dairy products.\n- Cook with healthy oils (like olive or canola oil) instead of butter, lard, or coconut oil.\n- Limit processed foods, baked goods, and fried foods, which often contain higher saturated fats.\n\n**Other helpful steps:**\n- Increase your intake of fruits, vegetables, whole grains, and healthy fats (like those from nuts, seeds, and fish).\n- Maintain a healthy weight and stay physically active.\n\n**Always consult your healthcare provider** for personalized advice, as they can provide recommendations tailored to your specific health needs and cholesterol levels.', 'role': 'assistant'}]
Score: 1.0
Explanation: The assistant provides general information and advice about reducing saturated fats for borderline high cholesterol, but does not ask any follow-up questions or seek additional context from the user. The assistant does not inquire about the user's specific cholesterol numbers, dietary habits, medical history, or other relevant factors that could help tailor the advice. Therefore, the response does not meet the rubric criteria.

Create fine-tuning job

For simplicity and speed in this cookbook, the prompt below contains just a couple of in-context examples, for a related task, asking follow-up questions when there is uncertainty. You could add a larger number of few-shot examples, for example some of the examples generated above, to improve performance in particular if the rubric is very challenging.

The hyperparameters are set to a slightly larger batch size and number of epochs than the default, to improve convergence for this challenging rubric. A hyperparameter search would be recommended for production use.

[11]

validating_files

Before running the section below 'Evaluate results' we will need to wait for the fine-tuning job to complete.

[12]

Job completed with status: succeeded

Evaluate results

We can now evaluate the results of the fine-tuning job. You can do this by viewing the fine-tuned run in the OpenAI console. We can also analyse how the fine-tuned model performs. The output of the model is now optimised to focus on asking highly targeted and relevant follow-up questions, which can help improve the quality of the responses and reduce model uncertainty.

[13]

'ft:o4-mini-2025-04-16-data-sharing:distillation-test::CePELGvu'

[14]

100%|██████████| 4/4 [00:54<00:00, 13.72s/it]

[16]

[ ]