Arize AI Optimize Cline Act AX

Optimize Cline Act AX

alph-notebooks/arize-prompt-learning / optimize_cline_act_AX.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Cline Prompt Learning Optimization on SWE-bench - Act Mode

This notebook demonstrates how we used Prompt Learning to optimize Cline's performance on the SWE-bench dataset in Act Mode. Cline is a popular and powerful open-source coding agent. We look to improve its performance on SWE-bench by optimizing its rules, which are user specified instructions that Cline appends to its system prompt.

Act Mode - Real Code Execution

Unlike Plan Mode, this notebook runs Cline in Act Mode, where Cline actually edits the codebase and generates patches. We then run the SWE-bench tests to compute a definitive accuracy of whether Cline made the correct edits. This provides ground truth evaluation of Cline's performance.

In Act Mode, Cline:

Analyzes the problem statement
Explores the codebase
Makes actual code edits
Generates patches
Has its patches validated against the SWE-bench test suite

Setup

Please visit README.md and complete all the Setup before running this notebook!

Important Note

Running this notebook is computationally intensive and expensive as it involves:

Multiple API calls to Claude for each SWE-bench instance
Actually cloning repositories and running tests in isolated environments
Running SWE-bench harness to validate patches

Consider adjusting the training and test set sizes based on your requirements, budget constraints, and computational resources.

[ ]

API Keys

Set up your API keys for OpenAI, Anthropic, and Arize. If not already in your environment, you'll be prompted to enter them.

[ ]

Configuration

LOOPS: number of Prompt Learning loops. How many times you want to optimize your prompt.
TRAIN_SIZE: size of training set.
TEST_SIZE: size of test set.
WORKERS: SWE-bench is set up to run in parallel, with however many workers you specify. Set this relative to your machine's capabilities and your Claude rate limits.

[ ]

Cline Environment Configuration

Set environment variables for Cline to run properly in Act Mode.

[ ]

Train/Test Datasets

This code splits SWE-bench Lite into train/test splits.

The train set will be used to optimize the ruleset, while the test set will be used to measure the success of optimized rulesets.

[ ]

Upload Datasets to Arize

Upload datasets to Arize for experiment tracking and visualization.

[ ]

Helper: Log Experiments to Arize

This helper function logs experiment results to Arize, allowing us to visualize and track optimization progress across iterations.

[ ]

Ruleset Optimization Loop

This is the main optimization loop. For each iteration:

Run Cline in Act Mode on training set with the current ruleset, generating actual code patches
Run Cline in Act Mode on test set with the current ruleset to measure generalization
Run SWE-bench tests to validate patches and compute pass/fail metrics
Evaluate results using LLM-as-judge to provide detailed feedback on patch quality
Optimize the ruleset using Prompt Learning based on training results and feedback
Save results and rulesets for tracking and analysis

The optimization loop uses actual test execution results (pass/fail) as ground truth, combined with LLM evaluator feedback to iteratively improve the ruleset.

[ ]