Notebooks
A
Arize AI
Optimize Cline Act AX

Cline Prompt Learning Optimization on SWE-bench - Act Mode

This notebook demonstrates how we used Prompt Learning to optimize Cline's performance on the SWE-bench dataset in Act Mode. Cline is a popular and powerful open-source coding agent. We look to improve its performance on SWE-bench by optimizing its rules, which are user specified instructions that Cline appends to its system prompt.

More on Cline

More on Prompt Learning

Act Mode - Real Code Execution

Unlike Plan Mode, this notebook runs Cline in Act Mode, where Cline actually edits the codebase and generates patches. We then run the SWE-bench tests to compute a definitive accuracy of whether Cline made the correct edits. This provides ground truth evaluation of Cline's performance.

In Act Mode, Cline:

  1. Analyzes the problem statement
  2. Explores the codebase
  3. Makes actual code edits
  4. Generates patches
  5. Has its patches validated against the SWE-bench test suite

Setup

Please visit README.md and complete all the Setup before running this notebook!

Important Note

Running this notebook is computationally intensive and expensive as it involves:

  • Multiple API calls to Claude for each SWE-bench instance
  • Actually cloning repositories and running tests in isolated environments
  • Running SWE-bench harness to validate patches

Consider adjusting the training and test set sizes based on your requirements, budget constraints, and computational resources.

[ ]

API Keys

Set up your API keys for OpenAI, Anthropic, and Arize. If not already in your environment, you'll be prompted to enter them.

[ ]

Configuration

  • LOOPS: number of Prompt Learning loops. How many times you want to optimize your prompt.
  • TRAIN_SIZE: size of training set.
  • TEST_SIZE: size of test set.
  • WORKERS: SWE-bench is set up to run in parallel, with however many workers you specify. Set this relative to your machine's capabilities and your Claude rate limits.
[ ]

Cline Environment Configuration

Set environment variables for Cline to run properly in Act Mode.

[ ]

Train/Test Datasets

This code splits SWE-bench Lite into train/test splits.

The train set will be used to optimize the ruleset, while the test set will be used to measure the success of optimized rulesets.

[ ]

Upload Datasets to Arize

Upload datasets to Arize for experiment tracking and visualization.

[ ]

Helper: Log Experiments to Arize

This helper function logs experiment results to Arize, allowing us to visualize and track optimization progress across iterations.

[ ]

Ruleset Optimization Loop

This is the main optimization loop. For each iteration:

  1. Run Cline in Act Mode on training set with the current ruleset, generating actual code patches
  2. Run Cline in Act Mode on test set with the current ruleset to measure generalization
  3. Run SWE-bench tests to validate patches and compute pass/fail metrics
  4. Evaluate results using LLM-as-judge to provide detailed feedback on patch quality
  5. Optimize the ruleset using Prompt Learning based on training results and feedback
  6. Save results and rulesets for tracking and analysis

The optimization loop uses actual test execution results (pass/fail) as ground truth, combined with LLM evaluator feedback to iteratively improve the ruleset.

[ ]