Skip to content

Week 1

1. Background and Objective

  • We aim to compare self-evolving/self-optimizing systems (GEPA, ACE) under different levels of inner-loop search/evaluation intensity, focusing on their performance–cost–compute scaling behavior.

The key question: how should we define the horizontal axis for scaling?

Initial candidates:

  • Number of rollouts
  • Number of iterations
  • Token usage
  • Total cost (USD)

2. Current GEPA Experiments and Key Findings

Model and Configurations: gpt-4o-mini with configurations — Baseline, GEPA-5, GEPA-10, GEPA-15, GEPA-20

Evaluation Datasets: HotpotQA, HoVer (consistent with default script settings)

Performance Scaling

Cost Scaling

Cost Scaling

Performance-Cost Tradeoff

Performance-Cost Tradeoff

Conclusions

Costs remain roughly constant while performance improves with more iterations.

  • GEPA-10/15/20 show similar API spending, but performance continues to increase.
  • The number of external API calls increases, but total output tokens do not rise significantly.
  • Iteration count could be a candidate, but its definition varies across frameworks.
  • The GEPA paper uses rollouts as its metric, but rollout semantics differ by architecture or task.

Number of metric evaluations may serve as a better scaling indicator. Because sel-evolving agent should always contain metrics calling function. It can be used for comparing different agent structure. The metric calling func looks like:

def one_metric_call(program, example):
    output = program(example.question)  # may internally call GPT 2–3 times
    # Step 2: Evaluate the result (possibly another GPT API call)
    score = metric_fn(example, output)  # may call GPT once more
    return score  # counted as 1 metric call

3. Current Issue

At present, I’m encountering a resource access issue on the Helios4 cluster.
It appears that the nodes are continuously occupied by other lab members, making it extremely difficult to find an available time slot for running experiments.
Currently, the experiments on the two benchmarks are being conducted using my personal OpenAI API key, but this setup is not sustainable for continued large-scale experimentation.

Therefore, I would like to ask:

  • Whether it's possible to grant access to additional compute nodes, or
  • If there exists a dashboard or monitoring tool to better visualize GPU availability and idle times.

Adam kindly provided me with access credentials for Helios3, but for some reason, I still haven’t been able to log in successfully.
He might be quite busy these days, so the issue hasn’t been resolved yet.