Tutorial: Building, Running, and Publishing a Custom LLM Evaluation

Evaluating large language models (LLMs) on novel tasks (like game-playing) requires careful planning. This tutorial will guide you through designing a good evaluation ("eval"), preparing data, writing and running the eval, and sharing your results. We assume you have a GitHub account, basic programming knowledge, and familiarity with LLMs. By the end, you should be able to build a custom eval (e.g. testing an LLM's Go-playing skill using KataGo data) and publish it for others.

1. Design Principles of Good Evals

What Makes a Good Eval? A well-designed eval should provide meaningful insights about model performance. Generally, a good eval is one that:

Targets Key Outcomes: It covers the most important outcomes or abilities for your application. Focus on capabilities or failure modes that truly matter.
Uses Few, Clear Metrics: It uses a small number of interpretable metrics. Simpler metrics (accuracy, win-rate, etc.) are easier to understand and act on.
Is Automated and Fast: Running the eval should be fast and automatic, ideally without requiring manual checking. This enables frequent regression testing.
Is Diverse & Representative: It tests the model on a diverse, representative dataset of scenarios. This ensures the eval isn't too narrow and reflects real-world inputs.
Correlates with Human Judgment: The results of the eval should align with human evaluations of quality. If humans think model outputs are good, the eval metrics should also rate them highly (and vice versa).

Single vs. Multiple Evals (Scope): Decide if you need one comprehensive eval or separate evals for different skills. A broad eval (covering many sub-tasks) can give an overall picture, but might mix multiple metrics or be harder to interpret. Conversely, multiple smaller evals let you isolate specific capabilities (for example, one eval for strategy in game play, another for mathematical reasoning). If tasks are very different or require different metrics, it’s often better to create separate evals for clarity. On the other hand, if tasks are variations of a theme, grouping them into one eval can showcase combined performance. Consider maintenance too: updating a monolithic eval vs. several targeted ones.

Task Diversity vs. Specificity: There’s a trade-off in eval design between being broad or specific. A diverse eval (covering many contexts or sub-tasks) ensures the model isn’t overfitting to a narrow pattern and can reveal general robustness. However, very diverse evals might dilute focus – a model might do well on some parts and poorly on others, making it hard to pinpoint issues. A specific eval zooms in on one capability or scenario, providing detailed insight there, but might miss other weaknesses. In practice, you should align the scope with your goals: for a general model benchmark, diversity is key, while for testing a specific feature (like Go move prediction), specificity yields more actionable feedback. You can also do both: start with specific evals to diagnose particular skills, then combine them into a diverse suite for an overall assessment.

Clear Success Criteria: Before building anything, define what success looks like. Is it 90% accuracy on a quiz? Winning 50% of games against an engine? Formulate a clear goal so that the eval’s results are meaningful.

2. Data Requirements for Custom Evals

How Much Data Do You Need? The number of examples should be enough to yield statistically meaningful results, but not so large as to be unwieldy or expensive. The required size depends on the variability of the task and the differences you expect to measure. For quick iteration or demos, even a few dozen examples might suffice. For robust benchmarks, you might need hundreds or thousands. For instance, one community chess puzzle eval used 1000 puzzles to benchmark LLMs, which gave a solid basis for comparing models. If your eval will be used to detect small performance regressions, lean towards a larger sample size to reduce noise. Remember that API costs can accumulate with large evals (OpenAI’s eval framework even warns to be mindful of API usage costs), so strike a balance between thoroughness and cost.

Data Diversity and Quality: Ensure your dataset covers a range of scenarios relevant to your task. For example, if evaluating Go moves, include positions from different openings, middle-game and endgame situations, both typical and edge-case scenarios. Diversity helps test the model’s consistency. However, all examples should be quality-checked: errors in the dataset (wrong answers, ambiguous questions) will make your eval unreliable. If possible, have human experts review data or generate data with an LLM and then filter it using a stronger model or human feedback (an approach OpenAI and others have used to improve eval quality). It’s often better to have fewer high-quality examples than many noisy ones.

Data Licensing and Ethics: Always use data you have rights to. If you collect or create the examples yourself, you can choose how to license them. Common choices for open data are Creative Commons licenses like CC BY (requires attribution) or CC0 (public domain). If you’re repurposing existing data (e.g. game records, transcripts), check the source’s license or terms of use. KataGo data, for example, is open-source (the neural network weights are MIT licensed (
katagotraining.org - Neural Network License
), and many Go game records are public domain or CC-licensed). Still, always give credit to sources and ensure you’re not violating any terms. If your eval involves sensitive content (like user data or proprietary info), you might opt to keep that data private and only share aggregate results or a sanitized version.

Preparing Data Format: Once you have raw data, convert it into a convenient format for evaluation. Most eval frameworks use JSON Lines (JSONL) or CSV for datasets. JSONL is simply a text file where each line is a JSON object representing one example. For instance, an example could be:

{"problem": "2+2=", "answer": "4"}

Each line should contain all information needed for that test case: the prompt/input and the expected output or evaluation info. We’ll discuss prompt formatting next, but think ahead about what fields you need (e.g., for a question-answer task, you might have {"question": "...", "answer": "..."} for each example).

Train vs Test in Evals: Some evals (especially those using few-shot prompting) include “train” examples in the prompt and then test the model on a “test” query. In such cases, you might prepare separate sets. For instance, the OpenAI Evals tutorial creates a tiny dataset with 2 training examples and 2 test examples to include the training examples as few-shot context (evals/docs/custom-eval.md at main · openai/evals · GitHub). Whether you separate train/test depends on your eval design. If each test case is independent (no in-prompt examples), you can just have one set of samples. If you plan to include some examples as part of the prompt (few-shot), you may need a way to distinguish or pair them with test prompts.

3. Prompt and Input Design Strategies

Designing the input prompt for each example is crucial, as it directly affects how the model responds.

Prompt Formatting: Choose a format that your target models handle well. For modern chat-based LLMs (OpenAI GPT-4, Anthropic Claude, Google Gemini, etc.), using a chat format with roles may yield the best results. In OpenAI’s eval framework, they encourage using the new chat format for prompts. For example, a chat prompt might be represented in JSON as:

{
  "messages": [
    {"role": "system", "content": "You are a Go expert."},
    {"role": "user", "content": "Here's a Go board state: <state>. What is Black's best move?"}
  ],
  "ideal": "The best move is D4."
}

Here, ideal (or a similar field) would contain the expected/ideal answer. If using older models or simpler completion-style APIs, you might instead craft a single text string prompt (concatenating any context and question). OpenAI’s tooling can convert chat format to the older completion format if needed.

Few-Shot Examples: For tasks where the model benefits from seeing a few demonstrations, you can include few-shot examples in the prompt. This means prepending a few (input, output) pairs before asking the real question. For instance, if evaluating arithmetic, your prompt to the model might literally contain: "Q: 2+2=? A: 4\nQ: 4*4=? A: 16\nQ: 3+5=? A:" and you expect the model to continue with the answer. These demonstrations can improve performance on the eval if the model uses in-context learning. Make sure the few-shot examples are representative and don't give away answers to the test queries. Include variations in format and content if possible, so the model learns the pattern but not a trivial mapping.

Structured Inputs (Games, Code, etc.): Sometimes inputs aren’t plain English. For example, to test Go or chess, you need to describe a board state. You have a few options:

Textual Encodings: For chess, a common choice is FEN notation (a single-line string describing piece positions). In one eval, each puzzle was given in FEN and included in a prompt. Similarly for Go, you could use an SGF snippet or a list of moves played so far. Ensure the model can understand the encoding; you might need to provide explanation in the prompt (e.g., “The board state in SGF format is: (SGF string).”).
Tabular/JSON: Alternatively, represent the state as JSON or a simple grid. For example, a 19x19 Go board could be represented as a matrix of symbols (. for empty, X for black, O for white) in text. This is verbose but human-readable and maybe within the model’s understanding.
Images: If the eval involves visual data (e.g., interpreting an image or diagram), you need a model that supports multimodal input. Some LLMs (like GPT-4 with vision or others) can accept images. In such cases, your eval framework must send the image along with the prompt. This is more complex – you might use a base64-encoded image string or a special API call for images. OpenAI Evals framework is currently text-focused, but other frameworks or custom code might handle images. Note that EleutherAI’s evaluation harness has started prototyping text+image multimodal tasks. If you require multimodal input, be prepared to handle different API endpoints or use a specialized tool.

In summary, format the prompt in a way that the model can parse. Provide any needed instructions as part of the prompt (e.g., “Output only the move, in coordinates.” to prevent the model from chatting). Consistency in formatting across examples is important so that a single eval logic can handle all cases.

Avoiding Prompt Leaks and Clues: Ensure your prompt for each test case doesn't accidentally reveal the expected answer. For example, if your data is in question-answer format, don’t include the answer in the prompt (unless it’s part of a few-shot example clearly separated from the test query). This seems obvious, but subtle leaks (like including the answer as a “hint”) can happen if you’re not careful.

4. Scoring and Metrics Design

After the model produces an output for each prompt, your eval needs to score it. Designing the right metric is as important as the prompt.

Exact Match vs. Partial Credit: If the task has a single correct answer (e.g., a math problem or a specific fact), you can use exact match scoring: the model gets 1 point for a correct answer and 0 for anything else. This yields an accuracy score over all examples. Exact match is simple and deterministic, but it can be too strict for open-ended tasks or those with multiple correct answers. In tasks like language translation or summarization, there isn't a single "correct" output. For those, use metrics that allow partial credit or similarity:

String Similarity Metrics: BLEU, ROUGE, or METEOR for translation/summarization measure overlap with a reference text. These can capture if the model’s output is close to a reference answer even if not word-for-word.
Embedding Similarity: Compute vector embeddings (using e.g. cosine similarity) between the model output and reference output(s).
Human or LLM Judgement: Have a separate judge evaluate the output. This could be a human rater or an LLM used as a scoring agent.

LLM-as-a-Judge: A modern approach is to use a strong LLM to grade the outputs of another (or same) model. For example, you might prompt GPT-4 with the model’s answer and the reference, asking for a score or verdict. Research has found that these model-based judgments can correlate well with human judgment. If you do this, design a clear rubric for the judging LLM (e.g., “Score 1 if the answer is correct, 0 if not. Here is the question, answer, and solution...”). Note that this introduces an extra dependency (the judge model might have its own biases or errors). It’s often good to spot-check some scoring outputs manually or include a few cases with known outcomes to verify the judge’s reliability.

External Tools for Scoring: In specialized domains, external tools or models can provide a ground truth or evaluation:

In coding evals, you might run the code the model produces and check its output or whether tests pass.
In game-playing evals (chess, Go), you can use a game engine to evaluate moves. For instance, if testing Go moves, use KataGo to get the win-rate delta for the model's move versus the optimal move. A simple metric could be “% of moves that match KataGo’s top choice.” Or, more nuanced, “average win-rate drop from optimal.” In the chess puzzle eval, they tracked not only how many puzzles were solved but also how many illegal moves were made. An illegal move indicates the model failed to follow game rules, which is a critical error, so they treated that as a separate metric (in fact, they even computed an adjusted Elo rating penalizing illegal moves). This illustrates designing metrics that capture important failure modes (in this case, rule violations).
For multiple-choice questions, metrics like accuracy or F1 (if multiple labels) are common. If the eval is adversarial or probabilistic, you might measure things like false-positive rate, etc.

Deterministic vs. Probabilistic Outputs: Decide if your eval expects a deterministic output or if some randomness is acceptable. Ideally, for evals, make the model's outputs deterministic by controlling the generation settings. For instance, use temperature=0 in OpenAI API to minimize randomness so that each run of the eval is repeatable. If your eval absolutely requires sampling (say, evaluating the model’s creativity or diversity), then you might need to run multiple trials. In such cases, scoring could involve statistics (e.g., out of 5 sampled stories, 3 met the criteria). But this complicates things. As a rule of thumb, for straightforward evals, keep generation deterministic to get consistent scores run-to-run.

Reference Outputs: If you have a known correct answer for each example, store it in the data (like an "ideal" or "expected" field). Your eval code can then compare model output to this reference. If multiple answers are acceptable, you can store a list of acceptable answers or a pattern to match. For example, if any synonym of a word is fine, your scoring function could check if the model’s answer is in a set of synonyms. For numeric or structured outputs, you might compute a numeric error (e.g., difference between expected and model output if both are numbers). The key is implementing the scoring logic that aligns with what you consider "correct enough."

Custom Score Functions: In some eval frameworks (like OpenAI’s), you can write custom metrics in code. For example, you could parse the model’s output and give points for certain content. If doing this, thoroughly test your scoring function on sample outputs to ensure it behaves as expected (no false positives or negatives).

To summarize, design metrics that truly reflect success on your task. Keep them as simple as possible, but not simpler – they should capture critical aspects of performance. When in doubt, correlate automatic metrics with human judgment by reviewing a subset of outputs.

5. Implementation: Writing a Custom Eval (Python/TypeScript)

Now to the hands-on part: implementing your eval. You have two broad approaches:

Use an evaluation framework (like OpenAI Evals, EleutherAI’s LM Evaluation Harness, etc.) where you plug in your data and possibly some code.
Or write a custom script (in Python, TypeScript, etc.) that calls the model API and evaluates results.

We’ll outline a Python approach, as it's common, but the logic applies in any language.

5.1 Choosing a Framework or DIY: If you use OpenAI's Evals framework, much is handled for you (data loading, logging, etc.). You'd write a small amount of code or YAML configuration to define your eval. For example, an OpenAI eval can be configured in a YAML like:

evals:
  - id: my_eval
    metrics: [accuracy]
    description: "Math addition questions"
    handler: evals.elsuite.basic.match:Match
    args:
      samples_jsonl: my_eval/samples.jsonl
      # any additional args like case_sensitive, etc.

Here handler refers to a built-in class that checks if model output matches the reference (the Match eval in this case). The samples_jsonl points to your data file. This YAML essentially registers the eval. You would then run a CLI command to execute it.

If you want more customization (like complex scoring or multi-turn interaction), you might write a Python class inheriting from the eval framework’s base classes. OpenAI’s guide suggests focusing on existing templates if possible, resorting to code only for novel logic. Other frameworks like LM Evaluation Harness allow adding new tasks via Python classes or configs as well.

5.2 Writing a Simple Evaluation Script (Python): If not using a specialized framework, you can directly use model APIs. Here's a simplified pseudocode of how you might implement an eval in Python using OpenAI API (this can be adapted to Anthropic’s API, etc. by changing the client library):

import openai
import json

openai.api_key = "YOUR_API_KEY"  # ensure this is set securely

# Load your eval samples
samples = [json.loads(line) for line in open("my_eval.jsonl")]

def evaluate_sample(sample):
    prompt = construct_prompt(sample)  # build the prompt text or messages from sample
    # Call the model (assuming chat format for example)
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=prompt,
        temperature=0  # deterministic
    )
    output = response["choices"][0]["message"]["content"]
    score = score_output(sample, output)  # e.g., 1 if matches sample["ideal"], else 0
    return score, output

results = []
for sample in samples:
    sc, out = evaluate_sample(sample)
    results.append({"sample": sample, "output": out, "score": sc})

# Compute aggregate metrics, e.g. accuracy
accuracy = sum(r["score"] for r in results) / len(results)
print(f"Accuracy: {accuracy:.2%}")

In the above:

construct_prompt(sample) is a function you write to format the sample into the model input (it might use sample["question"] to form a message list or text).
score_output(sample, output) is your scoring logic, e.g., compare to sample["ideal_answer"].

For Anthropic’s Claude, you’d use their client (anthropic.Client.completion()) with the prompt in their format (they use a system prompt with human/assistant roles). For Google’s models, their API might differ. The principle remains: send prompt, get output, compare to reference.

TypeScript Implementation: You can do the same in Node.js using fetch or official SDKs. For example, using OpenAI’s Node SDK:

import { Configuration, OpenAIApi } from "openai";
const config = new Configuration({ apiKey: process.env.OPENAI_API_KEY });
const openai = new OpenAIApi(config);

const samples = JSON.parse(fs.readFileSync("my_eval.json"));  // assuming it's an array in a JSON
for (const sample of samples) {
  const prompt = constructPrompt(sample);
  const response = await openai.createChatCompletion({
    model: "gpt-4",
    messages: prompt,
    temperature: 0
  });
  const output = response.data.choices[0].message.content;
  const score = scoreOutput(sample, output);
  // ...collect results
}

TypeScript can also be useful if you want to integrate with a web interface or use browser-based eval (for example, if evaluating a model through a web UI or using a browser automation to test something).

API Access vs. Browser-Based Evaluation: In most cases, using direct API calls (as above) is appropriate. This is true for evaluating pure LLM behavior. However, if your eval involves the model interacting with a browser (say you’re evaluating a browsing agent plugin that looks up information), then a different approach is needed. You might have to automate a browser (using something like Playwright or Selenium) to simulate user interactions and capture model outputs in a web interface. That’s quite advanced and usually not necessary unless evaluating a full agent with tools. Another angle: some evaluation frameworks integrate with browser for collecting human feedback or for certain web-based tasks, but for our focus (LLM capabilities via API), sticking to API is simpler and more reproducible.

Ensuring Multi-Model Compatibility: To test across OpenAI, Anthropic, etc., you might abstract the model interface in your code. For example, have a function generate_response(model_name, prompt) that internally calls the appropriate API depending on model_name. Many community frameworks (like the EleutherAI harness) do this, supporting a variety of models with a unified interface. You can also use libraries like Hugging Face’s transformers to load local models or call API endpoints, but be mindful of rate limits and format differences. The goal is to avoid writing completely separate eval code for each model vendor. Instead, parameterize the model choice.

Sample Data and Code Structure: Organize your eval repository with clarity:

A data folder (e.g. data/my_eval/) containing your JSONL or JSON data.
Code for running the eval (like a Python script or notebook).
If using an eval framework, configuration files or eval definition classes.
A README explaining how to run it.

Keeping things well-structured will also help when publishing (others can understand and reproduce your eval).

6. Running the Eval: Execution and Debugging

Once implementation is ready, it’s time to run your eval and gather results.

Batching and Efficiency: If you have many examples, calling the API one-by-one can be slow and hit rate limits. Many APIs allow batching or streaming. For example, OpenAI’s API supports submitting prompts as a list for some endpoints, or you can use asynchronous calls. Some evaluation libraries automatically batch requests for speed. If writing from scratch, consider using Python’s asyncio or concurrent futures to send multiple requests in parallel (respect the API’s concurrency limits). Batch processing can dramatically speed up eval runs, especially for hundreds of examples.

Local vs. Cloud Execution: Decide where to run the eval:

Locally: Running on your machine is fine for smaller evals or if using APIs (the heavy lifting is on the API side). If you have a powerful GPU and want to evaluate open-source models (like Llama 2, etc.), you can load them locally via Hugging Face Transformers. This avoids API costs, but setup can be complex for very large models. Local runs give you full control (and you avoid any internet latency).
Cloud: For large-scale evals or to evaluate models that you host on a server, you might use cloud resources. For example, if evaluating a custom model on Google Cloud or AWS, you could run the eval script on a VM with access to that model. Cloud is also handy if you want to run many evals (for different models or repeated trials) and need more compute. Keep in mind API-based evals can run anywhere as long as the machine can reach the API, so sometimes the cloud vs local question is about convenience and performance rather than necessity.

Debugging Tips: It’s common for something to go wrong on the first run. Here’s how to troubleshoot:

Start Small: Run the eval on just 1-2 examples in verbose mode to ensure the prompt formatting, API calls, and scoring logic all work. Print out the prompt and model output for inspection.
Check Errors & Rate Limits: API responses might contain errors (e.g., if prompt is too long or content disallowed). Handle exceptions and print/log them clearly. If you hit rate limits, insert delays or reduce batch size.
Verify Scoring on Known Cases: Include a couple of test cases where you know what the model should do (perhaps trivial ones). Check that the scoring function gives the expected score. If not, adjust the scoring logic.
Model Output Format: Sometimes models might not format the output exactly as you expect (e.g. extra explanations or different casing). You may need to post-process outputs. For instance, if your eval expects just a move like "D4" but the model says "I think the best move is D4 because ...", you’d need to parse out "D4". You can either refine the prompt to only output the move or add output parsing in scoring.
Reproducibility: Set random seeds if any randomness (some frameworks allow setting a seed for generation). Log the model version (if using an open model checkpoint or an API snapshot). This way, you or others can reproduce the run later. Note that some APIs (like OpenAI’s) update models over time; for long-lived evals, you may want to version your results by date or model snapshot.

Collecting Results: Decide on an output format for results. You might simply print metrics to screen. But it’s often useful to save detailed results (each example’s output and score) to a file (JSON or CSV). This allows analysis later, especially if you want to see which examples failed. OpenAI’s evals framework, for instance, can log to a JSONL or even a database, and also integrates with Weights & Biases for tracking runs. If you plan multiple eval runs (e.g., comparing models), systematically save them (naming files by model and date, for example).

Interpreting Output: After a successful run, interpret the results carefully. Look beyond the top-line metric. For example, if accuracy is 70%, which 30% did it get wrong? Are they clustered in a certain category of input? Perhaps all failures are endgame positions in Go, indicating a weakness there. This analysis might lead you to refine your eval (maybe add more examples of that type or split out a sub-metric).

Iterate if Needed: It’s not uncommon to iterate on the eval design after seeing initial results. If your eval was too easy (all models score 100%), you might add harder cases. If it was too hard or ambiguous (all models score near 0, including ones you expected to do well), inspect if the questions are fair or if scoring is too harsh. The eval should ideally differentiate model performance in a useful range (not all 0s or all 100s).

7. Publishing and Sharing Your Eval

You’ve built and run your custom eval – now consider sharing it so others can benefit or replicate it.

Open-Source the Code and Data: The best practice is to create a repository (on GitHub or similar) containing:

Data files (.jsonl or whatever format) for the eval dataset. If the dataset is large, you could host it on a data hub (Hugging Face Datasets, Kaggle, etc.) and just provide a link or script to download.
Code to run the eval. This could be a script, notebook, or instructions for using an eval framework with your config.
README Documentation: Explain the purpose of the eval, how to run it, what models it’s for, and what metrics it reports. Include examples of expected input/output format.
License Files: Clearly license your work. For code, licenses like MIT or Apache-2.0 are popular and permissive, encouraging others to use and modify your eval. For data, you can use a Creative Commons license (e.g., CC BY 4.0 to require attribution, or CC0 to waive rights). Indicate the license in a LICENSE file and/or in the README. This helps avoid any legal ambiguity.

Publishing on Eval Platforms: In addition to your own repo, you might contribute your eval to established platforms:

OpenAI Evals Registry: OpenAI initially invited contributions to their openai/evals repository. However, note the current guidance: they are not accepting evals with custom code at the moment, only certain YAML-based evals. This may change over time. If your eval fits their criteria (they look for evals that surface interesting capabilities or problems), you could submit a Pull Request to add it. This makes it visible to anyone using OpenAI’s framework.
Hugging Face: You can create a Hugging Face dataset repository for your eval data. If you write your eval as a Python script or Jupyter Notebook, you can also share that (or even make a Hugging Face Space for an interactive demo or visualization of results).
Academic or Leaderboard: If your eval is akin to a benchmark (and you have results for multiple models), consider publishing a short report or blog post about it. Sometimes new evals, especially in novel domains (like an LLM playing Go), can be turned into an academic paper or at least a technical blog which can be cited.

Licensing and Permissions: Double-check any external data you included. If you incorporated data from somewhere (like game records, or KataGo’s outputs), mention the source and license in your repo. For instance, “Game positions taken from XYZ database, ©2023 ABC (used under CC BY 4.0 license).” If your eval might reveal sensitive information or you’re not sure about sharing certain pieces, consider anonymizing or aggregating. For example, if you had a proprietary dataset you can’t share, you might still share the scoring logic and perhaps a few example data points, so others could at least follow the method with their own data.

Privacy/Keeping Parts Private: If some parts must remain private (due to business or privacy reasons), you have options:

Share the Evaluation Code but not Data: Others could run your eval on their own similar data. Provide instructions on data format.
Share Synthetic Data: If you can’t share real user data, perhaps create a synthetic version that mimics it for demonstration.
Keep Scoring Logic Transparent: Even if data is private, share how you score (unless that itself is sensitive). This helps validate the eval design.
API Keys and Secrets: Never commit API keys or credentials in the repo. Use environment variables or config files (and add those to .gitignore).

Community Engagement: Announce your eval on relevant forums or communities (the OpenAI community forum, Discord channels, Reddit, etc.). You might get feedback, or others might run their models on it and share results. This can be very insightful – maybe someone finds a model that does unexpectedly well or identifies a flaw in one of your examples.

Continuous Updates: If your eval becomes popular or if LLMs improve, you might update it. For instance, if models start solving all your Go positions easily, you’d want to add harder ones. Treat an eval as a living benchmark that can evolve. Just be sure to version it (so results from v1 vs v2 aren’t confused).

Finally, when publishing, articulate why the eval is useful. For example: “This eval tests strategic planning in Go. It’s challenging for GPT-4-class models and can help identify whether new models have improved in long-term planning.” Clear motivation will attract users (and possibly contributors) to your eval.

8. Example: Evaluating LLM Go-Playing Skill with KataGo

Let’s put it all together with an example eval: assessing an LLM’s skill at Go (the board game) by using KataGo (a strong Go engine) as a reference. Our goal will be to see if an LLM can suggest good moves for a given Go board state.

8.1 Designing the Go Eval

Scope: We focus on move prediction in Go. This is a specific task, so we'll create a single eval for it. We won’t mix unrelated tasks (no language questions, etc.) here.
What is a Good Outcome? Ideally, the LLM’s suggested move should match KataGo’s top choice or at least be a high-quality move.
Data Needed: We need a set of Go positions and KataGo's analysis for them (i.e., the best move from KataGo or a set of strong moves). We might take positions from real games or compose some interesting scenarios (joseki, life-and-death problems, etc.). Let’s say we gather 50 board states covering different stages of the game.
Format: We decide to use a text-based description of the board. We’ll use a simplified coordinate system (e.g., A1 to T19 for a 19x19 board, skipping I since Go usually uses letters A-H, J-T for columns). An example position description might look like:
- Black stones at D4, Q16, ...; White stones at C3, D16, ...; (with maybe 30-40 stones each for mid-game). We also indicate whose turn it is.
Prompt Design: For each example, our prompt will be something like:
```
You are an expert Go player. Analyze the board and suggest the best move for the player to move.
Board state:
- Black: D4, Q16, ...
- White: C3, D16, ...
It is Black's turn.
What is Black's best move?
```
We might not include few-shot examples because describing one board and then another might confuse things. We’ll rely on the instruction and the board listing being clear.
Expected Output: We want the model to output a move coordinate, e.g., “D4” or “Q16”. Maybe with a short justification, but ideally just the move (we can instruct: “Output the move coordinates only.”). However, if it outputs reasoning, we can parse the first coordinate it mentions as the move.
Reference Answer: KataGo (with high playouts) might say the best move is at (say) D4. So our ideal answer for that example is "D4". If multiple moves are nearly equal (within say 1% win rate), we could allow any of those as correct (store a list like ["D4","C16"] if applicable).

8.2 Data Preparation

We create a file go_eval.jsonl with each line like:

{
  "board": "Black: D4, Q16, K10, ...; White: C3, D16, Q4, ...; turn: Black",
  "ideal": "J4"
}

This represents one test case (with a truncated list of stones for brevity). Ensure all coordinates are valid and the board state is plausible (no overlaps, etc.). We include 50 such lines, each with a unique board and ideal best move.

(If we had KataGo's full analysis, we could include win-rate info or secondary moves, but for simplicity we just put the top move.)

8.3 Implementation (Python example)

We'll use the OpenAI API with gpt-4 for this eval (assuming it has been trained on enough Go-related text to have some idea). We set temperature to 0 for consistency.

import openai, json

openai.api_key = "YOUR_API_KEY"

def format_go_prompt(board_desc, player_to_move):
    return [
      {"role": "system", "content": "You are an expert Go player and teacher."},
      {"role": "user", "content": f"Analyze the following Go board and suggest the best move for {player_to_move}.\nBoard state:\n{board_desc}\nIt is {player_to_move}'s turn. What is the best move for {player_to_move}?"}
    ]

# Load eval data
with open("go_eval.jsonl") as f:
    samples = [json.loads(line) for line in f]

correct = 0
for sample in samples:
    # Each board description contains whose turn, but let's parse it out or store separately
    board_desc = sample["board"]
    # assume board string contains "turn: X" at end:
    player_to_move = "Black" if "turn: Black" in board_desc else "White"
    messages = format_go_prompt(board_desc, player_to_move)
    resp = openai.ChatCompletion.create(model="gpt-4", messages=messages, temperature=0)
    answer = resp["choices"][0]["message"]["content"].strip()
    # Simple parsing: take the first two characters that look like a coordinate
    move = answer.split()[0]  # this might grab "D4" even if model said "D4 is the best move."
    if move.upper().strip(",.") in sample["ideal"].split("/"):
        # If we allowed multiple ideal moves separated by "/", check those
        correct += 1
    else:
        print(f"Board: {board_desc}\nModel answer: {answer} (expected {sample['ideal']})\n")

After running this, correct will be the number of times the model’s move matched KataGo’s top move. The print statement will output cases the model got wrong, for analysis.

8.4 Running and Refining

Suppose we run this and get output like:

The model often suggests moves that are decent but not KataGo’s choice. Accuracy might be, say, 10/50 = 20%. We observe that in many cases, the model's move is in KataGo's top 3 choices but not the top one.
In a few cases, the model output something invalid like “pass” or an obviously bad move. We examine those: maybe the prompt could be improved (perhaps the model got confused by formatting).

We might refine the eval:

Maybe allow top-3 moves as correct (if we have that info) for partial credit.
Ensure none of our positions are super obscure (the model might not “know” them).
Add in the instructions “If unsure, make your best guess; do not say you don’t know.”

We rerun after tweaks, log the final results.

8.5 Results and Sharing

Finally, we prepare a summary:

Metric: The model got 20% exact matches with KataGo’s top move. When considering any move within 5% win-rate of optimal (a more lenient metric), it got about 50% acceptable moves.
Analysis: GPT-4 has some knowledge of Go to suggest reasonable moves in some cases, but it significantly underperforms a Go engine and makes mistakes a human intermediate player might avoid. This shows the limitation of LLMs in precise strategic domains without specialized training.

We then share the go_eval.jsonl and the script in a GitHub repo, along with these findings. Perhaps also try a couple of other models (Claude, etc.) by adjusting the code, and include those results in our README for context.

By going through this example, we demonstrated the full process: designing an eval for a novel capability (game-playing), implementing it, running it, and analyzing outcomes. You can follow similar steps for other custom evals – whether it’s testing coding ability, logical puzzles, multi-modal reasoning, or anything else you can devise. Good luck with building your own LLM evals, and we look forward to seeing what creative benchmarks you come up with!