Binary weighted evaluations...how to
marcosomma

marcosomma @marcosomma

About: I'm Marco Somma, a cognitive systems architect, technologist, and builder of modular reasoning tools. I’m the creator of OrKa – an open framework for orchestrating explainable AI agents.

Location:
Barcelona, Spain
Joined:
Apr 18, 2025

Binary weighted evaluations...how to

Publish Date: Dec 7 '25
12 2

Evaluating LLM agents is messy.

You cannot rely on perfect determinism, you cannot just assert result == expected, and asking a model to rate itself on a 1–5 scale gives you noisy, unstable numbers.

A much simpler pattern works far better in practice:

Turn everything into yes/no checks, then combine them with explicit weights.

In this article we will walk through how to design and implement binary weighted evaluations using a real scheduling agent as an example. You can reuse the same pattern for any agent: customer support bots, coding assistants, internal workflow agents, you name it.


1. What is a binary weighted evaluation?

At a high level:

  1. You define a set of binary criteria for a task

    Each criterion is a question that can be answered with True or False.

    • Example:
      • correct_participants: Did the agent book the right people?
      • clear_explanation: Did the agent explain the outcome clearly?
  2. You assign each criterion a weight that reflects its importance

    All weights typically sum to 1.0.

   COMPLETION_WEIGHTS = {
       "correct_participants": 0.25,
       "correct_time": 0.25,
       "correct_duration": 0.10,
       "explored_alternatives": 0.20,
       "clear_explanation": 0.20,
   }
Enter fullscreen mode Exit fullscreen mode
  1. For each task, you compute a score from 0.0 to 1.0 You sum the weights of all criteria that are True.
   score = sum(
       COMPLETION_WEIGHTS[k]
       for k, v in checks.items()
       if v
   )
Enter fullscreen mode Exit fullscreen mode
  1. You classify the outcome based on the score and state For example:
    • score >= 0.75 and booking confirmed → successful completion
    • score >= 0.50 → graceful failure
    • score > 0.0 but < 0.50 → partial failure
    • score == 0.0 and conversation failed → hard failure

This gives you a scalar metric that is:

  • Interpretable: you can see exactly which criteria failed.
  • Tunable: change the weights without touching your agent.
  • Stable: True or False decisions are far easier to agree on between humans or models.

2. Step 1 – Turn “good behavior” into boolean checks

Start by asking: What does “good” look like for this task?

For a scheduling agent, a successful task might mean:

  • It booked a meeting with the right participants.
  • At the right time.
  • With the right duration.
  • If there was a conflict, it proposed alternatives.
  • Regardless of outcome, it explained clearly what happened.

Those become boolean checks.

Conceptually:

checks = {
    "correct_participants": ... -> bool,
    "correct_time": ... -> bool,
    "correct_duration": ... -> bool,
    "explored_alternatives": ... -> bool,
    "clear_explanation": ... -> bool,
}
Enter fullscreen mode Exit fullscreen mode

In the scheduling example, these checks use the agent’s final state plus a ground truth object.

Simplified version:

def _check_participants(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    booked = set(scheduling_ctx["booked_event"]["participants"])
    expected = set(ground_truth["participants"])
    return booked == expected


def _check_time(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    return scheduling_ctx["booked_event"]["time"] == ground_truth["time"]


def _check_duration(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    expected = ground_truth.get("duration", 30)
    return scheduling_ctx["booked_event"]["duration"] == expected
Enter fullscreen mode Exit fullscreen mode

And for behavior around conflicts and explanations:

def _check_alternatives(scheduling_ctx, conversation_trace) -> bool:
    if not scheduling_ctx.get("conflicts"):
        # If there was no conflict, this is automatically ok
        return True

    proposed = scheduling_ctx.get("proposed_alternatives", [])
    return len(proposed) > 0


def _check_explanation(conversation_trace, conversation_stage: str) -> bool:
    if not conversation_trace:
        return False

    last_response = conversation_trace[-1].get("response", "")
    # Silent crash is bad
    if conversation_stage == "failed" and len(last_response) < 20:
        return False

    # Very simple heuristic: the user sees some explanation
    return len(last_response) > 20
Enter fullscreen mode Exit fullscreen mode

The exact logic is domain specific. The key rule is:

Each check should be obviously True or False when you look at the trace.


3. Step 2 – Turn business priorities into weights

Not all criteria are equally important.

In the scheduling agent example:

COMPLETION_WEIGHTS = {
    "correct_participants": 0.25,   # Booked the right people
    "correct_time": 0.25,           # Booked the right date/time
    "correct_duration": 0.10,       # Meeting length as requested
    "explored_alternatives": 0.20,  # Tried to find another slot if needed
    "clear_explanation": 0.20,      # User understands outcome
}
Enter fullscreen mode Exit fullscreen mode

Why this makes sense:

  • Booking the wrong person or the wrong time is catastrophic → high weight.
  • Slightly wrong duration is annoying but not fatal → lower weight.
  • Exploring alternatives and clear explanations are key to user trust → medium weight.

Guidelines for designing weights:

  1. Start from business impact, not from what is easiest to check.
  2. Make weights sum to 1.0 so the score is intuitive.
  3. Keep a small number of criteria at first (4 to 7 is plenty).
  4. Be willing to change weights after you see real data.

4. Step 3 – Implement the per request evaluator

Now combine the boolean checks and weights to compute a score for a single request.

In the example repository, this machinery is wrapped in an EvaluationResult dataclass:

from dataclasses import dataclass
from enum import Enum
from typing import Dict


class OutcomeType(Enum):
    SUCCESSFUL_COMPLETION = "successful_completion"
    GRACEFUL_FAILURE = "graceful_failure"
    PARTIAL_FAILURE = "partial_failure"
    HARD_FAILURE = "hard_failure"


@dataclass
class EvaluationResult:
    score: float                 # 0.0 to 1.0
    details: Dict[str, bool]     # criterion -> passed?
    outcome_type: OutcomeType
    explanation: str
Enter fullscreen mode Exit fullscreen mode

Then the core evaluation function:

def evaluate_task_completion(final_state, ground_truth, conversation_trace) -> EvaluationResult:
    scheduling_ctx = final_state.get("scheduling_context", {})
    conversation_stage = final_state.get("conversation_stage", "unknown")

    checks = {
        "correct_participants": _check_participants(scheduling_ctx, ground_truth),
        "correct_time": _check_time(scheduling_ctx, ground_truth),
        "correct_duration": _check_duration(scheduling_ctx, ground_truth),
        "explored_alternatives": _check_alternatives(scheduling_ctx, conversation_trace),
        "clear_explanation": _check_explanation(conversation_trace, conversation_stage),
    }

    score = sum(
        COMPLETION_WEIGHTS[k]
        for k, v in checks.items()
        if v
    )

    outcome = _classify_outcome(scheduling_ctx, conversation_stage, score)
    explanation = _generate_explanation(checks, outcome, score)

    return EvaluationResult(
        score=score,
        details=checks,
        outcome_type=outcome,
        explanation=explanation,
    )
Enter fullscreen mode Exit fullscreen mode

This gives you:

  • A numeric score for analytics and thresholds.
  • A details dict for debugging.
  • A human friendly explanation for reports or console output.

5. Step 4 – Map scores to outcome classes

Users and stakeholders do not want to look at a sea of floating point numbers. They want to know:

  • How often does the agent succeed?
  • How often does it fail gracefully?
  • How often does it blow up?

You answer that by mapping scores to classes.

Example logic:

def _classify_outcome(scheduling_ctx, conversation_stage: str, score: float) -> OutcomeType:
    booking_confirmed = scheduling_ctx.get("booking_confirmed", False)

    if booking_confirmed and score >= 0.75:
        return OutcomeType.SUCCESSFUL_COMPLETION

    if conversation_stage == "failed" and score == 0.0:
        return OutcomeType.HARD_FAILURE

    if score >= 0.50:
        return OutcomeType.GRACEFUL_FAILURE

    return OutcomeType.PARTIAL_FAILURE
Enter fullscreen mode Exit fullscreen mode

You can now define clear thresholds:

  • Successful completion Meeting booked correctly with a high score.
  • Graceful failure The task could not be completed, but the user got a useful explanation or alternatives.
  • Partial failure The agent tried, but did not do enough to help the user.
  • Hard failure Wrong booking or silent crash.

This gives you both quantitative and qualitative views of performance.


6. Step 5 – Aggregating into metrics like TCR

Once you can evaluate a single request, turning that into a metric is straightforward.

For example, define Task Completion Rate (TCR) as the mean of per request scores:

def compute_tcr(results: list[EvaluationResult]) -> float:
    if not results:
        return 0.0
    return sum(r.score for r in results) / len(results)
Enter fullscreen mode Exit fullscreen mode

Then define thresholds that match your risk tolerance:

  • TCR >= 0.85 → production ready
  • 0.70 <= TCR < 0.85 → usable but needs improvement
  • TCR < 0.70 → not production ready

You can also break down by outcome type:

from collections import Counter

def summarize_outcomes(results: list[EvaluationResult]):
    counts = Counter(r.outcome_type for r in results)
    total = len(results) or 1

    return {
        "successful_completion": counts[OutcomeType.SUCCESSFUL_COMPLETION] / total,
        "graceful_failure": counts[OutcomeType.GRACEFUL_FAILURE] / total,
        "partial_failure": counts[OutcomeType.PARTIAL_FAILURE] / total,
        "hard_failure": counts[OutcomeType.HARD_FAILURE] / total,
    }
Enter fullscreen mode Exit fullscreen mode

This lets you say things like:

  • “78 percent of requests end in successful completion, 15 percent in graceful failure, and 7 percent in partial or hard failure.”

Which is far more actionable than “average rating: 3.9 out of 5”.


7. Extending the pattern to other metrics

Binary weighted evaluations are not only for completion. In the example project, the same pattern is reused for:

  • Response Clarity Score (RCS)

    How clear and useful is a single answer?

  • Error Recovery Score (RTE)

    How well does the agent recover when something goes wrong?

7.1 Response clarity

Define a new set of boolean criteria:

CLARITY_WEIGHTS = {
    "addresses_request": 0.30,      # Did it answer the original question?
    "provides_next_step": 0.25,     # Does the user know what to do next?
    "is_concise": 0.20,             # Not rambling
    "no_hallucination": 0.15,       # Grounded in context
    "appropriate_tone": 0.10,       # Professional and friendly
}
Enter fullscreen mode Exit fullscreen mode

Then evaluate:

def evaluate_response_clarity(user_input, agent_response, context) -> EvaluationResult:
    checks = {
        "addresses_request": _check_addresses_request(user_input, agent_response, context),
        "provides_next_step": _check_next_step(agent_response, context),
        "is_concise": len(agent_response.split()) < 100,
        "no_hallucination": _check_no_hallucination(agent_response, context),
        "appropriate_tone": _check_tone(agent_response),
    }

    score = sum(
        CLARITY_WEIGHTS[k]
        for k, v in checks.items()
        if v
    )

    # You can reuse OutcomeType or define a dedicated one
    return EvaluationResult(
        score=score,
        details=checks,
        outcome_type=OutcomeType.SUCCESSFUL_COMPLETION,  # or a clarity specific enum
        explanation=f"Response clarity score: {score:.2f}",
    )
Enter fullscreen mode Exit fullscreen mode

7.2 Error recovery

Same pattern, different criteria:

ERROR_RECOVERY_WEIGHTS = {
    "detected_error": 0.30,
    "requested_clarification": 0.25,
    "actionable_message": 0.20,
    "no_hallucination": 0.15,
    "no_crash": 0.05,
}
Enter fullscreen mode Exit fullscreen mode

You define checks for each of these and compute a weighted score in the same way.


8. How to adopt this in your own project

Here is a practical checklist to implement binary weighted evaluations for your agents.

  1. Pick one task type

    For example:

    • Answering factual questions
    • Generating SQL queries
    • Routing support tickets
  2. Write down 3 to 7 binary criteria

    Good prompts:

    • “What must be true for this result to be useful?”
    • “What are the most expensive mistakes?”
    • “What would we highlight in a post mortem?”
  3. Assign approximate weights

    Start with something like:

    • 0.3 for the main success criterion
    • 0.2 for each secondary one
    • 0.1 or less for extras
  4. Implement check functions

    They should:

    • Receive the final state, the ground truth, and optionally the full trace.
    • Return clear booleans with simple logic, even if heuristic.
  5. Create an EvaluationResult object

    So you are not juggling loose dicts. Include:

    • score
    • details
    • outcome_type
    • explanation
  6. Write a small evaluator script

    Like the scripts/run_evaluation.py in your example:

    • Load test scenarios.
    • Run the agent.
    • Evaluate each run.
    • Print a summary: TCR, outcome breakdown, top failing criteria.
  7. Iterate on weights and criteria

    After a few runs:

    • Check what failures you see in practice.
    • Adjust weights to match real risk.
    • Add or remove criteria if some are always True or always False.

9. Why this works so well for LLM agents

Binary weighted evaluations match the nature of LLM work:

  • Non deterministic outputs: You care less about string equality and more about semantics: did the agent satisfy the contract of the task.

  • Complex, stateful flows: It is unrealistic to reduce a full multi turn workflow to a single “pass or fail”. Binary checks let you inspect specific aspects of behavior.

  • LLM as judge integrations: Even when you use a model like GPT 4 as a grader, it is far more stable at answering yes/no questions than “rate 1–5”. You can plug an LLM into each criterion and still keep the same scoring layer.

  • Easy to explain to stakeholders You can say: “The agent passes correct_participants only 65 percent of the time, but clear_explanation is at 92 percent. We will focus on participant selection next.”

Comments 2 total

  • okram_mAI
    okram_mAIDec 7, 2025

    Nice! This resonate a lot to me. But one question. What if I'm evaluating a CoT and I need to evaluate the execution order. Should I then create a set of checks like. did_bot_do_xxxx_at_step_1? How this can be validate if I expect the bot to perform X before time N. Is really intriguing...

    • marcosomma
      marcosommaDec 7, 2025

      Love this question, because it hits the uncomfortable bit everyone skips: order actually matters.

      Short answer: yes, you can still use binary checks, but the checks become predicates over the sequence of steps, not just the final state.

Add comment