Guardrails

Guardrails can protect against harmful inputs, such as jailbreak attempts, and damaging output, such as mentions of a competitor’s product.

For protecting sensitive information like PII, see Sanitization.

A specific guardrail implements the TextGuardrail interface. It takes the input or output text as a parameter and a result if it passed the validation or not, including an explanation of why the decision was made. These results are included in metrics and traces. A guardrail can abort the interaction with the model, or only report the problem and continue anyway.

An example of a Guardrail implementation:

ToxicGuard.java

import akka.javasdk.agent.GuardrailContext;
import akka.javasdk.agent.TextGuardrail;

public class ToxicGuard implements TextGuardrail {

  private final String searchFor;

  public ToxicGuard(GuardrailContext context) {
    searchFor = context.config().getString("search-for");
  }

  @Override
  public Result evaluate(String text) {
    // this would typically be more advanced in a real implementation
    if (text.contains(searchFor)) {
      return new Result(false, "Toxic response '%s' not allowed.".formatted(searchFor));
    } else {
      return Result.OK;
    }
  }
}

Guardrails are enabled by configuration, to be able to enforce at deployment time that certain guardrails are always used.

src/main/resources/application.conf

akka.javasdk.agent.guardrails {
  "pii guard" {                                     (1)
    class = "com.example.guardrail.PiiGuard"        (2)
    agents = ["planner-agent"]                      (3)
    agent-roles = ["worker"]                        (4)
    category = PII                                  (5)
    use-for = ["model-request", "mcp-tool-request"] (6)
    report-only = false                             (7)
  }

  "toxic guard" {
    class = "com.example.guardrail.ToxicGuard"
    agent-roles = ["worker"]
    category = TOXIC
    use-for = ["model-response", "mcp-tool-response"]
    report-only = false
    search-for = "bad stuff"
  }
}

1	Each configured guardrail has a unique name.
2	Implementation class of the guardrail.
3	Enable this guardrail for agents with these component ids.
4	Enable this guardrail for agents with these roles.
5	The type of validation, such as PII and TOXIC.
6	Where to use the guardrail, such as for the model request or model response.
7	If it didn’t pass the evaluation criteria, the execution can either be aborted or continue anyway. In both cases, the result is tracked in logs, metrics and traces.

The implementation class of the guardrail is configured with the class property. The class must implement the TextGuardrail interface. The class may optionally have a constructor with a GuardrailContext parameter, which includes the name and the config section for the specific guardrail. In above code example of the ToxicGuard you can see how the configuration property search-for is read from the configuration of the GuardrailContext parameter.

Agents are selected by matching agent or agent-role configuration.

agents: enabled for agents with these component ids, if agents contain "*" the guardrail is enabled for all agents
agent-roles: enabled for agents with these roles, if agent-roles contain "*" the guardrail is enabled for all agents that has a role, but not for agents without a role

If both agents and agent-roles are defined it’s enough that one of them matches to enable the guardrail for an agent.

This role is defined in the @AgentRole annotation.

The name and the category are reported in logs, metrics and traces. The category should classify the type of validation. It can be any value, but a few recommended categories are JAILBREAK, PROMPT_INJECTION, PII, TOXIC, HALLUCINATED, NSFW, FORMAT.

The guardrail can be enabled for certain inputs or outputs with the use-for property. The use-for property accepts the following values: model-request, model-response, mcp-tool-request, mcp-tool-response, and *.

Guardrail of similar text

The built-in SimilarityGuard evaluates the text by making a similarity search in a dataset of "bad examples". If the similarity exceeds a threshold, the result is flagged as blocked.

This is how to configure the SimilarityGuard:

src/main/resources/application.conf

akka.javasdk.agent.guardrails {
  "jailbreak guard" {
    class = "akka.javasdk.agent.SimilarityGuard"
    agents = ["planner-agent", "weather-agent"]
    category = JAILBREAK
    use-for = ["model-request"]
    threshold = 0.75
    bad-examples-resource-dir = "guardrail/jailbreak"
  }
}

Here, it’s using predefined examples of jailbreak prompts in guardrail/jailbreak. Those have been incorporated from https://github.com/verazuo/jailbreak_llms, but you can define your own examples and place in a subdirectory of src/main/resources/. All text files in the configured bad-examples-resource-dir are included in the similarity search.

This can be used for other things than jailbreak attempt detection.