Guardrails
Guardrails can protect against harmful inputs, such as jailbreak attempts, and damaging output, such as mentions of a competitor’s product.
| For protecting sensitive information like PII, see Sanitization. |
A specific guardrail implements the TextGuardrail interface. It takes the input or output text as a parameter and a result if it passed the validation or not, including an explanation of why the decision was made. These results are included in metrics and traces. A guardrail can abort the interaction with the model, or only report the problem and continue anyway.
An example of a Guardrail implementation:
import akka.javasdk.agent.GuardrailContext;
import akka.javasdk.agent.TextGuardrail;
public class ToxicGuard implements TextGuardrail {
private final String searchFor;
public ToxicGuard(GuardrailContext context) {
searchFor = context.config().getString("search-for");
}
@Override
public Result evaluate(String text) {
// this would typically be more advanced in a real implementation
if (text.contains(searchFor)) {
return new Result(false, "Toxic response '%s' not allowed.".formatted(searchFor));
} else {
return Result.OK;
}
}
}
Guardrails are enabled by configuration, to be able to enforce at deployment time that certain guardrails are always used.
akka.javasdk.agent.guardrails {
"pii guard" { (1)
class = "com.example.guardrail.PiiGuard" (2)
agents = ["planner-agent"] (3)
agent-roles = ["worker"] (4)
category = PII (5)
use-for = ["model-request", "mcp-tool-request"] (6)
report-only = false (7)
}
"toxic guard" {
class = "com.example.guardrail.ToxicGuard"
agent-roles = ["worker"]
category = TOXIC
use-for = ["model-response", "mcp-tool-response"]
report-only = false
search-for = "bad stuff"
}
}
| 1 | Each configured guardrail has a unique name. |
| 2 | Implementation class of the guardrail. |
| 3 | Enable this guardrail for agents with these component ids. |
| 4 | Enable this guardrail for agents with these roles. |
| 5 | The type of validation, such as PII and TOXIC. |
| 6 | Where to use the guardrail, such as for the model request or model response. |
| 7 | If it didn’t pass the evaluation criteria, the execution can either be aborted or continue anyway. In both cases, the result is tracked in logs, metrics and traces. |
The implementation class of the guardrail is configured with the class property. The class must implement the TextGuardrail interface. The class may optionally have a constructor with a GuardrailContext parameter, which includes the name and the config section for the specific guardrail. In above code example of the ToxicGuard you can see how the configuration property search-for is read from the configuration of the GuardrailContext parameter.
Agents are selected by matching agent or agent-role configuration.
-
agents: enabled for agents with these component ids, ifagentscontain"*"the guardrail is enabled for all agents -
agent-roles: enabled for agents with these roles, if agent-roles contain"*"the guardrail is enabled for all agents that has a role, but not for agents without a role
If both agents and agent-roles are defined it’s enough that one of them matches to enable the guardrail for an agent.
This role is defined in the @AgentRole annotation.
The name and the category are reported in logs, metrics and traces. The category should classify the type of validation. It can be any value, but a few recommended categories are JAILBREAK, PROMPT_INJECTION, PII, TOXIC, HALLUCINATED, NSFW, FORMAT.
The guardrail can be enabled for certain inputs or outputs with the use-for property. The use-for property accepts the following values: model-request, model-response, mcp-tool-request, mcp-tool-response, and *.
Guardrail of similar text
The built-in SimilarityGuard evaluates the text by making a similarity search in a dataset of "bad examples". If the similarity exceeds a threshold, the result is flagged as blocked.
This is how to configure the SimilarityGuard:
akka.javasdk.agent.guardrails {
"jailbreak guard" {
class = "akka.javasdk.agent.SimilarityGuard"
agents = ["planner-agent", "weather-agent"]
category = JAILBREAK
use-for = ["model-request"]
threshold = 0.75
bad-examples-resource-dir = "guardrail/jailbreak"
}
}
Here, it’s using predefined examples of jailbreak prompts in guardrail/jailbreak. Those have been incorporated from https://github.com/verazuo/jailbreak_llms, but you can define your own examples and place in a subdirectory of src/main/resources/. All text files in the configured bad-examples-resource-dir are included in the similarity search.
This can be used for other things than jailbreak attempt detection.