Working with LLMs
Embabel supports any LLM supported by Spring AI. In practice, this is just about any LLM.
Choosing an LLM
Embabel encourages you to think about LLM choice for every LLM invocation.
The PromptRunner interface makes this easy.
Because Embabel enables you to break agentic flows up into multiple action steps, each step can use a smaller, focused prompt with fewer tools.
This means it may be able to use a smaller LLM.
Considerations:
- Consider the complexity of the return type you expect from the LLM. This is typically a good proxy for determining required LLM quality. A small LLM is likely to struggle with a deeply nested return structure.
- Consider the nature of the task. LLMs have different strengths; review any available documentation. You don’t necessarily need a huge, expensive model that is good at nearly everything, at the cost of your wallet and the environment.
- Consider the sophistication of tool calling required. Simple tool calls are fine, but complex orchestration is another indicator you’ll need a strong LLM. (It may also be an indication that you should create a more sophisticated flow using Embabel GOAP.)
- Consider trying a local LLM running under Ollama or Docker.
Trial and error is your friend.
Embabel makes it easy to switch LLMs; try the cheapest thing that could work and switch if it doesn’t.
Tuning for Smaller and Local Models
A core goal of Embabel is to make agentic flows work well across the full range of LLMs, so you can choose the cheapest, smallest, or most private model that does the job — rather than always reaching for a frontier model. Smaller chat models behave differently from frontier models in ways that the framework can compensate for:
- Silent failures after tool calls. Weaker open-weights models (e.g.
gpt-oss-20b, some Qwen variants) sometimes return blank text with no further tool calls when they don’t know how to proceed. Without intervention the tool loop exits with empty content. Activateembabel.agent.platform.toolloop.empty-response.max-retries: 1to feed a synthetic nudge back to the model and give it one more chance — see . - Tool-name confusion. Smaller models more frequently call tools by approximate names. The default
AutoCorrectionPolicyhandles this by feeding back a "did you mean X?" suggestion; tuneembabel.agent.platform.toolloop.tool-not-found.max-retriesif your model needs more attempts. - Iteration headroom. Recovery costs LLM calls. If you enable retry policies, raise
embabel.agent.platform.toolloop.max-iterationsso a turn that needs an extra round trip doesn’t run out of budget.
These settings are off-by-default so existing deployments using strong models behave exactly as before. Turn them on per-deployment when the model you’ve picked benefits from them.
Advanced: Custom LLM Integration
If you want to use a standard provider (Anthropic, OpenAI, DeepSeek, Mistral) with a
user-supplied key at runtime, see .
That is the recommended path for BYOK scenarios.
This section covers implementing LlmMessageSender from scratch for providers not otherwise
supported by Embabel.
Embabel’s tool loop is framework-agnostic, allowing you to integrate any LLM provider by implementing the LlmMessageSender interface.
This is useful when:
- You want to use an LLM provider not supported by Spring AI
- You need custom request/response handling
- You’re integrating with a proprietary or internal LLM service
The LlmMessageSender Interface
The core abstraction is the LlmMessageSender functional interface:
@FunctionalInterface
public interface LlmMessageSender {
LlmMessageResponse call(
List<Message> messages,
List<Tool> tools
);
}
The implementation makes a single LLM inference call and returns the response.
Importantly, it does not execute tools--it only returns any tool call requests from the LLM.
Tool execution is handled by Embabel’s DefaultToolLoop.
Even more advanced control over tools execution is available - tools can be executed in parallel, controlled by ParallelToolLoop. In order to activate ParallelToolLoop, please set the following parameter:
embabel.agent.platform.toolloop.type=parallel
For full list of tool loop configuration parameters please refer to ToolLoopConfiguration.
Tool-Not-Found Recovery Policy
When the LLM calls a tool by a name that doesn’t exist in the available set, the behavior is controlled by ToolNotFoundPolicy.
Two built-in policies are provided:
AutoCorrectionPolicy(default) — feeds the error back to the LLM so it can self-correct. Uses case-insensitive fuzzy matching to suggest corrections for hallucinated tool names (e.g.,ragbot_vectorSearch→ suggestsvectorSearch). When multiple candidates match, all are listed. ThrowsToolNotFoundExceptionafter 3 consecutive failures.ImmediateThrowPolicy— throwsToolNotFoundExceptionimmediately.
The system-wide default is AutoCorrectionPolicy, provided as a Spring bean with @ConditionalOnMissingBean.
To override it globally, define your own ToolNotFoundPolicy bean.
For per-interaction control, use withToolNotFoundPolicy() on PromptRunner:
promptRunner
.withToolNotFoundPolicy(new AutoCorrectionPolicy(5))
.creating(MyOutput.class)
.create(messages);
Custom policies can be implemented by implementing the ToolNotFoundPolicy interface:
class MyEditDistancePolicy : ToolNotFoundPolicy {
override fun handle(requestedName: String, availableTools: List<Tool>): ToolNotFoundAction {
// Custom recovery logic, e.g. edit-distance matching
...
}
}
Response Types
The LlmMessageResponse contains:
message: The LLM’s response as an EmbabelMessagetextContent: Text content from the responseusage: Optional token usage information
For responses that include tool calls, return an AssistantMessageWithToolCalls:
public record ToolCall(
String id, // Unique identifier for the tool call
String name, // Name of the tool to invoke
String arguments // JSON arguments for the tool
) {}
Example: Custom LLM Provider
Here’s an example of implementing LlmMessageSender for a hypothetical HTTP-based LLM API:
public class MyCustomLlmMessageSender implements LlmMessageSender {
private final HttpClient httpClient;
private final String apiKey;
private final String model;
public MyCustomLlmMessageSender(HttpClient httpClient, String apiKey, String model) {
this.httpClient = httpClient;
this.apiKey = apiKey;
this.model = model;
}
@Override
public LlmMessageResponse call(List<Message> messages, List<Tool> tools) {
// Convert Embabel messages to your API's format
List<Map<String, Object>> apiMessages = messages.stream()
.map(message -> Map.<String, Object>of(
"role", message.getRole().name().toLowerCase(),
"content", message.getTextContent()
))
.toList();
// Convert tool definitions to your API's format
List<Map<String, Object>> apiTools = tools.stream()
.map(tool -> Map.<String, Object>of(
"name", tool.getDefinition().getName(),
"description", tool.getDefinition().getDescription(),
"parameters", tool.getDefinition().getInputSchema().jsonSchema()
))
.toList();
// Make API request (using your preferred HTTP client)
MyApiResponse responseBody = httpClient.post("https://api.my-llm.com/chat")
.header("Authorization", "Bearer " + apiKey)
.body(Map.of(
"model", model,
"messages", apiMessages,
"tools", apiTools.isEmpty() ? null : apiTools
))
.execute(MyApiResponse.class);
// Check if LLM requested tool calls
List<ToolCall> toolCalls = null;
if (responseBody.getToolCalls() != null) {
toolCalls = responseBody.getToolCalls().stream()
.map(call -> new ToolCall(
call.getId(),
call.getFunction().getName(),
call.getFunction().getArguments()
))
.toList();
}
Message embabelMessage;
if (toolCalls == null || toolCalls.isEmpty()) {
embabelMessage = new AssistantMessage(
responseBody.getContent() != null ? responseBody.getContent() : ""
);
} else {
embabelMessage = new AssistantMessageWithToolCalls(
responseBody.getContent() != null ? responseBody.getContent() : "",
toolCalls
);
}
Usage usage = null;
if (responseBody.getUsage() != null) {
usage = new Usage(
responseBody.getUsage().getPromptTokens(),
responseBody.getUsage().getCompletionTokens()
);
}
return new LlmMessageResponse(embabelMessage, responseBody.getContent(), usage);
}
}
Creating an LlmService
To make your custom LLM available through Embabel’s ModelProvider, implement the LlmService interface:
public class MyCustomLlmService implements LlmService<MyCustomLlmService> {
private final String name;
private final String provider;
private final HttpClient httpClient;
private final String apiKey;
private final LocalDate knowledgeCutoffDate;
private final List<PromptContributor> promptContributors;
private final PricingModel pricingModel;
public MyCustomLlmService(
String name,
String provider,
HttpClient httpClient,
String apiKey,
LocalDate knowledgeCutoffDate,
PricingModel pricingModel) {
this.name = name;
this.provider = provider;
this.httpClient = httpClient;
this.apiKey = apiKey;
this.knowledgeCutoffDate = knowledgeCutoffDate;
this.promptContributors = knowledgeCutoffDate != null
? List.of(new KnowledgeCutoffDate(knowledgeCutoffDate))
: List.of();
this.pricingModel = pricingModel;
}
@Override
public String getName() { return name; }
@Override
public String getProvider() { return provider; }
@Override
public LocalDate getKnowledgeCutoffDate() { return knowledgeCutoffDate; }
@Override
public List<PromptContributor> getPromptContributors() { return promptContributors; }
@Override
public PricingModel getPricingModel() { return pricingModel; }
@Override
public LlmMessageSender createMessageSender(LlmOptions options) {
return new MyCustomLlmMessageSender(
httpClient,
apiKey,
options.getModel() != null ? options.getModel() : name
);
}
@Override
public MyCustomLlmService withKnowledgeCutoffDate(LocalDate date) {
return new MyCustomLlmService(name, provider, httpClient, apiKey, date, pricingModel);
}
@Override
public MyCustomLlmService withPromptContributor(PromptContributor promptContributor) {
var newContributors = new ArrayList<>(promptContributors);
newContributors.add(promptContributor);
return new MyCustomLlmService(
name, provider, httpClient, apiKey, knowledgeCutoffDate,
newContributors, pricingModel
);
}
}
Then register it as a Spring bean:
@Configuration
public class MyLlmConfiguration {
@Bean
public LlmService<?> myCustomLlm(
HttpClient httpClient,
@Value("$\{my-llm.api-key}") String apiKey) {
return new MyCustomLlmService(
"my-custom-model",
"MyProvider",
httpClient,
apiKey,
LocalDate.of(2024, 12, 1),
null
);
}
}
The bean will be automatically discovered and made available through the ModelProvider.
How Model Discovery and Selection Works
When your application starts, ConfigurableModelProvider collects all LlmService beans from the Spring application context.
Your custom LLM is matched by the name property you set on your LlmService implementation.
By name: Use the name from your LlmService directly.
This works with @LlmCall, ai.withLlm(), and AgenticTool.withLlm():
// In a declarative action
@LlmCall(llm = "my-custom-model")
String myAction();
// In an imperative action
ai.withLlm("my-custom-model")
.create("Tell me a joke", String.class);
By role: Map a role name to your model name in configuration, then reference it with the # prefix:
embabel:
models:
default-llm: my-custom-model # ①
llms:
best: my-custom-model # ②
cheapest: my-small-model # ③
- Sets the default LLM used when no explicit model is specified
- Maps the
bestrole to your custom model - Maps the
cheapestrole to a different model
Then reference roles with #:
// By role
@LlmCall(llm = "#best")
String myAction();
// Or programmatically
ai.withLlmByRole("best")
.create("Tell me a joke", String.class);
If no LLM is specified in @LlmCall or withLlm(), the default-llm from configuration is used.
Using Your Custom Implementation (Alternative)
If you need more control over the LLM operations layer itself, you can extend ToolLoopLlmOperations:
public class MyCustomLlmOperations extends ToolLoopLlmOperations {
private final HttpClient httpClient;
private final String apiKey;
public MyCustomLlmOperations(
HttpClient httpClient,
String apiKey,
ModelProvider modelProvider,
ToolDecorator toolDecorator,
Validator validator) {
super(modelProvider, toolDecorator, validator);
this.httpClient = httpClient;
this.apiKey = apiKey;
}
@Override
protected LlmMessageSender createMessageSender(LlmService<?> llm, LlmOptions options) {
return new MyCustomLlmMessageSender(
httpClient,
apiKey,
options.getModel() != null ? options.getModel() : "default-model"
);
}
}
The ToolLoopLlmOperations base class provides several extension points:
createMessageSender(): Create the LLM communication layercreateOutputConverter(): Parse LLM responses into typed objectssanitizeStringOutput(): Clean up raw text responsesemitCallEvent(): Emit observability events
Key Implementation Notes
- Tool calls are not executed by your sender. Just return the tool call requests--Embabel’s tool loop handles execution and continuation.
- Handle both tool and non-tool responses. Return
AssistantMessagefor plain text,AssistantMessageWithToolCallswhen the LLM wants to invoke tools. - Include usage information when available. This enables cost tracking and observability.
- Message types matter. The tool loop expects specific message types:
UserMessage: User inputSystemMessage: System promptsAssistantMessage: LLM text responseAssistantMessageWithToolCalls: LLM response with tool requestsToolResultMessage: Result returned to LLM after tool execution
Advanced: Custom Embedding Service
Just as you can integrate a custom LLM, you can implement a custom embedding service that doesn’t depend on Spring AI. This is useful when:
- You want to use an embedding provider not supported by Spring AI
- You need custom pre/post-processing of embeddings
- You’re integrating with a proprietary or internal embedding API
The EmbeddingService Interface
The EmbeddingService interface is framework-agnostic.
Unlike SpringAiEmbeddingService, a custom implementation does not need to wrap a Spring AI EmbeddingModel:
public interface EmbeddingService {
float[] embed(String text);
List<float[]> embed(List<String> texts);
int getDimensions();
String getName();
String getProvider();
}
Example: Custom Embedding Provider
Here’s an example of implementing EmbeddingService for an HTTP-based embedding API:
public class MyCustomEmbeddingService implements EmbeddingService {
private final String name;
private final String provider;
private final int dimensions;
private final HttpClient httpClient;
private final String apiKey;
public MyCustomEmbeddingService(
String name,
String provider,
int dimensions,
HttpClient httpClient,
String apiKey) {
this.name = name;
this.provider = provider;
this.dimensions = dimensions;
this.httpClient = httpClient;
this.apiKey = apiKey;
}
@Override
public String getName() { return name; }
@Override
public String getProvider() { return provider; }
@Override
public int getDimensions() { return dimensions; }
@Override
public float[] embed(String text) {
return embed(List.of(text)).get(0);
}
@Override
public List<float[]> embed(List<String> texts) {
MyEmbeddingResponse response = httpClient
.post("https://api.my-embeddings.com/embed")
.header("Authorization", "Bearer " + apiKey)
.body(Map.of("texts", texts, "model", name))
.execute(MyEmbeddingResponse.class);
return response.getEmbeddings();
}
}
Registering as a Spring Bean
Register your custom embedding service as a Spring bean and it will be automatically discovered:
@Configuration
public class MyEmbeddingConfiguration {
@Bean
public EmbeddingService myCustomEmbeddings(
HttpClient httpClient,
@Value("$\{my-embeddings.api-key}") String apiKey) {
return new MyCustomEmbeddingService(
"my-custom-embeddings",
"MyProvider",
384,
httpClient,
apiKey
);
}
}
Discovery and Selection
Custom embedding services follow the same discovery and selection pattern as LLMs (see [](/How Model Discovery and Selection Works)).
By name: Use ai.withEmbeddingService() with the name from your implementation:
ai.withEmbeddingService("my-custom-embeddings")
.embed("Hello world");
By role: Map a role name to your embedding service in configuration:
embabel:
models:
default-embedding-model: my-custom-embeddings # ①
embedding-services:
cheapest: my-custom-embeddings # ②
- Sets the default embedding service
- Maps the
cheapestrole to your custom embedding service
Advanced Caching with Anthropic
While many providers have implicit caching managed internally, Anthropic exposes public APIs for explicit prompt caching control. This allows you to optimize costs and latency for applications with long prompts, many tools, or extended conversations.
Motivation
Anthropic’s prompt caching feature provides significant benefits:
- Cost savings: Cache reads cost 90% less than regular input tokens
- Latency improvements: Cached content is processed faster
- Ideal for: Long system prompts, extensive tool definitions, multi-turn conversations
Without caching, every API call processes the entire prompt from scratch. With caching, repeated content (system prompts, tools, conversation history) can be cached and reused across requests.
How It Works
Anthropic caches the trailing portion of your prompt context. The cache is identified by an exact match of the content hashcode. Any change to the cached portion invalidates the cache.
Key concepts:
- Cache creation: First time content is seen, it’s written to cache with a 25% premium over regular input tokens (for 5-minute TTL)
- Cache reads: Subsequent requests with matching content read from cache at 10% of regular input token cost
- Cache TTL: 5 minutes (default) or 1 hour (premium, higher creation cost)
- Minimum size: 1024 tokens for older models, 4096 tokens for Claude Sonnet 4.5 and newer.
Cache Strategies
Embabel provides several caching strategies through AnthropicCachingConfig:
System Prompt Caching
Cache the system prompt for reuse across multiple requests:
AnthropicCachingConfig cachingConfig = new AnthropicCachingConfig();
cachingConfig.setSystemPrompt(true);
LlmOptions options = LlmOptions.withDefaultLlm();
options = withAnthropicCaching(options, cachingConfig);
Tools Caching
Cache tool definitions when using many tools or tools with large schemas:
AnthropicCachingConfig cachingConfig = new AnthropicCachingConfig();
cachingConfig.setTools(true);
LlmOptions options = LlmOptions.withDefaultLlm();
options = withAnthropicCaching(options, cachingConfig);
System + Tools Caching
Combine both strategies:
AnthropicCachingConfig cachingConfig = new AnthropicCachingConfig();
cachingConfig.setSystemPrompt(true);
cachingConfig.setTools(true);
LlmOptions options = LlmOptions.withDefaultLlm();
options = withAnthropicCaching(options, cachingConfig);
Conversation History Caching
Cache conversation history for long multi-turn conversations:
AnthropicCachingConfig cachingConfig = new AnthropicCachingConfig();
cachingConfig.setConversationHistory(true);
LlmOptions options = LlmOptions.withDefaultLlm();
options = withAnthropicCaching(options, cachingConfig);
Advanced Configuration
Message Type Minimum Content Length
Control which messages are eligible for caching based on their content length:
AnthropicCachingConfig cachingConfig = new AnthropicCachingConfig();
cachingConfig.setSystemPrompt(true);
cachingConfig.messageTypeMinContentLength(MessageRole.SYSTEM, 1024);
cachingConfig.messageTypeMinContentLength(MessageRole.USER, 512);
LlmOptions options = LlmOptions.withDefaultLlm();
options = withAnthropicCaching(options, cachingConfig);
Message Type TTL
Set cache TTL per message type (default is 5 minutes):
AnthropicCachingConfig cachingConfig = new AnthropicCachingConfig();
cachingConfig.setSystemPrompt(true);
cachingConfig.messageTypeTtl(MessageRole.SYSTEM, AnthropicCacheTtl.ONE_HOUR);
LlmOptions options = LlmOptions.withDefaultLlm();
options = withAnthropicCaching(options, cachingConfig);
Accessing Cache Metrics
Embabel provides extension methods to access Anthropic-specific cache metrics from the Usage object:
import static com.embabel.agent.config.models.anthropic.AnthropicUsage.*;
AssistantMessage response = promptRunner.respond(messages);
Usage usage = response.getUsage();
// Check if cache was created or read
boolean cacheCreated = hasAnthropicCacheCreation(usage);
boolean cacheRead = hasAnthropicCacheRead(usage);
// Get token counts
Integer creationTokens = anthropicCacheCreationTokens(usage);
Integer readTokens = anthropicCacheReadTokens(usage);
// Get summary string for logging
String summary = anthropicCacheSummary(usage);
// Example output: "cache[creation=1061, read=0]"
Best Practices
- Cache long, stable content: System prompts and tool definitions that don’t change frequently are ideal candidates
- Mind the minimum size: Content must meet the minimum token requirement (1024 or 4096 depending on model)
- Monitor cache metrics: Use the cache extension methods to track cache hit rates and validate savings
- Consider TTL vs cost: 1-hour TTL has higher creation cost but better for longer sessions
- Test before deploying: Cache behavior can vary based on prompt structure and usage patterns
Reference
For complete details on Anthropic’s prompt caching, see: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching




