A working Claude API integration without prompt caching is paying full price for tokens you've already paid for. Caching alone often cuts spend in half. This tutorial walks the canonical Anthropic SDK shape and the optimizations that actually move the bill — plus what changes (and doesn't) when a new model ships.
The smallest Anthropic SDK example in the docs does work — but it ships three soft defaults that cost real money at any non-toy traffic:
SEARCH(...), run search" is the path of pain — fragile, no schema validation, no parallel tool calls. The SDK has a first-class tools shape; use it.LingCode knows the Anthropic SDK shape and writes integrations with caching, tool use, and a model constant by default — ask it to build the thing and you get the optimized version on the first pass.
.env as ANTHROPIC_API_KEY.@anthropic-ai/sdk) because it covers the broadest shape; the Python anthropic SDK mirrors it method-for-method.Ask LingCode to scaffold the SDK setup. The shape it produces:
// src/claude.ts
import Anthropic from "@anthropic-ai/sdk";
const MODEL = "claude-sonnet-4-7-20260101"; // one place to bump
const client = new Anthropic(); // reads ANTHROPIC_API_KEY from env
export async function ask(userText: string) {
const res = await client.messages.create({
model: MODEL,
max_tokens: 1024,
messages: [{ role: "user", content: userText }],
});
return res.content[0].type === "text" ? res.content[0].text : "";
}
Three things to notice. MODEL is a constant — every call routes through it. max_tokens is explicit — the SDK requires it. The response is a content-block array, not a string; the model can return multiple blocks (text, tool calls, thinking), so always check the type.
from anthropic import Anthropic; client = Anthropic(); client.messages.create(...). Pick whichever language your backend is in; the API surface is the same.
If the system prompt is the same on every request — which it usually is — mark it cacheable and pay 10% of base price after the first call. Five minutes of work, sometimes half the bill.
const SYSTEM_PROMPT = `You are a support agent for Acme Widgets.
Your knowledge base:
${KNOWLEDGE_BASE_TEXT}
Always cite the section number when answering.`;
const res = await client.messages.create({
model: MODEL,
max_tokens: 1024,
system: [
{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" },
},
],
messages: [{ role: "user", content: userText }],
});
Two rules to internalize:
res.usage.cache_creation_input_tokens and cache_read_input_tokens on the response to verify the cache actually hit.Tool use is a three-step protocol: you declare tools, the model emits a tool_use block, you reply with a tool_result. The SDK gives you typed shapes; don't reinvent it as string parsing.
const tools = [
{
name: "get_weather",
description: "Get current weather for a city.",
input_schema: {
type: "object",
properties: {
city: { type: "string", description: "City name, e.g. 'Paris'." },
},
required: ["city"],
},
},
];
let messages = [{ role: "user", content: "What's the weather in Tokyo?" }];
while (true) {
const res = await client.messages.create({
model: MODEL,
max_tokens: 1024,
tools,
messages,
});
if (res.stop_reason !== "tool_use") {
// Plain answer — done.
return res.content;
}
// The model wants a tool. Run it, reply, loop.
const toolUse = res.content.find((b) => b.type === "tool_use");
const result = await runTool(toolUse.name, toolUse.input);
messages.push({ role: "assistant", content: res.content });
messages.push({
role: "user",
content: [
{
type: "tool_result",
tool_use_id: toolUse.id,
content: JSON.stringify(result),
},
],
});
}
The loop terminates when stop_reason is anything other than tool_use — typically end_turn. The model may emit multiple tool_use blocks in a single response (parallel tool calls); run them all and reply with all the tool_results in one user message.
cache_control: { type: "ephemeral" } to the last tool in the list to mark the whole block cacheable. Big tools array, same effect as a big system prompt.
The same call, streamed:
const stream = client.messages.stream({
model: MODEL,
max_tokens: 1024,
messages: [{ role: "user", content: userText }],
});
for await (const event of stream) {
if (event.type === "content_block_delta" &&
event.delta.type === "text_delta") {
process.stdout.write(event.delta.text);
}
}
const final = await stream.finalMessage();
console.log("usage:", final.usage);
Streaming events include message_start, content_block_start, content_block_delta, content_block_stop, message_delta, message_stop. For most UIs you only render text_delta deltas; the rest are bookkeeping. finalMessage() resolves the assembled response — usage stats and stop reason live there.
LingCode will wire any of these if you ask, but they're not free choices — each has a use case:
citations: { enabled: true } and the model returns citation objects pointing at the source spans. Real grounding, not vibes.thinking: { type: "enabled", budget_tokens: 5000 } on hard problems. The model produces a thinking content block (preserved across tool use so it doesn't re-reason). Cost: those thinking tokens are billed. Worth it for math, code, multi-step planning; overkill for "summarize this paragraph."memory tool lets the model write durable notes across sessions. Useful for long-running agents; verify the model actually has the tool in its training before depending on it (check the model card).When a new Sonnet or Opus ships, the migration is usually a one-line change — but the regression catch is on you. The API is stable across these releases; the model's behavior isn't.
// Before
const MODEL = "claude-sonnet-4-6-20251022";
// After
const MODEL = "claude-sonnet-4-7-20260101";
What to check after the swap:
cache_read_input_tokens > 0).max_tokens if your prompts ever truncated.Retired-model swaps are the same shape — Anthropic publishes a sunset date, you change the constant before that date. LingCode reads the model deprecation pages when you ask "what's the current Sonnet model string" and writes the migration with a regression test.
claude-3-5-sonnet-latest-style aliases in production. They drift under you. Pin to the dated snapshot (claude-sonnet-4-7-20260101) and migrate deliberately.
Every Anthropic response carries usage stats. Log them:
console.log({
input: res.usage.input_tokens,
output: res.usage.output_tokens,
cache_create: res.usage.cache_creation_input_tokens,
cache_read: res.usage.cache_read_input_tokens,
});
The two numbers worth watching: cache_read_input_tokens / (cache_read + input + cache_create) is your cache hit rate — anything above 50% on a chatbot means caching is working. output_tokens / input_tokens is your verbosity ratio — if it's climbing over weeks, your prompts are drifting toward longer answers and the bill is following.
Package the whole workflow as a skill so LingCode picks the right API shape, caching strategy, and model constant every time you ask it to wire up Claude:
---
name: claude-api
description: Build, debug, and optimize Claude API / Anthropic SDK apps in TypeScript or Python. Apps built with this skill should include prompt caching. Also handles migrating between Claude model versions (Sonnet/Opus/Haiku 4.5 → 4.6 → 4.7, retired-model replacements). Triggers: anthropic SDK import, @anthropic-ai/sdk usage, prompt caching questions, tool use loops, streaming, batch API, files API, citations, memory, thinking. Actions: build Claude app, add prompt caching, migrate model version, fix Anthropic API error, optimize tokens, wire tool use. Skip if: code imports openai or other-provider SDK.
---
When wiring or modifying Claude API code:
1. Use @anthropic-ai/sdk (TS) or anthropic (Python). Same shape.
2. Centralize the model string in one constant — never inline.
Pin to dated snapshots, not -latest aliases.
3. Mark stable prefixes cacheable with
cache_control: { type: "ephemeral" }: system prompt, tools
array, large reused documents. Verify cache hits via
res.usage.cache_read_input_tokens.
4. Use the SDK tool-use shape (tools: [...], tool_use block,
tool_result reply). Never string-parse tool calls.
5. Handle parallel tool calls — a single response may contain
multiple tool_use blocks; run all, reply with all tool_results
in one user message.
6. For chat UIs, stream content_block_delta events. Use
finalMessage() for the assembled response.
7. Log usage stats on every call. Watch cache hit rate and
output/input ratio over time.
8. On model migrations: change the constant, re-run the eval
suite, verify cache still hits, check tool-call shape didn't
regress.
Prefer Batch API for offline enrichment (50% cheaper, 24h SLA).
Prefer Files API + caching for repeated queries against the same
document. Use extended thinking only when the problem warrants it
(math, code, multi-step planning).
Save as ~/.lingcode/skills/claude-api/SKILL.md — see Install a skill for the exact location and how skills get discovered.