📝 Written ● Intermediate Updated 2026-05-19

Build with the Claude API — prompt caching, tools, and model migrations

A working Claude API integration without prompt caching is paying full price for tokens you've already paid for. Caching alone often cuts spend in half. This tutorial walks the canonical Anthropic SDK shape and the optimizations that actually move the bill — plus what changes (and doesn't) when a new model ships.

Why "just works" is the expensive default

The smallest Anthropic SDK example in the docs does work — but it ships three soft defaults that cost real money at any non-toy traffic:

No prompt caching. Every request re-bills the full system prompt, the full tool spec, and any uploaded context. Cached tokens cost ~10% of base input tokens. Skipping caching on a chatbot with a 4 KB system prompt and 100 daily users is a recurring bill for nothing.
Tool use written as string parsing. Hand-rolling "if the model says SEARCH(...), run search" is the path of pain — fragile, no schema validation, no parallel tool calls. The SDK has a first-class tools shape; use it.
Hard-coded model strings everywhere. When the next Sonnet or Opus ships, you want one line to change, not twelve. Centralize.

LingCode knows the Anthropic SDK shape and writes integrations with caching, tool use, and a model constant by default — ask it to build the thing and you get the optimized version on the first pass.

What you need

LingCode — download the installer.
An Anthropic API key — from console.anthropic.com. Drop it in .env as ANTHROPIC_API_KEY.
A Node or Python project — examples below are TypeScript (@anthropic-ai/sdk) because it covers the broadest shape; the Python anthropic SDK mirrors it method-for-method.

The minimal canonical call

Ask LingCode to scaffold the SDK setup. The shape it produces:

// src/claude.ts
import Anthropic from "@anthropic-ai/sdk";

const MODEL = "claude-sonnet-4-7-20260101"; // one place to bump
const client = new Anthropic(); // reads ANTHROPIC_API_KEY from env

export async function ask(userText: string) {
  const res = await client.messages.create({
    model: MODEL,
    max_tokens: 1024,
    messages: [{ role: "user", content: userText }],
  });
  return res.content[0].type === "text" ? res.content[0].text : "";
}

Three things to notice. MODEL is a constant — every call routes through it. max_tokens is explicit — the SDK requires it. The response is a content-block array, not a string; the model can return multiple blocks (text, tool calls, thinking), so always check the type.

The Python shape is identical. from anthropic import Anthropic; client = Anthropic(); client.messages.create(...). Pick whichever language your backend is in; the API surface is the same.

Prompt caching — the one optimization you can't skip

If the system prompt is the same on every request — which it usually is — mark it cacheable and pay 10% of base price after the first call. Five minutes of work, sometimes half the bill.

const SYSTEM_PROMPT = `You are a support agent for Acme Widgets.
Your knowledge base:

${KNOWLEDGE_BASE_TEXT}

Always cite the section number when answering.`;

const res = await client.messages.create({
  model: MODEL,
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: SYSTEM_PROMPT,
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [{ role: "user", content: userText }],
});

Two rules to internalize:

Cache the stable prefix. System prompts, tool specs, large documents the user is asking about — anything that doesn't change between turns. The cache key is the literal prefix bytes, so a single edited character invalidates the entry.
There's a minimum size. Cached blocks have a token-count floor (1024 for Sonnet, 2048 for Haiku at time of writing). Tiny system prompts won't cache — and the SDK won't error, it just silently bills full price. Check res.usage.cache_creation_input_tokens and cache_read_input_tokens on the response to verify the cache actually hit.

Watch the TTL. Ephemeral cache entries live ~5 minutes since last hit. If your traffic is bursty (one request, then nothing for 10 minutes), you pay the full creation cost on every burst. For predictable high traffic, ephemeral is a clean win; for irregular traffic, weigh it.

Tool use — the real shape

Tool use is a three-step protocol: you declare tools, the model emits a tool_use block, you reply with a tool_result. The SDK gives you typed shapes; don't reinvent it as string parsing.

const tools = [
  {
    name: "get_weather",
    description: "Get current weather for a city.",
    input_schema: {
      type: "object",
      properties: {
        city: { type: "string", description: "City name, e.g. 'Paris'." },
      },
      required: ["city"],
    },
  },
];

let messages = [{ role: "user", content: "What's the weather in Tokyo?" }];

while (true) {
  const res = await client.messages.create({
    model: MODEL,
    max_tokens: 1024,
    tools,
    messages,
  });

  if (res.stop_reason !== "tool_use") {
    // Plain answer — done.
    return res.content;
  }

  // The model wants a tool. Run it, reply, loop.
  const toolUse = res.content.find((b) => b.type === "tool_use");
  const result = await runTool(toolUse.name, toolUse.input);

  messages.push({ role: "assistant", content: res.content });
  messages.push({
    role: "user",
    content: [
      {
        type: "tool_result",
        tool_use_id: toolUse.id,
        content: JSON.stringify(result),
      },
    ],
  });
}

The loop terminates when stop_reason is anything other than tool_use — typically end_turn. The model may emit multiple tool_use blocks in a single response (parallel tool calls); run them all and reply with all the tool_results in one user message.

Cache the tools array too. Tool specs are stable across requests — append cache_control: { type: "ephemeral" } to the last tool in the list to mark the whole block cacheable. Big tools array, same effect as a big system prompt.

Streaming — for chat UIs that aren't allowed to feel slow

The same call, streamed:

const stream = client.messages.stream({
  model: MODEL,
  max_tokens: 1024,
  messages: [{ role: "user", content: userText }],
});

for await (const event of stream) {
  if (event.type === "content_block_delta" &&
      event.delta.type === "text_delta") {
    process.stdout.write(event.delta.text);
  }
}

const final = await stream.finalMessage();
console.log("usage:", final.usage);

Streaming events include message_start, content_block_start, content_block_delta, content_block_stop, message_delta, message_stop. For most UIs you only render text_delta deltas; the rest are bookkeeping. finalMessage() resolves the assembled response — usage stats and stop reason live there.

Other surfaces worth knowing exist

LingCode will wire any of these if you ask, but they're not free choices — each has a use case:

Batch API. Submit up to 100,000 requests as a single batch, results within 24h, 50% cheaper than synchronous. Right answer for nightly enrichment jobs; wrong answer for chat.
Files API. Upload a PDF or document once, reference it by ID across requests. Pair with prompt caching when the same file is queried repeatedly.
Citations. Pass documents as a content block with citations: { enabled: true } and the model returns citation objects pointing at the source spans. Real grounding, not vibes.
Extended thinking. Set thinking: { type: "enabled", budget_tokens: 5000 } on hard problems. The model produces a thinking content block (preserved across tool use so it doesn't re-reason). Cost: those thinking tokens are billed. Worth it for math, code, multi-step planning; overkill for "summarize this paragraph."
Memory. The new memory tool lets the model write durable notes across sessions. Useful for long-running agents; verify the model actually has the tool in its training before depending on it (check the model card).

Migrating between model versions

When a new Sonnet or Opus ships, the migration is usually a one-line change — but the regression catch is on you. The API is stable across these releases; the model's behavior isn't.

// Before
const MODEL = "claude-sonnet-4-6-20251022";

// After
const MODEL = "claude-sonnet-4-7-20260101";

What to check after the swap:

Run your existing eval suite. If you don't have one, this is the moment to write one — 20 representative inputs, expected outputs, diff the two model versions side by side.
Re-check prompt caching. New model = new cache. The first request will be a cache miss; that's normal. Verify subsequent requests hit cache (cache_read_input_tokens > 0).
Re-check tool calling shape. Newer models sometimes prefer different tool argument structures. If a tool call regressed, the fix is usually tightening the tool description, not rolling back.
Re-check max_tokens. Newer models may have larger output windows; consider bumping max_tokens if your prompts ever truncated.

Retired-model swaps are the same shape — Anthropic publishes a sunset date, you change the constant before that date. LingCode reads the model deprecation pages when you ask "what's the current Sonnet model string" and writes the migration with a regression test.

Don't pin to claude-3-5-sonnet-latest-style aliases in production. They drift under you. Pin to the dated snapshot (claude-sonnet-4-7-20260101) and migrate deliberately.

Cost telemetry — track the metric that matters

Every Anthropic response carries usage stats. Log them:

console.log({
  input: res.usage.input_tokens,
  output: res.usage.output_tokens,
  cache_create: res.usage.cache_creation_input_tokens,
  cache_read: res.usage.cache_read_input_tokens,
});

The two numbers worth watching: cache_read_input_tokens / (cache_read + input + cache_create) is your cache hit rate — anything above 50% on a chatbot means caching is working. output_tokens / input_tokens is your verbosity ratio — if it's climbing over weeks, your prompts are drifting toward longer answers and the bill is following.

Use this in LingCode

Package the whole workflow as a skill so LingCode picks the right API shape, caching strategy, and model constant every time you ask it to wire up Claude:

---
name: claude-api
description: Build, debug, and optimize Claude API / Anthropic SDK apps in TypeScript or Python. Apps built with this skill should include prompt caching. Also handles migrating between Claude model versions (Sonnet/Opus/Haiku 4.5 → 4.6 → 4.7, retired-model replacements). Triggers: anthropic SDK import, @anthropic-ai/sdk usage, prompt caching questions, tool use loops, streaming, batch API, files API, citations, memory, thinking. Actions: build Claude app, add prompt caching, migrate model version, fix Anthropic API error, optimize tokens, wire tool use. Skip if: code imports openai or other-provider SDK.
---

When wiring or modifying Claude API code:

1. Use @anthropic-ai/sdk (TS) or anthropic (Python). Same shape.
2. Centralize the model string in one constant — never inline.
   Pin to dated snapshots, not -latest aliases.
3. Mark stable prefixes cacheable with
   cache_control: { type: "ephemeral" }: system prompt, tools
   array, large reused documents. Verify cache hits via
   res.usage.cache_read_input_tokens.
4. Use the SDK tool-use shape (tools: [...], tool_use block,
   tool_result reply). Never string-parse tool calls.
5. Handle parallel tool calls — a single response may contain
   multiple tool_use blocks; run all, reply with all tool_results
   in one user message.
6. For chat UIs, stream content_block_delta events. Use
   finalMessage() for the assembled response.
7. Log usage stats on every call. Watch cache hit rate and
   output/input ratio over time.
8. On model migrations: change the constant, re-run the eval
   suite, verify cache still hits, check tool-call shape didn't
   regress.

Prefer Batch API for offline enrichment (50% cheaper, 24h SLA).
Prefer Files API + caching for repeated queries against the same
document. Use extended thinking only when the problem warrants it
(math, code, multi-step planning).

Save as ~/.lingcode/skills/claude-api/SKILL.md — see Install a skill for the exact location and how skills get discovered.

Get LingCode →