📝 Written ● Intermediate Updated 2026-05-13

Vector search with Pinecone

Embed documents into vectors, store in Pinecone, query by semantic similarity. The foundation of RAG, recommendation, semantic search, and "find documents like this one."

The 60-second mental model

Embedding = a function that turns text (or image) into a fixed-length array of floats (e.g. 1536 for OpenAI's text-embedding-3-small). Similar meaning → close vectors.
Vector DB = stores millions of those arrays and answers "which N vectors are closest to this one?" in <100ms.
Pick Pinecone when — you want a managed service, no infra to run, a generous free tier, and well-documented SDKs.
Alternatives — Qdrant (open source, fastest single-node), Weaviate (open source, modular), pgvector (Postgres extension — already in your DB!), Supabase + pgvector (free for small data), Chroma (Python-first prototyping).
Don't use a vector DB if — you have <10K docs and they fit in RAM. Just embed everything on boot, do cosine similarity in-process. Avoids the network hop entirely.
Sign up — app.pinecone.io. Google/GitHub login.

Create an index

Console → Create Index. Required choices:

Dimensions — must match your embedding model. text-embedding-3-small = 1536. text-embedding-3-large = 3072. Cohere embed-v3 = 1024. You can't change this later — wrong choice means rebuilding the index.
Metric — cosine (default, normalized similarity), euclidean, or dotproduct. Use cosine unless your embedding model docs say otherwise.
Cloud — AWS / GCP / Azure. Pick the same as your app to minimize latency.
Type — Serverless (pay-per-query, scales to zero, default) or Pod (dedicated, predictable cost, legacy). New projects → serverless.

Copy the index name and host URL. From API Keys, copy the key.

PINECONE_API_KEY=…
PINECONE_INDEX=my-docs
OPENAI_API_KEY=…   # for embeddings

Generate embeddings

You need something to produce vectors. OpenAI's embedding API is the default; Cohere, Voyage, and open-source models via sentence-transformers are alternatives.

npm install openai @pinecone-database/pinecone

import OpenAI from "openai";
const openai = new OpenAI();

async function embed(text) {
  const r = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return r.data[0].embedding;  // float[1536]
}

Batch when you can — embedding 100 strings in one call is 30× cheaper than 100 calls:

const r = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: chunks,  // array of strings, max 2048 inputs per call
});
const vectors = r.data.map(d => d.embedding);

Chunk your documents

Don't embed a 50-page PDF as one vector — meaning gets averaged into mush. Split into ~500-token chunks with ~50-token overlap (so context isn't cut mid-sentence).

function chunkText(text, size = 1500, overlap = 150) {
  const chunks = [];
  let i = 0;
  while (i < text.length) {
    chunks.push(text.slice(i, i + size));
    i += size - overlap;
  }
  return chunks;
}

For real production chunking that respects paragraph + sentence boundaries, use LangChain's RecursiveCharacterTextSplitter or LlamaIndex node parsers.

Chunking is the single biggest determinant of retrieval quality. Bad chunks (too big, too small, split mid-paragraph) → bad search results, no matter how good your DB is.

Upsert vectors

import { Pinecone } from "@pinecone-database/pinecone";

const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pc.index(process.env.PINECONE_INDEX);

async function ingestDocument(docId, text) {
  const chunks = chunkText(text);
  const vectors = await embedBatch(chunks);

  await index.upsert(
    chunks.map((chunk, i) => ({
      id: `${docId}#${i}`,             // deterministic ID — re-ingesting overwrites
      values: vectors[i],
      metadata: {
        docId,
        text: chunk,                    // store the chunk for retrieval
        chunkIndex: i,
        createdAt: Date.now(),
      },
    })),
  );
}

Metadata patterns:

Always store the source text in metadata — retrieval gives you the chunk back so you can show it to the LLM
Tag with filters — userId, workspaceId, documentType, tags: ["draft", "public"]. Pinecone filters on metadata server-side; cheap.
Don't store huge blobs — metadata has a 40KB limit per vector. For PDFs, store a pointer to S3/R2.

Query — semantic search

async function search(query, userId) {
  const queryVec = await embed(query);

  const result = await index.query({
    vector: queryVec,
    topK: 5,
    includeMetadata: true,
    filter: { userId: { $eq: userId } },  // multi-tenant guard
  });

  return result.matches.map(m => ({
    text: m.metadata.text,
    docId: m.metadata.docId,
    score: m.score,  // 0..1 cosine similarity
  }));
}

Filter operators: $eq, $ne, $gt, $lt, $in, $nin, $and, $or. Full reference: Pinecone metadata filtering.

Hook up to an LLM (RAG)

Retrieval-Augmented Generation: search → stuff results into prompt → ask LLM.

async function answer(question, userId) {
  const docs = await search(question, userId);
  const context = docs.map(d => `[${d.docId}]\n${d.text}`).join("\n\n---\n\n");

  const r = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: "Answer using only the context below. Cite [docId] inline. If the context doesn't contain the answer, say so.",
      },
      { role: "user", content: `Context:\n${context}\n\nQuestion: ${question}` },
    ],
  });

  return r.choices[0].message.content;
}

For richer RAG flows (query rewriting, multi-hop, re-ranking) use a framework: LangChain or LlamaIndex. For the simple "search + stuff" shape above, framework is overhead.

Hybrid search (semantic + keyword)

Pure vector search misses exact-match queries. "ZeroMQ" embedded looks similar to "RabbitMQ"; a user looking for ZeroMQ docs wants ZeroMQ specifically. Hybrid combines BM25 (keyword) + cosine (semantic).

Pinecone supports hybrid via sparse-dense vectors on serverless indexes. Generate sparse vectors with Pinecone Inference or rank-bm25.

If your needs are simpler, run both searches in parallel and merge — reciprocal rank fusion is 20 lines of code and works well.

Common failures

Top results look unrelated — usually a chunking problem (too big), not a DB problem. Try smaller chunks with overlap. Inspect the actual returned text — don't trust the score in isolation.
Slow query → too many vectors per query — topK: 100 is rarely useful. 5–10 is the sweet spot for RAG. Larger topK + re-rank with Cohere Rerank if quality demands it.
"Embedding dimension doesn't match index" — you switched models. Pinecone indexes are dimension-locked. Create a new index with the right dimension and re-ingest.
Stale vectors after document edit — you edited a source document; old chunks still exist with the old text. Use deterministic IDs (${docId}#${i}) and upsert overwrites. Also delete chunks beyond the new chunk count.
Multi-tenant leak — every query must include the user/workspace filter. Forgetting it once leaks documents across customers. Wrap the SDK in a per-user helper that injects the filter; never call index.query directly from app code.
Embedding cost surprise — re-ingesting 1M chunks at OpenAI prices = ~$20. Not bad, but unexpected the third time you trigger it. Cache embeddings keyed by content hash.
Free tier eviction — Pinecone serverless free tier has storage limits. Hitting them silently degrades recall. Monitor index size in console.

Pricing reality

Pinecone Serverless free tier — 2 GB storage, 2M read units / 1M write units per month. Enough for <100K-vector apps.
Beyond free — $0.33/GB/mo storage, $16 per 1M read units, $4 per 1M write units. A read unit ≈ one similarity comparison.
Embeddings cost more than the DB — OpenAI text-embedding-3-small is $0.02 per 1M tokens; embedding 1M docs (avg 500 tokens) = $10. Re-embedding everything is the expensive part.
Pgvector on existing Postgres = free — if you already run Postgres (Supabase / Neon / RDS) and have <10M vectors, pgvector is the cheapest option. Latency is similar at small scale; degrades faster at large scale.
Self-host Qdrant = $5–20/mo VPS — open source, single binary, fastest single-node performance. You operate it.

Pricing: pinecone.io/pricing.

Official references

Pinecone docs home
REST API reference
TypeScript SDK · Python SDK
OpenAI embeddings guide
Pinecone integrations — LangChain, LlamaIndex, Haystack adapters

Don't reach for a vector DB until you've tried in-memory. For <10K vectors, cosine(query, every_vector) in your app is faster than the network hop to Pinecone, and free. Vector DBs earn their keep at scale — but most apps never get there.