Tutorials / Backend integrations / Vector search with Pinecone
πŸ“ Written ● Intermediate Updated 2026-05-13

Vector search with Pinecone

Embed documents into vectors, store in Pinecone, query by semantic similarity. The foundation of RAG, recommendation, semantic search, and "find documents like this one."

The 60-second mental model

0
  • Embedding = a function that turns text (or image) into a fixed-length array of floats (e.g. 1536 for OpenAI's text-embedding-3-small). Similar meaning β†’ close vectors.
  • Vector DB = stores millions of those arrays and answers "which N vectors are closest to this one?" in <100ms.
  • Pick Pinecone when β€” you want a managed service, no infra to run, a generous free tier, and well-documented SDKs.
  • Alternatives β€” Qdrant (open source, fastest single-node), Weaviate (open source, modular), pgvector (Postgres extension β€” already in your DB!), Supabase + pgvector (free for small data), Chroma (Python-first prototyping).
  • Don't use a vector DB if β€” you have <10K docs and they fit in RAM. Just embed everything on boot, do cosine similarity in-process. Avoids the network hop entirely.
  • Sign up β€” app.pinecone.io. Google/GitHub login.

Create an index

1

Console β†’ Create Index. Required choices:

  • Dimensions β€” must match your embedding model. text-embedding-3-small = 1536. text-embedding-3-large = 3072. Cohere embed-v3 = 1024. You can't change this later β€” wrong choice means rebuilding the index.
  • Metric β€” cosine (default, normalized similarity), euclidean, or dotproduct. Use cosine unless your embedding model docs say otherwise.
  • Cloud β€” AWS / GCP / Azure. Pick the same as your app to minimize latency.
  • Type β€” Serverless (pay-per-query, scales to zero, default) or Pod (dedicated, predictable cost, legacy). New projects β†’ serverless.

Copy the index name and host URL. From API Keys, copy the key.

PINECONE_API_KEY=…
PINECONE_INDEX=my-docs
OPENAI_API_KEY=…   # for embeddings

Generate embeddings

2

You need something to produce vectors. OpenAI's embedding API is the default; Cohere, Voyage, and open-source models via sentence-transformers are alternatives.

npm install openai @pinecone-database/pinecone
import OpenAI from "openai";
const openai = new OpenAI();

async function embed(text) {
  const r = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return r.data[0].embedding;  // float[1536]
}

Batch when you can β€” embedding 100 strings in one call is 30Γ— cheaper than 100 calls:

const r = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: chunks,  // array of strings, max 2048 inputs per call
});
const vectors = r.data.map(d => d.embedding);

Chunk your documents

3

Don't embed a 50-page PDF as one vector β€” meaning gets averaged into mush. Split into ~500-token chunks with ~50-token overlap (so context isn't cut mid-sentence).

function chunkText(text, size = 1500, overlap = 150) {
  const chunks = [];
  let i = 0;
  while (i < text.length) {
    chunks.push(text.slice(i, i + size));
    i += size - overlap;
  }
  return chunks;
}

For real production chunking that respects paragraph + sentence boundaries, use LangChain's RecursiveCharacterTextSplitter or LlamaIndex node parsers.

Chunking is the single biggest determinant of retrieval quality. Bad chunks (too big, too small, split mid-paragraph) β†’ bad search results, no matter how good your DB is.

Upsert vectors

4
import { Pinecone } from "@pinecone-database/pinecone";

const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pc.index(process.env.PINECONE_INDEX);

async function ingestDocument(docId, text) {
  const chunks = chunkText(text);
  const vectors = await embedBatch(chunks);

  await index.upsert(
    chunks.map((chunk, i) => ({
      id: `${docId}#${i}`,             // deterministic ID β€” re-ingesting overwrites
      values: vectors[i],
      metadata: {
        docId,
        text: chunk,                    // store the chunk for retrieval
        chunkIndex: i,
        createdAt: Date.now(),
      },
    })),
  );
}

Metadata patterns:

  • Always store the source text in metadata β€” retrieval gives you the chunk back so you can show it to the LLM
  • Tag with filters β€” userId, workspaceId, documentType, tags: ["draft", "public"]. Pinecone filters on metadata server-side; cheap.
  • Don't store huge blobs β€” metadata has a 40KB limit per vector. For PDFs, store a pointer to S3/R2.

Query β€” semantic search

5
async function search(query, userId) {
  const queryVec = await embed(query);

  const result = await index.query({
    vector: queryVec,
    topK: 5,
    includeMetadata: true,
    filter: { userId: { $eq: userId } },  // multi-tenant guard
  });

  return result.matches.map(m => ({
    text: m.metadata.text,
    docId: m.metadata.docId,
    score: m.score,  // 0..1 cosine similarity
  }));
}

Filter operators: $eq, $ne, $gt, $lt, $in, $nin, $and, $or. Full reference: Pinecone metadata filtering.

Hook up to an LLM (RAG)

6

Retrieval-Augmented Generation: search β†’ stuff results into prompt β†’ ask LLM.

async function answer(question, userId) {
  const docs = await search(question, userId);
  const context = docs.map(d => `[${d.docId}]\n${d.text}`).join("\n\n---\n\n");

  const r = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: "Answer using only the context below. Cite [docId] inline. If the context doesn't contain the answer, say so.",
      },
      { role: "user", content: `Context:\n${context}\n\nQuestion: ${question}` },
    ],
  });

  return r.choices[0].message.content;
}

For richer RAG flows (query rewriting, multi-hop, re-ranking) use a framework: LangChain or LlamaIndex. For the simple "search + stuff" shape above, framework is overhead.

Hybrid search (semantic + keyword)

7

Pure vector search misses exact-match queries. "ZeroMQ" embedded looks similar to "RabbitMQ"; a user looking for ZeroMQ docs wants ZeroMQ specifically. Hybrid combines BM25 (keyword) + cosine (semantic).

Pinecone supports hybrid via sparse-dense vectors on serverless indexes. Generate sparse vectors with Pinecone Inference or rank-bm25.

If your needs are simpler, run both searches in parallel and merge β€” reciprocal rank fusion is 20 lines of code and works well.

Common failures

8
  • Top results look unrelated β€” usually a chunking problem (too big), not a DB problem. Try smaller chunks with overlap. Inspect the actual returned text β€” don't trust the score in isolation.
  • Slow query β†’ too many vectors per query β€” topK: 100 is rarely useful. 5–10 is the sweet spot for RAG. Larger topK + re-rank with Cohere Rerank if quality demands it.
  • "Embedding dimension doesn't match index" β€” you switched models. Pinecone indexes are dimension-locked. Create a new index with the right dimension and re-ingest.
  • Stale vectors after document edit β€” you edited a source document; old chunks still exist with the old text. Use deterministic IDs (${docId}#${i}) and upsert overwrites. Also delete chunks beyond the new chunk count.
  • Multi-tenant leak β€” every query must include the user/workspace filter. Forgetting it once leaks documents across customers. Wrap the SDK in a per-user helper that injects the filter; never call index.query directly from app code.
  • Embedding cost surprise β€” re-ingesting 1M chunks at OpenAI prices = ~$20. Not bad, but unexpected the third time you trigger it. Cache embeddings keyed by content hash.
  • Free tier eviction β€” Pinecone serverless free tier has storage limits. Hitting them silently degrades recall. Monitor index size in console.

Pricing reality

9
  • Pinecone Serverless free tier β€” 2 GB storage, 2M read units / 1M write units per month. Enough for <100K-vector apps.
  • Beyond free β€” $0.33/GB/mo storage, $16 per 1M read units, $4 per 1M write units. A read unit β‰ˆ one similarity comparison.
  • Embeddings cost more than the DB β€” OpenAI text-embedding-3-small is $0.02 per 1M tokens; embedding 1M docs (avg 500 tokens) = $10. Re-embedding everything is the expensive part.
  • Pgvector on existing Postgres = free β€” if you already run Postgres (Supabase / Neon / RDS) and have <10M vectors, pgvector is the cheapest option. Latency is similar at small scale; degrades faster at large scale.
  • Self-host Qdrant = $5–20/mo VPS β€” open source, single binary, fastest single-node performance. You operate it.

Pricing: pinecone.io/pricing.

Official references

Don't reach for a vector DB until you've tried in-memory. For <10K vectors, cosine(query, every_vector) in your app is faster than the network hop to Pinecone, and free. Vector DBs earn their keep at scale β€” but most apps never get there.