Embed documents into vectors, store in Pinecone, query by semantic similarity. The foundation of RAG, recommendation, semantic search, and "find documents like this one."
text-embedding-3-small). Similar meaning β close vectors.Console β Create Index. Required choices:
text-embedding-3-small = 1536. text-embedding-3-large = 3072. Cohere embed-v3 = 1024. You can't change this later β wrong choice means rebuilding the index.cosine (default, normalized similarity), euclidean, or dotproduct. Use cosine unless your embedding model docs say otherwise.Copy the index name and host URL. From API Keys, copy the key.
PINECONE_API_KEY=β¦
PINECONE_INDEX=my-docs
OPENAI_API_KEY=β¦ # for embeddings
You need something to produce vectors. OpenAI's embedding API is the default; Cohere, Voyage, and open-source models via sentence-transformers are alternatives.
npm install openai @pinecone-database/pinecone
import OpenAI from "openai";
const openai = new OpenAI();
async function embed(text) {
const r = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
});
return r.data[0].embedding; // float[1536]
}
Batch when you can β embedding 100 strings in one call is 30Γ cheaper than 100 calls:
const r = await openai.embeddings.create({
model: "text-embedding-3-small",
input: chunks, // array of strings, max 2048 inputs per call
});
const vectors = r.data.map(d => d.embedding);
Don't embed a 50-page PDF as one vector β meaning gets averaged into mush. Split into ~500-token chunks with ~50-token overlap (so context isn't cut mid-sentence).
function chunkText(text, size = 1500, overlap = 150) {
const chunks = [];
let i = 0;
while (i < text.length) {
chunks.push(text.slice(i, i + size));
i += size - overlap;
}
return chunks;
}
For real production chunking that respects paragraph + sentence boundaries, use LangChain's RecursiveCharacterTextSplitter or LlamaIndex node parsers.
import { Pinecone } from "@pinecone-database/pinecone";
const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pc.index(process.env.PINECONE_INDEX);
async function ingestDocument(docId, text) {
const chunks = chunkText(text);
const vectors = await embedBatch(chunks);
await index.upsert(
chunks.map((chunk, i) => ({
id: `${docId}#${i}`, // deterministic ID β re-ingesting overwrites
values: vectors[i],
metadata: {
docId,
text: chunk, // store the chunk for retrieval
chunkIndex: i,
createdAt: Date.now(),
},
})),
);
}
Metadata patterns:
userId, workspaceId, documentType, tags: ["draft", "public"]. Pinecone filters on metadata server-side; cheap.async function search(query, userId) {
const queryVec = await embed(query);
const result = await index.query({
vector: queryVec,
topK: 5,
includeMetadata: true,
filter: { userId: { $eq: userId } }, // multi-tenant guard
});
return result.matches.map(m => ({
text: m.metadata.text,
docId: m.metadata.docId,
score: m.score, // 0..1 cosine similarity
}));
}
Filter operators: $eq, $ne, $gt, $lt, $in, $nin, $and, $or. Full reference: Pinecone metadata filtering.
Retrieval-Augmented Generation: search β stuff results into prompt β ask LLM.
async function answer(question, userId) {
const docs = await search(question, userId);
const context = docs.map(d => `[${d.docId}]\n${d.text}`).join("\n\n---\n\n");
const r = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: "Answer using only the context below. Cite [docId] inline. If the context doesn't contain the answer, say so.",
},
{ role: "user", content: `Context:\n${context}\n\nQuestion: ${question}` },
],
});
return r.choices[0].message.content;
}
For richer RAG flows (query rewriting, multi-hop, re-ranking) use a framework: LangChain or LlamaIndex. For the simple "search + stuff" shape above, framework is overhead.
Pure vector search misses exact-match queries. "ZeroMQ" embedded looks similar to "RabbitMQ"; a user looking for ZeroMQ docs wants ZeroMQ specifically. Hybrid combines BM25 (keyword) + cosine (semantic).
Pinecone supports hybrid via sparse-dense vectors on serverless indexes. Generate sparse vectors with Pinecone Inference or rank-bm25.
If your needs are simpler, run both searches in parallel and merge β reciprocal rank fusion is 20 lines of code and works well.
topK: 100 is rarely useful. 5β10 is the sweet spot for RAG. Larger topK + re-rank with Cohere Rerank if quality demands it.${docId}#${i}) and upsert overwrites. Also delete chunks beyond the new chunk count.index.query directly from app code.text-embedding-3-small is $0.02 per 1M tokens; embedding 1M docs (avg 500 tokens) = $10. Re-embedding everything is the expensive part.Pricing: pinecone.io/pricing.
cosine(query, every_vector) in your app is faster than the network hop to Pinecone, and free. Vector DBs earn their keep at scale β but most apps never get there.