Building a RAG System with Pinecone and Node.js

Authors
  • avatar
    Name
    Hamza Rahman
Published on
-
5 mins read

Once your documents outgrow passing whole files into the prompt, or you need search that matches on meaning rather than exact words, a vector database earns its place. Pinecone stores text as embeddings, so a question finds documents that are semantically similar even when they share no keywords. This is the step up from MongoDB keyword search.

When Pinecone fits for RAG

Reach for a vector database when you have a large collection of documents, you need semantic search that understands meaning, and retrieval has to stay fast as the data grows. The trade-off is more moving parts and the cost of generating embeddings, so it is worth it when keyword search is genuinely missing relevant results.

Setup

Install the dependencies:

npm install openai @pinecone-database/pinecone dotenv

Add your keys to a .env file:

OPENAI_API_KEY=your-api-key-here
PINECONE_API_KEY=your-pinecone-api-key

Initialize both clients:

import { OpenAI } from 'openai'
import { Pinecone } from '@pinecone-database/pinecone'
import dotenv from 'dotenv'
dotenv.config()
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY })

Create an index whose dimension matches your embedding model. text-embedding-3-small outputs 1536 values, so the index dimension is 1536.

await pc.createIndex({
name: 'documents',
dimension: 1536,
metric: 'cosine',
spec: { serverless: { cloud: 'aws', region: 'us-east-1' } },
})

Pinecone also offers an integrated embedding option, where you create the index with createIndexForModel and then upsert plain text with upsertRecords. Pinecone generates the embeddings for you, so you skip the OpenAI embedding step. This tutorial uses OpenAI embeddings directly, which is why we create a plain index and provide our own vectors below.

Add documents

To store a document, turn its text into an embedding and upsert that vector along with the original text as metadata. Keeping the text in metadata is what lets you read it back at query time.

async function indexDocument(document) {
// 1. Create an embedding for the document text
const embedResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: document.text,
})
const embedding = embedResponse.data[0].embedding
// 2. Upsert the vector into Pinecone
const index = pc.index('documents')
await index.upsert([
{
id: document.id,
values: embedding,
metadata: {
text: document.text,
title: document.title,
source: document.source,
},
},
])
}

The RAG query

Embed the question with the same model, search Pinecone for the nearest vectors, then answer from the text those matches carry in their metadata.

async function pineconeRAG(question) {
// 1. Embed the question
const embedResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: question,
})
const questionEmbedding = embedResponse.data[0].embedding
// 2. Find the most similar documents
const index = pc.index('documents')
const queryResponse = await index.query({
vector: questionEmbedding,
topK: 3,
includeMetadata: true,
})
// 3. Combine their text into context
const context = queryResponse.matches.map((match) => match.metadata.text).join('\n\n')
// 4. Answer from that context
const completion = await openai.chat.completions.create({
model: 'gpt-5.4-mini',
messages: [
{
role: 'system',
content: 'Answer using only the provided context. If the answer is not there, say so.',
},
{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${question}`,
},
],
})
return completion.choices[0].message.content
}

Both the document and the question go through the same embedding model, so their vectors live in the same space and cosine similarity can compare them. topK: 3 returns the three closest matches.

Chunk long documents

Embedding a whole long document into one vector blurs its meaning, and retrieval gets vague. Split long text into smaller chunks and embed each one, so a query can match the specific passage it needs.

function chunkDocument(text, maxChunkSize = 500) {
const sentences = text.match(/[^.!?]+[.!?]+/g) || []
const chunks = []
let current = ''
for (const sentence of sentences) {
if ((current + sentence).length <= maxChunkSize) {
current += sentence
} else {
chunks.push(current)
current = sentence
}
}
if (current) chunks.push(current)
return chunks
}

Handle rate limits

Embedding many documents can hit rate limits. A small retry with exponential backoff keeps indexing jobs from failing on a transient error.

async function withRetry(fn, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn()
} catch (error) {
if (i === maxRetries - 1) throw error
await new Promise((r) => setTimeout(r, 1000 * 2 ** i))
}
}
}

Limitations

Pinecone adds real capability, but also real cost and complexity. You pay to generate embeddings, you run another service, and you have to chunk documents well to get good results. If keyword search already returns what you need, the simpler options are worth trying first.