Hopp til hovedinnhold
AI Systems

Building RAG Systems with Supabase pgvector: The Complete Guide

Echo Algori Data
By Echo Team
||25 min read
Building RAG Systems with Supabase pgvector: The Complete Guide

The complete guide to implementing production-ready Retrieval-Augmented Generation systems using Supabase's vector capabilities. Based on 500+ successful deployments.


Table of Contents

  1. Introduction to RAG Architecture
  2. Setting Up Supabase with pgvector
  3. Generating and Storing Embeddings
  4. Implementing Similarity Search
  5. Performance Optimization Techniques
  6. Production Deployment Strategies
  7. Real-World Case Studies
  8. Common Issues and Solutions

1. Introduction to RAG Architecture

Retrieval-Augmented Generation (RAG) has revolutionized how we build AI applications by combining the power of large language models with domain-specific knowledge bases. At Echo Algori Data, we've implemented over 500 RAG systems for Norwegian businesses, and Supabase with pgvector has emerged as our go-to solution.

RAG systems solve the fundamental limitation of LLMs: their knowledge cutoff and lack of proprietary data. By retrieving relevant context from a vector database before generation, we can create AI systems that are both accurate and current.

Why Supabase for RAG?

  • Built-in pgvector: Native PostgreSQL extension for vector operations
  • Real-time subscriptions: Live updates when knowledge base changes
  • Row-level security: GDPR-compliant data isolation
  • Edge functions: Serverless embedding generation
  • Cost effective: 10x cheaper than dedicated vector databases

2. Setting Up Supabase with pgvector

Let's start by setting up a Supabase project with pgvector enabled. This process takes about 5 minutes and gives you a production-ready vector database.

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create documents table with embedding column
CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content TEXT NOT NULL,
  metadata JSONB,
  embedding vector(1536), -- OpenAI text-embedding-3-small (default 1536, configurable 256-3072)
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Create HNSW index for fast similarity search
CREATE INDEX documents_embedding_idx ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Create function for similarity search
CREATE OR REPLACE FUNCTION search_documents(
  query_embedding vector(1536),
  match_count INT DEFAULT 5,
  filter JSONB DEFAULT '{}'::JSONB
)
RETURNS TABLE(
  id UUID,
  content TEXT,
  metadata JSONB,
  similarity FLOAT
)
LANGUAGE plpgsql
AS $$
BEGIN
  RETURN QUERY
  SELECT
    d.id,
    d.content,
    d.metadata,
    1 - (d.embedding <=> query_embedding) AS similarity
  FROM documents d
  WHERE d.metadata @> filter
  ORDER BY d.embedding <=> query_embedding
  LIMIT match_count;
END;
$$;

Pro Tip: HNSW (Hierarchical Navigable Small World) indexes provide 95% accuracy with 10x speed improvement over IVFFlat indexes for datasets under 1M vectors.


3. Generating and Storing Embeddings

The quality of your RAG system depends heavily on embedding generation. As of 2025, OpenAI's text-embedding-3-small is the recommended model -- it's 5x cheaper than the legacy ada-002 while delivering comparable accuracy for most use cases.

2025 Embedding Model Update: text-embedding-3-small ($0.00002/1K tokens) replaces ada-002 ($0.0001/1K tokens) for most production workloads. For maximum accuracy, use text-embedding-3-large with configurable dimensions (256-3072).

// TypeScript implementation with Supabase client
import { createClient } from '@supabase/supabase-js';
import OpenAI from 'openai';

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_ANON_KEY!
);

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY!
});

async function generateEmbedding(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small', // 5x cheaper than ada-002, recommended for 2025
    input: text
  });
  return response.data[0].embedding;
}

async function indexDocument(
  content: string,
  metadata: Record<string, any> = {}
) {
  // Chunk long content for better retrieval
  const chunks = chunkText(content, 500, 100); // 500 tokens, 100 overlap

  for (const chunk of chunks) {
    const embedding = await generateEmbedding(chunk);

    const { error } = await supabase
      .from('documents')
      .insert({
        content: chunk,
        metadata: {
          ...metadata,
          source: 'manual_upload',
          chunk_index: chunks.indexOf(chunk),
          total_chunks: chunks.length
        },
        embedding
      });

    if (error) throw error;
  }
}

function chunkText(
  text: string,
  maxTokens: number = 500,
  overlap: number = 100
): string[] {
  // Simple word-based chunking - use tiktoken for production
  const words = text.split(' ');
  const chunks: string[] = [];

  for (let i = 0; i &lt; words.length; i += (maxTokens - overlap)) {
    const chunk = words.slice(i, i + maxTokens).join(' ');
    if (chunk.length > 50) chunks.push(chunk);
  }

  return chunks;
}

2025 Best Practice: Semantic Chunking

Semantic chunking outperforms fixed-size chunking by 15-20% in retrieval accuracy. Instead of splitting by token count, use sentence boundaries and semantic similarity to create coherent chunks. Libraries like LangChain's SemanticChunker or LlamaIndex's SentenceSplitter implement this automatically.

Chunking Strategy Comparison

Fixed-Size Chunking:

  • Simple to implement
  • May split mid-sentence
  • Use 20% overlap minimum

Semantic Chunking (Recommended):

  • 15-20% better accuracy
  • Preserves meaning boundaries
  • Use embedding similarity thresholds

4. Implementing Similarity Search

With embeddings stored, we can now implement semantic search. In 2025, hybrid search (combining BM25 keyword matching with vector similarity) has become the standard approach, improving accuracy by 10-15% over pure vector search.

2025 RAG Best Practices

  • Hybrid Search: Combine BM25 (keyword) + embeddings for 10-15% accuracy boost. Supabase supports this via pg_trgm extension.
  • HyDE Technique: Generate a hypothetical answer first, then embed that for retrieval. Improves recall for complex queries by 20%.
  • Sub-10ms Queries: Achievable with HNSW indexes, connection pooling, and embedding caching for <100K vectors.
  • Adaptive RAG: Dynamically adjust retrieval strategy based on query complexity and available context.
// Advanced RAG retrieval with hybrid search
interface RAGQuery {
  question: string;
  filters?: Record<string, any>;
  topK?: number;
  threshold?: number;
}

async function retrieveContext(query: RAGQuery): Promise<string> {
  const { question, filters = {}, topK = 5, threshold = 0.7 } = query;

  // Generate embedding for the query
  const queryEmbedding = await generateEmbedding(question);

  // Perform similarity search
  const { data: documents, error } = await supabase
    .rpc('search_documents', {
      query_embedding: queryEmbedding,
      match_count: topK * 2, // Get more for re-ranking
      filter: filters
    });

  if (error) throw error;

  // Re-rank using cross-encoder for better accuracy
  const reranked = await rerankDocuments(question, documents);

  // Filter by similarity threshold
  const relevant = reranked
    .filter(doc => doc.similarity >= threshold)
    .slice(0, topK);

  // Combine chunks intelligently
  const context = combineChunks(relevant);

  return context;
}

async function rerankDocuments(
  query: string,
  documents: any[]
): Promise<any[]> {
  // Use a cross-encoder model for re-ranking
  // This significantly improves retrieval quality
  const scores = await Promise.all(
    documents.map(async doc => {
      const score = await crossEncoderScore(query, doc.content);
      return { ...doc, rerank_score: score };
    })
  );

  return scores.sort((a, b) => b.rerank_score - a.rerank_score);
}

function combineChunks(documents: any[]): string {
  // Group by source and maintain order
  const grouped = documents.reduce((acc, doc) => {
    const source = doc.metadata.source;
    if (!acc[source]) acc[source] = [];
    acc[source].push(doc);
    return acc;
  }, {} as Record<string, any[]>);

  // Intelligently combine chunks
  let context = '';
  for (const [source, docs] of Object.entries(grouped)) {
    const sorted = docs.sort((a, b) =>
      a.metadata.chunk_index - b.metadata.chunk_index
    );

    context += `\n\nSource: ${source}\n`;
    context += sorted.map(d => d.content).join('\n');
  }

  return context.trim();
}

Similarity Metrics Comparison

MetricSpeedAccuracyUse Case
CosineFastHighGeneral purpose (recommended)
EuclideanFastMediumWhen magnitude matters
Inner ProductFastestHighNormalized vectors only

5. Performance Optimization Techniques

After deploying hundreds of RAG systems, we've identified key optimization strategies that can improve performance by 10x.

Indexing Strategies

  • Use HNSW for <1M vectors
  • IVFFlat for 1M-10M vectors
  • Partition tables by date/category
  • Maintain multiple indexes for different query types

Query Optimization

  • Cache frequent embeddings
  • Batch similarity searches
  • Use connection pooling
  • Implement query result caching

Storage Optimization

  • Compress embeddings with quantization
  • Archive old embeddings to cold storage
  • Use dimension reduction for non-critical data
  • Implement TTL for temporary embeddings

Accuracy Improvements

  • Hybrid search (vector + keyword)
  • Cross-encoder re-ranking
  • Query expansion techniques
  • Feedback loop for relevance
-- Performance monitoring query
WITH search_stats AS (
  SELECT
    date_trunc('hour', created_at) as hour,
    COUNT(*) as searches,
    AVG(similarity) as avg_similarity,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY similarity) as p95_similarity
  FROM search_logs
  WHERE created_at > NOW() - INTERVAL '24 hours'
  GROUP BY hour
)
SELECT
  hour,
  searches,
  ROUND(avg_similarity::numeric, 3) as avg_similarity,
  ROUND(p95_similarity::numeric, 3) as p95_similarity,
  CASE
    WHEN avg_similarity &lt; 0.7 THEN 'Poor'
    WHEN avg_similarity &lt; 0.8 THEN 'Fair'
    WHEN avg_similarity &lt; 0.9 THEN 'Good'
    ELSE 'Excellent'
  END as quality_rating
FROM search_stats
ORDER BY hour DESC;

6. Production Deployment Strategies

Deploying RAG systems to production requires careful consideration of scalability, reliability, and compliance. Here's our battle-tested approach.

Production Checklist

  • Database Configuration: Connection pooling, read replicas, automated backups
  • Embedding Pipeline: Queue-based processing, retry logic, error handling
  • Security: API key rotation, RLS policies, encryption at rest
  • Monitoring: Query latency, embedding quality, cost tracking
  • GDPR Compliance: Data retention policies, user consent, audit logs
// Production-ready RAG API endpoint
import { NextRequest, NextResponse } from 'next/server';
import { z } from 'zod';
import { RateLimiter } from '@/lib/rate-limiter';
import { metrics } from '@/lib/monitoring';

const querySchema = z.object({
  question: z.string().min(1).max(500),
  context: z.enum(['general', 'technical', 'legal']).optional(),
  language: z.enum(['en', 'no', 'sv', 'da']).default('en')
});

const rateLimiter = new RateLimiter({
  tokensPerInterval: 100,
  interval: 'minute'
});

export async function POST(req: NextRequest) {
  const startTime = Date.now();

  try {
    // Rate limiting
    const identifier = req.headers.get('x-api-key') || req.ip;
    if (!rateLimiter.check(identifier)) {
      return NextResponse.json(
        { error: 'Rate limit exceeded' },
        { status: 429 }
      );
    }

    // Validate input
    const body = await req.json();
    const { question, context, language } = querySchema.parse(body);

    // Log request
    await logRequest({ question, context, language, identifier });

    // Retrieve context with caching
    const cacheKey = `rag:${hashQuery(question)}:${context}:${language}`;
    let retrievedContext = await cache.get(cacheKey);

    if (!retrievedContext) {
      retrievedContext = await retrieveContext({
        question,
        filters: { context, language },
        topK: 5,
        threshold: 0.75
      });

      // Cache for 1 hour
      await cache.set(cacheKey, retrievedContext, 3600);
    }

    // Generate response
    const response = await generateResponse(question, retrievedContext);

    // Track metrics
    const duration = Date.now() - startTime;
    metrics.recordLatency('rag_query', duration);
    metrics.incrementCounter('rag_queries_total');

    // Return response with metadata
    return NextResponse.json({
      answer: response.answer,
      sources: response.sources,
      confidence: response.confidence,
      metadata: {
        processingTime: duration,
        contextLength: retrievedContext.length,
        model: 'gpt-4-turbo',
        timestamp: new Date().toISOString()
      }
    });

  } catch (error) {
    metrics.incrementCounter('rag_errors_total');
    console.error('RAG query error:', error);

    if (error instanceof z.ZodError) {
      return NextResponse.json(
        { error: 'Invalid input', details: error.errors },
        { status: 400 }
      );
    }

    return NextResponse.json(
      { error: 'Internal server error' },
      { status: 500 }
    );
  }
}

7. Real-World Case Studies

Let's examine three successful RAG implementations from our portfolio, each demonstrating different optimization strategies.

Nordic FinTech AS -- Compliance Assistant

Challenge: 10,000+ pages of financial regulations needed instant searchability with 100% accuracy for compliance queries.

Solution: Hybrid RAG system combining vector search with rule-based filters, cross-encoder re-ranking for legal precision.

Results:

  • 85% query response time reduction
  • 99.8% accuracy rate
  • EUR 200k annual savings

Oslo MedTech Solutions -- Medical Knowledge Base

Challenge: Real-time access to medical literature with patient privacy compliance and multi-language support.

Solution: Multi-lingual embeddings, RLS for patient data isolation, edge functions for GDPR compliance.

Results:

  • 200ms average query latency
  • 5M+ documents indexed
  • 100% GDPR compliant

Scandinavian Logistics -- Supply Chain Intelligence

Challenge: Real-time tracking data from 10,000+ daily shipments needed instant analysis and pattern recognition.

Solution: Streaming embeddings pipeline, time-series aware retrieval, predictive analytics integration.

Results:

  • 45% delivery time reduction
  • 10k/sec events processed
  • EUR 1.2M efficiency savings

8. Common Issues and Solutions

After hundreds of deployments, we've cataloged the most common RAG issues and their solutions.

Issue: Poor retrieval accuracy

Symptoms: Irrelevant documents returned, low similarity scores

Solutions:

  • Implement chunking with overlap
  • Use hybrid search (vector + keyword)
  • Add cross-encoder re-ranking
  • Fine-tune embedding model on domain data
  • Adjust similarity threshold dynamically

Issue: Slow query performance

Symptoms: Queries taking >2 seconds, timeouts under load

Solutions:

  • Add HNSW index with appropriate parameters
  • Implement embedding caching layer
  • Use connection pooling (pgBouncer)
  • Partition large tables by date/category
  • Pre-compute frequent query embeddings

Issue: High operational costs

Symptoms: Embedding API costs exceeding budget, storage growing rapidly

Solutions:

  • Batch embedding generation
  • Implement smart caching strategies
  • Use smaller embedding models for non-critical data
  • Compress embeddings with quantization
  • Archive old embeddings to cold storage

Issue: GDPR compliance concerns

Symptoms: User data in embeddings, deletion requests, audit requirements

Solutions:

  • Implement row-level security (RLS)
  • Separate PII from embeddings
  • Create deletion workflows with cascading
  • Maintain comprehensive audit logs
  • Use European region deployments

Ready to Build Your RAG System?

With over 500 successful RAG implementations, Echo Algori Data is Norway's leading expert in Supabase-powered AI systems. Our team can help you design, implement, and optimize a production-ready RAG solution tailored to your specific needs.


Frequently Asked Questions

What is the minimum dataset size for a RAG system to be worthwhile?

RAG systems add value even with small datasets. We've seen meaningful results with as few as 50 documents (around 100 pages of text). The key advantage is not dataset size but the need for accuracy -- if your AI must reference specific, authoritative content rather than relying on general knowledge, RAG is worthwhile regardless of scale.

How does Supabase pgvector compare to dedicated vector databases like Pinecone or Weaviate?

Supabase pgvector is ideal when you already use PostgreSQL for your application data. It eliminates the complexity of managing a separate vector database, offers built-in RLS for GDPR compliance, and costs roughly 10x less. Dedicated vector databases like Pinecone shine at extreme scale (10M+ vectors) or when you need specialized features like automatic reindexing. For most production workloads under 5M vectors, Supabase is the better choice.

How do I handle multi-language content in a RAG system?

Use a multilingual embedding model such as OpenAI's text-embedding-3-small (which supports 100+ languages natively). Store language metadata alongside each chunk, and filter by language during retrieval. For Norwegian/English bilingual systems, we recommend indexing both languages in the same table with a language metadata field, then applying language-aware filters at query time.

What is the recommended chunk size for optimal retrieval accuracy?

We recommend 300-500 tokens per chunk with 15-20% overlap for general-purpose RAG. Shorter chunks (150-300 tokens) work better for FAQ-style content where answers are concise. Longer chunks (500-1000 tokens) suit technical documentation where context is critical. Always use semantic chunking over fixed-size splitting when possible -- it improves retrieval accuracy by 15-20%.

How do I measure and improve RAG system quality over time?

Track three core metrics: retrieval precision (are the returned chunks relevant?), answer faithfulness (does the generated answer match the retrieved context?), and answer relevance (does the answer address the user's question?). Use tools like RAGAS or custom evaluation pipelines. Implement a feedback loop where users rate responses, then use low-rated queries to identify retrieval gaps and refine your chunking, indexing, or re-ranking strategies.


Related Reading

Tags

RAGSupabasepgvectorVector DatabaseAI ArchitectureProduction Deployment

Stay Updated

Subscribe to our newsletter for the latest AI insights and industry updates.

Get in touch