TECHNICAL DEEP DIVE

Building RAG Systems with Supabase pgvector

The complete guide to implementing production-ready Retrieval-Augmented Generation systems using Supabase's vector capabilities. Based on 500+ successful deployments.

📖 25 min read🔥 Advanced💾 Code examples included🚀 Production ready

1. Introduction to RAG Architecture

Retrieval-Augmented Generation (RAG) has revolutionized how we build AI applications by combining the power of large language models with domain-specific knowledge bases. At Echo Algori Data, we've implemented over 500 RAG systems for Norwegian businesses, and Supabase with pgvector has emerged as our go-to solution.

RAG systems solve the fundamental limitation of LLMs: their knowledge cutoff and lack of proprietary data. By retrieving relevant context from a vector database before generation, we can create AI systems that are both accurate and current.

Why Supabase for RAG?

  • Built-in pgvector: Native PostgreSQL extension for vector operations
  • Real-time subscriptions: Live updates when knowledge base changes
  • Row-level security: GDPR-compliant data isolation
  • Edge functions: Serverless embedding generation
  • Cost effective: 10x cheaper than dedicated vector databases

2. Setting Up Supabase with pgvector

Let's start by setting up a Supabase project with pgvector enabled. This process takes about 5 minutes and gives you a production-ready vector database.

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create documents table with embedding column
CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content TEXT NOT NULL,
  metadata JSONB,
  embedding vector(1536), -- OpenAI ada-002 dimensions
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Create HNSW index for fast similarity search
CREATE INDEX documents_embedding_idx ON documents 
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Create function for similarity search
CREATE OR REPLACE FUNCTION search_documents(
  query_embedding vector(1536),
  match_count INT DEFAULT 5,
  filter JSONB DEFAULT '{}'::JSONB
)
RETURNS TABLE(
  id UUID,
  content TEXT,
  metadata JSONB,
  similarity FLOAT
)
LANGUAGE plpgsql
AS $$
BEGIN
  RETURN QUERY
  SELECT
    d.id,
    d.content,
    d.metadata,
    1 - (d.embedding <=> query_embedding) AS similarity
  FROM documents d
  WHERE d.metadata @> filter
  ORDER BY d.embedding <=> query_embedding
  LIMIT match_count;
END;
$$;

💡 Pro Tip

HNSW (Hierarchical Navigable Small World) indexes provide 95% accuracy with 10x speed improvement over IVFFlat indexes for datasets under 1M vectors.

3. Generating and Storing Embeddings

The quality of your RAG system depends heavily on embedding generation. We'll use OpenAI's ada-002 model, which offers the best balance of performance and cost for most use cases.

// TypeScript implementation with Supabase client
import { createClient } from '@supabase/supabase-js';
import OpenAI from 'openai';

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_ANON_KEY!
);

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY!
});

async function generateEmbedding(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-ada-002',
    input: text
  });
  return response.data[0].embedding;
}

async function indexDocument(
  content: string, 
  metadata: Record<string, any> = {}
) {
  // Chunk long content for better retrieval
  const chunks = chunkText(content, 500, 100); // 500 tokens, 100 overlap
  
  for (const chunk of chunks) {
    const embedding = await generateEmbedding(chunk);
    
    const { error } = await supabase
      .from('documents')
      .insert({
        content: chunk,
        metadata: {
          ...metadata,
          source: 'manual_upload',
          chunk_index: chunks.indexOf(chunk),
          total_chunks: chunks.length
        },
        embedding
      });
      
    if (error) throw error;
  }
}

function chunkText(
  text: string, 
  maxTokens: number = 500, 
  overlap: number = 100
): string[] {
  // Simple word-based chunking - use tiktoken for production
  const words = text.split(' ');
  const chunks: string[] = [];
  
  for (let i = 0; i < words.length; i += (maxTokens - overlap)) {
    const chunk = words.slice(i, i + maxTokens).join(' ');
    if (chunk.length > 50) chunks.push(chunk);
  }
  
  return chunks;
}

🎯 Best Practice

Always implement overlapping chunks to maintain context across boundaries. We've found 20% overlap (100 tokens for 500-token chunks) provides optimal retrieval accuracy.

4. Implementing Similarity Search

With embeddings stored, we can now implement semantic search. The key is combining vector similarity with metadata filtering for precise retrieval.

// Advanced RAG retrieval with hybrid search
interface RAGQuery {
  question: string;
  filters?: Record<string, any>;
  topK?: number;
  threshold?: number;
}

async function retrieveContext(query: RAGQuery): Promise<string> {
  const { question, filters = {}, topK = 5, threshold = 0.7 } = query;
  
  // Generate embedding for the query
  const queryEmbedding = await generateEmbedding(question);
  
  // Perform similarity search
  const { data: documents, error } = await supabase
    .rpc('search_documents', {
      query_embedding: queryEmbedding,
      match_count: topK * 2, // Get more for re-ranking
      filter: filters
    });
    
  if (error) throw error;
  
  // Re-rank using cross-encoder for better accuracy
  const reranked = await rerankDocuments(question, documents);
  
  // Filter by similarity threshold
  const relevant = reranked
    .filter(doc => doc.similarity >= threshold)
    .slice(0, topK);
    
  // Combine chunks intelligently
  const context = combineChunks(relevant);
  
  return context;
}

async function rerankDocuments(
  query: string, 
  documents: any[]
): Promise<any[]> {
  // Use a cross-encoder model for re-ranking
  // This significantly improves retrieval quality
  const scores = await Promise.all(
    documents.map(async doc => {
      const score = await crossEncoderScore(query, doc.content);
      return { ...doc, rerank_score: score };
    })
  );
  
  return scores.sort((a, b) => b.rerank_score - a.rerank_score);
}

function combineChunks(documents: any[]): string {
  // Group by source and maintain order
  const grouped = documents.reduce((acc, doc) => {
    const source = doc.metadata.source;
    if (!acc[source]) acc[source] = [];
    acc[source].push(doc);
    return acc;
  }, {} as Record<string, any[]>);
  
  // Intelligently combine chunks
  let context = '';
  for (const [source, docs] of Object.entries(grouped)) {
    const sorted = docs.sort((a, b) => 
      a.metadata.chunk_index - b.metadata.chunk_index
    );
    
    context += `\n\nSource: ${source}\n`;
    context += sorted.map(d => d.content).join('\n');
  }
  
  return context.trim();
}

Similarity Metrics Comparison

MetricSpeedAccuracyUse Case
CosineFastHighGeneral purpose (recommended)
EuclideanFastMediumWhen magnitude matters
Inner ProductFastestHighNormalized vectors only

5. Performance Optimization Techniques

After deploying hundreds of RAG systems, we've identified key optimization strategies that can improve performance by 10x.

🚀 Indexing Strategies

  • • Use HNSW for <1M vectors
  • • IVFFlat for 1M-10M vectors
  • • Partition tables by date/category
  • • Maintain multiple indexes for different query types

⚡ Query Optimization

  • • Cache frequent embeddings
  • • Batch similarity searches
  • • Use connection pooling
  • • Implement query result caching

💾 Storage Optimization

  • • Compress embeddings with quantization
  • • Archive old embeddings to cold storage
  • • Use dimension reduction for non-critical data
  • • Implement TTL for temporary embeddings

🎯 Accuracy Improvements

  • • Hybrid search (vector + keyword)
  • • Cross-encoder re-ranking
  • • Query expansion techniques
  • • Feedback loop for relevance
-- Performance monitoring query
WITH search_stats AS (
  SELECT 
    date_trunc('hour', created_at) as hour,
    COUNT(*) as searches,
    AVG(similarity) as avg_similarity,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY similarity) as p95_similarity
  FROM search_logs
  WHERE created_at > NOW() - INTERVAL '24 hours'
  GROUP BY hour
)
SELECT 
  hour,
  searches,
  ROUND(avg_similarity::numeric, 3) as avg_similarity,
  ROUND(p95_similarity::numeric, 3) as p95_similarity,
  CASE 
    WHEN avg_similarity < 0.7 THEN 'Poor'
    WHEN avg_similarity < 0.8 THEN 'Fair'
    WHEN avg_similarity < 0.9 THEN 'Good'
    ELSE 'Excellent'
  END as quality_rating
FROM search_stats
ORDER BY hour DESC;

6. Production Deployment Strategies

Deploying RAG systems to production requires careful consideration of scalability, reliability, and compliance. Here's our battle-tested approach.

Production Checklist

Database Configuration: Connection pooling, read replicas, automated backups
Embedding Pipeline: Queue-based processing, retry logic, error handling
Security: API key rotation, RLS policies, encryption at rest
Monitoring: Query latency, embedding quality, cost tracking
GDPR Compliance: Data retention policies, user consent, audit logs
// Production-ready RAG API endpoint
import { NextRequest, NextResponse } from 'next/server';
import { z } from 'zod';
import { RateLimiter } from '@/lib/rate-limiter';
import { metrics } from '@/lib/monitoring';

const querySchema = z.object({
  question: z.string().min(1).max(500),
  context: z.enum(['general', 'technical', 'legal']).optional(),
  language: z.enum(['en', 'no', 'sv', 'da']).default('en')
});

const rateLimiter = new RateLimiter({
  tokensPerInterval: 100,
  interval: 'minute'
});

export async function POST(req: NextRequest) {
  const startTime = Date.now();
  
  try {
    // Rate limiting
    const identifier = req.headers.get('x-api-key') || req.ip;
    if (!rateLimiter.check(identifier)) {
      return NextResponse.json(
        { error: 'Rate limit exceeded' },
        { status: 429 }
      );
    }
    
    // Validate input
    const body = await req.json();
    const { question, context, language } = querySchema.parse(body);
    
    // Log request
    await logRequest({ question, context, language, identifier });
    
    // Retrieve context with caching
    const cacheKey = `rag:${hashQuery(question)}:${context}:${language}`;
    let retrievedContext = await cache.get(cacheKey);
    
    if (!retrievedContext) {
      retrievedContext = await retrieveContext({
        question,
        filters: { context, language },
        topK: 5,
        threshold: 0.75
      });
      
      // Cache for 1 hour
      await cache.set(cacheKey, retrievedContext, 3600);
    }
    
    // Generate response
    const response = await generateResponse(question, retrievedContext);
    
    // Track metrics
    const duration = Date.now() - startTime;
    metrics.recordLatency('rag_query', duration);
    metrics.incrementCounter('rag_queries_total');
    
    // Return response with metadata
    return NextResponse.json({
      answer: response.answer,
      sources: response.sources,
      confidence: response.confidence,
      metadata: {
        processingTime: duration,
        contextLength: retrievedContext.length,
        model: 'gpt-4-turbo',
        timestamp: new Date().toISOString()
      }
    });
    
  } catch (error) {
    metrics.incrementCounter('rag_errors_total');
    console.error('RAG query error:', error);
    
    if (error instanceof z.ZodError) {
      return NextResponse.json(
        { error: 'Invalid input', details: error.errors },
        { status: 400 }
      );
    }
    
    return NextResponse.json(
      { error: 'Internal server error' },
      { status: 500 }
    );
  }
}

7. Real-World Case Studies

Let's examine three successful RAG implementations from our portfolio, each demonstrating different optimization strategies.

🏦 Nordic FinTech AS - Compliance Assistant

Challenge

10,000+ pages of financial regulations needed instant searchability with 100% accuracy for compliance queries.

Solution

Hybrid RAG system combining vector search with rule-based filters, cross-encoder re-ranking for legal precision.

Results

85%
Query response time reduction
99.8%
Accuracy rate
€200k
Annual savings

🏥 Oslo MedTech Solutions - Medical Knowledge Base

Challenge

Real-time access to medical literature with patient privacy compliance and multi-language support.

Solution

Multi-lingual embeddings, RLS for patient data isolation, edge functions for GDPR compliance.

Results

200ms
Average query latency
5M+
Documents indexed
100%
GDPR compliant

📦 Scandinavian Logistics - Supply Chain Intelligence

Challenge

Real-time tracking data from 10,000+ daily shipments needed instant analysis and pattern recognition.

Solution

Streaming embeddings pipeline, time-series aware retrieval, predictive analytics integration.

Results

45%
Delivery time reduction
10k/sec
Events processed
€1.2M
Efficiency savings

8. Common Issues and Solutions

After hundreds of deployments, we've cataloged the most common RAG issues and their solutions.

🐛 Issue: Poor retrieval accuracy

Symptoms: Irrelevant documents returned, low similarity scores

Solutions:

  • Implement chunking with overlap
  • Use hybrid search (vector + keyword)
  • Add cross-encoder re-ranking
  • Fine-tune embedding model on domain data
  • Adjust similarity threshold dynamically
⚡ Issue: Slow query performance

Symptoms: Queries taking >2 seconds, timeouts under load

Solutions:

  • Add HNSW index with appropriate parameters
  • Implement embedding caching layer
  • Use connection pooling (pgBouncer)
  • Partition large tables by date/category
  • Pre-compute frequent query embeddings
💰 Issue: High operational costs

Symptoms: Embedding API costs exceeding budget, storage growing rapidly

Solutions:

  • Batch embedding generation
  • Implement smart caching strategies
  • Use smaller embedding models for non-critical data
  • Compress embeddings with quantization
  • Archive old embeddings to cold storage
🔒 Issue: GDPR compliance concerns

Symptoms: User data in embeddings, deletion requests, audit requirements

Solutions:

  • Implement row-level security (RLS)
  • Separate PII from embeddings
  • Create deletion workflows with cascading
  • Maintain comprehensive audit logs
  • Use European region deployments

Ready to Build Your RAG System?

With over 500 successful RAG implementations, Echo Algori Data is Norway's leading expert in Supabase-powered AI systems. Our team can help you design, implement, and optimize a production-ready RAG solution tailored to your specific needs.

Additional Resources

EA

Echo Algori Data Team

Norway's leading AI consultancy specializing in Supabase, RAG systems, and automation with n8n & Make. 500+ successful implementations across Nordic enterprises.