Skip to main content

Machine Learning & Deep Learning Portfolio

Context: Delivered while working Full-Time (ARGO DATA) + completing MS CS (UT Austin)

This portfolio showcases my expertise in machine learning and deep learning, from foundational research to production-scale systems serving hundreds of users.

🧠 2025: RL for Mathematical Reasoning (Gemma-3 Fine-tuning)

Signal: RLHF / Post-Training / Foundation Models

🔗 Model on Hugging Face

GRPO Training Pipeline

🎯 Motivation & Research Challenge

Mathematical reasoning in small language models (270M parameters) presents unique challenges: limited capacity for complex reasoning, tendency for verbose explanations, and difficulty in learning structured problem-solving approaches. The goal was to implement Group Relative Policy Optimization (GRPO) from scratch to enhance chain-of-thought reasoning while maintaining computational efficiency.

🔧 GRPO Implementation from Scratch

Algorithm Innovation

  • Group Relative Optimization: Custom implementation using TRL 0.21.0 framework
  • Policy Architecture: Gemma-3-270M with specialized mathematical reasoning head
  • Reference Model: Frozen SFT checkpoint for KL divergence regularization
  • Reward Engineering: Multi-component reward function with efficiency penalties

Training Infrastructure

  • Memory Optimization: 15GB VRAM constraint (down from 25GB+ baseline)
  • Precision: BF16 mixed precision for 2x memory reduction
  • Batch Strategy: Gradient accumulation (4-8 steps) + dynamic batching
  • Distributed Setup: Multi-GPU training with gradient synchronization

⚡ Advanced Reward Engineering

Custom Reward Function Design

R = Correctness + Efficiency + KL Regularization

  • Correctness Component:
    • SymPy-based mathematical equivalence checking (handles algebraic simplification)
    • Numerical tolerance for floating-point comparisons (±1e-6)
    • Multi-format answer parsing (fractions, decimals, expressions)
    • Binary reward: +1.0 for correct, 0.0 for incorrect
  • Efficiency Penalty:
    • Token count penalty: -λ × (tokens in <think>...</think>)
    • Encourages concise reasoning while maintaining accuracy
    • Hyperparameter λ = 0.01 (tuned empirically)
  • KL Regularization:
    • Prevents policy drift from SFT reference model
    • β = 0.02 (KL coefficient) for stable training
    • Computed per-token for fine-grained control

🔬 Memory & Compute Optimizations

Memory Engineering

  • Gradient Accumulation: 4-8 steps to simulate larger batch sizes
  • Activation Checkpointing: Trade compute for memory (30% reduction)
  • Dynamic Padding: Variable sequence lengths to minimize waste
  • Model Sharding: Distribute model weights across GPUs

Training Optimizations

  • Learning Rate Schedule: Cosine annealing with warmup
  • Attention Implementation: "eager" mode for compatibility
  • Checkpoint Strategy: Save every N steps, keep last 3
  • Monitoring: Real-time reward tracking and KL divergence

📊 Training Results & Analysis

Achieved Performance Metrics

  • Training Steps: ~1200 GRPO iterations with stable convergence
  • Memory Usage: Reduced from 25GB+ to 15GB VRAM (40% improvement)
  • Reward Convergence: Steady improvement in mathematical accuracy
  • Efficiency Gains: 25% reduction in reasoning verbosity while maintaining correctness
  • KL Stability: Maintained <0.1 KL divergence from reference model

🔬 Technical Innovations

  • GRPO from Scratch: Complete implementation of group relative policy optimization algorithm
  • Mathematical Reward Design: Novel approach combining correctness, efficiency, and stability
  • Memory-Efficient Training: Techniques for training large models on constrained hardware
  • Structured Reasoning: <think>...</think> format for interpretable mathematical reasoning

🏭 2023-Present: Production RAG System (ARGO DATA)

Signal: Production Scale / Latency Engineering / System Design

Self-Healing RAG Architecture

🎯 Motivation & Production Challenge

Building a production RAG system that serves `200+` concurrent users with sub-100ms latency requires solving complex challenges: real-time index updates, content change detection, embedding consistency, and fault tolerance. The system needed to be "self-healing" - automatically adapting to content changes without manual intervention.

🔧 Self-Healing Architecture Design

Change Detection Pipeline

  • File System Watchers: Real-time monitoring of wiki content directories
  • Content Hash Tracking: SHA-256 hashing for detecting granular changes
  • Semantic Diff Analysis: NLP-based change significance scoring
  • Batch Processing: Intelligent grouping of related changes

Intelligent Re-indexing

  • Incremental Updates: Only re-process changed content sections
  • Dependency Tracking: Update related documents automatically
  • Zero-Downtime Deployment: Blue-green indexing strategy
  • Rollback Capability: Automatic reversion on quality degradation

⚡ Production-Scale Optimizations

  • Embedding Pipeline Optimization:
    • Batch Processing: Process 100+ documents simultaneously
    • GPU Acceleration: CUDA-optimized embedding generation
    • Caching Strategy: Redis-based embedding cache with TTL
    • Async Processing: Non-blocking embedding updates
  • Vector Database Engineering:
    • Pinecone Optimization: Custom indexing strategy for 1M+ vectors
    • Sharding Strategy: Namespace-based data partitioning
    • Query Optimization: Metadata filtering for faster retrieval
    • Connection Pooling: Persistent connections for reduced latency

🔬 Latency Engineering Deep Dive

Sub-100ms P50 Latency Breakdown

  • Query Embedding: <10ms (cached model inference)
  • Vector Search: <30ms (Pinecone optimized queries)
  • Context Assembly: <15ms (parallel document retrieval)
  • LLM Generation: <40ms (streaming response initiation)
  • Total P50: 95ms average response time

🏗️ Infrastructure & Reliability

FastAPI Backend

  • Async Architecture: Handles 200+ concurrent connections
  • Connection Pooling: Persistent database connections
  • Rate Limiting: Per-user and global rate limits
  • Health Checks: Continuous system monitoring

Monitoring & Observability

  • Prometheus Metrics: Custom metrics for RAG performance
  • Grafana Dashboards: Real-time system visualization
  • Alert System: Automated incident response
  • Performance Tracking: Query latency and accuracy metrics

📊 Production Performance Metrics

Achieved Scale & Performance

  • Concurrent Users: 200+ simultaneous active sessions
  • Response Latency: 95ms P50, 150ms P95
  • Throughput: 1000+ queries per minute sustained
  • Uptime: 99.7% availability (production SLA)
  • Index Updates: Real-time processing of content changes
  • Memory Efficiency: 8GB RAM for full system operation

🔬 Technical Innovations

  • Self-Healing Index: Automatic content change detection and re-indexing
  • Latency Optimization: Multi-level caching and async processing
  • Production Reliability: Circuit breakers, health checks, and auto-recovery
  • Scalable Architecture: Horizontal scaling with load balancing

👁️ 2023-Present: ID Verification Vision System (ARGO DATA)

Signal: Computer Vision / Cloud Cost Optimization / Production ML

Challenges Solved

Cut vendor API costs by ~60% by replacing 3rd-party APIs with in-house PyTorch models deployed on Azure.

Technical Depth

  • Architecture: "Hydra-Net" architecture (3 heads, 1 backbone) for multi-task learning
  • Optimization: 50% memory usage reduction enabling efficient autoscaling
  • Cost Engineering: Significant cost reduction while maintaining accuracy
  • Cloud Deployment: Azure-based inference with auto-scaling capabilities
  • Model Design: Multi-head architecture for document verification tasks

🛰️ 2022: Hyperspectral Super-Resolution (ISRO Research)

Signal: HPC / CUDA Optimization / Research Rigor

🔗 Research Publication

Challenges Solved

Accelerated training 30x on 1TB+ satellite datasets via custom CUDA-level optimizations for hyperspectral image processing.

Technical Depth

  • Performance: 30x training acceleration through CUDA optimization
  • Scale: Handling 1TB+ satellite datasets efficiently
  • Architecture: SR-GAN pipeline for 4x resolution upscaling (20m → 5m)
  • Infrastructure: Multi-GPU cluster optimization and distributed training
  • Research Impact: Published research with ISRO collaboration

🚦 2022: TrafficSwarm (Multi-Agent Reinforcement Learning)

Signal: Reinforcement Learning / Multi-Agent Systems / Distributed AI

🔗 GitHub Repository

Challenges Solved

Solved decentralized coordination for traffic grid optimization using shared-context reinforcement learning.

Technical Depth

  • Multi-Agent RL: Coordinated behavior across multiple autonomous agents
  • Algorithm Development: Custom policy optimization algorithms using RLlib and PyTorch
  • Distributed Systems: Decentralized coordination without central control
  • Simulation: Complex traffic simulation environments with realistic constraints
  • Performance: Improved traffic flow efficiency through learned coordination

🔬 Research & Innovation Highlights

Core ML/DL Expertise

  • Foundation Models: RLHF, post-training optimization, and fine-tuning at scale
  • Computer Vision: Multi-task learning, super-resolution, document processing
  • Reinforcement Learning: Policy optimization, multi-agent systems, reward engineering
  • Production ML: Latency optimization, cost engineering, scalable deployment

Technical Specializations

  • Optimization: CUDA programming, memory optimization, distributed training
  • Architecture Design: Multi-head networks, attention mechanisms, efficient inference
  • MLOps: Production deployment, monitoring, auto-scaling, cost optimization
  • Research: Academic collaboration, publication-quality research, novel algorithm development

Technology Stack

  • Frameworks: PyTorch, Transformers, RLlib, LangChain
  • Infrastructure: Azure ML, CUDA, Multi-GPU clusters, Docker
  • Databases: Pinecone, Vector databases, Real-time indexing
  • APIs: FastAPI, Production serving, Auto-scaling systems

📊 Impact Metrics

  • Production Scale: `200+` concurrent users with sub-100ms latency
  • Cost Optimization: 60% reduction in vendor API costs
  • Performance: 30x training acceleration through optimization
  • Research: Published work with government research organization (ISRO)
  • Open Source: Multiple repositories with community adoption

This portfolio demonstrates deep expertise across the full ML/DL lifecycle, from research and algorithm development to production deployment and optimization.