Neural Nets

Welcome! I'm Rahul. I write detailed deep dives about LLM architectures, inference optimization, and GPU programming, bridging the gap between ML research and systems engineering.

Pinned

Latest Posts

Projects

LLM Distillation

Quintus

Quintus is a compact 1.7B assistant model that improves over Qwen3-1.7B-Instruct on key reasoning and commonsense benchmarks, measured by +4.5pp on GSM8K flexible, +5.4pp on ARC-Challenge acc_norm, +5.4pp on WinoGrande, +6.6pp on MBPP, and +3.5pp on PIQA acc_norm. The model was built by distilling Qwen3-8B into Qwen3-1.7B through a two-stage pipeline using full-vocabulary online KL divergence, sequence packing, token-chunked KD loss, targeted SFT, strict evaluation controls, and public weight-audit tooling.

ML Systems

Keiro

Keiro retrofits a Sparse Mixture-of-Experts architecture into Qwen2.5-3B. A Top-2 dynamic router activates 2 of 8 LoRA experts per transformer block, expanding effective capacity while keeping active compute identical to the dense baseline. The residual design — frozen FFN + routed Rank-16 LoRA adapters — adds only 19.46M trainable parameters (0.63% of total) while retaining 95.4% of the base model's GSM8K mathematical reasoning capability. Engineering challenges included resolving a CUDA race condition in index_add_ with duplicate Top-K indices, a BFloat16 cumsum upcast mismatch in the coalesce path, and a 4.7× autoregressive inference bottleneck diagnosed and addressed by bypassing capacity buffers for single-token generation. Benchmarked via EleutherAI lm-evaluation-harness (HellaSwag −0.13%, ARC-Challenge −0.17%, GSM8K −3.19%).

ML Systems

Prolepsis

Speculative decoding for LLM inference — a small Qwen 1.7B draft model races ahead, predicting future tokens that a larger Qwen 8B target then verifies in a single parallel pass. The result: 1.30× faster generation on an A100 at ~56.5% acceptance rate across mixed-domain prompts, with a rejection sampling pipeline that mathematically guarantees the output distribution stays identical to the target model.

ML Systems

FlashTile

FlashTile is a reference PyTorch implementation of Flash Attention (V1/V2) and KV-cache-efficient inference variants (GQA/MQA). It uses block-wise tiling, online softmax, and recomputation-based backward passes to reduce attention-score storage from O(N²) to O(N). The project includes benchmark and validation artifacts from A100 testing, an archived H100 cross-check, and a forward-only Triton kernel for performance comparison.

Systems Engineering

Substrata9: Linux Process Introspection Toolkit

Substrata9 — a lightweight, pure-Bash toolkit for deep Linux process introspection. Built entirely without compilation or external dependencies, it mines raw /proc filesystem data to surface memory maps, file descriptors, process hierarchies, and runtime anomalies. Its modular CLI utilities emit JSON output for automation, slotly cleanly into observability, debugging, and forensics workflows.

Archived

AI & RAG

Mission Cipher

Created a Graph Retrieval-Augmented Generation (GraphRAG) web app that answers Mission: Impossible questions with context-aware, generative responses. The system enhances traditional RAG by combining cosine-similarity search on semantic embeddings with a dynamically constructed knowledge graph, enabling deeper contextual understanding. A Flask backend builds and queries the graph using NetworkX, while a language model generates responses based on rich subgraph context. The application runs under Gunicorn (WSGI) and is fronted by NGINX as a reverse proxy, with communication handled via a Unix socket for secure, low-latency performance. Hosted on multi-zone Google Compute Engine, the service leverages GCE's 99.99% uptime SLA, with tightly scoped ingress rules for high-performance, secure access.

Security & Analytics

CloudNet Analytics

Designed and deployed a secure, real-time log-analytics platform on Google Cloud that ingests, processes, and visualizes network logs end to end. Architected a custom Virtual Private Cloud (VPC) sliced into three /24 subnets—x.y.1.0/24 (web), x.y.2.0/24 (application), and x.y.3.0/24 (processor)—each pinned to dedicated Compute Engine VMs to enforce zero-trust micro-segmentation and east-west isolation. Granular, stateful firewall rules admit traffic only from whitelisted IP prefixes and service accounts. Logs are encrypted in flight over SSH, transformed with Python, staged in Cloud Storage, and streamed through Pub/Sub to invoke Cloud Functions (1st gen.) that load structured data into BigQuery. A hardened Flask API—exposed via HTTPS and IAM-based authentication—delivers controlled, low-latency access to analytics, providing scalable, compliant, and high-performance troubleshooting insights.

Education

Project Graphil

Built an interactive visual learning platform (Graphil) using React to simplify complex technical topics—including Linux, GCP, networking, Python, and AI—through modular, pre-rendered visualizations. The intuitive UI/UX design enables self-guided exploration of technical subjects, enhancing comprehension for visual learners. The platform is open-source, encouraging community collaboration and extensibility.

Compliance

Compliance Guide

Compiled a comprehensive and holistic compliance framework covering 15+ critical domains (e.g., Cybersecurity/CyberSecOps, Data Privacy (GDPR, CCPA), PCI DSS, IT Best Practices, Legal & Operational Standards) for a fictional grocery delivery startup. This proactive resource demonstrates the potential to streamline onboarding and reduce initial legal/compliance research overhead by an estimated 5–10%.