Muskula Rahul • Neural Nets

Welcome! I'm Rahul. I write detailed deep dives about LLM architectures, inference optimization, and GPU programming, bridging the gap between ML research and systems engineering.

Reach Out

Pinned

Mixture of Experts - Mathematical Foundations and Scaling
7 Nov 2025

Posts

Model Compression via Knowledge Distillation
6 Mar 2026
Low-Rank Adaptation (LoRA)
10 Dec 2025
The Equations That Changed The World
10 Sept 2025
DNS Architecture And DNS Records
26 Jul 2025
Activation And Loss Functions In Deep Learning
15 Jun 2025
AI and the Art of Subtle Control
29 Apr 2025

Projects

Interpretability

L31H14+

A mechanistic interpretability study of factual recall in Qwen3.5-4B. The project identifies a reproducible attention-head cluster anchored at layer 31, head 14, then tests it with discovery probes, targeted ablations, source tracing, and a 14.mid sparse autoencoder. Ablating the L31H14+ cluster reduces the clean-vs-corrupted answer margin by ~57% across 116 contrastive factual-recall prompts, with published code, configs, SAE checkpoint, and interpretation artifacts.

Model

LLM Distillation

Quintus

Quintus is a compact 1.7B assistant model that improves over Qwen3-1.7B-Instruct on key reasoning and commonsense benchmarks, measured by +4.5pp on GSM8K flexible, +5.4pp on ARC-Challenge acc_norm, +5.4pp on WinoGrande, +6.6pp on MBPP, and +3.5pp on PIQA acc_norm. The model was built by distilling Qwen3-8B into Qwen3-1.7B through a two-stage pipeline using full-vocabulary online KL divergence, sequence packing, token-chunked KD loss, targeted SFT, strict evaluation controls, and public weight-audit tooling.

Model

Notebook

ML Systems

Keiro

Keiro retrofits a Sparse Mixture-of-Experts architecture into Qwen2.5-3B. A Top-2 dynamic router activates 2 of 8 LoRA experts per transformer block, expanding effective capacity while keeping active compute identical to the dense baseline. The residual design — frozen FFN + routed Rank-16 LoRA adapters — adds only 19.46M trainable parameters (0.63% of total) while retaining 95.4% of the base model's GSM8K mathematical reasoning capability. Engineering challenges included resolving a CUDA race condition in index_add_ with duplicate Top-K indices, a BFloat16 cumsum upcast mismatch in the coalesce path, and a 4.7× autoregressive inference bottleneck diagnosed and addressed by bypassing capacity buffers for single-token generation. Benchmarked via EleutherAI lm-evaluation-harness (HellaSwag −0.13%, ARC-Challenge −0.17%, GSM8K −3.19%).

Model

Notebook

ML Systems

Prolepsis

Prolepsis explores speculative decoding with a Qwen3-4B draft model and a Qwen3-32B target. On an RTX PRO 6000 at batch size 1, the vLLM FP8 path reached 65.40 tok/s against 37.94 tok/s target-only, a 1.724x speedup. The custom Hugging Face BF16 path reached 1.319x with 64.18% acceptance. Full prompts, responses, metrics, and limitations are included.

Code

ML Systems

FlashTile

FlashTile is a reference PyTorch implementation of Flash Attention (V1/V2) and KV-cache-efficient inference variants (GQA/MQA). It uses block-wise tiling, online softmax, and recomputation-based backward passes to reduce attention-score storage from O(N²) to O(N). The project includes benchmark and validation artifacts from A100 testing, an archived H100 cross-check, and a forward-only Triton kernel for performance comparison.

Code

Systems Engineering

Substrata9: Linux Process Introspection Toolkit

Substrata9 — a lightweight, pure-Bash toolkit for deep Linux process introspection. Built entirely without compilation or external dependencies, it mines raw /proc filesystem data to surface memory maps, file descriptors, process hierarchies, and runtime anomalies. Its modular CLI utilities emit JSON output for automation, slotly cleanly into observability, debugging, and forensics workflows.

Code

Archived

AI & RAG

Mission Cipher

Created a Graph Retrieval-Augmented Generation (GraphRAG) web app that answers Mission: Impossible questions with context-aware, generative responses. The system enhances traditional RAG by combining cosine-similarity search on semantic embeddings with a dynamically constructed knowledge graph, enabling deeper contextual understanding. A Flask backend builds and queries the graph using NetworkX, while a language model generates responses based on rich subgraph context. The application runs under Gunicorn (WSGI) and is fronted by NGINX as a reverse proxy, with communication handled via a Unix socket for secure, low-latency performance. Hosted on multi-zone Google Compute Engine, the service leverages GCE's 99.99% uptime SLA, with tightly scoped ingress rules for high-performance, secure access.

Code Demo

Security & Analytics

CloudNet Analytics

Designed and deployed a secure, real-time log-analytics platform on Google Cloud that ingests, processes, and visualizes network logs end to end. Architected a custom Virtual Private Cloud (VPC) sliced into three /24 subnets—x.y.1.0/24 (web), x.y.2.0/24 (application), and x.y.3.0/24 (processor)—each pinned to dedicated Compute Engine VMs to enforce zero-trust micro-segmentation and east-west isolation. Granular, stateful firewall rules admit traffic only from whitelisted IP prefixes and service accounts. Logs are encrypted in flight over SSH, transformed with Python, staged in Cloud Storage, and streamed through Pub/Sub to invoke Cloud Functions (1st gen.) that load structured data into BigQuery. A hardened Flask API—exposed via HTTPS and IAM-based authentication—delivers controlled, low-latency access to analytics, providing scalable, compliant, and high-performance troubleshooting insights.

Code

Education

Project Graphil

Built an interactive visual learning platform (Graphil) using React to simplify complex technical topics—including Linux, GCP, networking, Python, and AI—through modular, pre-rendered visualizations. The intuitive UI/UX design enables self-guided exploration of technical subjects, enhancing comprehension for visual learners. The platform is open-source, encouraging community collaboration and extensibility.

Code Demo

Compliance

Compliance Guide

Compiled a comprehensive and holistic compliance framework covering 15+ critical domains (e.g., Cybersecurity/CyberSecOps, Data Privacy (GDPR, CCPA), PCI DSS, IT Best Practices, Legal & Operational Standards) for a fictional grocery delivery startup. This proactive resource demonstrates the potential to streamline onboarding and reduce initial legal/compliance research overhead by an estimated 5–10%.

Code Demo