MSE Scientific Computing @ UPenn. BS+MS Physics @ IIT Madras.
I work on GPU programming, physics-informed ML, and distributed systems. Most of what I build is about making something run faster or fit on smaller hardware.
Languages: C++, CUDA, Python, JavaScript, Golang, Java, GLSL Tools: PyTorch, JAX, WebGPU, OpenGL, LangChain, FAISS, Spark
- OpenGJK-GPU — CUDA implementation of GJK/EPA collision detection. Half-warp per distance query, full warp per penetration computation, all data sharing via
__shfl_sync(). 37x over CPU at 1000 vertices/polytope. Built with Mattia Montanari (PhysicsX). Used by Google DeepMind and Unity. - CUDA Path Tracer — GPU path tracer with SAH-built linear BVH for O(log n) intersection. Three BSDFs, stream compaction (99.96% ray termination by bounce 7), thin-lens DoF, stochastic AA.
- WebGPU Gaussian Splat Viewer — 3DGS in the browser. Compute preprocessing, GPU radix sort per frame, instanced indirect draw. 153 FPS on 272K gaussians.
- WebGPU Forward+ & Clustered Deferred — Three rendering techniques on Sponza at 5000 dynamic lights. G-buffer compressed to 64 bits/pixel.
- Mini Minecraft — C++/OpenGL voxel engine. Infinite terrain, 5 biomes, 3D Perlin caves, multithreaded chunk gen, ray-marched physics. (Demo)
- PDE-aware Optimizer — Custom optimizer for PINNs. Scales updates by per-sample PDE gradient variance for second-order-like preconditioning at first-order cost. Tested on Burgers, Allen-Cahn, KdV. (Paper)
- Diffusion Transformer for Flow Prediction — DDPM + Transformer for Navier-Stokes and LBM flow fields. Found and fixed a loss formulation bug in the original DiffFluid paper. Under 8% L2 error. (Paper)
- Fast Image Editing — 100x faster than DDIM inversion. SSD-1B + 4-step LCM + ControlNet Canny, 6 seconds per edit on an RTX 3060. CPU offloading gives a 4.2x speedup by avoiding VRAM fragmentation.
- KronAdaGrad + Polar Express — Replaced Newton-Schulz in KronAdaGrad with Polar Express pair iteration. Also implemented Muon and Polar Express optimizers in JAX/Optax for PINNs.
- PennCloud — Distributed KVS in C++ from raw sockets and pthreads. Synchronous replication, coordinator-based failover, 10MB memory ceiling with LRU eviction and tablet splitting. 5K+ req/s. (Demo)
LLM agent that turns text prompts into parametric CadQuery models with constraint validation and STEP/STL export. FAISS-indexed RAG over working examples. Wharton GenAI Labs, 1 of 8 selected, $4K seed.
