An independent LLM researcher with experience in distributed systems and machine learning, currently investigating different approaches to shrink the KVCache. Previously worked @ Microsoft and AWS.
-
Reducing the KVCache's Footprint
- Implemented a Large Concept Model with jointly trained encoder/decoder for KVCache compression
- Explored shared projection for K/V as an alternative to GQA
- Implemented multi-head latent attention and MixAttention - two different approaches to shrink the KVCache during inference
- Identified patterns between heads that sparsely accessed V without additional training
- Investigated calibrating cluster centers for Q/K to partition and sparsify KVCache access
-
Hyperparameter Transfer
- Compared two different approaches for hyperparameter transfer described by Yang et al. (2022) and Everett et al. (2024) and reproduced Everett's findings that standard parameterization with scaling exponents outperforms muP when scaling hyperparameters from 37m to 1b
- Implemented power scheduler from Shen et al (2024) for batch size and training length optimization
-
Other work
- Implemented Llama 3 and Gemma 2 within muGPT
- Designed a synthetic benchmark to assess the impact of position embeddings (RoPE, Alibi, CoPE, NoPE) and attention modifications (e.g., KVCache reuse, multi-head latent attention) on long-context performance
- Languages: Python, C++, TypeScript
- Frameworks: JAX, Pallas, PyTorch, TensorFlow
- Cloud & Infrastructure: Google TPUs, AWS Lambda, CloudFormation
- Areas of Expertise: Transformers, Machine Learning Infrastructure, Distributed Systems
- BS in Computer Science from University of Maryland @ College Park (2021)
- Graduated in 3 years with President's Scholarship
- Thomas Jefferson High School for Science and Technology (2018)