Large Language Models from Scratch

GitHub

I take a look under the hood of LLMs and implement a variety of common neural networks used in language modeling from scratch to understand the inner workings of these models. Many of these implementations are from Andrej Karpathy's "Neural Networks: Zero to Hero" lecture series, which culminates in a ground-up implementation of a mini GPT model. I then dive deeper into representation learning.

micrograd

I begin with micrograd, a simple autograd engine that aligns with the PyTorch API. It includes an implementation of backpropagation and stochastic gradient descent on scalar-valued neural networks. The class Value is defined such that basic algebraic operations can be carried out to form new Values while tracking the operations and Values that went into the creation of new Values, forming an expression graph for the purpose of backpropagation. This allows gradients to be computed more efficiently and accurately than through the use of numerical approximation.

makemore

makemore is an autoregressive character level language model. It generates fake words that are structurally similar to the words it was trained on. Our implementation covers key parts of torch.nn using torch.tensor. makemore includes various different models, including:

Bigram Language Model
- trained through counting character pairs (bigrams) to form a character correlation matrix
- simple neural network (linear layer + softmax layer) trained on bigrams
Multi-Layer Perceptron
- MLP implementation following Bengio et al., 2003
- with Batch Normalization following Ioffe et al., 2015
Convolutional Neural Network
- CNN implementation modeled off Google DeepMind's WaveNet, 2016
Transformer
- Mini GPT-2 implementation following Vaswani et al., 2017
- with the following for optimizing deep networks:
  - Residual Connections from He et al., 2015
  - Layer Normalization from Ba, Kiros, and Hinton, 2016
  - Dropout (to prevent overfitting) from Srivastava, 2014