A complete repository to train small LLMs from scratch
model_low.py - creates a modular model architecture framework while using its classes as building blocks. To speed up computations, the NumPy library is utilized for demanding and highly optimized computations to lower resources required to train models. Still, the computations are not parallelized, making it highly inefficient. Although unused for real training, it exposes the internals and calculations behind the curtain in model.py.
model.py - solves the lack of parallelization in model_low.py by utilizing PyTorch - an optimized ML library with predefined methods, rapidly shortening the code needed.
dataset_download.py - downloads both datasets used during training and writes the corpus to disk as a txt file.
tokenizer_train.py - creates tokenizers using BPE directly from the downloaded corpus and outputs a tokenizer model and vocabulary.
train.py - tokenizes the corpus into raw token IDs, writes it as a binary file, and trains the model using model.py. It also saves the model and its checkpoints.
inference.py - loads a model checkpoint, uses model.py for its architecture, and the tokenizer to encode and decode. The model then continues your prompt for the specified number of tokens.
dataset_download.py => tokenizer_train.py => train.py => inference.py