Skip to content

JanBartos6/Graduation-Project

Repository files navigation

LLM From Scratch

A complete repository to train small LLMs from scratch

Project Structure

model_low.py - creates a modular model architecture framework while using its classes as building blocks. To speed up computations, the NumPy library is utilized for demanding and highly optimized computations to lower resources required to train models. Still, the computations are not parallelized, making it highly inefficient. Although unused for real training, it exposes the internals and calculations behind the curtain in model.py.

model.py - solves the lack of parallelization in model_low.py by utilizing PyTorch - an optimized ML library with predefined methods, rapidly shortening the code needed.

dataset_download.py - downloads both datasets used during training and writes the corpus to disk as a txt file.

tokenizer_train.py - creates tokenizers using BPE directly from the downloaded corpus and outputs a tokenizer model and vocabulary.

train.py - tokenizes the corpus into raw token IDs, writes it as a binary file, and trains the model using model.py. It also saves the model and its checkpoints.

inference.py - loads a model checkpoint, uses model.py for its architecture, and the tokenizer to encode and decode. The model then continues your prompt for the specified number of tokens.

Pipeline

dataset_download.py => tokenizer_train.py => train.py => inference.py

About

A transformer-based language model trained from scratch to showcase performance on Tiny Stories and Wikipedia datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages