LLM From Scratch

A complete repository to train small LLMs from scratch

Project Structure

model_low.py - creates a modular model architecture framework while using its classes as building blocks. To speed up computations, the NumPy library is utilized for demanding and highly optimized computations to lower resources required to train models. Still, the computations are not parallelized, making it highly inefficient. Although unused for real training, it exposes the internals and calculations behind the curtain in model.py.

model.py - solves the lack of parallelization in model_low.py by utilizing PyTorch - an optimized ML library with predefined methods, rapidly shortening the code needed.

dataset_download.py - downloads both datasets used during training and writes the corpus to disk as a txt file.

tokenizer_train.py - creates tokenizers using BPE directly from the downloaded corpus and outputs a tokenizer model and vocabulary.

train.py - tokenizes the corpus into raw token IDs, writes it as a binary file, and trains the model using model.py. It also saves the model and its checkpoints.

inference.py - loads a model checkpoint, uses model.py for its architecture, and the tokenizer to encode and decode. The model then continues your prompt for the specified number of tokens.

Pipeline

dataset_download.py => tokenizer_train.py => train.py => inference.py

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
outputs		outputs
tinystories_models		tinystories_models
wiki_models		wiki_models
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
chinchilla.pdf		chinchilla.pdf
dataset_download.py		dataset_download.py
inference.py		inference.py
model.py		model.py
model_low.py		model_low.py
requirements.txt		requirements.txt
tokenizer.model		tokenizer.model
tokenizer.vocab		tokenizer.vocab
tokenizer_train.py		tokenizer_train.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM From Scratch

Project Structure

Pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM From Scratch

Project Structure

Pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages