Stream is a modular pipeline designed to generate synthetic or use real hospital data, format it for instruction tuning, and fine-tune a large language model (LLM) using this data.
β οΈ IMPORTANT β NVIDIA GPU & CUDA Requiredβ οΈ This project requires an NVIDIA GPU and a properly installed CUDA toolkit.
Make sure your system has compatible GPU drivers and the correct version of CUDA installed before running anything.π§ See the official CUDA installation guide for setup instructions.
%%{init: { "flowchart": { "htmlLabels": true } } }%%
flowchart TB
classDef next stroke-dasharray: 3
style fd stroke-dasharray: 3
style rd stroke-dasharray: 3
subgraph Datamanagement
subgraph fd["Fictive data"]
c["Generate ICD-10-coded hospital stays"]
g["Generate synthetic medical records"]
end
subgraph rd["Real data"]
i["Import ICD-10-coded hospital stays"]
r["Import real medical records"]
end
s["Create LLM response"]
o["Create a standardized input file for fine-tuning"]
end
subgraph F["Finetuning"]
direction TB
sd["Split dataset"]
t["Train Model"]
e["Evaluate model on test set"]
sa["Save and upload model to Ollama"]
re["Use finetuned model"]
end
c & i --> s --> g --> o
s & r --> o
o --> F --> sd --> t --> e --> sa --> re
βββ .gitignore
βββ outputs
βΒ Β βββ best_models
βββ pyproject.toml
βββ .python-version
βββ README.md
βββ stream
βββ cli.py
βββ config
βΒ Β βββ config.py
βΒ Β βββ config.yaml
βΒ Β βββ stream_config.yaml
βββ dataset.py
βββ export.py
βββ finetunning.py
βββ generate.py
βββ inference.py
βββ __init__.py
βββ __main__.py
βββ prompt.yaml
βββ referentials
βββ cim_10_atih_2019.tsvuv venv
source .venv/bin/activate
uv pip install .
β οΈ Note : During the developpement of the projet, uv was used as the main Python package manager. Pip can also be used.
Create a configuration file at stream/config/config.yaml:
cd stream/stream/config
touch config.yaml
vim config.yamlPaste the following example settings (replace placeholders with real values):
ollama:
host: "host_to_your_ollama_server"
model: "your_ollama_model_for_data_generation"π‘ Tip: You can specify a default HuggingFace model for fine-tuning.
The stream/config/stream_config.yaml file is setting up all the parameters for the llm finetuning and data generation.
π§ You can adjust the parameters as you want for your project.
Here an example of stream_config.yaml file.
path:
pmsi_dataset: "path/to/pmsi_dataset.csv"
train_dataset: "path/to/train_dataset.csv"
gguf_path: "path/to/gguf/model/will/be/saved"
llm:
epoch: 3
batch_size: 8
lr: 2e-4
best_model_path: "path/to/best/model/folder"
r: 8
lora_alpha: 16
hf:
model: "unsloth/the_model_i_want_to_finetune"π‘ The
pmsi_dataset.csvfile has the following schema :schema={ "visit_occurence_id": polars.Int64, "los": polars.Int32, # length of stay "condition_status_source_value": polars.String, # DP, DR, DAS "concept_code": polars.String, # ICD-10 code "concept_name": polars.String, # ICD-10 label }
π‘ The
train_dataset.csvfile has the following schema :schema={ "visit_occurence_id": polars.Int64, "instruction": polars.String, "input": polars.String, "output": polars.String, "text": polars.String, }
To generate synthetic medical records for training:
uv run -m stream generate --r --n 10The generate is used to generate synthetic medical records.
- Use
--rfor real ICD-10 codes. - Use
--ffor fake ICD-10 codes.
The --n is used to specify how many data you want to generate.
from stream.generate import generate_MR
generate_MR(fictive=True, n=10) # fictive=False to generate fictive ICD-10 values- To train the model mentionned in the
config.yamlfile :
uv run -m stream llm --trainπ‘ The train file is the one mentionned in the
stream_config.yaml.
π‘ The dataset will be splited automaticaly during training : train (60%), validation (20%), test (20%).
- To evalue the model on the test set :
uv run -m stream llm --testπ‘ The test set is a split of
train_dataset.csvfile, not included during training.
This is the python code to use the module for any model finetuning :
from stream.finetunning import Cim10Trainer
from stream.dataset import UnslothDatasetBuilder
cim = Cim10Trainer()
# Create and split the dataset
pl_dataset = cim.dataset
dataset_builder = UnslothDatasetBuilder(cim.tokenizer)
dataset = dataset_builder.build_dataset(pl_dataset)
dataset = dataset.map(dataset_builder.formatting_prompts_func, batched=True)
splits = dataset_builder.split_dataset(dataset)
train = splits["train"]
val = splits["validation"]
# set and finetune the model
cim.set_model()
cim.finetune_model(train=train, val=val)This is the python code to use the module for any model evalutaion :
from stream.finetunning import Cim10Trainer
from stream.dataset import UnslothDatasetBuilder
cim = Cim10Trainer()
# Create and split the dataset
pl_dataset = cim.dataset
dataset_builder = UnslothDatasetBuilder(cim.tokenizer)
dataset = dataset_builder.build_dataset(pl_dataset)
dataset = dataset.map(dataset_builder.formatting_prompts_func, batched=True)
splits = dataset_builder.split_dataset(dataset)
test=splits["test"]
# Evaluate the model
cim.evaluate_model(
best_model_path=cim.llm_setting.llm.best_model_path,
tokenizer=dataset_builder.tokenizer,
test_dataset=test,
dataset_chat_template=dataset_builder.tokenize()
)Once you have trained the model, you might want to use it on real data.
β οΈ Note : For the moment, only .txt format for input file is supported.
uv run -m stream inference --i path/to/my_MR.txtfrom stream.inference import Cim10Inference
from stream.dataset import UnslothDatasetBuilder
cim = Cim10Inference()
inference_data = args.input
with open(inference_data, "r", encoding="utf-8") as f:
lignes = [ligne.strip() for ligne in f]
df = pl.DataFrame({"conversations": lignes})
text = "\n".join(df["conversations"].to_list())
message = [
{"role" : "user", "content": text}
]
dataset_builder = UnslothDatasetBuilder(cim.tokenizer)
cim.inference(
best_model_path=cim.llm_setting.llm.best_model_path,
tokenizer=dataset_builder.tokenizer,
inference_data=message,
dataset_chat_template=dataset_builder.tokenize()
)You might want to export you model to your local Ollama server. For this, you have to convert your model to a gguf type model. Ollama required a Modelfile. Everything can be created by the two commands bellow :
uv run stream -m export
ollama create mistral-custom -f ./outputs/best_models/ModelfileThose two command create a model on your current ollama server named "mistral-custom"
If you want to use directly python :
from stream.export import setup_tokenizer, export_gguf_model
from stream.finetunning import Cim10Trainer
cim = Cim10Trainer()
tokenizer = setup_tokenizer(cim.model_name)
export_gguf_model(model=cim.model, custom_model_path=cim.gguf_path, tokenizer=tokenizer, gguf_path=cim.gguf_path)
β οΈ Note : you still need theollama createbash command to upload your gguf model to your ollama server instance.
Created by Arthur Lamard and Basile Fuchs MD at CHU Brest
This project is based on : record-with-mistral-finetune from AP-HP