Skip to content

Micropot/Stream

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Stream

Stream is a modular pipeline designed to generate synthetic or use real hospital data, format it for instruction tuning, and fine-tune a large language model (LLM) using this data.


⚠️ IMPORTANT β€” NVIDIA GPU & CUDA Required ⚠️

This project requires an NVIDIA GPU and a properly installed CUDA toolkit.
Make sure your system has compatible GPU drivers and the correct version of CUDA installed before running anything.

πŸ”§ See the official CUDA installation guide for setup instructions.


🧠 Workflow Overview

%%{init: { "flowchart": { "htmlLabels": true } } }%%

flowchart TB
classDef next stroke-dasharray: 3
style fd stroke-dasharray: 3
style rd stroke-dasharray: 3

subgraph Datamanagement

subgraph fd["Fictive data"]
    c["Generate ICD-10-coded hospital stays"]
    g["Generate synthetic medical records"]
end
subgraph rd["Real data"]
    i["Import ICD-10-coded hospital stays"]
    r["Import real medical records"]
end

s["Create LLM response"]
o["Create a standardized input file for fine-tuning"]

end

subgraph F["Finetuning"]
    direction TB
    sd["Split dataset"]
    t["Train Model"]
    e["Evaluate model on test set"]
    sa["Save and upload model to Ollama"]
    re["Use finetuned model"]
end

c & i --> s --> g --> o
s & r --> o
o --> F --> sd --> t --> e --> sa --> re
Loading

πŸš€ Getting started

πŸ“¦ Project Structure

β”œβ”€β”€ .gitignore
β”œβ”€β”€ outputs
β”‚Β Β  └── best_models
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ .python-version
β”œβ”€β”€ README.md
└── stream
    β”œβ”€β”€ cli.py
    β”œβ”€β”€ config
    β”‚Β Β  β”œβ”€β”€ config.py
    β”‚Β Β  β”œβ”€β”€ config.yaml
    β”‚Β Β  └── stream_config.yaml
    β”œβ”€β”€ dataset.py
    β”œβ”€β”€ export.py
    β”œβ”€β”€ finetunning.py
    β”œβ”€β”€ generate.py
    β”œβ”€β”€ inference.py
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ __main__.py
    β”œβ”€β”€ prompt.yaml
    └── referentials
        └── cim_10_atih_2019.tsv

πŸ“₯ Install the dependencies

uv venv
source .venv/bin/activate
uv pip install .

⚠️ Note : During the developpement of the projet, uv was used as the main Python package manager. Pip can also be used.

✏️ Set-up connection to Ollama server

Create a configuration file at stream/config/config.yaml:

cd stream/stream/config
touch config.yaml
vim config.yaml

Paste the following example settings (replace placeholders with real values):

ollama:
  host: "host_to_your_ollama_server"
  model: "your_ollama_model_for_data_generation"

πŸ’‘ Tip: You can specify a default HuggingFace model for fine-tuning.


✏️ Set-up configuration file

The stream/config/stream_config.yaml file is setting up all the parameters for the llm finetuning and data generation.

πŸ”§ You can adjust the parameters as you want for your project.

Here an example of stream_config.yaml file.

path:
  pmsi_dataset: "path/to/pmsi_dataset.csv"
  train_dataset: "path/to/train_dataset.csv"
  gguf_path: "path/to/gguf/model/will/be/saved"

llm:
  epoch: 3
  batch_size: 8
  lr: 2e-4
  best_model_path: "path/to/best/model/folder"
  r: 8
  lora_alpha: 16

hf:
  model: "unsloth/the_model_i_want_to_finetune"

πŸ’‘ The pmsi_dataset.csv file has the following schema :

schema={
            "visit_occurence_id": polars.Int64,
            "los": polars.Int32, # length of stay
            "condition_status_source_value": polars.String, # DP, DR, DAS
            "concept_code": polars.String, # ICD-10 code
            "concept_name": polars.String, # ICD-10 label
        }

πŸ’‘ The train_dataset.csv file has the following schema :

schema={
                "visit_occurence_id": polars.Int64,
                "instruction": polars.String,
                "input": polars.String,
                "output": polars.String,
                "text": polars.String,
            }

🧬 Data Generation

πŸ’» With CLI

To generate synthetic medical records for training:

uv run -m stream generate --r --n 10

The generate is used to generate synthetic medical records.

  • Use --r for real ICD-10 codes.
  • Use --f for fake ICD-10 codes.

The --n is used to specify how many data you want to generate.

🐍 With Python

from stream.generate import generate_MR
generate_MR(fictive=True, n=10) # fictive=False to generate fictive ICD-10 values

πŸ” Fine-Tuning

πŸ’» Run with CLI

  • To train the model mentionned in the config.yaml file :
uv run -m stream llm --train

πŸ’‘ The train file is the one mentionned in the stream_config.yaml.

πŸ’‘ The dataset will be splited automaticaly during training : train (60%), validation (20%), test (20%).

  • To evalue the model on the test set :
uv run -m stream llm --test

πŸ’‘ The test set is a split of train_dataset.csv file, not included during training.

🐍 Run with Python

This is the python code to use the module for any model finetuning :

from stream.finetunning import Cim10Trainer
from stream.dataset import UnslothDatasetBuilder

cim = Cim10Trainer()
# Create and split the dataset
pl_dataset = cim.dataset
dataset_builder = UnslothDatasetBuilder(cim.tokenizer)
dataset = dataset_builder.build_dataset(pl_dataset)
dataset = dataset.map(dataset_builder.formatting_prompts_func, batched=True)
splits = dataset_builder.split_dataset(dataset)
train = splits["train"]
val = splits["validation"]

# set and finetune the model
cim.set_model()
cim.finetune_model(train=train, val=val)

This is the python code to use the module for any model evalutaion :

from stream.finetunning import Cim10Trainer
from stream.dataset import UnslothDatasetBuilder

cim = Cim10Trainer()
# Create and split the dataset
pl_dataset = cim.dataset
dataset_builder = UnslothDatasetBuilder(cim.tokenizer)
dataset = dataset_builder.build_dataset(pl_dataset)
dataset = dataset.map(dataset_builder.formatting_prompts_func, batched=True)
splits = dataset_builder.split_dataset(dataset)
test=splits["test"]
# Evaluate the model
cim.evaluate_model(
       best_model_path=cim.llm_setting.llm.best_model_path,
       tokenizer=dataset_builder.tokenizer,
       test_dataset=test,
       dataset_chat_template=dataset_builder.tokenize()
)

🎣 Inference

Once you have trained the model, you might want to use it on real data.

⚠️ Note : For the moment, only .txt format for input file is supported.

πŸ’» Run with CLI

uv run -m stream inference --i path/to/my_MR.txt

🐍 Run with Python

from stream.inference import Cim10Inference
from stream.dataset import UnslothDatasetBuilder
cim = Cim10Inference()   
inference_data = args.input
with open(inference_data, "r", encoding="utf-8") as f:
    lignes = [ligne.strip() for ligne in f]

df = pl.DataFrame({"conversations": lignes})
text = "\n".join(df["conversations"].to_list())
message = [
    {"role" : "user", "content": text}
]
dataset_builder = UnslothDatasetBuilder(cim.tokenizer)
cim.inference(
            best_model_path=cim.llm_setting.llm.best_model_path,
            tokenizer=dataset_builder.tokenizer,
            inference_data=message,
            dataset_chat_template=dataset_builder.tokenize()
        )

🌐 Export the model to Ollama server

You might want to export you model to your local Ollama server. For this, you have to convert your model to a gguf type model. Ollama required a Modelfile. Everything can be created by the two commands bellow :

uv run stream -m export
ollama create mistral-custom -f ./outputs/best_models/Modelfile

Those two command create a model on your current ollama server named "mistral-custom"

If you want to use directly python :

from stream.export import setup_tokenizer,  export_gguf_model
from stream.finetunning import Cim10Trainer

cim = Cim10Trainer()
tokenizer = setup_tokenizer(cim.model_name)
export_gguf_model(model=cim.model, custom_model_path=cim.gguf_path, tokenizer=tokenizer, gguf_path=cim.gguf_path)

⚠️ Note : you still need the ollama create bash command to upload your gguf model to your ollama server instance.


πŸ‘₯ Credits

Created by Arthur Lamard and Basile Fuchs MD at CHU Brest

πŸ“ Source

This project is based on : record-with-mistral-finetune from AP-HP

About

Stream is a modular pipeline designed to generate synthetic or use real hospital data, format it for instruction tuning, and fine-tune a large language model (LLM) using this data.

Resources

License

Stars

Watchers

Forks

Contributors

Languages