Wine Quality Classification — Model Comparison

Short description

Dataset: Wine Quality (data/WineQT.csv) — multiclass classification of the quality label (scores 3–8, imbalanced classes).
Models: DecisionTree, kNN, RandomForest — 3 hyperparameter variants each (9 classifiers in total).
Prediction ensemble: VotingClassifier with majority voting (voting="hard") over the same 9 pipelines.
Validation: stratified 5-fold CV (StratifiedKFold(shuffle=True, random_state=42)).
Metrics: accuracy (the fraction of correct predictions) and balanced_accuracy (mean per-class recall — important given the imbalanced distribution of wine scores).
Stack: scikit-learn, Pandas (groupby, to_latex), Plotly (charts), Streamlit (study walkthrough, term definitions, requirements compliance, .tex preview).

Protection against data leakage

All scaling operations are performed inside a scikit-learn Pipeline object:

Pipeline([
    ("preprocess", ColumnTransformer([("num", MinMaxScaler(), feature_cols)])),
    ("clf", classifier),
])

Thanks to this, MinMaxScaler is fitted exclusively on the training fold — it never "sees" the test data before evaluation. Passing a ready Pipeline to cross_validate guarantees this automatically for each of the 5 splits. Details: scikit-learn — Common pitfalls.

No missing data

The wczytaj_pelna_ramke function in src/experiment.py calls df.isna().any().any() and raises an exception if it finds missing values — the experiment will not run on an incomplete dataset.

Requirements

Python 3.10+ (recommended)
Install: pip install -r requirements.txt

Experiment (writes to `results/`)

Runs the validation and saves, among others, CSV, .tex, wersje_bibliotek.txt and HTML charts in results/wykresy/.

python run_experiment.py

Streamlit and start scripts

The start.sh (Linux/macOS) and start.bat (Windows) scripts run, in order:

venv + pip install -r requirements.txt
python run_experiment.py
streamlit run streamlit_app.py → usually http://localhost:8501

chmod +x start.sh
./start.sh

start.bat

Application tabs

Tab	Content
Project description	Study goal, experiment flow (what → why → effect), glossary of terms, requirements compliance, LaTeX file preview
Requirements & methodology	Table: requirement → implementation in code / files
Dataset (EDA)	`quality` class distribution, correlation matrix, data preview
Classification results	CV metric charts, tables, preview and download of results

Reproducibility

In line with the requirement scikit-learn — Getting reproducible results:

One fixed seed: RANDOM_STATE = 42 in src/config.py; passed to every classifier (random_state) and to StratifiedKFold(shuffle=True, random_state=RANDOM_STATE).
StratifiedKFold with shuffle=True requires a seed — without it the sample order after shuffling would differ on every run.
The same cv object is passed to each cross_validate call, which guarantees identical splits for all models.
The results/wersje_bibliotek.txt file records the versions of scikit-learn, numpy and pandas when the results are generated — making it possible to reproduce the environment.

Repository structure

Path	Description
`data/WineQT.csv`	Source data
`src/config.py`	Paths, seed
`src/experiment.py`	Pipeline, CV, ensemble, CSV/LaTeX export, chart invocation
`src/wykresy.py`	Plotly charts + HTML export
`run_experiment.py`	Command-line entry point
`streamlit_app.py`	Web application
`start.sh` / `start.bat`	Experiment + Streamlit
`results/`	Generated results (CSV, TeX, `wykresy/*.html`, `wersje_bibliotek.txt`)

`.gitignore`

Ignored items include .venv/, Python cache, and IDE files. Generated files in results/ can optionally be added to the ignore list — .gitignore contains a ready, commented-out block with instructions.

License

See the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wine Quality Classification — Model Comparison

Short description

Protection against data leakage

No missing data

Requirements

Experiment (writes to `results/`)

Streamlit and start scripts

Application tabs

Reproducibility

Repository structure

`.gitignore`

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
docs		docs
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_experiment.py		run_experiment.py
start.bat		start.bat
start.sh		start.sh
streamlit_app.py		streamlit_app.py

Folders and files

Latest commit

History

Repository files navigation

Wine Quality Classification — Model Comparison

Short description

Protection against data leakage

No missing data

Requirements

Experiment (writes to results/)

Streamlit and start scripts

Application tabs

Reproducibility

Repository structure

.gitignore

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Experiment (writes to `results/`)

`.gitignore`

Packages