A Big Data Engineering project — detecting cyber-attacks on IoT devices in real time using an online, incrementally-trained machine learning pipeline
Most ML pipelines train once on a static dataset. This project does the opposite: it builds a custom client–server streaming pipeline that pushes IoT network records row-by-row over a WebSocket, then trains an ensemble model incrementally on each incoming chunk — continuously improving as new data arrives, the way a real intrusion-detection system would.
The pipeline classifies cyber-attacks on IoT devices (e.g. DDoS, injection, backdoor, scanning) in near real time, using the TON_IoT datasets from UNSW.
The repo ships two complete experiments:
| Experiment | Description |
|---|---|
| 🔌 IoT_Modbus | Streaming + ensemble pipeline on a single device type (Modbus industrial-protocol data, ~287K records). Includes a multithreaded client and live evaluation plots. |
| 🌐 All_IoT_Devices_Combined | Same pipeline on a heterogeneous dataset built by cleaning and merging all 7 TON_IoT device datasets — the model must classify attacks across mixed device sources. |
- ⚡ Custom real-time streaming engine — a from-scratch WebSocket server/client pipeline with per-row acknowledgements for back-pressure.
- 🔄 Online / incremental learning — trains per chunk and folds each new model into a running ensemble instead of retraining from scratch.
- 📦 Adaptive chunk sizing — automatically grows a chunk when it can't be trained (e.g. only one class present yet), solving a real streaming-data pitfall.
- 🧵 Multithreaded client — decouples network I/O from model training so streaming never blocks on a training step.
- 🗳️ 7-model voting ensemble — LR · SVM · RF · LDA · KNN · CART · Naive-Bayes, plus standalone LSTM & GRU deep-learning models.
- 🧹 End-to-end data engineering — cleans, deduplicates, encodes, and merges 7 heterogeneous datasets into one trainable source.
- 📊 Live metrics & visualization — per-chunk accuracy, precision, recall, F1 and training time, tracked and plotted with Plotly.
┌──────────────┐ row-by-row stream ┌─────────────────────────────────────┐
│ │ ws://localhost:8765 │ CLIENT │
│ SERVER │ ─────────────────────► │ ┌────────────┐ ┌───────────────┐ │
│ server.ipynb │ ◄───── ack ("1") ───── │ │ Chunk │──►│ Ensemble │ │
│ (streams CSV)│ │ │ Buffer │ │ (Voting + │ │
│ │ │ │ (adaptive) │ │ incremental)│ │
└──────────────┘ │ └────────────┘ └───────┬───────┘ │
│ │ │
│ metrics ◄────┘ │
│ accuracy · F1 · recall · │
│ precision · time │
└─────────────────────────────────────┘
| Category | Tools |
|---|---|
| Language / Env | Python 3, Jupyter Notebook |
| Streaming & Concurrency | websockets, asyncio, threading |
| Data | pandas, numpy |
| Classical ML | scikit-learn — LR, SVM, RF, LDA, KNN, CART, Gaussian Naive-Bayes, VotingClassifier |
| Deep Learning | TensorFlow / Keras — LSTM, GRU |
| Preprocessing | LabelEncoder, StandardScaler, train_test_split |
| Persistence | pickle |
| Visualization | plotly, matplotlib |
- Streaming server (
server.ipynb) — reads a CSV and streams it one row at a time overws://localhost:8765with a configurable delay, waiting for an acknowledgement after every row. - Incremental client (
client.ipynb) — buffers rows into chunks, trains an ensemble per chunk, and merges each new chunk's model into the running ensemble (online learning). - Adaptive chunking — starts at
INITIAL_CHUNK_SIZE, switches toFINAL_CHUNK_SIZE, and grows byCHUNK_SIZE_INCREMENT_FACTORwhen a chunk can't be trained. - Multithreaded variant (
client_multithread.ipynb) — uses a background thread +threading.Eventso receiving and training run concurrently. - Heterogeneous data pipeline —
data_preprocessing.ipynbcleans each device dataset (drops NaNs, groups duplicates by mode);data_merging.ipynbmerges all devices. - Run logging —
stream.log(streaming events) andensemble.log(training events) capture each run.
1. Install dependencies
pip install pandas numpy scikit-learn tensorflow websockets plotly matplotlib jupyter2. Prepare data & train the base models
- Place your IoT CSV in the
data/folder (aTest_*CSV is a trimmed copy for quick streaming demos). - Encode string columns to numeric with
LabelEncoderin eachmodels/notebooks/*notebook and inclient.ipynb. - For LR & SVM, set the
ilocupper limit to the row where the second class label first appears. - Run every notebook in
models/notebooks/to train and pickle the individual models intomodels/h5s/.
3. Run the stream
▶ Run server/server.ipynb → starts streaming on ws://localhost:8765
▶ Run client/client.ipynb → connects, chunks, and trains incrementally
(or client/client_multithread.ipynb)
Watch client/stream.log & client/ensemble.log, then open client/Graphs.ipynb to visualize per-chunk metrics.
For the combined experiment, first run
data_preparation/data_preprocessing.ipynb→data_preparation/data_merging.ipynb
- Streaming architecture — building a producer/consumer pipeline over WebSockets with
asyncio, including back-pressure via per-row acknowledgements. - Online vs. batch learning — keeping a model improving as new chunks arrive instead of retraining from scratch.
- Streaming-data pitfalls — why fixed chunk sizes break (a chunk may contain only one class) and how adaptive sizing fixes it.
- Ensemble methods — combining diverse classifiers via voting, and the gap between classical ML and deep models (LSTM/GRU) on tabular IoT data.
- Concurrency — moving from a single
asyncioloop to a multithreaded client so training doesn't block the stream. - Real-world data wrangling — cleaning, deduplicating, encoding, and merging heterogeneous datasets with inconsistent schemas.
- Production streaming stack — swap the hand-rolled server for Apache Kafka / Spark Streaming (fault tolerance, partitioning, horizontal scaling).
- True incremental learners — adopt
partial_fitmodels (SGDClassifier,river) and add concept-drift detection. - Productionize — extract logic from notebooks into reusable modules with a CLI, config files, and unit tests.
- Stronger evaluation — handle class imbalance with stratified sampling, weighted metrics, and confusion matrices.
⭐ If you found this project helpful or interesting, consider giving it a star! ⭐