The code is structured in two folders, one for our approach and another for running Anomal-E using our own approach as well.
To run the code, you will need to create two separate virtual environments, as the libraries needed to run each part of the code are incompatible.
The libraries needed to run each part of the experiment are included in the requirement files and can easily be installed. For the Machine Learning libraries, please ensure you download the libraries that match your CUDA version.
The original dataset, datasets file and model weights have been uploaded in a FigShare repository. They can be accessed with this link: https://doi.org/10.6084/m9.figshare.30643508
These are our hyperparameter search grids.
| Hyperparameter | Hyperparameter Values |
|---|---|
| No. Layers | [1, 2] |
| No. Hidden | 32 |
| Learning Rate | 1e-3 |
| Activation Function | ReLU |
| Loss Function for Structural Loss | Binary Cross Entropy |
| Loss Function for Feature Loss | [Mean Square Error, Cosine Embedding Loss] |
| Optimiser | AdamW |
| Weight Decay | 1e-5 |
| KL Annealing | [Yes, No] |
| KL Annealing Epochs | 10 |
| KL Min Weight | 0.0 |
| Regulariser | [Yes, No] |
| Multiple Draw | [Yes, No] |
| Maximum Epochs | 100 |
| Early Stop Patience | 20 |
| Feature Importance | 1.0 |
| Structural Importance | 1.0 |
| Hyperparameter | Hyperparameter Values |
|---|---|
| PCA no. components | [0.96, 0.98, 0.99] |
| PCA Weighted | True |
| PCA Whiten | [True, False] |
| PCA Standardisation | True |
| HBOS Bins | [5, 6, 8, 10, 12, 14, 16, 18, 20] |
| HBOS Alpha | [0.05, 0.1] |
| HBOS Tol | [0.1, 0.5] |
| Contamination (Anomal-E) | [0.02, 0.035, 0.05, 0.1, 0.2] |
| Alpha (AutoGraphAD) | [0.1, 0.5, 1.0] |
| Beta (AutoGraphAD) | [0.1, 0.5, 1.0] |
| Gamma (AutoGraphAD) | [0.1, 0.5, 1.0] |
| MSE use (AutoGraphAD) | [True, False] |
| Percentile (AutoGraphAD) | [95, 97, 98, 99] |
| Estimator | Contamination Level | Hyperparameters |
|---|---|---|
| PCA | No Contamination | Contamination: 0.02, Number of Components: 0.96, Whiten: False |
| PCA | 3.5% Contamination | Contamination: 0.05, Number of Components: 0.98, Whiten: False |
| PCA | 5.7% Contamination | Contamination: 0.1, Number of Components: 0.98, Whiten: False |
| CBLOF | No Contamination | Contamination: 0.02, Alpha: 0.9, Beta: 5, Number of Clusters: 36, Use Weights: True |
| CBLOF | 3.5% Contamination | Contamination: 0.05, Alpha: 0.9, Beta: 5, Number of Clusters: 40, Use Weights: True |
| CBLOF | 5.7% Contamination | Contamination: 0.1, Alpha: 0.9, Beta: 5, Number of Clusters: 50, Use Weights: False |
| HBOS | No Contamination | Contamination: 0.02, Alpha: 0.05, Beta: 5, Number of Bins: 14, Tol: 0.1 |
| HBOS | 3.5% Contamination | Contamination: 0.05, Alpha: 0.1, Beta: 5, Number of Bins: 5, Tol: 0.1 |
| HBOS | 5.7% Contamination | Contamination: 0.1, Alpha: 0.1, Beta: 5, Number of Bins: 12, Tol: 0.1 |
| Model Variant | Contamination | Model Hyperparameters |
|---|---|---|
| VGAE Regulariser | 0% | 20% Negative Sampling, Edge Dropping, Node Masking with Annealing, 1 Layer |
| VGAE Regulariser | 3.5% | 40% Negative Sampling, Edge Dropping, Node Masking with Annealing, 2 Layers |
| VGAE Multiple Draw | 5.7% | 20% Negative Sampling, Edge Dropping, Node Masking with Annealing, 1 Layer, 10 Draws |
| Contamination | Anomaly Score Hyperparameters |
|---|---|
| 0% Contamination | Alpha: 0.1, Beta: 0.5, Gamma: 0.1, MSE: True, Percentile: 95 |
| 3.5% Contamination | Alpha: 1.0, Beta: 0.1, Gamma: 1.0, MSE: False, Percentile: 95 |
| 5.7% Contamination | Alpha: 0.5, Beta: 0.1, Gamma: 0.5, MSE: True, Percentile: 95 |
Training models is done through the file training_script.py. Inside the file, you can declare your model and configure it:
- Configure model architecture (e.g. Embedding Size, Number of Layers etc.)
- Optimisation hyperparameters (e.g., learning rate, weight decay)
- KL Annealing
- Checkpoint saving paths and the dataset to be used
The models that are defined in a dictionary inside the file. Please modify the dictionary to train and design your files.
Hyperparameter optimisation and model evaluation happen in the optimised_grid_search.py file.
This script contains two main methods:
- One for hyperparameter optimisation
- One for testing the hyperparameters.
You only need to set the path for where the results need to be saved in the case of hyperparameter optimisation. When testing the models, you would need to set the location of the best results so it can read the hyperparameters that achieved those results during optimisation.
Additionally, you can add more hyperparameter options to expand the search space through the lists at the beginning of the file.
To generate the Graph Datasets, you will need the original dataset in a .parquet format. This format was chosen due to its rapid loading and saving times. Additionally, it has built-in data compression allowing for a reduced size and easier data transfer.
The .parquet file can be found in the FigShare repository.
To generate the graph dataset, you need to follow these instructions:
- Set a repository to save the dataset
- Set the path of the raw file
- Select the mode of the generator
- Select the individual settings that can be found in the datasets.py file.
The settings that we used for our dataset generation for 0% contamination in the training dataset are the following:
- batch_size=1
- window_size=180
- window_stride=180
- train_split=70
- test_split=10
- remove_attacks=True
- classification_threshold=0.0
- node_labels=True
- l2_norm=True
The settings that we used for our dataset generation for 3.36% contamination in the training dataset are the following:
- batch_size=1
- window_size=180
- window_stride=180
- train_split=70
- test_split=10
- remove_attacks=False
- classification_threshold=0.0
- node_labels=True
- l2_norm=True
The settings that we used for our dataset generation for 5.76% contamination in the training dataset are the following:
- batch_size=1
- window_size=180
- window_stride=180
- train_split=70
- test_split=10
- remove_attacks=False
- classification_threshold=0.0
- node_labels=True
- benign_downsampling=0.01
- l2_norm=True
To add negative edges, you will need to use the method negative_sampling() provided by the dataset class. To save the processed datasets, you will need to use the method save_datasets(). Please ensure that the folder where the dataset has been created before calling the saving method.
Training models is done through the Jupyter Notebook AnomalERunning.ipynb. You can do the following actions through the notebook:
- You can load different datasets
- Set the model's path for saving the model's weights at the best achieved loss
- Set the hyperparameters and set the patience for early stopping.
Hyperparameter optimisation and model evaluation happen in the optimised_grid_search.py file.
This script contains two main methods:
- One for hyperparameter optimisation
- One for testing the hyperparameters.
You only need to set the path for where the results need to be saved in the case of hyperparameter optimisation. When testing the models, you would need to set the location of the best results so it can read the hyperparameters that achieved those results during optimisation.
Additionally, you can add more hyperparameter options to expand the search space through the lists at the beginning of the file.
To generate the Graph Datasets, you will need the original dataset in a .parquet format. This format was chosen due to its rapid loading and saving times. Additionally, it has built-in data compression allowing for a reduced size and easier data transfer.
The .parquet file can be found in the FigShare repository.
To generate the graph dataset, you need to follow these instructions:
- Set a repository to save the dataset
- Set the path of the raw file
- Select the mode of the generator
- Select the individual settings that can be found in the datasets.py file.
The settings that we used for our dataset generation for 0% contamination in the training dataset are the following:
- batch_size=1
- window_size=180
- window_stride=180
- train_split=70
- test_split=10
- remove_attacks=True
- classification_threshold=0.0
- node_labels=True
- l2_norm=True
The settings that we used for our dataset generation for 3.36% contamination in the training dataset are the following:
- batch_size=1
- window_size=180
- window_stride=180
- train_split=70
- test_split=10
- remove_attacks=False
- classification_threshold=0.0
- node_labels=True
- l2_norm=True
The settings that we used for our dataset generation for 5.76% contamination in the training dataset are the following:
- batch_size=1
- window_size=180
- window_stride=180
- train_split=70
- test_split=10
- remove_attacks=False
- classification_threshold=0.0
- node_labels=True
- benign_downsampling=0.01
- l2_norm=True
To save the processed datasets, you will need to use the method save_datasets() provided in the dataset class. Please ensure that the folder where the dataset has been created before calling the saving method.
To process the PCAP files to NetFlows, it is required to install the NFStream library. You can use any of the existing Virtual Environments as it is compatible with both sets of libraries.
To start the processing, you will need to:
- Create a folder and load all the PCAP files there.
- Run the script pcap_to_nf_v2.py.
- Set up some settings regarding with what settings the PCAP files will be processed. This includes the PCAP folder, inactive timeout and max flow length.
For UNSW-NB15, the same settings as mentioned in the paper were used.
Once we have created the NetFlow CSVs, the next goal is to label them. For labelling, we created a ground truth file that contains information about each anomalous flow from the original dataset. Ground truth works by creating the following tuple:
- Source IP
- Source Post
- Destination IP
- Destination Port
- Start timestamp
- Finish timestamp
- Protocolo.
This tuple is then matched to the appropriate flow in the generated CSV files for labelling.
We create and use a .config file to indicate in which rows each component of the tuple is present.
For the labelling, we use the script nf_labeler_v3.py, and we add in the flags to pass the paths to:
- The ground truth.
- Config file.
- CSV file of the folder to be labelled.
To join any possible disjoined flows, we use the script nf_joiner_v2.py. This file joins close flows and creates a single larger flow that is comprised from the smaller flows. Thus allowing for the creation of a longer, cleaner communication episode.
To use this file, we need to pass the flags:
- Contain the path to the folder containing the CSVs.
- The maximum Flow expiration timeout
If you find this code or research useful, please cite our paper:
AutoGraphAD: Unsupervised network anomaly detection using Variational Graph Autoencoders
Georgios Anyfantis and Pere Barlet-Ros
@misc{anyfantis2026autographadunsupervisednetworkanomaly,
title={AutoGraphAD: Unsupervised network anomaly detection using Variational Graph Autoencoders},
author={Georgios Anyfantis and Pere Barlet-Ros},
year={2026},
eprint={2511.17113},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2511.17113},
}