Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 11 additions & 95 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,29 +10,12 @@

Khisto is a Python library for creating histograms using the **Khiops optimal binning algorithm**. Unlike standard histograms that use fixed-width bins or simple heuristics, Khisto automatically determines the optimal number of bins and their variable widths to best represent the underlying data distribution.

## Features

- **Optimal Binning**: Uses the MODL (Minimum Description Length) principle to find the best discretization.
- **Variable-Width Bins**: Captures dense regions with fine bins and sparse regions with wider bins.
- **NumPy Compatible**: Drop-in replacement for `numpy.histogram`.
- **Matplotlib Integration**: `khisto.matplotlib.hist` works like `plt.hist`.
- **Core Histogram API**: Inspect every available granularity with `khisto.core.compute_histograms` and `HistogramResult`.
- **Minimal Dependencies**: Only requires NumPy (matplotlib optional for plotting).
Documentation is available at **[khiops.github.io/khisto-python](https://khiopsml.github.io/khisto-python/)**.

| Standard Gaussian | Heavy-tailed Pareto |
| --- | --- |
| ![Adaptive Gaussian histogram](docs/images/gaussian-quick-start.png) | ![Adaptive Pareto histogram](docs/images/pareto-quick-start.png) |

## Reproducing The Example Distributions

The complete runnable script is available in `scripts/generate_distribution_examples.py`.

Run it from the repository root to regenerate both example distributions and the figure files used in this README:

```bash
python scripts/generate_distribution_examples.py
```

## Installation

```bash
Expand All @@ -47,85 +30,28 @@ pip install "khisto[matplotlib]"

## Quick Start

### NumPy-like API

```python
import numpy as np
from khisto import histogram

# Generate 10,000 samples from a standard Gaussian distribution.
data = np.random.normal(0, 1, 10000)

# Compute optimal histogram (drop-in replacement for np.histogram)
hist, bin_edges = histogram(data)

# With density normalization
density, bin_edges = histogram(data, density=True)

# Limit maximum number of bins
hist, bin_edges = histogram(data, max_bins=10)

# Specify range
hist, bin_edges = histogram(data, range=(-2, 2))
```

Using 10,000 samples keeps the adaptive refinement visible while remaining fast to compute.

Heavy-tailed example:

```python
import numpy as np
import matplotlib.pyplot as plt
from khisto.matplotlib import hist

# Generate 10,000 samples from a Pareto distribution, shifted to start at 1 for better log-log visualization
shape = 3
long_tail_data = np.random.pareto(shape, size=10000) + 1
# Generate 10,000 samples from a Normal distribution
normal_data = np.random.normal(size=10000)

# Plot an adaptive histogram on logarithmic axes.
n, bins, patches = hist(long_tail_data, density=True)
plt.xscale("log")
plt.yscale("log")
# Plot an adaptive histogram
n, bins, patches = hist(normal_data)
plt.show()
```

### Matplotlib Integration

```python
import numpy as np
import matplotlib.pyplot as plt
from khisto.matplotlib import hist

# Generate 10,000 samples from a standard Gaussian distribution.
data = np.random.normal(0, 1, 10000)
# Generate 10,000 samples from a Pareto distribution
long_tail_data = np.random.pareto(3, size=10000)

# Density is usually the most interpretable view with variable-width bins.
n, bins, patches = hist(data, density=True)
plt.xlabel('Value')
plt.ylabel('Density')
plt.show()

# Cumulative density follows matplotlib semantics.
n, bins, patches = hist(data, density=True, cumulative=True)
plt.ylabel('Cumulative probability')
# Plot an adaptive histogram on logarithmic axes.
n, bins, patches = hist(long_tail_data)
plt.xscale("symlog")
plt.yscale("log")
plt.show()
```

## How It Works

Khisto uses the Khiops optimal binning algorithm based on the MODL (Minimum Optimal Description Length) principle. Instead of using fixed-width bins like traditional histograms, it:

1. Analyzes the data distribution
2. Finds bin boundaries that minimize information loss
3. Creates variable-width bins that adapt to data density

This results in histograms that better represent the underlying distribution, with finer bins in dense regions and wider bins in sparse regions.

The method implemented in Khiops is comprehensively detailed in [2] and further extended in [1].

- [1] M. Boullé. Floating-point histograms for exploratory analysis of large scale real-world data sets. Intelligent Data Analysis, 28(5):1347-1394, 2024
- [2] V. Zelaya Mendizábal, M. Boullé, F. Rossi. Fast and fully-automated histograms for large-scale data sets. Computational Statistics & Data Analysis, 180:0-0, 2023

## Development

```bash
Expand All @@ -140,16 +66,6 @@ uv sync --group dev --extra all
uv run pytest
```

## Documentation

Full documentation is hosted at **[khiops.github.io/khisto-python](https://khiops.github.io/khisto-python/)**.

- [API Reference](https://khiops.github.io/khisto-python/array/histogram/index.html) — NumPy-like histogram API
- [Matplotlib Integration](https://khiops.github.io/khisto-python/matplotlib/index.html) — `hist` plotting function
- [Core API](https://khiops.github.io/khisto-python/core/index.html) — full access to histogram granularity levels
- [API Comparison](https://khiops.github.io/khisto-python/api_comparison.html) — side-by-side with NumPy and Matplotlib
- [Demo Notebook](https://khiops.github.io/khisto-python/demo.html) — interactive walkthrough

## License

[BSD 3-Clause Clear License](LICENSE)
246 changes: 246 additions & 0 deletions docs/api_comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
# API Comparison

This document compares the current Khisto APIs with NumPy and Matplotlib.

## NumPy Comparison

### `numpy.histogram` vs `khisto.histogram`

Khisto's `histogram` function is designed as a drop-in replacement for `numpy.histogram`.

#### Signature Comparison

```python
# NumPy
numpy.histogram(
a,
bins=10,
range=None,
density=None,
weights=None,
)

# Khisto
khisto.histogram(
a,
range=None,
max_bins=None,
density=False,
)
```

#### Key Differences

| Feature | NumPy | Khisto |
|---|---|---|
| **Binning method** | Fixed-width bins | Optimal variable-width bins |
| **Bins parameter** | `bins` (int or edges) | `max_bins` (optional limit) |
| **Default bins** | 10 fixed bins | Auto-determined optimal |
| **Weights support** | Yes | No |
| **Returns** | `(hist, bin_edges)` | `(hist, bin_edges)` |

#### Usage Comparison

```python
import numpy as np
from khisto import histogram

data = np.random.normal(0, 1, 1000)

# NumPy - fixed 10 bins
np_hist, np_edges = np.histogram(data)

# Khisto - optimal bins (automatic)
khisto_hist, khisto_edges = histogram(data)

# NumPy - specified bin count
np_hist, np_edges = np.histogram(data, bins=20)

# Khisto - maximum bin count
khisto_hist, khisto_edges = histogram(data, max_bins=20)

# Both support density normalization
np_density, _ = np.histogram(data, density=True)
khisto_density, _ = histogram(data, density=True)

# Both support range specification
np_hist, _ = np.histogram(data, range=(-2, 2))
khisto_hist, _ = histogram(data, range=(-2, 2))
```

#### When to Use Each

| Use NumPy | Use Khisto |
|---|---|
| Need fixed-width bins | Want optimal data representation |
| Need weighted histograms | Want automatic bin selection |
| Need specific bin edges | Want adaptive bin widths |
| Performance-critical loops | Data visualization |

---

## Matplotlib Comparison

### `matplotlib.pyplot.hist` vs `khisto.matplotlib.hist`

Khisto's `hist` function works similarly to matplotlib's `hist`, but with optimal binning.

#### Signature Comparison

```python
# Matplotlib
matplotlib.pyplot.hist(
x,
bins=10,
range=None,
density=False,
weights=None,
cumulative=False,
bottom=None,
histtype='bar',
align='mid',
orientation='vertical',
rwidth=None,
log=False,
color=None,
label=None,
stacked=False,
**kwargs,
)

# Khisto
khisto.matplotlib.hist(
x,
range=None,
max_bins=None,
density=False,
cumulative=False,
histtype='bar',
orientation='vertical',
log=False,
color=None,
label=None,
ax=None,
edgecolor=None,
linewidth=None,
alpha=None,
**kwargs,
)
```

#### Key Differences

| Feature | Matplotlib | Khisto |
|---|---|---|
| **Binning** | Fixed-width | Optimal variable-width |
| **Bins param** | `bins` | `max_bins` |
| **Axes param** | Implicit (current) | Optional `ax` parameter |
| **Cumulative** | Supported | Supported |
| **Reverse cumulative** | Supported with negative `cumulative` | Supported with negative `cumulative` |
| **Stacked** | Supported | Not supported |
| **Weights** | Supported | Not supported |
| **Unsupported histogram args** | None | `bins`, `stacked`, and `weights` raise a `TypeError` |
| **Multiple datasets** | Supported | Not supported; only 1-D arrays are accepted |

#### Usage Comparison

```python
import numpy as np
import matplotlib.pyplot as plt
from khisto.matplotlib import hist

data = np.random.normal(0, 1, 1000)

# Matplotlib - fixed bins
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))

ax1.hist(data, bins=30)
ax1.set_title("Matplotlib (30 bins)")

hist(data, ax=ax2)
ax2.set_title("Khisto (optimal bins)")

plt.tight_layout()
plt.show()
```

#### Common Parameters (Same Behavior)

```python
# Both support these parameters identically:

# density normalization
plt.hist(data, density=True)
hist(data, density=True)

# cumulative view
plt.hist(data, density=True, cumulative=True)
hist(data, density=True, cumulative=True)

# reverse cumulative view
plt.hist(data, cumulative=-1)
hist(data, cumulative=-1)

# histogram type
plt.hist(data, histtype='step')
hist(data, histtype='step')

# orientation
plt.hist(data, orientation='horizontal')
hist(data, orientation='horizontal')

# log scale
plt.hist(data, log=True)
hist(data, log=True)

# color and label
plt.hist(data, color='blue', label='Data')
hist(data, color='blue', label='Data')
```

---

## Migration Guide

### From NumPy

```python
# Before (NumPy)
import numpy as np
hist, edges = np.histogram(data, bins=30)

# After (Khisto)
from khisto import histogram
hist, edges = histogram(data, max_bins=30) # max_bins is optional
```

### From Matplotlib

```python
# Before (Matplotlib)
import matplotlib.pyplot as plt
n, bins, patches = plt.hist(data, bins=30)

# After (Khisto)
from khisto.matplotlib import hist
n, bins, patches = hist(data, max_bins=30) # max_bins is optional
```

---

## Feature Matrix

| Feature | NumPy | Matplotlib | Khisto |
|---|---|---|---|
| Fixed-width bins | Yes | Yes | No |
| Optimal bins | No | No | Yes |
| Variable-width bins | Manual | Manual | Auto |
| Density | Yes | Yes | Yes |
| Range | Yes | Yes | Yes |
| Weights | Yes | Yes | No |
| Cumulative | No | Yes | Yes |
| Reverse cumulative | No | Yes | Yes |
| Plotting | No | Yes | Yes |
| Step histogram | No | Yes | Yes |
| Horizontal | No | Yes | Yes |
| Log scale | No | Yes | Yes |
Loading
Loading