Paper: Domain-Focused PCA on Text Embeddings Improves Semantic Retrieval: A Medical Domain Study
paper.pdf
General-purpose text embeddings (e.g. OpenAI text-embedding-3-small) are trained on broad web data, encoding variation across all knowledge domains. When used for retrieval in a narrow domain, most dimensions carry cross-domain noise. This project shows that fitting PCA on a domain-specific document corpus and projecting embeddings into the top-k principal components consistently improves retrieval quality — with no fine-tuning required.
Best result: PCA-32 (corpus-only fit) on a 20-topic medical corpus achieves MAP 0.9203 vs 0.8750 baseline (+5.2%), similarity gap 2.5×, and 48× storage compression.
| Hypothesis | Result |
|---|---|
| H1: Optimal PCA dim ≈ 1.6× number of topics | Confirmed (32 dims for 20 topics) |
| H2: PCA vs Random Projection | PCA wins all 8 dims — domain axes are essential |
| H3: Corpus-only fit outperforms query+corpus fit | Confirmed — best overall result |
| H4: PCA more robust to hard negatives | Not confirmed — baseline slightly more robust |
| H5: PCA gain grows (not shrinks) with corpus diversity | Confirmed — largest gain at 20 topics |
pip install openai scikit-learn numpy pandas matplotlib seaborn
export OPENAI_API_KEY=sk-...
# Run the full experiment (v2: 20 topics, 5 hypotheses)
python3 pca_experiment_v2.py
# Or the original smaller experiment (10 topics)
python3 pca_experiment.py
Embeddings are cached in embeddings_cache.json after the first run — subsequent runs are free.
pca_embedding/
├── pca_experiment.py # v1: 10 topics × 6 docs, baseline experiment
├── pca_experiment_v2.py # v2: 20 topics × 15 docs, 5 hypotheses
├── pca_results.csv # v1 raw results
├── pca_results_v2.csv # v2 raw results
├── paper.pdf # Compiled PDF
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
import numpy as np
# 1. Embed your domain corpus
corpus_embeddings # shape (N, 1536)
query_embeddings # shape (Q, 1536)
# 2. Fit PCA on corpus only (not queries)
pca = PCA(n_components=32, random_state=42)
pca.fit(corpus_embeddings)
# 3. Project at index time and query time
corpus_reduced = pca.transform(corpus_embeddings) # (N, 32)
query_reduced = pca.transform(query_embeddings) # (Q, 32)
# 4. Retrieve with cosine similarity
corpus_norm = normalize(corpus_reduced)
query_norm = normalize(query_reduced)
scores = query_norm @ corpus_norm.T # (Q, N)| v1 | v2 | |
|---|---|---|
| Topics | 10 | 20 |
| Docs / topic | 6 | 15 |
| Noise docs | 8 | 20 |
| Hard negatives | — | 10 |
| Queries | 10 | 20 |
| Total embedded | 78 | 340 |
Clinical topics covered: Type 2 Diabetes, Myocardial Infarction, Antibiotic Resistance, Pulmonary Embolism, ACE Inhibitors, Stroke, COVID-19 Vaccines, Chronic Kidney Disease, Asthma, Major Depressive Disorder, Parkinson's Disease, Alzheimer's Disease, Rheumatoid Arthritis, Sepsis, Liver Cirrhosis, Tuberculosis, Breast Cancer, Schizophrenia, Inflammatory Bowel Disease, Thyroid Disorders.
| Dims | Method | MAP | NDCG@10 | SimGap | Compression |
|---|---|---|---|---|---|
| 1536 | Baseline | 0.8750 | 0.9235 | 0.250 | 1× |
| 32 | PCA corpus-only | 0.9203 | 0.9498 | 0.615 | 48× |
| 48 | PCA corpus-only | 0.9142 | 0.9525 | 0.554 | 32× |
| 64 | PCA corpus-only | 0.9137 | 0.9465 | 0.512 | 24× |
| 32 | Random Projection | 0.3582 | 0.4356 | 0.237 | 48× |
- Python 3.10+
openai— embedding APIscikit-learn— PCAnumpy,pandas— data handlingmatplotlib,seaborn— figures
pip install openai scikit-learn numpy pandas matplotlib seabornIf you use this work, please cite:
@article{giri2026pca,
title = {Domain-Focused PCA on Text Embeddings Improves Semantic Retrieval:
A Medical Domain Study},
author = {Giri, Sandeep},
year = {2026},
url = {https://github.com/sandeepgiri/pca_embedding}
}MIT