Domain-Focused PCA on Text Embeddings Improves Semantic Retrieval

Paper: Domain-Focused PCA on Text Embeddings Improves Semantic Retrieval: A Medical Domain Study paper.pdf

General-purpose text embeddings (e.g. OpenAI text-embedding-3-small) are trained on broad web data, encoding variation across all knowledge domains. When used for retrieval in a narrow domain, most dimensions carry cross-domain noise. This project shows that fitting PCA on a domain-specific document corpus and projecting embeddings into the top-k principal components consistently improves retrieval quality — with no fine-tuning required.

Best result: PCA-32 (corpus-only fit) on a 20-topic medical corpus achieves MAP 0.9203 vs 0.8750 baseline (+5.2%), similarity gap 2.5×, and 48× storage compression.

Key Findings

Hypothesis	Result
H1: Optimal PCA dim ≈ 1.6× number of topics	Confirmed (32 dims for 20 topics)
H2: PCA vs Random Projection	PCA wins all 8 dims — domain axes are essential
H3: Corpus-only fit outperforms query+corpus fit	Confirmed — best overall result
H4: PCA more robust to hard negatives	Not confirmed — baseline slightly more robust
H5: PCA gain grows (not shrinks) with corpus diversity	Confirmed — largest gain at 20 topics

Quickstart

pip install openai scikit-learn numpy pandas matplotlib seaborn
export OPENAI_API_KEY=sk-...

# Run the full experiment (v2: 20 topics, 5 hypotheses)
python3 pca_experiment_v2.py

# Or the original smaller experiment (10 topics)
python3 pca_experiment.py

Embeddings are cached in embeddings_cache.json after the first run — subsequent runs are free.

Repository Structure

pca_embedding/
├── pca_experiment.py          # v1: 10 topics × 6 docs, baseline experiment
├── pca_experiment_v2.py       # v2: 20 topics × 15 docs, 5 hypotheses
├── pca_results.csv            # v1 raw results
├── pca_results_v2.csv         # v2 raw results
├── paper.pdf                  # Compiled PDF

How It Works

from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
import numpy as np

# 1. Embed your domain corpus
corpus_embeddings  # shape (N, 1536)
query_embeddings   # shape (Q, 1536)

# 2. Fit PCA on corpus only (not queries)
pca = PCA(n_components=32, random_state=42)
pca.fit(corpus_embeddings)

# 3. Project at index time and query time
corpus_reduced = pca.transform(corpus_embeddings)   # (N, 32)
query_reduced  = pca.transform(query_embeddings)    # (Q, 32)

# 4. Retrieve with cosine similarity
corpus_norm = normalize(corpus_reduced)
query_norm  = normalize(query_reduced)
scores = query_norm @ corpus_norm.T                 # (Q, N)

Corpus

	v1	v2
Topics	10	20
Docs / topic	6	15
Noise docs	8	20
Hard negatives	—	10
Queries	10	20
Total embedded	78	340

Clinical topics covered: Type 2 Diabetes, Myocardial Infarction, Antibiotic Resistance, Pulmonary Embolism, ACE Inhibitors, Stroke, COVID-19 Vaccines, Chronic Kidney Disease, Asthma, Major Depressive Disorder, Parkinson's Disease, Alzheimer's Disease, Rheumatoid Arthritis, Sepsis, Liver Cirrhosis, Tuberculosis, Breast Cancer, Schizophrenia, Inflammatory Bowel Disease, Thyroid Disorders.

Results Summary (v2, 20 topics)

Dims	Method	MAP	NDCG@10	SimGap	Compression
1536	Baseline	0.8750	0.9235	0.250	1×
32	PCA corpus-only	0.9203	0.9498	0.615	48×
48	PCA corpus-only	0.9142	0.9525	0.554	32×
64	PCA corpus-only	0.9137	0.9465	0.512	24×
32	Random Projection	0.3582	0.4356	0.237	48×

Requirements

Python 3.10+
openai — embedding API
scikit-learn — PCA
numpy, pandas — data handling
matplotlib, seaborn — figures

pip install openai scikit-learn numpy pandas matplotlib seaborn

Citation

If you use this work, please cite:

@article{giri2026pca,
  title   = {Domain-Focused PCA on Text Embeddings Improves Semantic Retrieval:
             A Medical Domain Study},
  author  = {Giri, Sandeep},
  year    = {2026},
  url     = {https://github.com/sandeepgiri/pca_embedding}
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
figures		figures
.gitignore		.gitignore
README.md		README.md
arxiv_submission.tar.gz		arxiv_submission.tar.gz
generate_figures.py		generate_figures.py
paper.docx		paper.docx
paper.html		paper.html
paper.md		paper.md
paper.pdf		paper.pdf
paper.tex		paper.tex
pca_experiment.py		pca_experiment.py
pca_experiment_results.md		pca_experiment_results.md
pca_experiment_v2.py		pca_experiment_v2.py
pca_experiment_v2_results.md		pca_experiment_v2_results.md
pca_results.csv		pca_results.csv
pca_results_v2.csv		pca_results_v2.csv
references.bib		references.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain-Focused PCA on Text Embeddings Improves Semantic Retrieval

Key Findings

Quickstart

Repository Structure

How It Works

Corpus

Results Summary (v2, 20 topics)

Requirements

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Domain-Focused PCA on Text Embeddings Improves Semantic Retrieval

Key Findings

Quickstart

Repository Structure

How It Works

Corpus

Results Summary (v2, 20 topics)

Requirements

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages