🧠 Data Discrepancy Resolver (with Azure OpenAI)

This solution compares two CSV datasets and resolves possible data mismatches using AI. It leverages vector embeddings and GPT-based reasoning to assess record similarity.

✨ Features

Upload or use sample CSV datasets
Generate embeddings with Azure OpenAI (text-embedding-3-small)
Evaluate entity similarity using cosine distance
Get human-like judgment and explanation via GPT-4o (chat completion)

🛠 Azure Setup (Azure OpenAI)

To use this project with Azure OpenAI, follow these steps:

1. Request Access

If you haven't already, request access to Azure OpenAI:
👉 https://aka.ms/oai/access

2. Create Azure OpenAI Resource

Go to Azure Portal → Create a resource → Search for "Azure OpenAI"
Select Azure OpenAI and create the resource.
After creation, go to 'Explore Azure AI Foundry portal'

3. Deploy Required Models

Inside Azure AI Foundry portal:

Go to Deployments
Deploy the following models:
- text-embedding-3-small (for embeddings)
- gpt-4o (for reasoning via chat completion)

Note the deployment names and your resource endpoint, you'll need them in the app config.

4. Update `appsettings.json`

In the solution, edit appsettings.json:

{
  "OpenAI": {
    "ApiKey": "<your-azure-openai-api-key>",
    "Endpoint": "https://<your-resource>.openai.azure.com/",
    "EmbeddingDeployment": "text-embedding-3-small",
    "ChatDeployment": "gpt-4o",
    "ApiVersion": "2024-02-15-preview"
  }
}

🔍 Similarity Threshold Optimization

To reduce costs and improve performance, the application uses cosine similarity between text embeddings (generated by the text-embedding-3-small model) to determine whether a pair of records is similar enough to justify a more detailed GPT-based evaluation.

The similarity score is calculated using cosine similarity, which produces a value between 0 and 1:

- 1.0 = Perfect match (identical semantics)
- 0.7–1.0 = Likely to be a match
- 0.4–0.7 = Uncertain match
- < 0.4 = Unlikely to be a match

⛔ Skipping Low-Value Comparisons If the similarity score falls below a configurable threshold (e.g., 0.5), the system skips the expensive GPT call and returns an automatic judgment of "No Match":

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 Data Discrepancy Resolver (with Azure OpenAI)

✨ Features

🛠 Azure Setup (Azure OpenAI)

1. Request Access

2. Create Azure OpenAI Resource

3. Deploy Required Models

4. Update `appsettings.json`

🔍 Similarity Threshold Optimization

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🧠 Data Discrepancy Resolver (with Azure OpenAI)

✨ Features

🛠 Azure Setup (Azure OpenAI)

1. Request Access

2. Create Azure OpenAI Resource

3. Deploy Required Models

4. Update appsettings.json

🔍 Similarity Threshold Optimization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

4. Update `appsettings.json`

Packages