Skip to content

talentconsulting/OpenAiTestProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

🧠 Data Discrepancy Resolver (with Azure OpenAI)

This solution compares two CSV datasets and resolves possible data mismatches using AI. It leverages vector embeddings and GPT-based reasoning to assess record similarity.

✨ Features

  • Upload or use sample CSV datasets
  • Generate embeddings with Azure OpenAI (text-embedding-3-small)
  • Evaluate entity similarity using cosine distance
  • Get human-like judgment and explanation via GPT-4o (chat completion)

🛠 Azure Setup (Azure OpenAI)

To use this project with Azure OpenAI, follow these steps:

1. Request Access

If you haven't already, request access to Azure OpenAI:
👉 https://aka.ms/oai/access


2. Create Azure OpenAI Resource

  1. Go to Azure Portal → Create a resource → Search for "Azure OpenAI"
  2. Select Azure OpenAI and create the resource.
  3. After creation, go to 'Explore Azure AI Foundry portal'

3. Deploy Required Models

Inside Azure AI Foundry portal:

  1. Go to Deployments
  2. Deploy the following models:
    • text-embedding-3-small (for embeddings)
    • gpt-4o (for reasoning via chat completion)

Note the deployment names and your resource endpoint, you'll need them in the app config.


4. Update appsettings.json

In the solution, edit appsettings.json:

{
  "OpenAI": {
    "ApiKey": "<your-azure-openai-api-key>",
    "Endpoint": "https://<your-resource>.openai.azure.com/",
    "EmbeddingDeployment": "text-embedding-3-small",
    "ChatDeployment": "gpt-4o",
    "ApiVersion": "2024-02-15-preview"
  }
}

🔍 Similarity Threshold Optimization

To reduce costs and improve performance, the application uses cosine similarity between text embeddings (generated by the text-embedding-3-small model) to determine whether a pair of records is similar enough to justify a more detailed GPT-based evaluation.

The similarity score is calculated using cosine similarity, which produces a value between 0 and 1:

- 1.0 = Perfect match (identical semantics)
- 0.7–1.0 = Likely to be a match
- 0.4–0.7 = Uncertain match
- < 0.4 = Unlikely to be a match

⛔ Skipping Low-Value Comparisons If the similarity score falls below a configurable threshold (e.g., 0.5), the system skips the expensive GPT call and returns an automatic judgment of "No Match":

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors