This solution compares two CSV datasets and resolves possible data mismatches using AI. It leverages vector embeddings and GPT-based reasoning to assess record similarity.
- Upload or use sample CSV datasets
- Generate embeddings with Azure OpenAI (
text-embedding-3-small) - Evaluate entity similarity using cosine distance
- Get human-like judgment and explanation via GPT-4o (chat completion)
To use this project with Azure OpenAI, follow these steps:
If you haven't already, request access to Azure OpenAI:
👉 https://aka.ms/oai/access
- Go to Azure Portal → Create a resource → Search for "Azure OpenAI"
- Select Azure OpenAI and create the resource.
- After creation, go to 'Explore Azure AI Foundry portal'
Inside Azure AI Foundry portal:
- Go to Deployments
- Deploy the following models:
- text-embedding-3-small (for embeddings)
- gpt-4o (for reasoning via chat completion)
Note the deployment names and your resource endpoint, you'll need them in the app config.
In the solution, edit appsettings.json:
{
"OpenAI": {
"ApiKey": "<your-azure-openai-api-key>",
"Endpoint": "https://<your-resource>.openai.azure.com/",
"EmbeddingDeployment": "text-embedding-3-small",
"ChatDeployment": "gpt-4o",
"ApiVersion": "2024-02-15-preview"
}
}To reduce costs and improve performance, the application uses cosine similarity between text embeddings (generated by the text-embedding-3-small model) to determine whether a pair of records is similar enough to justify a more detailed GPT-based evaluation.
The similarity score is calculated using cosine similarity, which produces a value between 0 and 1:
- 1.0 = Perfect match (identical semantics)
- 0.7–1.0 = Likely to be a match
- 0.4–0.7 = Uncertain match
- < 0.4 = Unlikely to be a match
⛔ Skipping Low-Value Comparisons If the similarity score falls below a configurable threshold (e.g., 0.5), the system skips the expensive GPT call and returns an automatic judgment of "No Match":