This repo has SQL scripts that are developed to genrate IPL statistics like:
- Orange Cap
- Purple Cap
- Most Dot Balls
- Fastest 50's, 100's etx
- Hatricks.. etc
There a total of 25 reports, we will be using Joins, aggregate functions to achive the desired results.
All the stats are avalibale @ https://www.iplt20.com/stats/2008 I am trying to produce all the stats that are in the IPL website using the sample data from 2008 to 2017
Databrciks community addition works for this project, we can later put this in cloud (Azure), improve performance and more
Below is the architecture the we are going to use for this use case. There are no streaming data sources for this project, its batch oriented only
User will upload data to DBFS file system in to bronze layer and do operations on top of it. Source data for this project is available in data directory of this repo.
Data files Description:
- Ball_By_Ball.csv -- provides details about each ball bowler like runs, ball number, over number ,match_id etc.
- Team.csv -- list of IPL teams
- player.csv -- Provides information about each player played in ipl season
- Match.csv -- Information about match venu, date, toss winner, captain etc
- Player_match.csv - Information about palyer played matches, his role, seasosn etc
ETL Data Flow and Transformations sequence:
- IPL Data Analysis - Data Transform- Bronze to silver.ipynb
- IPL Data Analysis - Orange Cap Stats - SQL.ipynb
- IPL Data Analysis - Purple Cap Stats - SQL.ipynb
- IPL Data Analysis- Read from Gold Layer.ipynb
Link to above published notebooks: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2418033757997087/456113821613659/2092576197820195/latest.html https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2418033757997087/502321478328215/2092576197820195/latest.html https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2418033757997087/502321478328231/2092576197820195/latest.html https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2418033757997087/3045803298173797/2092576197820195/latest.html
Output Validation:
Each report in gold layer is comapred against the reports in https://www.iplt20.com/stats/2008 for at leat 95% accuracy, since this datasets are not from official IPL website.
Purple Cap Comparision.
Orange Cap Comparision
Next Steps: we look into performance imporovements and implement the same in cloud.




