Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
c2754ed
Refactor introduction and update contributor list
hathawayj Apr 14, 2026
f1ddaa1
Create build_book.yaml
hathawayj Apr 14, 2026
4b78eaf
Add GitHub Actions workflow for Quarto book build
hathawayj Apr 14, 2026
7c8012c
Delete .github/workflows/build_book.yaml
hathawayj Apr 14, 2026
17052a5
Update GitHub Actions workflow for book build
hathawayj Apr 14, 2026
363f420
Delete .github/workflows/build_book2.yaml
hathawayj Apr 14, 2026
f8097a9
Update pyproject.toml
hathawayj Apr 14, 2026
3fa7910
Update spreadsheet.ipynb and databases.ipynb to replace pandas with p…
brentomagic Apr 17, 2026
9886acc
Updated Data Visualization notebook
ugohuche Apr 22, 2026
97c56c3
WIP: Data Transformation
ugohuche Apr 22, 2026
e745615
Replace pandas with polars in rectangling.ipynb and webscraping-and-a…
brentomagic Apr 22, 2026
89fdd7b
Remove obsolete Quarto notebooks: visualise.quarto_ipynb_1, workflow-…
brentomagic Apr 22, 2026
af97f71
Refactor code and structure to polars for improved readability and ma…
ugohuche Apr 23, 2026
3391c7d
Refactor code and structure to polars
ugohuche Apr 23, 2026
032cd4d
Minor edit to data-transform.ipynb
ugohuche Apr 23, 2026
c015995
Remove unnecessary output and reset execution count in data visualisa…
ugohuche Apr 23, 2026
abdca1f
Reformat YAML workflow for improved readability and consistency
ugohuche Apr 23, 2026
18fb95a
Update workflow and notebooks for improved code execution and style c…
ugohuche Apr 23, 2026
e48a82d
Merge branch 'main' into brent-IMPORT
ugohuche Apr 23, 2026
d174e80
Refactor webscraping-and-apis notebook:
ugohuche Apr 23, 2026
1dcd47b
Update spreadsheet.ipynb and databases.ipynb to replace pandas with p…
brentomagic Apr 17, 2026
369bd45
Replace pandas with polars in rectangling.ipynb and webscraping-and-a…
brentomagic Apr 22, 2026
da6546b
Remove obsolete Quarto notebooks: visualise.quarto_ipynb_1, workflow-…
brentomagic Apr 22, 2026
11062ba
Remove obsolete Quarto notebooks: visualise.quarto_ipynb_1, workflow-…
brentomagic Apr 23, 2026
38d6a83
Refactor code structure for improved readability and maintainability
ugohuche Apr 23, 2026
40807b3
Merge branch 'brent-IMPORT' of https://github.com/datathink/python4DS…
ugohuche Apr 23, 2026
07dcbb0
Refactor notebook content for improved readability and formatting
ugohuche Apr 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 52 additions & 52 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -1,59 +1,59 @@
name: tests

on:
push:
branches: [main]
pull_request:
push:
branches: [main]
pull_request:

concurrency:
group: tests-${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
group: tests-${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.10'
- uses: pre-commit/action@v3.0.1

build:
runs-on: ubuntu-latest
steps:
- name: Check out the repository
uses: actions/checkout@v4
with:
fetch-depth: 2

- name: Install uv
uses: astral-sh/setup-uv@v4

- name: Set up Python
run: uv python install

- name: Install dependencies
run: uv sync --all-extras --dev

- name: Install Quarto
uses: quarto-dev/quarto-actions/setup@v2
with:
version: "1.5.57"

- name: set timezone
run: |
TZ="Europe/London" &&
sudo ln -snf /usr/share/zoneinfo/$TZ /etc/localtime

- name: install linux deps
run: |
sudo apt-get -y install openssl graphviz nano texlive graphviz-dev unzip build-essential

- name: build the book
run: |
uv run quarto render --execute

- name: success
run: |
echo "Success in building book without errors!"
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.10"
- uses: pre-commit/action@v3.0.1

build:
runs-on: ubuntu-latest
steps:
- name: Check out the repository
uses: actions/checkout@v4
with:
fetch-depth: 2

- name: Install uv
uses: astral-sh/setup-uv@v4

- name: Set up Python
run: uv python install

- name: Install dependencies
run: uv sync --all-extras --dev

- name: Install Quarto
uses: quarto-dev/quarto-actions/setup@v2
with:
version: "1.5.57"

- name: set timezone
run: |
TZ="Europe/London" &&
sudo ln -snf /usr/share/zoneinfo/$TZ /etc/localtime

- name: install linux deps
run: |
sudo apt-get -y install openssl graphviz nano texlive graphviz-dev unzip build-essential

- name: build the book
run: |
uv run quarto render --execute

- name: success
run: |
echo "Success in building book without errors!"
618 changes: 259 additions & 359 deletions data-transform.ipynb

Large diffs are not rendered by default.

136 changes: 67 additions & 69 deletions data-visualise.ipynb

Large diffs are not rendered by default.

Binary file modified data/bake_sale.xlsx
Binary file not shown.
36 changes: 20 additions & 16 deletions databases.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
"\n",
"### Prerequisites\n",
"\n",
"You will need the **pandas**, **SQLModel**, and **ibis** packages for this chapter. You probably already have **pandas** installed; to install **SQLModel** and **ibis** respectively run `uv add sqlmodel` and `uv add ibis-framework` on your computer's command line. First, let's bring in some general packages and turn off verbose warnings."
"You will need the **polars**, **SQLModel**, and **ibis** packages for this chapter. You probably already have **polars** installed; to install **SQLModel** and **ibis** respectively run `uv add sqlmodel` and `uv add ibis-framework` on your computer's command line. First, let's bring in some general packages and turn off verbose warnings."
]
},
{
Expand All @@ -39,10 +39,9 @@
"metadata": {},
"source": [
"## Database Basics\n",
"\n",
"At the simplest level, you can think about a database as a collection of data frames, called **tables** in database terminology.\n",
"Like a **pandas** data frame, a database table is a collection of named columns, where every value in the column is the same type.\n",
"There are three high level differences between data frames and database tables:\n",
"At the simplest level, you can think about a database as a collection of data frames, called **tables** in database terminology. \n",
"Like a **Polars** DataFrame, a database table is a collection of named columns, where every value in a column shares the same data type. \n",
"There are three high-level differences between data frames and database tables:\n",
"\n",
"- Database tables are stored on disk (ie on file) and can be arbitrarily large.\n",
" Data frames are stored in memory, and are fundamentally limited (although that limit is still big enough for many problems). You can think about the difference between on disk and in memory as being like the difference between long-term and short-term memory (and you have much more limited capacity in the latter).\n",
Expand All @@ -68,7 +67,7 @@
"\n",
"- You'll always use a database interface that provides a connection to the database, for example Python's built-in **sqlite** package\n",
"\n",
"- You'll also use a package that pushes and/or pulls data to/from the database, for example **pandas**\n",
"- You'll also use a package that pushes and/or pulls data to/from the database, for example **polars**\n",
"\n",
"The precise details of the connection varies a lot from DBMS to DBMS so unfortunately we can't cover all the details here. The initial setup will often take a little fiddling (and maybe some research) to get right, but you'll generally only need to do it once. We'll do the best we can to cover some basics here.\n",
"\n",
Expand Down Expand Up @@ -112,7 +111,7 @@
"id": "2992b718",
"metadata": {},
"source": [
"Note that the output here is in the form a Python object called a tuple. If we wanted to put this into a **pandas** data frame, we can just pass it straight in:"
"Note that the output here is in the form of a Python object called a tuple. If we want to convert this into a **Polars** DataFrame, we can pass it to `pl.DataFrame()`. When working with tuples, you may need to provide column names using the **schema** argument or specify **orient=\"row\"** so Polars correctly interprets the structure."
]
},
{
Expand All @@ -122,9 +121,11 @@
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import polars as pl\n",
"\n",
"df = pl.DataFrame(rows, orient=\"row\")\n",
"\n",
"pd.DataFrame(rows)"
"df"
]
},
{
Expand Down Expand Up @@ -316,9 +317,9 @@
"source": [
"### Joins\n",
"\n",
"If you're familiar with joins in **pandas**, SQL joins are very similar. Let's see if we can join the 'album' and 'track' tables to find the *name* of the albums in the above query.\n",
"If youre familiar with joins in **polars**, SQL joins are very similar. Lets see if we can join the 'album' and 'track' tables to find the *name* of the albums in the above query.\n",
"\n",
"Note that as soon as we have the *same* column names in more than one table, we need to specify the table we are referring to when we use that column name. There are different options for joins (eg `INNER`, `LEFT`) that you can find out more about [here](https://en.wikipedia.org/wiki/Join_(SQL)).\n"
"In polars, you use the `df.join()` method, which defaults to an \"inner\" join. Note that if you have the same column names in both tables, Polars will often append a suffix (like _right) to the duplicate names to keep them distinct, unless you specify otherwise. There are different options for joins (eg `INNER`, `LEFT`) that you can find out more about [here](https://en.wikipedia.org/wiki/Join_(SQL)).\n"
]
},
{
Expand Down Expand Up @@ -403,9 +404,9 @@
"id": "495f97e5",
"metadata": {},
"source": [
"## SQL with **pandas**\n",
"## SQL with **polars**\n",
"\n",
"**pandas** is well-equipped for working with SQL. We can simply push the query we just created straight through using its `read_sql()` function—but bear in mind we need to pass in the connection we created to the database too:"
"**polars** is well-equipped for working with SQL. We can simply push the query we just created straight through using its `read_database()` function—but bear in mind we need to pass in the connection we created to the database too:"
]
},
{
Expand All @@ -415,7 +416,10 @@
"metadata": {},
"outputs": [],
"source": [
"pd.read_sql(sql_join, con)"
"df = pl.read_database(\n",
" query=sql_join, # your SQL query (string)\n",
" connection=con, # your connection object (SQLAlchemy, psycopg2 cursor, etc.)\n",
")"
]
},
{
Expand All @@ -435,7 +439,7 @@
"source": [
"## SQL with **ibis**\n",
"\n",
"It's not exactly satisfactory to have to write out your SQL queries in text. What if we could create commands directly from **pandas** commands? You can't *quite* do that, but there's a package that gets you pretty close and it's called [**ibis**](https://ibis-project.org/). **ibis** is particularly useful when you are reading from a database and want to query it just like you would a **pandas** data frame.\n",
"It's not exactly satisfactory to have to write out your SQL queries in text. What if we could create commands directly from **polars** commands? You can't *quite* do that, but there's a package that gets you pretty close and it's called [**ibis**](https://ibis-project.org/). **ibis** is particularly useful when you are reading from a database and want to query it just like you would a **polars** data frame.\n",
"\n",
"**Ibis** can connect to local databases (eg a SQLite database), server-based databases (eg Postgres), or cloud-based databased (eg Google's BigQuery). The syntax to make a connection is, for example, `ibis.bigquery.connect`.\n",
"\n",
Expand All @@ -462,7 +466,7 @@
"id": "6dcd7d71",
"metadata": {},
"source": [
"Okay, now let's reproduce the following query: \"SELECT albumid, AVG(milliseconds)/1e3/60 FROM track GROUP BY albumid ORDER BY AVG(milliseconds) ASC LIMIT 5;\". We'll use a groupby, a mutate (which you can think of like **pandas**' assign statement), a sort, and then `limit()` to only show the first five entries."
"Okay, now let's reproduce the following query: \"SELECT albumid, AVG(milliseconds)/1e3/60 FROM track GROUP BY albumid ORDER BY AVG(milliseconds) ASC LIMIT 5;\". We'll use a group_by, a mutate (which you can think of like **polars** assign statement), a sort, and then `limit()` to only show the first five entries."
]
},
{
Expand Down
Loading