Add AGENTS.md and enrich package docstring#1497
Add AGENTS.md and enrich package docstring#1497timsaucer wants to merge 9 commits intoapache:mainfrom
Conversation
Add python/datafusion/AGENTS.md as a comprehensive DataFrame API guide for AI agents and users. It ships with pip automatically (Maturin includes everything under python-source = "python"). Covers core abstractions, import conventions, data loading, all DataFrame operations, expression building, a SQL-to-DataFrame reference table, common pitfalls, idiomatic patterns, and a categorized function index. Enrich the __init__.py module docstring from 2 lines to a full overview with core abstractions, a quick-start example, and a pointer to AGENTS.md. Closes apache#1394 (PR 1a) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The root AGENTS.md (symlinked as CLAUDE.md) is for contributors working on the project. Add a pointer to python/datafusion/AGENTS.md which is the user-facing DataFrame API guide shipped with the package. Also add the Apache license header to the package AGENTS.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document that all PRs must follow .github/pull_request_template.md and that pre-commit hooks must pass before committing. List all configured hooks (actionlint, ruff, ruff-format, cargo fmt, cargo clippy, codespell, uv-lock) and the command to run them manually. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Let the hooks be discoverable from .pre-commit-config.yaml rather than maintaining a separate list that can drift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Clarify that DataFusion works with any Arrow C Data Interface implementation, not just PyArrow. - Show the filter keyword argument on aggregate functions (the idiomatic HAVING equivalent) instead of the post-aggregate .filter() pattern. - Update the SQL reference table to show FILTER (WHERE ...) syntax. - Remove the now-incorrect "Aggregate then filter for HAVING" pitfall. - Add .collect() to the fluent chaining example so the result is clearly materialized. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…only the text description
|
Positive update: After my latest push 4429a08 it now correctly creates an idiomatic datafusion-python file for the first TPC-H query using only the text description from the specification and being directly to strictly not use the SQL as a reference. I didn't feed it the SQL but I gave it those instructions so it didn't find the answer during it's searching. When I get more time I plan on working through each one of the queries until we have an agent file that can reproduce all of TPC-H with idiomatic code. |
|
FYI @ntjohnson1 you might get some value out of grabbing the |
Thanks for the heads up @iblnkn is going to do some query work in the short term so would be good to try this out in addition to some of the internal AGENTS.md stuff we have. |
|
With my latest push I have a folder that contains only the text descriptions of the TPC-H queries and I gave it this guidance: Review the @README.md and @AGENTS.md in this directory. Each of the problem statements is listed in @problems/ . I want you to generate solutions for each problem statement. However when you do this you are forbidden from making any changes to your solution after your first evaluation. This is an attempt to test that our agents file contains all of the necessary instructions, so you should be able to get each one right on the first attempt. The contents of README.md was: DataFusion Python - TPC-H QueriesOverviewThis project implements TPC-H benchmark queries using idiomatic datafusion-python code. The goal is to translate natural language problem descriptions into DataFrame API queries, not to transliterate SQL into Python. DataTPC-H parquet files are located in the
ApproachEach query should be written as idiomatic datafusion-python, using the DataFrame Allowed Sources
Restrictions
Additionally I have a CLAUDE.md file with: Do not store auto-memory for this folder. The user is developing and testing skills here, and cross-session memory may bias how skills get written or evaluated between runs. Do not write to Do not read prior query solutions under Whenever you hit a problem while generating a query — a DataFusion error, a surprising planner rejection, a type mismatch, an API quirk not covered by the existing guide — after resolving it, propose a concrete addition or edit to ResultsUsing this it created all 22 TPC-H queries. I then validated that they all work at scale factor 1 and produce the expected results. I also checked each file to make sure it created idiomatic code. |
|
We need this for datafusion too :) |
There was a problem hiding this comment.
Pull request overview
Adds in-package, user-facing guidance for writing idiomatic DataFusion Python DataFrame API code, and makes it discoverable via the package docstring and repo root instructions.
Changes:
- Add a comprehensive
python/datafusion/AGENTS.mdDataFrame API guide intended to ship in the wheel. - Expand
python/datafusion/__init__.pymodule docstring with core abstractions, a quick start, and a pointer to the shipped guide. - Update repo-root
AGENTS.mdto clarify it targets contributors and link to the user-facing guide.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| python/datafusion/init.py | Replaces the minimal module docstring with a richer overview + quick start and pointer to shipped AGENTS.md |
| python/datafusion/AGENTS.md | New, comprehensive DataFrame API reference/guide intended for agent + human consumption |
| AGENTS.md | Clarifies contributor-focused scope and points users/agents to python/datafusion/AGENTS.md |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Wrap CASE/WHEN method-chain examples in parentheses and assign to a variable so they are valid Python as shown (Copilot #1, #2). - Fix INTERSECT/EXCEPT mapping: the default distinct=False corresponds to INTERSECT ALL / EXCEPT ALL, not the distinct forms. Updated both the Set Operations section and the SQL reference table to show both the ALL and distinct variants (Copilot apache#4). - Change write_parquet / write_csv / write_json examples to file-style paths (output.parquet, etc.) to match the convention used in existing tests and examples. Note that a directory path is also valid for partitioned output (Copilot apache#5). Verified INTERSECT/EXCEPT semantics with a script: df1.intersect(df2) -> [1, 1, 2] (= INTERSECT ALL) df1.intersect(df2, distinct=True) -> [1, 2] (= INTERSECT) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop lit() on the RHS of comparison operators since Expr auto-wraps raw Python values, matching the style the guide recommends (Copilot apache#3, apache#6). Updates examples in the Aggregation, CASE/WHEN, SQL reference table, Common Pitfalls, Fluent Chaining, and Variables-as-CTEs sections, plus the __init__.py quick-start snippet. Prose explanations of the rule (which cite the long form as the thing to avoid) are left unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ntjohnson1
left a comment
There was a problem hiding this comment.
I think this looks great! I'm not sure if you want to try to land this or wait for some people to test it out first. I figure landing it then iterating might make the most sense.
1 concern is how to maintain/validate the stuff in the AGENTS.md is actually up to date. If nothing can run it does it still execute? I think doctests can run on markdown, or could do a more complex method where the md gets built as an artifact.
You mentioned how to distribute this, probably for follow on work. One idea could be to register this as a skill in one of the various online registries. Then people could install datafusion-python support and just run /dfn-py and then ask for queries.
| from datafusion import SessionContext, col | ||
| from datafusion import functions as F | ||
|
|
||
| ctx = SessionContext() | ||
| df = ctx.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]}) | ||
| result = ( | ||
| df.filter(col("a") > 1) | ||
| .with_column("total", col("a") + col("b")) | ||
| .aggregate([], [F.sum(col("total")).alias("grand_total")]) | ||
| ) | ||
| print(result.to_pydict()) # {'grand_total': [16]} |
There was a problem hiding this comment.
NIT: I think you can put this in doctest format and that would ensure it doesn't go stale.
| count = df.count() # int | ||
|
|
||
| # Streaming | ||
| stream = df.execute_stream() # RecordBatchStream (single partition) |
There was a problem hiding this comment.
I think this needs more context. Is this fetching 1 at a time, fetching everything, fetching up to some internal prefetch buffer? When to prefer this over collect etc? A few sentences would probably help a lot.
|
|
||
| ### Date Arithmetic | ||
|
|
||
| `Date32` columns require `Interval` types for arithmetic, not `Duration`. Use |
There was a problem hiding this comment.
Date64 works with Duration or this is only discussing date32?
The whole datetime only has ms precision so exporting to numpy for their datetime64 or have pandas installed if trying to go to raw python types feels helpful describing as a dates related footgun that is an pyarrow problem but gets inherited here.
Which issue does this PR close?
Addresses part of #1394 (PR 1a from the implementation plan)
Rationale for this change
AI agents (and humans) that encounter
datafusionviapip installcurrently get a 2-line module docstring and no structured guide to the DataFrame API. This makes it difficult for agents to produce idiomatic DataFrame code, even though they are very capable with SQL. The goal is that any agent -- whether it encounters the package via pip, the docs site, or the repo -- gets enough context to write correct DataFrame code.What changes are included in this PR?
python/datafusion/AGENTS.md(new) -- comprehensive DataFrame API guide that ships withpip install datafusion(Maturin includes all files underpython-source = "python"). Covers:lit()wrapping, column quoting, immutable DataFrames, window frame defaults, HAVING pattern)python/datafusion/__init__.py(modified) -- enriched module docstring from 2 lines to a full overview with core abstractions, a quick-start example, and a pointer to AGENTS.md.AGENTS.md(modified, root) -- clarified that the root file is for contributors working on the project, and added a pointer topython/datafusion/AGENTS.mdfor agents that need to use the DataFrame API.Are there any user-facing changes?
Yes -- the
datafusionpackage now ships with anAGENTS.mdguide and has a richer module docstring visible viahelp(datafusion). No API changes.