refactor: separate statistic computation by tristan-f-r · Pull Request #411 · Reed-CompBio/spras

tristan-f-r · 2025-10-10T06:33:29Z

We also make graph statistics lazy. Laziness isn't used in summary.py, but I assume that we'll have more computationally expensive graph statistics as SPRAS develops, especially when it can take long to compute for our larger graphs, so this also splits up statistic generation into different rules.

Most importantly, this allows us to re-use statistics by consuming specific statistics as input files, which is currently used in #431.

Depends on feat!: SPRAS revision #320 for the summary statistics test (integration testing over unit testing is now required since the heavy workflow lifting is done by Snakemake).

we also make it lazy

read-the-docs-community · 2025-10-10T06:34:25Z

Documentation build overview

📚 spras | 🛠️ Build #32474691 | 📁 Comparing 9053a59 against latest (caf8e9e)

🔍 Preview build

4 files changed

± genindex.html
± contributing/maintain.html
± fordevs/spras.analysis.html
± fordevs/spras.html

tristan-f-r · 2025-10-14T17:39:58Z

Building on top of this PR allows me to add graph heuristics.

Most likely, every tuning PR will be at least marked with P-medium unless it's an end result.

agitter · 2025-11-07T22:47:27Z

Before I can review the implementation of the change, I need to better understand what problem we are tying to solve with the change. Where will laziness be needed in the future?

we can reuse the code for graph heuristic pruning

Do we envision calling graph statistic computation twice per graph? After we compute these statistics on a graph once, shouldn't that be sufficient for an entire pass of a workflow?

tristan-f-r · 2025-11-07T23:53:18Z

I was going to ask @ntalluri about this, since I wasn't quite sure if we will have expensive graph heuristics or not.

Do we envision calling graph statistic computation twice per graph? After we compute these statistics on a graph once, shouldn't that be sufficient for an entire pass of a workflow?

I did decouple this from analysis: summary: enabled: true, and I imagined it like this. I didn't think about that, though: would it make sense to have graph summary statistics always enabled the moment any heuristics are enabled?

agitter · 2025-11-08T04:25:01Z

There could be more than one way to design this sensibly. One would be that if heuristics are enabled in the config file, that automatically generates the graph summary table. The produces more output than requested, which is slightly undesirable.

Another could be to move the heuristic calculations inside each --parameters> subdirectory, which may be where you are headed. If that is written as a file for that one pathway, it could be consumed for heuristics (or used for heuristics and then written to disk). Later, if the graph summary table is requested, it would grab the precomputed statistics from those files in the subdirectories.

tristan-f-r · 2025-11-08T08:06:01Z

I'll mark this as a draft for now and design something in line with your second proposal.

this had incorrect behavior ?

ntalluri · 2026-02-05T16:30:16Z

Would you be able to explain what the goal and what the changes are of this PR in the top comment? Also why does this depend on SPRAS revision?

tristan-f-r · 2026-02-05T18:42:06Z

I've edited the top comment to mention the heuristics PR 👍, though the motivation was already present.

As mentioned in the meeting and in the top-level comment, this depends on the integration testing part of the SPRAS revision and not the immutability section.

agitter

I like the new design. I'm not confident I follow everything in spras/statistics.py correctly or how the testing updates intersect with #320.

along with proper Snakemake procedural rule usage

ntalluri · 2026-04-29T19:14:21Z

Is there a reason why we need to have separate statistic folders within each output subnetwork folder? I don't understand this motivation.

ntalluri · 2026-04-29T20:10:24Z

I read the conversion between you and tony, I see the motivation now.

ntalluri

I am running this locally still, but this is my current review

ntalluri · 2026-04-29T17:17:51Z

+        return max(degrees), median(degrees)
+
+def compute_on_cc(directed_graph: nx.DiGraph) -> tuple[int, float]:
+    # We convert our directed_graph to an undirected graph as networkx (reasonably) does


I can't remember why we do this. @agitter I remember we talked about this years ago.

[I'm a little confused - is it why we compute the number of connected components in the first place?]

If it's about the undirected graph conversion, that comment should be why.

@ntalluri was your comment asking why we convert to undirected for the connected component calculation?

Related to my comment above, I recommend we stay with reading undirected graphs and use them throughout. That affects other statistics like degree as well.

agitter

I finally understand the design better and like it.

The tests don't pass for me locally

FAILED test/analysis/test_summary.py::TestSummary::test_example_networks[example] - AssertionError: assert False
FAILED test/analysis/test_summary.py::TestSummary::test_example_networks[egfr] - AssertionError: assert False

agitter · 2026-06-19T18:49:07Z


 def summarize_networks(file_paths: Iterable[Path], node_table: pd.DataFrame, algo_params: dict[str, dict],
-                       algo_with_params: list[str]) -> pd.DataFrame:
+                       algo_with_params: list[str], statistics_files: Mapping[str, Iterable[str | os.PathLike]]) -> pd.DataFrame:


We now have LoosePathLike in util. Does it make sense to use it here?

agitter · 2026-06-19T21:17:06Z

+To make the statistics allow directed graph input, they will always take
+in a networkx.DiGraph, which contains even more information, even though
+the underlying graph may be just as easily represented by networkx.Graph.


Is this a change in functionality? In summary.py we had the opposite

Network directionality is ignored and all edges are treated as undirected

When we previously assessed treating all graphs as directed or undirected, we decided that undirected would be less wrong.

agitter · 2026-06-19T21:23:10Z

+statistics_computation: dict[tuple[str, ...], Callable[[nx.DiGraph], tuple[float | int, ...]]] = {
+    ('Number of nodes',): lambda graph : (graph.number_of_nodes(),),
+    ('Number of edges',): lambda graph : (graph.number_of_edges(),),
+    ('Number of connected components',): lambda graph : (nx.number_connected_components(graph.to_undirected()),),


Again, having to convert everything to undirected is messy.

agitter · 2026-06-19T21:24:41Z

+    return (avg_path_len,)
+
+# The type signature here is meant to be 'an n-tuple has n outputs.'
+statistics_computation: dict[tuple[str, ...], Callable[[nx.DiGraph], tuple[float | int, ...]]] = {


This syntax is somewhat confusing. It took me a minute to realize why everything needs to be a tuple, that sometimes we have a function return one statistic and sometimes multiple. The design makes sense now that I get it, but a comment would have helped.

agitter · 2026-06-19T21:25:54Z

+# All of the keys inside statistics_computation, flattened.
+statistics_options: list[str] = list(itertools.chain(*(list(key) for key in statistics_computation.keys())))
+
+def from_output_pathway(lines) -> nx.Graph:


When I saw it in summary.py, I didn't recognize this function as a graph loader from the name.

agitter · 2026-06-19T21:32:00Z

    subprocess.run(["snakemake", "--cores", "1", "--configfile", f"test/analysis/input/{param}.yaml"])
    yield param # this runs the test itself: once this is passed, we go to test cleanup.
-    shutil.rmtree(f"test/analysis/input/run/{param}")
+    # shutil.rmtree(f"test/analysis/input/run/{param}")


Why do we not need this anymore?

agitter · 2026-06-19T21:39:09Z


+# We generate new Snakemake rules for every statistic
+# to allow parallel and lazy computation of individual statistics
+for keys in statistics_computation.keys():


Because of the complicated typing, I find it hard to track what keys is here. I'm pretty sure it is the tuples describing the statistics as strings.

agitter · 2026-06-19T21:40:40Z

+        # (See https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#procedural-rule-definition)
+        name: pythonic_name
+        input: pathway_file = rules.parse_output.output.standardized_file
+        output: [SEP.join([out_dir, '{dataset}-{algorithm}-{params}', 'statistics', f'{key}.txt']) for key in keys]


Do we need to sanitize the keys here as done in the pythonic_name above? When I run locally, I see output files like Number of connected components.txt.

refactor: separate statistic computation

6ec4f62

we also make it lazy

tristan-f-r added tuning Workflow-spanning algorithm tuning refactor Changes that don't actually improve anything except for code quality. labels Oct 10, 2025

tristan-f-r added 2 commits October 10, 2025 06:48

fix: correct tuple assumption

9987189

fix: stably use graph statistic values

25eef5e

tristan-f-r requested a review from ntalluri October 14, 2025 17:38

tristan-f-r added the P-medium medium prirotity; this is needed for some external service or another PR label Oct 14, 2025

style: fmt

cb373c1

github-actions Bot added the merge-conflict This PR has merge conflicts. label Oct 30, 2025

tristan-f-r mentioned this pull request Oct 30, 2025

feat: heuristics #431

Open

1 task

Merge branch 'main' into lazy-stats

47a9e26

github-actions Bot removed the merge-conflict This PR has merge conflicts. label Oct 30, 2025

tristan-f-r and others added 2 commits October 29, 2025 18:15

style: specify zip strict

898d568

fix: make undirected for determining number of connected components

c675ece

tristan-f-r marked this pull request as draft November 8, 2025 08:06

ntalluri removed the P-medium medium prirotity; this is needed for some external service or another PR label Nov 19, 2025

tristan-f-r added 2 commits January 13, 2026 09:28

Merge branch 'main' into lazy-stats

3c81d05

feat: snakemake-based summary generation

1ca730e

tristan-f-r added the P-high This is a blocker for many PRs/issues/features label Jan 13, 2026

tristan-f-r marked this pull request as ready for review January 13, 2026 20:13

tristan-f-r added 3 commits January 13, 2026 12:19

fix(Snakefile): use parse_output for edgelist parsing

d67186d

fix: parse edgelist with rank, embed header skip inside from_edgelist

fd483c3

this had incorrect behavior ?

style: fmt

fd5046f

tristan-f-r removed the P-high This is a blocker for many PRs/issues/features label Jan 13, 2026

tristan-f-r added blocked-by-other-pr P-medium medium prirotity; this is needed for some external service or another PR labels Jan 13, 2026

tristan-f-r added 2 commits January 13, 2026 13:17

chore: mention statistics_files param

79cf748

Merge branch 'hash' into lazy-stats

339d915

agitter reviewed Feb 13, 2026

View reviewed changes

tristan-f-r added 3 commits February 14, 2026 01:09

docs: more info on summary & statistics

85e0ea8

style: fmt

804849a

Merge branch 'hash' into lazy-stats

cf3c6a0

github-actions Bot added the merge-conflict This PR has merge conflicts. label Mar 16, 2026

Merge remote-tracking branch 'upstream/main' into lazy-stats

0f7acca

tristan-f-r removed the blocked-by-other-pr label Mar 19, 2026

github-actions Bot removed the merge-conflict This PR has merge conflicts. label Mar 19, 2026

tristan-f-r mentioned this pull request Mar 26, 2026

Review Queue #466

Open

tristan-f-r added 4 commits April 17, 2026 19:27

Merge branch 'umain' into generate-all-inputs

ae61e57

Merge branch 'main' into lazy-stats

b038ecf

refactor: use dictionaries instead of a flat list

4fe949d

along with proper Snakemake procedural rule usage

docs: clarification

a86354f

ntalluri reviewed Apr 29, 2026

View reviewed changes

tristan-f-r commented Apr 29, 2026

View reviewed changes

Comment thread spras/statistics.py Outdated

refactor: apply suggestions

9053a59

tristan-f-r requested a review from ntalluri April 30, 2026 03:06

ntalluri mentioned this pull request May 8, 2026

feat: conditional runs #471

Open

agitter reviewed Jun 19, 2026

View reviewed changes

Conversation

tristan-f-r commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

read-the-docs-community Bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

tristan-f-r commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agitter commented Nov 7, 2025

Uh oh!

tristan-f-r commented Nov 7, 2025

Uh oh!

agitter commented Nov 8, 2025

Uh oh!

tristan-f-r commented Nov 8, 2025

Uh oh!

ntalluri commented Feb 5, 2026

Uh oh!

tristan-f-r commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agitter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ntalluri commented Apr 29, 2026

Uh oh!

ntalluri commented Apr 29, 2026

Uh oh!

ntalluri left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tristan-f-r Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agitter Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

agitter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tristan-f-r commented Oct 10, 2025 •

edited

Loading

read-the-docs-community Bot commented Oct 10, 2025 •

edited

Loading

tristan-f-r commented Oct 14, 2025 •

edited

Loading

tristan-f-r commented Feb 5, 2026 •

edited

Loading

tristan-f-r Apr 29, 2026 •

edited

Loading

agitter Jun 19, 2026 •

edited

Loading