refactor: separate statistic computation#411
Conversation
we also make it lazy
Documentation build overview
4 files changed± genindex.html± contributing/maintain.html± fordevs/spras.analysis.html± fordevs/spras.html |
|
Building on top of this PR allows me to add graph heuristics. Most likely, every |
|
Before I can review the implementation of the change, I need to better understand what problem we are tying to solve with the change. Where will laziness be needed in the future?
Do we envision calling graph statistic computation twice per graph? After we compute these statistics on a graph once, shouldn't that be sufficient for an entire pass of a workflow? |
|
I was going to ask @ntalluri about this, since I wasn't quite sure if we will have expensive graph heuristics or not.
I did decouple this from |
|
There could be more than one way to design this sensibly. One would be that if heuristics are enabled in the config file, that automatically generates the graph summary table. The produces more output than requested, which is slightly undesirable. Another could be to move the heuristic calculations inside each --parameters> subdirectory, which may be where you are headed. If that is written as a file for that one pathway, it could be consumed for heuristics (or used for heuristics and then written to disk). Later, if the graph summary table is requested, it would grab the precomputed statistics from those files in the subdirectories. |
|
I'll mark this as a draft for now and design something in line with your second proposal. |
|
Would you be able to explain what the goal and what the changes are of this PR in the top comment? Also why does this depend on SPRAS revision? |
|
I've edited the top comment to mention the heuristics PR 👍, though the motivation was already present. As mentioned in the meeting and in the top-level comment, this depends on the integration testing part of the SPRAS revision and not the immutability section. |
along with proper Snakemake procedural rule usage
|
Is there a reason why we need to have separate statistic folders within each output subnetwork folder? I don't understand this motivation. |
|
I read the conversion between you and tony, I see the motivation now. |
ntalluri
left a comment
There was a problem hiding this comment.
I am running this locally still, but this is my current review
| return max(degrees), median(degrees) | ||
|
|
||
| def compute_on_cc(directed_graph: nx.DiGraph) -> tuple[int, float]: | ||
| # We convert our directed_graph to an undirected graph as networkx (reasonably) does |
There was a problem hiding this comment.
I can't remember why we do this. @agitter I remember we talked about this years ago.
There was a problem hiding this comment.
[I'm a little confused - is it why we compute the number of connected components in the first place?]
There was a problem hiding this comment.
If it's about the undirected graph conversion, that comment should be why.
There was a problem hiding this comment.
@ntalluri was your comment asking why we convert to undirected for the connected component calculation?
Related to my comment above, I recommend we stay with reading undirected graphs and use them throughout. That affects other statistics like degree as well.
agitter
left a comment
There was a problem hiding this comment.
I finally understand the design better and like it.
The tests don't pass for me locally
FAILED test/analysis/test_summary.py::TestSummary::test_example_networks[example] - AssertionError: assert False
FAILED test/analysis/test_summary.py::TestSummary::test_example_networks[egfr] - AssertionError: assert False
|
|
||
| def summarize_networks(file_paths: Iterable[Path], node_table: pd.DataFrame, algo_params: dict[str, dict], | ||
| algo_with_params: list[str]) -> pd.DataFrame: | ||
| algo_with_params: list[str], statistics_files: Mapping[str, Iterable[str | os.PathLike]]) -> pd.DataFrame: |
There was a problem hiding this comment.
We now have LoosePathLike in util. Does it make sense to use it here?
| To make the statistics allow directed graph input, they will always take | ||
| in a networkx.DiGraph, which contains even more information, even though | ||
| the underlying graph may be just as easily represented by networkx.Graph. |
There was a problem hiding this comment.
Is this a change in functionality? In summary.py we had the opposite
Network directionality is ignored and all edges are treated as undirected
When we previously assessed treating all graphs as directed or undirected, we decided that undirected would be less wrong.
| statistics_computation: dict[tuple[str, ...], Callable[[nx.DiGraph], tuple[float | int, ...]]] = { | ||
| ('Number of nodes',): lambda graph : (graph.number_of_nodes(),), | ||
| ('Number of edges',): lambda graph : (graph.number_of_edges(),), | ||
| ('Number of connected components',): lambda graph : (nx.number_connected_components(graph.to_undirected()),), |
There was a problem hiding this comment.
Again, having to convert everything to undirected is messy.
| return (avg_path_len,) | ||
|
|
||
| # The type signature here is meant to be 'an n-tuple has n outputs.' | ||
| statistics_computation: dict[tuple[str, ...], Callable[[nx.DiGraph], tuple[float | int, ...]]] = { |
There was a problem hiding this comment.
This syntax is somewhat confusing. It took me a minute to realize why everything needs to be a tuple, that sometimes we have a function return one statistic and sometimes multiple. The design makes sense now that I get it, but a comment would have helped.
| # All of the keys inside statistics_computation, flattened. | ||
| statistics_options: list[str] = list(itertools.chain(*(list(key) for key in statistics_computation.keys()))) | ||
|
|
||
| def from_output_pathway(lines) -> nx.Graph: |
There was a problem hiding this comment.
When I saw it in summary.py, I didn't recognize this function as a graph loader from the name.
| subprocess.run(["snakemake", "--cores", "1", "--configfile", f"test/analysis/input/{param}.yaml"]) | ||
| yield param # this runs the test itself: once this is passed, we go to test cleanup. | ||
| shutil.rmtree(f"test/analysis/input/run/{param}") | ||
| # shutil.rmtree(f"test/analysis/input/run/{param}") |
There was a problem hiding this comment.
Why do we not need this anymore?
|
|
||
| # We generate new Snakemake rules for every statistic | ||
| # to allow parallel and lazy computation of individual statistics | ||
| for keys in statistics_computation.keys(): |
There was a problem hiding this comment.
Because of the complicated typing, I find it hard to track what keys is here. I'm pretty sure it is the tuples describing the statistics as strings.
| # (See https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#procedural-rule-definition) | ||
| name: pythonic_name | ||
| input: pathway_file = rules.parse_output.output.standardized_file | ||
| output: [SEP.join([out_dir, '{dataset}-{algorithm}-{params}', 'statistics', f'{key}.txt']) for key in keys] |
There was a problem hiding this comment.
Do we need to sanitize the keys here as done in the pythonic_name above? When I run locally, I see output files like Number of connected components.txt.
We also make graph statistics lazy. Laziness isn't used in
summary.py, but I assume that we'll have more computationally expensive graph statistics as SPRAS develops, especially when it can take long to compute for our larger graphs, so this also splits up statistic generation into different rules.Most importantly, this allows us to re-use statistics by consuming specific statistics as input files, which is currently used in #431.