Model a DRAM write of one Einsum's output overlapping with the next Einsum's compute

## Summary

When a tensor is both (a) an intermediate consumed by the next Einsum in a fused cascade *and* (b) a tensor that also needs to be written out to DRAM (e.g. it's a real workload output, a checkpoint, a debug tap), AccelForge currently has no way to express "that DRAM write happens in the background / overlapped with the next Einsum's compute." Every Einsum-to-Einsum transition is modeled as fully sequential, so the DRAM write is always on the critical path before the next Einsum is considered to start, even though on real hardware these are usually two independent resources (a DRAM write port vs. the compute array) that can run concurrently.

## Motivating use case

Three cascaded Einsums `e1 -> e2 -> e3` on a DRAM/SRAM hierarchy, where `e2`'s output `o2` is consumed by `e3` (so it must stay resident in SRAM for fusion) but is *also* a real workload output that must land in DRAM. 
Using a `Tensors.keep` set-expression override on the DRAM node (`"(~Intermediates) | o2"`) successfully gets `o2` stored in both SRAM (for `e3`) and DRAM (as a backed/persistent output) -- that part of the API works well and is genuinely useful. 
The open question is purely about latency: is the DRAM write of `o2` allowed to happen while `e3` is already computing, or does it have to finish first?

## Reproduction

```python
import accelforge as af

def build_spec(o2_to_dram: bool):
    workload = af.Workload(
        rank_sizes={"M": 8, "N0": 8, "N1": 8, "N2": 8, "N3": 8},
        einsums=[
            {"name": "e1", "tensor_accesses": [
                {"name": "i1", "projection": ["m", "n0"]},
                {"name": "w1", "projection": ["n0", "n1"]},
                {"name": "o1", "projection": ["m", "n1"], "output": True},
            ]},
            {"name": "e2", "tensor_accesses": [
                {"name": "o1", "projection": ["m", "n1"]},
                {"name": "w2", "projection": ["n1", "n2"]},
                {"name": "o2", "projection": ["m", "n2"], "output": True},
            ]},
            {"name": "e3", "tensor_accesses": [
                {"name": "o2", "projection": ["m", "n2"]},
                {"name": "w3", "projection": ["n2", "n3"]},
                {"name": "o3", "projection": ["m", "n3"], "output": True},
            ]},
        ],
        bits_per_value={"All": 8},
    )
    mem_actions = [
        af.arch.TensorHolderAction(name="read", energy=1, throughput=1),
        af.arch.TensorHolderAction(name="write", energy=1, throughput=1),
    ]
    keep_expr = "(~Intermediates) | o2" if o2_to_dram else "~Intermediates"
    arch = af.Arch(nodes=[
        af.arch.Memory(name="DRAM", size=float("inf"), actions=mem_actions, 
            area=0, leak_power=0,
            tensors=af.arch.Tensors(keep=keep_expr, may_keep="All")),
        af.arch.Memory(name="SRAM", size=1_000_000, actions=mem_actions,
            area=0, leak_power=0,
            tensors=af.arch.Tensors(keep="Intermediates", may_keep="All")),
        af.arch.Compute(name="mac",
            actions=[af.arch.Action(name="compute", energy=1, throughput=1)],
            area=0, leak_power=0),
    ])
    return af.Spec(arch=arch, workload=workload)

for label, flag in [("o2 SRAM-only (pure intermediate)", False),
                     ("o2 also written to DRAM", True)]:
    spec = build_spec(flag)
    mappings = spec.map_workload_to_arch(print_progress=False)
    print(label, "-> latency:", mappings.latency())
```

Output:

```
o2 SRAM-only (pure intermediate) -> latency: 31744
o2 also written to DRAM          -> latency: 32768
```

Adding the DRAM write for `o2` increases total modeled latency by exactly the cost of that extra write -- it is fully serialized onto the critical path, with no overlap with `e3`'s compute.

## What I found while digging in

Not proposing a fix here, just sharing what I found in case it's useful context:

- The FFM leg-merge logic (`accelforge/frontend/mapping/mapping.py`, where divergent per-Einsum legs get stitched into one `Mapping`) always wraps diverging legs in `Sequential(nodes=[Nested(...), Nested(...)])`. I couldn't find anywhere in `mapper/FFM/` that ever constructs a `Pipeline` node when joining Einsums.
- There *is* a `Pipeline` mapping node class with a docstring claiming branches "operate at the same time," and the reuse/skew analysis
  (`accelforge/model/_looptree/reuse/isl/mapping_to_isl/skews_from_mapping.py`) does tag `Pipeline` nodes differently from `Sequential` (`PipelineTag` vs. `SequentialTag`). However, when I manually took an FFM-generated mapping and swapped the top-level `Sequential` node for a `Pipeline` node with the same children, then re-ran it through `spec.evaluate_mapping()`, the reported latency was bit-for-bit identical (32768 in both cases). So as far as I can tell, `Pipeline` doesn't currently change the computed latency at all -- whatever distinguishes it structurally doesn't appear to be consumed by whatever ultimately produces the latency number.
- `BRANCH_TAGS = (PipelineTag, SequentialTag)` in `model/_looptree/reuse/isl/mapping_to_isl/types.py` doesn't seem to be referenced anywhere outside its own definition, and there are "PIPELINE CHANGES REQUIRED" notes in `mapper/FFM/_pareto_df/df_convention.py` that read like this is a known, partially-sketched-out direction rather than something I'm missing an existing flag for.

## Question

Is overlapped/pipelined latency across Einsum (or even across loop-nest) boundaries something on the roadmap, and if so, is `Pipeline` the intended mechanism once it's wired up, or is the right mental model something different?  Happy to help test against a real use case once there's something to try -- the workload above is a minimal repro if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Model a DRAM write of one Einsum's output overlapping with the next Einsum's compute #54

Summary

Motivating use case

Reproduction

What I found while digging in

Question

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Model a DRAM write of one Einsum's output overlapping with the next Einsum's compute #54

Description

Summary

Motivating use case

Reproduction

What I found while digging in

Question

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions