Skip to content

Model a DRAM write of one Einsum's output overlapping with the next Einsum's compute #54

Description

@okaikov

Summary

When a tensor is both (a) an intermediate consumed by the next Einsum in a fused cascade and (b) a tensor that also needs to be written out to DRAM (e.g. it's a real workload output, a checkpoint, a debug tap), AccelForge currently has no way to express "that DRAM write happens in the background / overlapped with the next Einsum's compute." Every Einsum-to-Einsum transition is modeled as fully sequential, so the DRAM write is always on the critical path before the next Einsum is considered to start, even though on real hardware these are usually two independent resources (a DRAM write port vs. the compute array) that can run concurrently.

Motivating use case

Three cascaded Einsums e1 -> e2 -> e3 on a DRAM/SRAM hierarchy, where e2's output o2 is consumed by e3 (so it must stay resident in SRAM for fusion) but is also a real workload output that must land in DRAM.
Using a Tensors.keep set-expression override on the DRAM node ("(~Intermediates) | o2") successfully gets o2 stored in both SRAM (for e3) and DRAM (as a backed/persistent output) -- that part of the API works well and is genuinely useful.
The open question is purely about latency: is the DRAM write of o2 allowed to happen while e3 is already computing, or does it have to finish first?

Reproduction

import accelforge as af

def build_spec(o2_to_dram: bool):
    workload = af.Workload(
        rank_sizes={"M": 8, "N0": 8, "N1": 8, "N2": 8, "N3": 8},
        einsums=[
            {"name": "e1", "tensor_accesses": [
                {"name": "i1", "projection": ["m", "n0"]},
                {"name": "w1", "projection": ["n0", "n1"]},
                {"name": "o1", "projection": ["m", "n1"], "output": True},
            ]},
            {"name": "e2", "tensor_accesses": [
                {"name": "o1", "projection": ["m", "n1"]},
                {"name": "w2", "projection": ["n1", "n2"]},
                {"name": "o2", "projection": ["m", "n2"], "output": True},
            ]},
            {"name": "e3", "tensor_accesses": [
                {"name": "o2", "projection": ["m", "n2"]},
                {"name": "w3", "projection": ["n2", "n3"]},
                {"name": "o3", "projection": ["m", "n3"], "output": True},
            ]},
        ],
        bits_per_value={"All": 8},
    )
    mem_actions = [
        af.arch.TensorHolderAction(name="read", energy=1, throughput=1),
        af.arch.TensorHolderAction(name="write", energy=1, throughput=1),
    ]
    keep_expr = "(~Intermediates) | o2" if o2_to_dram else "~Intermediates"
    arch = af.Arch(nodes=[
        af.arch.Memory(name="DRAM", size=float("inf"), actions=mem_actions, 
            area=0, leak_power=0,
            tensors=af.arch.Tensors(keep=keep_expr, may_keep="All")),
        af.arch.Memory(name="SRAM", size=1_000_000, actions=mem_actions,
            area=0, leak_power=0,
            tensors=af.arch.Tensors(keep="Intermediates", may_keep="All")),
        af.arch.Compute(name="mac",
            actions=[af.arch.Action(name="compute", energy=1, throughput=1)],
            area=0, leak_power=0),
    ])
    return af.Spec(arch=arch, workload=workload)

for label, flag in [("o2 SRAM-only (pure intermediate)", False),
                     ("o2 also written to DRAM", True)]:
    spec = build_spec(flag)
    mappings = spec.map_workload_to_arch(print_progress=False)
    print(label, "-> latency:", mappings.latency())

Output:

o2 SRAM-only (pure intermediate) -> latency: 31744
o2 also written to DRAM          -> latency: 32768

Adding the DRAM write for o2 increases total modeled latency by exactly the cost of that extra write -- it is fully serialized onto the critical path, with no overlap with e3's compute.

What I found while digging in

Not proposing a fix here, just sharing what I found in case it's useful context:

  • The FFM leg-merge logic (accelforge/frontend/mapping/mapping.py, where divergent per-Einsum legs get stitched into one Mapping) always wraps diverging legs in Sequential(nodes=[Nested(...), Nested(...)]). I couldn't find anywhere in mapper/FFM/ that ever constructs a Pipeline node when joining Einsums.
  • There is a Pipeline mapping node class with a docstring claiming branches "operate at the same time," and the reuse/skew analysis
    (accelforge/model/_looptree/reuse/isl/mapping_to_isl/skews_from_mapping.py) does tag Pipeline nodes differently from Sequential (PipelineTag vs. SequentialTag). However, when I manually took an FFM-generated mapping and swapped the top-level Sequential node for a Pipeline node with the same children, then re-ran it through spec.evaluate_mapping(), the reported latency was bit-for-bit identical (32768 in both cases). So as far as I can tell, Pipeline doesn't currently change the computed latency at all -- whatever distinguishes it structurally doesn't appear to be consumed by whatever ultimately produces the latency number.
  • BRANCH_TAGS = (PipelineTag, SequentialTag) in model/_looptree/reuse/isl/mapping_to_isl/types.py doesn't seem to be referenced anywhere outside its own definition, and there are "PIPELINE CHANGES REQUIRED" notes in mapper/FFM/_pareto_df/df_convention.py that read like this is a known, partially-sketched-out direction rather than something I'm missing an existing flag for.

Question

Is overlapped/pipelined latency across Einsum (or even across loop-nest) boundaries something on the roadmap, and if so, is Pipeline the intended mechanism once it's wired up, or is the right mental model something different? Happy to help test against a real use case once there's something to try -- the workload above is a minimal repro if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions