Summary
When a tensor is both (a) an intermediate consumed by the next Einsum in a fused cascade and (b) a tensor that also needs to be written out to DRAM (e.g. it's a real workload output, a checkpoint, a debug tap), AccelForge currently has no way to express "that DRAM write happens in the background / overlapped with the next Einsum's compute." Every Einsum-to-Einsum transition is modeled as fully sequential, so the DRAM write is always on the critical path before the next Einsum is considered to start, even though on real hardware these are usually two independent resources (a DRAM write port vs. the compute array) that can run concurrently.
Motivating use case
Three cascaded Einsums e1 -> e2 -> e3 on a DRAM/SRAM hierarchy, where e2's output o2 is consumed by e3 (so it must stay resident in SRAM for fusion) but is also a real workload output that must land in DRAM.
Using a Tensors.keep set-expression override on the DRAM node ("(~Intermediates) | o2") successfully gets o2 stored in both SRAM (for e3) and DRAM (as a backed/persistent output) -- that part of the API works well and is genuinely useful.
The open question is purely about latency: is the DRAM write of o2 allowed to happen while e3 is already computing, or does it have to finish first?
Reproduction
import accelforge as af
def build_spec(o2_to_dram: bool):
workload = af.Workload(
rank_sizes={"M": 8, "N0": 8, "N1": 8, "N2": 8, "N3": 8},
einsums=[
{"name": "e1", "tensor_accesses": [
{"name": "i1", "projection": ["m", "n0"]},
{"name": "w1", "projection": ["n0", "n1"]},
{"name": "o1", "projection": ["m", "n1"], "output": True},
]},
{"name": "e2", "tensor_accesses": [
{"name": "o1", "projection": ["m", "n1"]},
{"name": "w2", "projection": ["n1", "n2"]},
{"name": "o2", "projection": ["m", "n2"], "output": True},
]},
{"name": "e3", "tensor_accesses": [
{"name": "o2", "projection": ["m", "n2"]},
{"name": "w3", "projection": ["n2", "n3"]},
{"name": "o3", "projection": ["m", "n3"], "output": True},
]},
],
bits_per_value={"All": 8},
)
mem_actions = [
af.arch.TensorHolderAction(name="read", energy=1, throughput=1),
af.arch.TensorHolderAction(name="write", energy=1, throughput=1),
]
keep_expr = "(~Intermediates) | o2" if o2_to_dram else "~Intermediates"
arch = af.Arch(nodes=[
af.arch.Memory(name="DRAM", size=float("inf"), actions=mem_actions,
area=0, leak_power=0,
tensors=af.arch.Tensors(keep=keep_expr, may_keep="All")),
af.arch.Memory(name="SRAM", size=1_000_000, actions=mem_actions,
area=0, leak_power=0,
tensors=af.arch.Tensors(keep="Intermediates", may_keep="All")),
af.arch.Compute(name="mac",
actions=[af.arch.Action(name="compute", energy=1, throughput=1)],
area=0, leak_power=0),
])
return af.Spec(arch=arch, workload=workload)
for label, flag in [("o2 SRAM-only (pure intermediate)", False),
("o2 also written to DRAM", True)]:
spec = build_spec(flag)
mappings = spec.map_workload_to_arch(print_progress=False)
print(label, "-> latency:", mappings.latency())
Output:
o2 SRAM-only (pure intermediate) -> latency: 31744
o2 also written to DRAM -> latency: 32768
Adding the DRAM write for o2 increases total modeled latency by exactly the cost of that extra write -- it is fully serialized onto the critical path, with no overlap with e3's compute.
What I found while digging in
Not proposing a fix here, just sharing what I found in case it's useful context:
- The FFM leg-merge logic (
accelforge/frontend/mapping/mapping.py, where divergent per-Einsum legs get stitched into one Mapping) always wraps diverging legs in Sequential(nodes=[Nested(...), Nested(...)]). I couldn't find anywhere in mapper/FFM/ that ever constructs a Pipeline node when joining Einsums.
- There is a
Pipeline mapping node class with a docstring claiming branches "operate at the same time," and the reuse/skew analysis
(accelforge/model/_looptree/reuse/isl/mapping_to_isl/skews_from_mapping.py) does tag Pipeline nodes differently from Sequential (PipelineTag vs. SequentialTag). However, when I manually took an FFM-generated mapping and swapped the top-level Sequential node for a Pipeline node with the same children, then re-ran it through spec.evaluate_mapping(), the reported latency was bit-for-bit identical (32768 in both cases). So as far as I can tell, Pipeline doesn't currently change the computed latency at all -- whatever distinguishes it structurally doesn't appear to be consumed by whatever ultimately produces the latency number.
BRANCH_TAGS = (PipelineTag, SequentialTag) in model/_looptree/reuse/isl/mapping_to_isl/types.py doesn't seem to be referenced anywhere outside its own definition, and there are "PIPELINE CHANGES REQUIRED" notes in mapper/FFM/_pareto_df/df_convention.py that read like this is a known, partially-sketched-out direction rather than something I'm missing an existing flag for.
Question
Is overlapped/pipelined latency across Einsum (or even across loop-nest) boundaries something on the roadmap, and if so, is Pipeline the intended mechanism once it's wired up, or is the right mental model something different? Happy to help test against a real use case once there's something to try -- the workload above is a minimal repro if useful.
Summary
When a tensor is both (a) an intermediate consumed by the next Einsum in a fused cascade and (b) a tensor that also needs to be written out to DRAM (e.g. it's a real workload output, a checkpoint, a debug tap), AccelForge currently has no way to express "that DRAM write happens in the background / overlapped with the next Einsum's compute." Every Einsum-to-Einsum transition is modeled as fully sequential, so the DRAM write is always on the critical path before the next Einsum is considered to start, even though on real hardware these are usually two independent resources (a DRAM write port vs. the compute array) that can run concurrently.
Motivating use case
Three cascaded Einsums
e1 -> e2 -> e3on a DRAM/SRAM hierarchy, wheree2's outputo2is consumed bye3(so it must stay resident in SRAM for fusion) but is also a real workload output that must land in DRAM.Using a
Tensors.keepset-expression override on the DRAM node ("(~Intermediates) | o2") successfully getso2stored in both SRAM (fore3) and DRAM (as a backed/persistent output) -- that part of the API works well and is genuinely useful.The open question is purely about latency: is the DRAM write of
o2allowed to happen whilee3is already computing, or does it have to finish first?Reproduction
Output:
Adding the DRAM write for
o2increases total modeled latency by exactly the cost of that extra write -- it is fully serialized onto the critical path, with no overlap withe3's compute.What I found while digging in
Not proposing a fix here, just sharing what I found in case it's useful context:
accelforge/frontend/mapping/mapping.py, where divergent per-Einsum legs get stitched into oneMapping) always wraps diverging legs inSequential(nodes=[Nested(...), Nested(...)]). I couldn't find anywhere inmapper/FFM/that ever constructs aPipelinenode when joining Einsums.Pipelinemapping node class with a docstring claiming branches "operate at the same time," and the reuse/skew analysis(
accelforge/model/_looptree/reuse/isl/mapping_to_isl/skews_from_mapping.py) does tagPipelinenodes differently fromSequential(PipelineTagvs.SequentialTag). However, when I manually took an FFM-generated mapping and swapped the top-levelSequentialnode for aPipelinenode with the same children, then re-ran it throughspec.evaluate_mapping(), the reported latency was bit-for-bit identical (32768 in both cases). So as far as I can tell,Pipelinedoesn't currently change the computed latency at all -- whatever distinguishes it structurally doesn't appear to be consumed by whatever ultimately produces the latency number.BRANCH_TAGS = (PipelineTag, SequentialTag)inmodel/_looptree/reuse/isl/mapping_to_isl/types.pydoesn't seem to be referenced anywhere outside its own definition, and there are "PIPELINE CHANGES REQUIRED" notes inmapper/FFM/_pareto_df/df_convention.pythat read like this is a known, partially-sketched-out direction rather than something I'm missing an existing flag for.Question
Is overlapped/pipelined latency across Einsum (or even across loop-nest) boundaries something on the roadmap, and if so, is
Pipelinethe intended mechanism once it's wired up, or is the right mental model something different? Happy to help test against a real use case once there's something to try -- the workload above is a minimal repro if useful.