Skip to content

Add 2D Autoencoder and GMM Integration on Siracusa Target#190

Draft
Aldrago98 wants to merge 30 commits into
pulp-platform:develfrom
Aldrago98:FIORIRE2
Draft

Add 2D Autoencoder and GMM Integration on Siracusa Target#190
Aldrago98 wants to merge 30 commits into
pulp-platform:develfrom
Aldrago98:FIORIRE2

Conversation

@Aldrago98
Copy link
Copy Markdown
Contributor

The goal of this branch is to implement a 2D autoencoder model integrated with a Gaussian Mixture Model (GMM) on the Siracusa target with Neureka support.

The implementation was developed incrementally through the following steps:

  1. initial support for a generic target;
  2. porting to Siracusa without tiling;
  3. introduction of tiled support on Siracusa;
  4. final integration with Siracusa + Neureka.

Added

Changed

Fixed

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

@diaconuccalin diaconuccalin added the Feature Addition of new features label May 6, 2026
@Aldrago98 Aldrago98 changed the title FIORIRE2 Add 2D Autoencoder and GMM Integration on Siracusa Target May 6, 2026
Copy link
Copy Markdown
Contributor

@diaconuccalin diaconuccalin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting the first third part of the review, I will finish going through the rest of the files as soon as possible.

Comment thread .vscode/launch.json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that these changes were made to adjust your local work env, and should not be pushed to main. Please revert them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only changes to this file are blank lines. Please revert them completely.

BatchNorm_fp32(
${data_in}, ${scale}, ${bias}, ${mean}, ${variance},
${data_out}, ${batch_size}, ${channel_size}, ${window_size}
${data_out}, ${batch_size}, ${channel_size}, ${window_size}, ${epsilon}, ${channels_first}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add tests for this new feature (one test with channels_first 0, another with channels_first 1, and one more with non-defaul epsilon value)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert all the changes in this file. It seems to me that it was a temporary fix to assign the n_cores to a new variable, which is not needed anymore, so no need for the new variable either.

if engine is not None:
node.attrs["engine"] = engine.name
if hasattr(engine, "n_cores"):
node.attrs["n_cores"] = engine.n_cores
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good approach IMO. The number of cores is not a node attribute (conceptually, the node attributes should follow the ones that exist in the real ONNX nodes). Plus, this issue of passing the information about the number of cores should already be solved, and the value should already exist in the operator representation, it's passed here. If this value doesn't get passed in your case, we should identify the root cause.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking a little more into it, maybe you need to add NeurekaEngine in the list here.

tilerModel.addTensorDimToModel(ctxt, tensorName)

for idx, shapeDim in enumerate(_buffer.shape):
shape = [_buffer.shape] if isinstance(_buffer.shape, int) else _buffer.shape
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should identify the root cause for which this fix was needed and fix it in that location, rather than here (transforming there the shape in an enumerable, rather than here.


schedule = TilingSchedule({}, {}, [], [])
repScheme = VariableReplacementScheme({}, {})
inputLoadSchedule: List[Dict[str, HyperRectangle]] = [{}]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed?

shape = cls._normalizedShape(buffer.shape)
outputLoadSchedule[0][addrName] = HyperRectangle((0,) * len(shape), shape)

schedule = TilingSchedule(inputBaseOffsets, outputBaseOffsets, inputLoadSchedule, outputLoadSchedule)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, why is this needed? Why separate the input and output? And why change the schedule from an empty one?

if inputShapes[1] == () or inputShapes[1] == []:
inputShapes[1] = (1,)

# Scalars and singletons should broadcast to the tensor operand,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?


def __init__(self, name: str = '', shape = [1], values = [0]):
super().__init__(name, shape, values)
# Some Neureka lowering paths inspect global constants before type inference
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which ones? This looks to me like an unstable patch that doesn't really solve the root cause (why they inspect before type inference?)

Copy link
Copy Markdown
Contributor

@diaconuccalin diaconuccalin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second part of my review. Working on the third and hopefully final part :)

node = tensor.inputs[0]
input_values = []
for input_tensor in node.inputs:
value = self._evaluate_constant_tensor(input_tensor)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No good reason for having this kind of recursivity.

def __init__(self):
super().__init__()

def _evaluate_constant_tensor(self, tensor: gs.Tensor):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this function? Why do they need to be evaluated?

return None
input_values.append(value)

if node.op == "Constant":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's bad practice to separate node-specific checks and operations like this from either points where we would previously do this association to nodes, or to node-specific functions (if an issue appears, or a future developer wants to add a new node, they will also need to modify here, and it's not easy to find this location).


if ret:
self.operatorRepresentation['mode'] = node.attrs['mode']
self.operatorRepresentation['value'] = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to the else of "if 'value' in node.attrs", so it's more visible

return True

return ret
if len(node.inputs) in (2, 3):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick comment (: I think it would be cleaner and easier to understand with separate ifs (if len(input) >= 2, if len(input) >= 3, etc), like you did below

for (uint32_t buf = 0; buf < DeeployNetwork_num_outputs; buf++) {
uint32_t count = DeeployNetwork_outputs_bytes[buf] / sizeof(OUTPUTTYPE);
printf("OUTPUT %u %u\r\n", buf, count);
for (uint32_t i = 0; i < count; i++) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part a debugging leftover?

if (abs_actual > scale) {
scale = abs_actual;
}
float tolerance = FLOAT_ABS_TOL + FLOAT_REL_TOL * scale;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting solution for having a relative error check. Let's wait for the opinion of the others as well (@runwangdl @Xeratec @Victor-Jung), if they think we should keep it like this.

printf("Actual: %4d ", actual);
printf("Diff: %4d at Index %12u in Output %u\r\n", diff, i, buf);
}
#if ISOUTPUTFLOAT == 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change from if to #if?

float tolerance = FLOAT_ABS_TOL + FLOAT_REL_TOL * scale;

if ((diff < -1e-4) || (diff > 1e-4) || isnan(diff)) {
if ((abs_diff > tolerance) || isnan(diff)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, let's wait for other opinions.

printf("Actual: %4d ", actual);
printf("Diff: %4d at Index %12u in Output %u\r\n", diff, i, buf);
}
#if ISOUTPUTFLOAT == 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, why change from if to #if?

Copy link
Copy Markdown
Contributor

@diaconuccalin diaconuccalin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review finished. Other than the comments I left across the files, it would be useful to have some more testing for all the modified elements. Namely:

  • batchnorm with and without epsilon, with and without channel first
  • larger ConvTranspose2D with tighter memory limits, to force tiling
  • multiplication where B is not scalar, but has the same size as A
  • ReduceLogSumExp with one and more reduction axis
  • Pad and Slice, with thight mem limits for tiling
  • Full 2D autoencoder and GMM models, with tighter mem limits for tiling
  • MaxPool on non-square input, since you inverted the width and hight

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a Conv2D test in DeeployTest/Tests/Kernels/FP32/Conv/Regular_2D, why the need for this one?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a Conv2D test in DeeployTest/Tests/Kernels/FP32/Conv/Regular_2D, why the need for this one?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a Conv2D test in DeeployTest/Tests/Kernels/FP32/Conv/Regular_2D, why the need for this one?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this test in a new subdir inside DeeployTest/Tests/Kernels/FP32/Conv

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this test in a new subdir inside DeeployTest/Tests/Kernels/FP32/Conv

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have 2 different GMM models, one here, and one in DeeployTest/Tests/Models/Autoencoder2D/GMM?

gen_args_list.extend(args.input_offset_map)

if tiling:
if hasattr(args, 'cores'):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this extra pass of the cores argument needed? I already see it around line 413 of this file.

#include "DeeployPULPMath.h"
#include "pmsis.h"

__attribute__((noinline, optnone)) void PULP_ConvTranspose2d_fp32_fp32_fp32_CHW(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optnone seems like a debugging leftover, is there a reason to keep it?

PULPConv2DTilingReadyBindings = TilingReadyNodeBindings(nodeBindings = PULPFloatConv2DBindings,
tileConstraint = Conv2DTileConstraint())

PULPConv2DUntiledTilingReadyBindings = TilingReadyNodeBindings(nodeBindings = PULPFloatConv2DBindings,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-sensical to have tiling ready bindings for untiled operation, please remove.

PULPMaxPool2DTilingReadyBindings = TilingReadyNodeBindings(nodeBindings = PULPMaxPool2DBindings,
tileConstraint = MaxPoolCTileConstraint())

PULPMaxPool2DUntiledTilingReadyBindings = TilingReadyNodeBindings(nodeBindings = PULPMaxPool2DBindings,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-sensical to have tiling ready bindings for untiled operation, please remove.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature Addition of new features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants