Skip to content

Potential bug in preprocess.py: Feature overwriting due to reference assignment and Pandas indexing #2

@starway11

Description

@starway11

Hi there,

Thank you for sharing the code for HEIST! While reviewing the data preprocessing pipeline in utils/preprocess.py, I noticed a couple of potential issues in the loop that constructs the graphs list.

Specifically, in this block:

Python
for k in tqdm(range(len(adata.obs.cell_type))):
G_gene = gene_network_dict[adata.obs.cell_type[k]]
G_gene.num_nodes = NUM_GENES
G_gene.cell_type = G_cell.cell_type[k]
G_gene.X = torch.from_numpy(adata.X[k].reshape(NUM_GENES, 1))
graphs.append(G_gene)
I believe there are two unintended behaviors here:

Object Reference Overwriting: Because G_gene fetches a direct reference to the PyG graph stored in gene_network_dict, the subsequent lines (G_gene.X = ... and G_gene.cell_type = ...) mutate the shared object in place. Consequently, all cells belonging to the same cell type will point to the exact same graph object in memory. Their gene expression features (.X) will be continuously overwritten, leaving all cells of a given type with the expression profile of the last cell processed in the loop. This seems to conflict with the paper's design, which assigns cell-specific initial expression features to each graph.

Pandas Indexing Error: Using adata.obs.cell_type[k] with an integer k performs label-based indexing. If the AnnData object uses string barcodes for its index, this line will throw a KeyError.
Could you please confirm if this aligns with your intended logic? Thank you again for your time and the great work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions