Hi there,
Thank you for sharing the code for HEIST! While reviewing the data preprocessing pipeline in utils/preprocess.py, I noticed a couple of potential issues in the loop that constructs the graphs list.
Specifically, in this block:
Python
for k in tqdm(range(len(adata.obs.cell_type))):
G_gene = gene_network_dict[adata.obs.cell_type[k]]
G_gene.num_nodes = NUM_GENES
G_gene.cell_type = G_cell.cell_type[k]
G_gene.X = torch.from_numpy(adata.X[k].reshape(NUM_GENES, 1))
graphs.append(G_gene)
I believe there are two unintended behaviors here:
Object Reference Overwriting: Because G_gene fetches a direct reference to the PyG graph stored in gene_network_dict, the subsequent lines (G_gene.X = ... and G_gene.cell_type = ...) mutate the shared object in place. Consequently, all cells belonging to the same cell type will point to the exact same graph object in memory. Their gene expression features (.X) will be continuously overwritten, leaving all cells of a given type with the expression profile of the last cell processed in the loop. This seems to conflict with the paper's design, which assigns cell-specific initial expression features to each graph.
Pandas Indexing Error: Using adata.obs.cell_type[k] with an integer k performs label-based indexing. If the AnnData object uses string barcodes for its index, this line will throw a KeyError.
Could you please confirm if this aligns with your intended logic? Thank you again for your time and the great work!
Hi there,
Thank you for sharing the code for HEIST! While reviewing the data preprocessing pipeline in utils/preprocess.py, I noticed a couple of potential issues in the loop that constructs the graphs list.
Specifically, in this block:
Python
for k in tqdm(range(len(adata.obs.cell_type))):
G_gene = gene_network_dict[adata.obs.cell_type[k]]
G_gene.num_nodes = NUM_GENES
G_gene.cell_type = G_cell.cell_type[k]
G_gene.X = torch.from_numpy(adata.X[k].reshape(NUM_GENES, 1))
graphs.append(G_gene)
I believe there are two unintended behaviors here:
Object Reference Overwriting: Because G_gene fetches a direct reference to the PyG graph stored in gene_network_dict, the subsequent lines (G_gene.X = ... and G_gene.cell_type = ...) mutate the shared object in place. Consequently, all cells belonging to the same cell type will point to the exact same graph object in memory. Their gene expression features (.X) will be continuously overwritten, leaving all cells of a given type with the expression profile of the last cell processed in the loop. This seems to conflict with the paper's design, which assigns cell-specific initial expression features to each graph.
Pandas Indexing Error: Using adata.obs.cell_type[k] with an integer k performs label-based indexing. If the AnnData object uses string barcodes for its index, this line will throw a KeyError.
Could you please confirm if this aligns with your intended logic? Thank you again for your time and the great work!