Split cleanData and add Parquet exports#11
Conversation
Rename the original cleanData to cleanMetaData and add roxygen skeleton. Introduce a writeCompressedParquet helper and export cleaned metadata, AMR phenotype, genome data and original metadata to compressed Parquet files, then create a separate DuckDB (parquet-backed) with views for metadata, amr_phenotype, genome_data and original_metadata. Reintroduce a new cleanData function focused on feature matrices (genes/proteins/domains/etc.) that writes feature tables to Parquet and creates corresponding views; remove duplicated metadata parquet exports from the feature-matrix flow. Minor whitespace and path-handling adjustments to normalize paths and ensure output directories exist.
Fixed trailing zero bug, fixed FTP timeout bug (?), fixed empty files hanging downloads, fixed imbalanced genome data sets (e.g., no .fna, yes .faa, yes .gff)
| message(" amr_phenotype: ", n_amr_ids) | ||
| message(" genomes with 0 AMR rows: ", length(ids_zero_amr)) | ||
| if (length(ids_zero_amr)) { | ||
| message(" e.g.: ", paste(utils::head(ids_zero_amr, 10), collapse = ", ")) |
There was a problem hiding this comment.
It printed
Initial summary: targets=24421 | AMR genomes=24422 | genome_data genomes=24423
Final summary:
targets : 24421
bac_data : 24421
genome_data : 24423
amr_phenotype: 24422
genomes with 0 AMR rows: 2
e.g.: , genome.genome_id
So I am guessing it is considering column name as genome id!
There was a problem hiding this comment.
We can strip this out. It was for debugging originally which is why it's messy. Alternatively, if it is useful to keep a summary, I can work on some improvements for it!
There was a problem hiding this comment.
I didn't find any logical impact of this misprinting. We can come back to this later. I have made an issue.
Added a function to parse CD-HIT .clstr output into a long-format mapping of clusters to member feature ids. Updated database writing logic to include the new protein members table.
Minor regex change --> cleanData
Updated resistance summary calculation to read from the database and include a count of resistant classes.
There was a problem hiding this comment.
Ran PR locally for Shigella flexneri and it worked as expected once I made the code changes suggested in this review (see inline comments below).
All expected tables present at the end of the run:
Metadata: metadata, amr_phenotype, genome_data, original_metadata
Genes: gene_count, gene_names, gene_seqs, struct
Proteins: protein_count, protein_names, protein_seqs, protein_members, genome_gene_protein
Domains: domain_count, domain_names
| #' | ||
| #' @keywords internal | ||
| .buildProtMatrices <- function(cluster_map) { | ||
| cluster_map[, count := 1] |
There was a problem hiding this comment.
The := operator is used here, but data.table is not imported into the package namespace. Currently, it is only listed under Imports: in DESCRIPTION.
When I ran runDataProcessing() locally, this resulted in an error:
[ was called on a data.table in an environment that is not data.table-aware (cedta())
CD-HIT runs, but the error comes immediately after, when R tries to parse the CD-HIT outputs into a cluster matrix.
To fix, add data.table to any R/*.R file with:
#' @importFrom data.table :=
NULL
Then run devtools::document() to regenerate NAMESPACE. One import covers every := call in the package, so no need to repeat it per file.
Co-authored-by: Abhirupa Ghosh <100681585+AbhirupaGhosh@users.noreply.github.com> Co-authored-by: Emily Boyer <130874527+eboyer221@users.noreply.github.com>
jananiravi
left a comment
There was a problem hiding this comment.
because of a lot of formatting changes, not sure what were the actual changes -- looks OK at a high-level. what's new here, @AbhirupaGhosh?
| cdhit_extra_args = c("-g", "1"), | ||
| cdhit_output_prefix = "cdhit_out", | ||
| # InterPro | ||
| ipr_appl = c("Pfam"), |
There was a problem hiding this comment.
are there other alternatives allowed/available?
There was a problem hiding this comment.
antifam, cdd, funfam, gene3d, hamap, ncbifam, panther, pfam, phobius, pirsf, pirsr, prints, prosite, sfld, smart, superfamily, tmhmm are available when user download the InterPro data.
The PR started with separating functions to create parquets from Metadata and feature data. Then minor changes in regex patterns to read BV-BRC feature ids, reformating the summary stats, adding the missed libraries as |
Rename the original cleanData to cleanMetaData and add roxygen skeleton. Introduce a writeCompressedParquet helper and export cleaned metadata, AMR phenotype, genome data and original metadata to compressed Parquet files, then create a separate DuckDB (parquet-backed) with views for metadata, amr_phenotype, genome_data and original_metadata. Reintroduce a new cleanData function focused on feature matrices (genes/proteins/domains/etc.) that writes feature tables to Parquet and creates corresponding views; remove duplicated metadata parquet exports from the feature-matrix flow. Minor whitespace and path-handling adjustments to normalize paths and ensure output directories exist.
Description
What kind of change(s) are included?
Checklist
Please ensure that all boxes are checked before indicating that this pull request is ready for review.