Skip to content

Split cleanData and add Parquet exports#11

Open
AbhirupaGhosh wants to merge 19 commits intomainfrom
cleanData
Open

Split cleanData and add Parquet exports#11
AbhirupaGhosh wants to merge 19 commits intomainfrom
cleanData

Conversation

@AbhirupaGhosh
Copy link
Copy Markdown
Contributor

Rename the original cleanData to cleanMetaData and add roxygen skeleton. Introduce a writeCompressedParquet helper and export cleaned metadata, AMR phenotype, genome data and original metadata to compressed Parquet files, then create a separate DuckDB (parquet-backed) with views for metadata, amr_phenotype, genome_data and original_metadata. Reintroduce a new cleanData function focused on feature matrices (genes/proteins/domains/etc.) that writes feature tables to Parquet and creates corresponding views; remove duplicated metadata parquet exports from the feature-matrix flow. Minor whitespace and path-handling adjustments to normalize paths and ensure output directories exist.

Description

What kind of change(s) are included?

  • Feature (adds or updates new capabilities)
  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (these changes would cause existing functionality to not work as expected).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

  • I have read and followed the CONTRIBUTING.md guidelines.
  • I have searched for existing content to ensure this is not a duplicate.
  • I have performed a self-review of these additions (including spelling, grammar, and related).
  • I have added comments to my code to help provide understanding.
  • I have added a test which covers the code changes found within this PR.
  • I have deleted all non-relevant text in this pull request template.
  • Reviewer assignment: Tag a relevant team member to review and approve the changes.

AbhirupaGhosh and others added 3 commits January 28, 2026 16:31
Rename the original cleanData to cleanMetaData and add roxygen skeleton. Introduce a writeCompressedParquet helper and export cleaned metadata, AMR phenotype, genome data and original metadata to compressed Parquet files, then create a separate DuckDB (parquet-backed) with views for metadata, amr_phenotype, genome_data and original_metadata. Reintroduce a new cleanData function focused on feature matrices (genes/proteins/domains/etc.) that writes feature tables to Parquet and creates corresponding views; remove duplicated metadata parquet exports from the feature-matrix flow. Minor whitespace and path-handling adjustments to normalize paths and ensure output directories exist.
AbhirupaGhosh and others added 3 commits February 19, 2026 15:56
Fixed trailing zero bug, fixed FTP timeout bug (?), fixed empty files hanging downloads, fixed imbalanced genome data sets (e.g., no .fna, yes .faa, yes .gff)
Comment thread R/data_curation.R
message(" amr_phenotype: ", n_amr_ids)
message(" genomes with 0 AMR rows: ", length(ids_zero_amr))
if (length(ids_zero_amr)) {
message(" e.g.: ", paste(utils::head(ids_zero_amr, 10), collapse = ", "))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It printed

Initial summary: targets=24421 | AMR genomes=24422 | genome_data genomes=24423
Final summary:
  targets      : 24421
  bac_data     : 24421
  genome_data  : 24423
  amr_phenotype: 24422
  genomes with 0 AMR rows: 2
  e.g.: , genome.genome_id

So I am guessing it is considering column name as genome id!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can strip this out. It was for debugging originally which is why it's messy. Alternatively, if it is useful to keep a summary, I can work on some improvements for it!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any logical impact of this misprinting. We can come back to this later. I have made an issue.

Comment thread R/data_curation.R Outdated
Comment thread R/data_curation.R Outdated
AbhirupaGhosh and others added 10 commits March 11, 2026 16:31
Added a function to parse CD-HIT .clstr output into a long-format mapping of clusters to member feature ids. Updated database writing logic to include the new protein members table.
Minor regex change --> cleanData
Updated resistance summary calculation to read from the database and include a count of resistant classes.
Comment thread R/data_processing.R Outdated
@eboyer221 eboyer221 self-requested a review April 8, 2026 17:43
Comment thread R/data_processing.R Outdated
Copy link
Copy Markdown
Contributor

@eboyer221 eboyer221 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran PR locally for Shigella flexneri and it worked as expected once I made the code changes suggested in this review (see inline comments below).
All expected tables present at the end of the run:

Metadata: metadata, amr_phenotype, genome_data, original_metadata
Genes: gene_count, gene_names, gene_seqs, struct
Proteins: protein_count, protein_names, protein_seqs, protein_members, genome_gene_protein
Domains: domain_count, domain_names

Comment thread R/data_processing.R
#'
#' @keywords internal
.buildProtMatrices <- function(cluster_map) {
cluster_map[, count := 1]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The := operator is used here, but data.table is not imported into the package namespace. Currently, it is only listed under Imports: in DESCRIPTION.

When I ran runDataProcessing() locally, this resulted in an error:
[ was called on a data.table in an environment that is not data.table-aware (cedta())

CD-HIT runs, but the error comes immediately after, when R tries to parse the CD-HIT outputs into a cluster matrix.

To fix, add data.table to any R/*.R file with:

#' @importFrom data.table :=
NULL

Then run devtools::document() to regenerate NAMESPACE. One import covers every := call in the package, so no need to repeat it per file.

Comment thread R/data_processing.R Outdated
Comment thread R/data_processing.R
AbhirupaGhosh and others added 3 commits April 20, 2026 10:34
Co-authored-by: Abhirupa Ghosh <100681585+AbhirupaGhosh@users.noreply.github.com>
Co-authored-by: Emily Boyer <130874527+eboyer221@users.noreply.github.com>
Copy link
Copy Markdown
Member

@jananiravi jananiravi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of a lot of formatting changes, not sure what were the actual changes -- looks OK at a high-level. what's new here, @AbhirupaGhosh?

Comment thread vignettes/intro.Rmd
cdhit_extra_args = c("-g", "1"),
cdhit_output_prefix = "cdhit_out",
# InterPro
ipr_appl = c("Pfam"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there other alternatives allowed/available?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

antifam, cdd, funfam, gene3d, hamap, ncbifam, panther, pfam, phobius, pirsf, pirsr, prints, prosite, sfld, smart, superfamily, tmhmm are available when user download the InterPro data.

@AbhirupaGhosh
Copy link
Copy Markdown
Contributor Author

AbhirupaGhosh commented Apr 22, 2026

because of a lot of formatting changes, not sure what were the actual changes -- looks OK at a high-level. what's new here, @AbhirupaGhosh?

The PR started with separating functions to create parquets from Metadata and feature data. Then minor changes in regex patterns to read BV-BRC feature ids, reformating the summary stats, adding the missed libraries as @importFrom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants