Update htcondor instructions#459
Conversation
tristan-f-r
left a comment
There was a problem hiding this comment.
[I'll have to restore my HTCondor access to follow this.]
agitter
left a comment
There was a problem hiding this comment.
I'm testing the Snakemake long execution mode. The first time my jobs went on hold because I put my spras-v0.6.0.sif file in the htcondor/ directory instead of the root directory. That should have been obvious based on the comment in the .yaml file.
On the second attempt my jobs went on hold with
Transfer output files failure at execution point slot1_24@e2591.chtc.wisc.edu while sending files to access point ap2001. Details: 1 total failures: first failure: reading from file /var/lib/condor/execute/slot1/dir_3699332/scratch/output: (errno 2) No such file or directory
|
I converted this to a draft because these docs will depend on the explicit sif transfer PR, and I haven't yet tested everything here in that paradigm. |
|
Also, apologies for the poor git etiquette in the last commit that rolled too many things into one diff (including running an |
178a93e to
fe7fbbc
Compare
…gging I was tired of hacking around wanting verbose logging in the HTCondor Snakemake executor, so I added some plumbing to pass Snakemake's '--verbose' flag through 'snakemake_long.py' to snakemake itself. Additionally, I added '--env-manager' so I could run things with my preferred mamba env instead of conda (which is too slow to rebuild).
The executor has matured quite a bit since these instructions were first drafted, and it's my hope that these changes remove a lot of the headache for running jobs. Now, you can edit config files in `config/` and use the `input/` directory directly. Workflows should be submitted directly from the repository root.
Co-authored-by: Tristan F.-R. <pub.tristanf@gmail.com>
fe7fbbc to
ceea753
Compare
These came from testing Neha's real workflow in June 2026. Not totally sure how they all work (and whether additional environment variables will need to be added in the future), but they were key to getting custom sif images to unpack alongside the jobs.
|
This is a note for myself -- one thing I should document in the htcondor rst is the need to pre-create apptainer images before launching workflows. |
Add guidance to docs/htcondor.rst encouraging users to pre-build per-algorithm container images rather than pulling them at runtime, and steer them toward the proper place to build those images. Also add a warning against running `apptainer build` directly on a shared Access Point, pointing users to CHTC's guide for building images in an interactive job.
agitter
left a comment
There was a problem hiding this comment.
During my testing, I triggered a Snakemake lock error by launching a long job, killing it, changing the config file, and relaunching. That may be a common error.
python3.11/site-packages/snakemake/persistence.py", line 211, in lock
raise snakemake.exceptions.LockException()
snakemake.exceptions.LockException: Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following director
y:
/home/agitter/spras
If you are sure that no other instances of snakemake are running on this directory, the remaining lock was likely caused by a kill signal or a power loss. It can be removed with th
e --unlock argument.
LockException:
Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following directory:
/home/agitter/spras
If you are sure that no other instances of snakemake are running on this directory, the remaining lock was likely caused by a kill signal or a power loss. It can be removed with th
e --unlock argument.
Should we add it to troubleshooting?
I also hit this error
$ cat htcondor/logs/merge_input/merge_input-5_7645955.err
ModuleNotFoundError in file "/var/lib/condor/execute/slot1/dir_1046931/scratch/Snakefile", line 8:
No module named 'spras.config.revision'
File "/var/lib/condor/execute/slot1/dir_1046931/scratch/Snakefile", line 8, in <module>
I'm guessing that means a need a newer version of the SPRAS sif image. However, we haven't released a SPRAS version recently. What version of the image are you testing with?
|
For the first bug, this requires the spras conda environment to be activated and then the command For the second issue, you are right, this is because the version of SPRAS in the docker image v0.6 isn't up to date with the current version of SPRAS. Justin has a docker image you can pull from dockerhub (i think it is this jhiemstra/spras:update-htcondor-instructions-v2) or you will need to build the image with Docker on your local machine of the updated version of SPRAS, push the image to Docker Hub, and then use that image. |
In general, you should either:
The key is that the repo you're using to submit from the AP should match what's in the image. There's a callout relatively early in the documentation covering this, but I'm open to edits it seems like this is often missed: |
ntalluri
left a comment
There was a problem hiding this comment.
The updated documentation looks great, I added some suggestions on how to help users more.
I also remember we were changing the config.yaml file in spras_profile and wasn't sure if any of the commands needed to be added to the documentation.
There was a problem hiding this comment.
this is what my config.yaml looks like. I wasn't sure if we need to add:
... && versionGE(split(Target.CondorVersion)[1], "24.8.0") && (isenforcingdiskusage =!= true)'
stream_ouput: true
stream_error: true
parts to the documentation
# Default configuration for the SPRAS/HTCondor executor profile. Each of these values
# can also be passed via command line flags, e.g. `--jobs 30 --executor htcondor`.
# NOTE: File paths in here should be relative to where you submit from, typically the
# root of the SPRAS repository
# 'jobs' specifies the maximum number of HTCondor jobs that can be in the queue at once.
jobs: 30
executor: htcondor
configfile: config/egfr.yaml
htcondor-jobdir: htcondor/logs
# Indicate to the plugin that jobs running on various EPs do not share a filesystem with
# each other, or with the AP.
shared-fs-usage: none
# Distributed, heterogeneous computational environments are a wild place where strange things
# can happen. If something goes wrong, try again up to 2 times. After that, we assume there's
# a real error that requires user/admin intervention
retries: 2
# Default resources will apply to all workflow steps. If a single workflow step fails due
# to insufficient resources, it can be re-run with modified values. Snakemake will handle
# picking up where it left off, and won't re-run steps that have already completed.
default-resources:
job_wrapper: "htcondor/spras.sh"
# If running in CHTC, this only works with apptainer images
# Note requirement for quotes around the image name
container_image: "test-htc.sif"
universe: "container"
# The value for request_disk should be large enough to accommodate the runtime container
# image, any additional PRM container images, and your input data.
request_disk: "16GB"
request_memory: "12GB"
retry_request_memory_increase: "RequestMemory + 4"
retry_request_memory_max: "32GB"
classad_WantGlideIn: true
requirements: |
'(HAS_SINGULARITY == True) && (Poolname =!= "CHTC") && versionGE(split(Target.CondorVersion)[1], "24.8.0") && (isenforcingdiskusage =!= true)'
stream_ouput: true
stream_error: true
There was a problem hiding this comment.
I definitely don't think we should be covering the stream_{output/error} in the general purpose profile for fear that an unknowing user will take that as the default (and make a CHTC sys admin very sad when they crash an AP). These should be intentionally hard to fine because users will find them very tempting to use without understanding the detrimental effects they can have on shared computing resources.
As for the other requirements, I'd also like to avoid sticking those in the profile -- they're very specific to your run that needs both the OSPool and profiling, and these requirements are generally documented elsewhere.
There was a problem hiding this comment.
What was the goal of these settings for Neha's OSPool runs? I don't think we need to add them here but am curious for our SPRAS benchmarking in OSPool.
There was a problem hiding this comment.
classad_WantGlideIn: trueconfigures a classad that enables submission to OSPool(HAS_SINGULARITY == True)makes sure you land on OSPool EPs that support apptainer/singularity, which is a hard requirement for SPRAS(Poolname =!= "CHTC")-- this can probably be omitted -- it disables submission to CHTC EPs. If you enable OSPool submissions and don't include this, you're submitting to both the OSPool and CHTC at once.versionGE(split(Target.CondorVersion)[1], "24.8.0") && (isenforcingdiskusage =!= true)makes sure you match with EPs that have the features/configuration needed for the apptainer profiling code to work.
| - ✓ | ||
| - Convenience wrapper (in the repository root) around | ||
| ``snakemake_long.py``. | ||
|
|
There was a problem hiding this comment.
The next section is what I found confusing. It gives instructions to create the .sif from the existing DockerHub image. That usually breaks. I recommend we remove it and only give instructions to build a new image from source.
There was a problem hiding this comment.
Maybe you can explain how these steps break for you? I don't think I've ever run into issues building from an existing Dockerhub image, outside of the mismatched Snakefile conundrum (which I try to cover more heavily in my latest round of revisions).
There was a problem hiding this comment.
The new option to checkout a version of the SPRAS repo that matches the existing image makes sense and should help.
My prior confusion was around what a user should do if they need to build a SPRAS image themselves to match a newer commit. I don't see a way to build the Apptainer sif image entirely on CHTC. We link to CHTC docs that expect you to have a def file, which we don't have. The Apptainer instructions for building directly from a Dockerfile didn't work for me. My understanding is that I would have to build a Docker image on a local machine with Docker, push it to my DockerHub, then run a build job in CHTC to convert it to a sif file.
agitter
left a comment
There was a problem hiding this comment.
Looks ready to me. I'll let Neha check it as well.
This largely reformats the directory structure needed to run SPRAS workflows with HTCondor. In particular, it moves a lot of the helper code/submit files out of
docker-wrappers/SPRAS/into a top-levelhtcondor/directory. I can do this now that the HTCondor executor has matured significantly, and can handle all the paths as they're configured in this diff.To run a test SPRAS workflow, try following along with the instructions in
docs/htcondor.rst. If anything is confusing, or you get hung up on any of the steps, let's discuss what I can do to make things more clear.