fix: conda support + minor issues during initialization.#85
fix: conda support + minor issues during initialization.#85cicirori merged 1 commit intolightseekorg:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb8338360e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| parquet_files = _list_hub_data_files(data_path, (".parquet",)) | ||
| if not parquet_files: | ||
| raise ValueError(f"No parquet files found in HF Hub dataset '{data_path}'.") | ||
| urls = [f"https://huggingface.co/datasets/{data_path}/resolve/main/{f}" for f in parquet_files] |
There was a problem hiding this comment.
Use Hub-aware paths for parquet shard URLs
Build parquet data_files through Hugging Face Hub APIs instead of hardcoding https://.../resolve/main/.... This line drops repository revision/auth context, so fallback loading fails for datasets whose default branch is not main or for private/gated repos (where unauthenticated raw URLs return 401/404), even though list_repo_files(...) just succeeded.
Useful? React with 👍 / 👎.
| except ValueError as parquet_err: | ||
| log.info(f"parquet streaming failed ({parquet_err}), falling back to JSON download") |
There was a problem hiding this comment.
Broaden parquet fallback exception handling
Catch more than ValueError around _load_hub_parquet_dataset(...); parquet streaming can fail with other exception types (e.g., HTTP/dataset-generation/parsing errors), and in those cases this new branch aborts without trying the JSON fallback. That is a regression from the prior behavior where JSON fallback was always attempted after the initial load_dataset(...) failure.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR improves TorchSpec setup and dataset ingestion robustness by enhancing conda/micromamba installation flow, fixing an upstream patch application issue, and extending Hugging Face dataset loading to handle parquet-only repos and malformed rows more gracefully.
Changes:
- Add conda fallback (in addition to micromamba) for environment creation/install in
tools/build_conda.sh. - Improve HF Hub dataset loading fallback to support parquet shard streaming when
load_dataset(repo_id, ...)fails, and refactor Hub file listing. - Add troubleshooting guidance for a Mooncake TCP-store SEGFAULT workaround; update the SGLang patch file format to apply cleanly.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
torchspec/data/utils.py |
Handles None message content; adds Hub parquet streaming fallback and shared Hub file listing helper. |
tools/build_conda.sh |
Detects micromamba vs conda automatically and uses a unified env create/run flow. |
patches/sglang/v0.5.10.post1/sglang.patch |
Regenerated patch in email-style format to avoid git apply corruption errors. |
README.md |
Adds Mooncake SEGFAULT troubleshooting note and environment variable workaround. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| urls = [f"https://huggingface.co/datasets/{data_path}/resolve/main/{f}" for f in parquet_files] | ||
| return load_dataset("parquet", data_files=urls, split="train", streaming=True) |
| if content is None: | ||
| msg["content"] = "" | ||
| continue |
| except (ValueError, TypeError, ArithmeticError, KeyError) as e: | ||
| # Schema inference failures (e.g., mixed-type columns in Arrow/Parquet). | ||
| # Fall back to manual JSON download. | ||
| # Schema inference / card-parsing failures (e.g., mixed-type columns in | ||
| # Arrow/Parquet, malformed dataset_info YAML). Fall back to direct file | ||
| # streaming — parquet first, then JSON/JSONL. | ||
| import logging | ||
|
|
||
| logging.getLogger(__name__).info( | ||
| f"load_dataset failed for '{data_path}' ({e}), falling back to JSON download" | ||
| ) | ||
| return IterableDataset.from_generator( | ||
| _load_hub_json_files, gen_kwargs={"data_path": data_path} | ||
| ) | ||
| log = logging.getLogger(__name__) | ||
| log.info(f"load_dataset failed for '{data_path}' ({e}), trying direct file streaming") | ||
| try: | ||
| return _load_hub_parquet_dataset(data_path) | ||
| except ValueError as parquet_err: | ||
| log.info(f"parquet streaming failed ({parquet_err}), falling back to JSON download") | ||
| return IterableDataset.from_generator( | ||
| _load_hub_json_files, gen_kwargs={"data_path": data_path} | ||
| ) |
| TORCHSPEC_LOG_LEVEL=DEBUG ./examples/qwen3-8b-single-node/run.sh | ||
| ``` | ||
|
|
||
| ## Mooncake SEGFAULT |
|
Thank you for your contribution! |
I've encountered some issues & inconveniences while setting up TorchSpec.
git apply /root/TorchSpec/patches/sglang/v0.5.10.post1/sglang.patch error: corrupt patch at line 946error by re-generating the patch file.