Skip to content

fix: conda support + minor issues during initialization.#85

Merged
cicirori merged 1 commit intolightseekorg:mainfrom
Dogacel:fix-starter
Apr 27, 2026
Merged

fix: conda support + minor issues during initialization.#85
cicirori merged 1 commit intolightseekorg:mainfrom
Dogacel:fix-starter

Conversation

@Dogacel
Copy link
Copy Markdown
Contributor

@Dogacel Dogacel commented Apr 27, 2026

I've encountered some issues & inconveniences while setting up TorchSpec.

  1. Automatically detect Conda / Micromamba during installation.
  2. Fix git apply /root/TorchSpec/patches/sglang/v0.5.10.post1/sglang.patch error: corrupt patch at line 946 error by re-generating the patch file.
  3. Add instructions about how to solve the SEGFAULT issue in Mooncake's TCP store in latest version.
  4. Add huggingface dataset support for Parquet.
  5. Skip data with None message content that causes a crash if dataset has an offending row.

Copilot AI review requested due to automatic review settings April 27, 2026 06:30
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb8338360e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread torchspec/data/utils.py
parquet_files = _list_hub_data_files(data_path, (".parquet",))
if not parquet_files:
raise ValueError(f"No parquet files found in HF Hub dataset '{data_path}'.")
urls = [f"https://huggingface.co/datasets/{data_path}/resolve/main/{f}" for f in parquet_files]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use Hub-aware paths for parquet shard URLs

Build parquet data_files through Hugging Face Hub APIs instead of hardcoding https://.../resolve/main/.... This line drops repository revision/auth context, so fallback loading fails for datasets whose default branch is not main or for private/gated repos (where unauthenticated raw URLs return 401/404), even though list_repo_files(...) just succeeded.

Useful? React with 👍 / 👎.

Comment thread torchspec/data/utils.py
Comment on lines +485 to +486
except ValueError as parquet_err:
log.info(f"parquet streaming failed ({parquet_err}), falling back to JSON download")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Broaden parquet fallback exception handling

Catch more than ValueError around _load_hub_parquet_dataset(...); parquet streaming can fail with other exception types (e.g., HTTP/dataset-generation/parsing errors), and in those cases this new branch aborts without trying the JSON fallback. That is a regression from the prior behavior where JSON fallback was always attempted after the initial load_dataset(...) failure.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves TorchSpec setup and dataset ingestion robustness by enhancing conda/micromamba installation flow, fixing an upstream patch application issue, and extending Hugging Face dataset loading to handle parquet-only repos and malformed rows more gracefully.

Changes:

  • Add conda fallback (in addition to micromamba) for environment creation/install in tools/build_conda.sh.
  • Improve HF Hub dataset loading fallback to support parquet shard streaming when load_dataset(repo_id, ...) fails, and refactor Hub file listing.
  • Add troubleshooting guidance for a Mooncake TCP-store SEGFAULT workaround; update the SGLang patch file format to apply cleanly.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
torchspec/data/utils.py Handles None message content; adds Hub parquet streaming fallback and shared Hub file listing helper.
tools/build_conda.sh Detects micromamba vs conda automatically and uses a unified env create/run flow.
patches/sglang/v0.5.10.post1/sglang.patch Regenerated patch in email-style format to avoid git apply corruption errors.
README.md Adds Mooncake SEGFAULT troubleshooting note and environment variable workaround.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread torchspec/data/utils.py
Comment on lines +427 to +428
urls = [f"https://huggingface.co/datasets/{data_path}/resolve/main/{f}" for f in parquet_files]
return load_dataset("parquet", data_files=urls, split="train", streaming=True)
Comment thread torchspec/data/utils.py
Comment on lines +340 to +342
if content is None:
msg["content"] = ""
continue
Comment thread torchspec/data/utils.py
Comment on lines 475 to +489
except (ValueError, TypeError, ArithmeticError, KeyError) as e:
# Schema inference failures (e.g., mixed-type columns in Arrow/Parquet).
# Fall back to manual JSON download.
# Schema inference / card-parsing failures (e.g., mixed-type columns in
# Arrow/Parquet, malformed dataset_info YAML). Fall back to direct file
# streaming — parquet first, then JSON/JSONL.
import logging

logging.getLogger(__name__).info(
f"load_dataset failed for '{data_path}' ({e}), falling back to JSON download"
)
return IterableDataset.from_generator(
_load_hub_json_files, gen_kwargs={"data_path": data_path}
)
log = logging.getLogger(__name__)
log.info(f"load_dataset failed for '{data_path}' ({e}), trying direct file streaming")
try:
return _load_hub_parquet_dataset(data_path)
except ValueError as parquet_err:
log.info(f"parquet streaming failed ({parquet_err}), falling back to JSON download")
return IterableDataset.from_generator(
_load_hub_json_files, gen_kwargs={"data_path": data_path}
)
Comment thread README.md
TORCHSPEC_LOG_LEVEL=DEBUG ./examples/qwen3-8b-single-node/run.sh
```

## Mooncake SEGFAULT
@cicirori
Copy link
Copy Markdown
Collaborator

Thank you for your contribution!

@cicirori cicirori merged commit 39c2ee7 into lightseekorg:main Apr 27, 2026
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants