fix: conda support + minor issues during initialization. by Dogacel · Pull Request #85 · lightseekorg/TorchSpec

Dogacel · 2026-04-27T06:30:40Z

I've encountered some issues & inconveniences while setting up TorchSpec.

Automatically detect Conda / Micromamba during installation.
Fix git apply /root/TorchSpec/patches/sglang/v0.5.10.post1/sglang.patch error: corrupt patch at line 946 error by re-generating the patch file.
Add instructions about how to solve the SEGFAULT issue in Mooncake's TCP store in latest version.
Add huggingface dataset support for Parquet.
Skip data with None message content that causes a crash if dataset has an offending row.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb8338360e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T06:33:50Z

+    parquet_files = _list_hub_data_files(data_path, (".parquet",))
+    if not parquet_files:
+        raise ValueError(f"No parquet files found in HF Hub dataset '{data_path}'.")
+    urls = [f"https://huggingface.co/datasets/{data_path}/resolve/main/{f}" for f in parquet_files]


Use Hub-aware paths for parquet shard URLs

Build parquet data_files through Hugging Face Hub APIs instead of hardcoding https://.../resolve/main/.... This line drops repository revision/auth context, so fallback loading fails for datasets whose default branch is not main or for private/gated repos (where unauthenticated raw URLs return 401/404), even though list_repo_files(...) just succeeded.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-27T06:33:50Z

+        except ValueError as parquet_err:
+            log.info(f"parquet streaming failed ({parquet_err}), falling back to JSON download")


Broaden parquet fallback exception handling

Catch more than ValueError around _load_hub_parquet_dataset(...); parquet streaming can fail with other exception types (e.g., HTTP/dataset-generation/parsing errors), and in those cases this new branch aborts without trying the JSON fallback. That is a regression from the prior behavior where JSON fallback was always attempted after the initial load_dataset(...) failure.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This PR improves TorchSpec setup and dataset ingestion robustness by enhancing conda/micromamba installation flow, fixing an upstream patch application issue, and extending Hugging Face dataset loading to handle parquet-only repos and malformed rows more gracefully.

Changes:

Add conda fallback (in addition to micromamba) for environment creation/install in tools/build_conda.sh.
Improve HF Hub dataset loading fallback to support parquet shard streaming when load_dataset(repo_id, ...) fails, and refactor Hub file listing.
Add troubleshooting guidance for a Mooncake TCP-store SEGFAULT workaround; update the SGLang patch file format to apply cleanly.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`torchspec/data/utils.py`	Handles `None` message content; adds Hub parquet streaming fallback and shared Hub file listing helper.
`tools/build_conda.sh`	Detects micromamba vs conda automatically and uses a unified env create/run flow.
`patches/sglang/v0.5.10.post1/sglang.patch`	Regenerated patch in email-style format to avoid `git apply` corruption errors.
`README.md`	Adds Mooncake SEGFAULT troubleshooting note and environment variable workaround.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    urls = [f"https://huggingface.co/datasets/{data_path}/resolve/main/{f}" for f in parquet_files]
+    return load_dataset("parquet", data_files=urls, split="train", streaming=True)


+        if content is None:
+            msg["content"] = ""
+            continue


    except (ValueError, TypeError, ArithmeticError, KeyError) as e:
-        # Schema inference failures (e.g., mixed-type columns in Arrow/Parquet).
-        # Fall back to manual JSON download.
+        # Schema inference / card-parsing failures (e.g., mixed-type columns in
+        # Arrow/Parquet, malformed dataset_info YAML). Fall back to direct file
+        # streaming — parquet first, then JSON/JSONL.
        import logging

-        logging.getLogger(__name__).info(
-            f"load_dataset failed for '{data_path}' ({e}), falling back to JSON download"
-        )
-        return IterableDataset.from_generator(
-            _load_hub_json_files, gen_kwargs={"data_path": data_path}
-        )
+        log = logging.getLogger(__name__)
+        log.info(f"load_dataset failed for '{data_path}' ({e}), trying direct file streaming")
+        try:
+            return _load_hub_parquet_dataset(data_path)
+        except ValueError as parquet_err:
+            log.info(f"parquet streaming failed ({parquet_err}), falling back to JSON download")
+            return IterableDataset.from_generator(
+                _load_hub_json_files, gen_kwargs={"data_path": data_path}
+            )


 TORCHSPEC_LOG_LEVEL=DEBUG ./examples/qwen3-8b-single-node/run.sh
 ```

+## Mooncake SEGFAULT


cicirori · 2026-04-27T07:36:44Z

Thank you for your contribution!

conda support + fix initialization:

eb83383

Copilot AI review requested due to automatic review settings April 27, 2026 06:30

Copilot started reviewing on behalf of Dogacel April 27, 2026 06:31 View session

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

Copilot AI reviewed Apr 27, 2026

View reviewed changes

cicirori merged commit 39c2ee7 into lightseekorg:main Apr 27, 2026
4 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: conda support + minor issues during initialization.#85

fix: conda support + minor issues during initialization.#85
cicirori merged 1 commit intolightseekorg:mainfrom
Dogacel:fix-starter

Dogacel commented Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

cicirori commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		except ValueError as parquet_err:
		log.info(f"parquet streaming failed ({parquet_err}), falling back to JSON download")

		urls = [f"https://huggingface.co/datasets/{data_path}/resolve/main/{f}" for f in parquet_files]
		return load_dataset("parquet", data_files=urls, split="train", streaming=True)

Conversation

Dogacel commented Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

cicirori commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants