We are reporting a potential privacy / IP leakage issue in REST-style datastore-backed speculative decoding.
In our evaluation, we found that although the implementation limits the number of datastore tokens accepted in a single verification step, consecutive datastore-backed chunks can still accumulate into longer recovered fragments within one response. In our 1000-prompt evaluation, the average per-deployment median stitched recovery was 155 words for text-model deployments and 87 words for code-model deployments.
Representative examples below are lightly normalized for readability.
Text example
- model:
lmsys/vicuna-13b-v1.5
- datastore:
datastore_chat_large.idx
Thank you for your patience and understanding during this challenging time.
Best regards,
[Your Name]
[Your Title]
[Your Company]
Code example
- model:
codellama/CodeLlama-13b-hf
- datastore:
datastore_stack_large.idx
def verify_reset_password_token(token):
try:
id = jwt.decode(token, current_app.config['SECRET_KEY'],
algorithms=['HS256'])['reset_password']
except:
return
return User.query.get(id)
We understand that this is at least partly a deployment issue rather than necessarily a core correctness bug. In particular, the risk becomes more serious when REST is used with streaming output and when the datastore contains private, user-derived, or otherwise sensitive content.
For that reason, it would be useful to add explicit guidance in the documentation, for example:
- avoid streaming partial outputs in privacy-sensitive REST deployments
- do not place private or sensitive content in the datastore unless the leakage risk is acceptable
- document that per-step acceptance limits do not fully bound cumulative recovery within a single generation
We are intentionally not including low-level reproduction details in this public report. We are preparing an academic disclosure and wanted to notify the maintainers before publication. We would be happy to share a private technical write-up and affected configurations directly with the maintainers.
Thanks.
We are reporting a potential privacy / IP leakage issue in REST-style datastore-backed speculative decoding.
In our evaluation, we found that although the implementation limits the number of datastore tokens accepted in a single verification step, consecutive datastore-backed chunks can still accumulate into longer recovered fragments within one response. In our 1000-prompt evaluation, the average per-deployment median stitched recovery was 155 words for text-model deployments and 87 words for code-model deployments.
Representative examples below are lightly normalized for readability.
Text example
lmsys/vicuna-13b-v1.5datastore_chat_large.idxCode example
codellama/CodeLlama-13b-hfdatastore_stack_large.idxWe understand that this is at least partly a deployment issue rather than necessarily a core correctness bug. In particular, the risk becomes more serious when REST is used with streaming output and when the datastore contains private, user-derived, or otherwise sensitive content.
For that reason, it would be useful to add explicit guidance in the documentation, for example:
We are intentionally not including low-level reproduction details in this public report. We are preparing an academic disclosure and wanted to notify the maintainers before publication. We would be happy to share a private technical write-up and affected configurations directly with the maintainers.
Thanks.