Skip to content

Qualcomm AI Engine Direct - [LLM Quantization] Support dataloader-based prefill#20273

Merged
psiddh merged 1 commit into
pytorch:mainfrom
CodeLinaro:dev1/danny/remove_token_gen_from_calib
Jun 22, 2026
Merged

Qualcomm AI Engine Direct - [LLM Quantization] Support dataloader-based prefill#20273
psiddh merged 1 commit into
pytorch:mainfrom
CodeLinaro:dev1/danny/remove_token_gen_from_calib

Conversation

@DannyYuyang-quic

Copy link
Copy Markdown
Contributor

Summary

Calibration dataset:

  • Replace HF AutoModel token generation with direct tokenization of curated corpus (llm eval tasks or JSON samples)
  • Add default calibration samples: assets/samples/{text,vision,audio}.json
  • Support Dataloader-based calibration

Architecture:

  • Introduce PTQStrategy + DecoderInference as unified calibration forward-pass primitives; remove decoder_utils.graph_module_inference
  • Refactor dataset.py into dataset/ package: builders, collators, config, datasets, loaders, preprocessors, schema

Test plan

Test CI:

  • ExampleLLMScript
  • TestExampleMultimodalityScript

@pytorch-bot

pytorch-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20273

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 2 New Failures, 3 Unrelated Failures, 2 Unclassified Failures

As of commit 73263c2 with merge base 05b977d (image):

NEW FAILURES - The following jobs have failed:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 15, 2026
@DannyYuyang-quic

DannyYuyang-quic commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

@psiddh Hi, this PR is to support Dataloader-based calibration in MLLMs. With this PR, LLMs can be calibrated using the full input sequence at once, eliminating the need for iterative autoregressive (AR) processing over long sequences. For example, instead of performing hundreds of iterations for a sequence length of 1024, calibration can now be completed in a single forward pass.

Below is a comparison between AR iterative calibration and dataloader-based calibration across different models:

MLLMs metrics

model name AR iterative calibration
Time(sec)/PPL
Dataloader-based calibration
Time(sec)/PPL
speedup
gemma-2b 1216 / 16.588 100 /16.609 12.16x
gemma2-2b 1827 / 11.504 123 / 11.517 14.85x
gemma3-1b 907 / 23.052 81 / 22.722 10.67x
glm-1_5b 963 / 20.180 85 / 20.041 11.32x
llama3_2-3b 2286 / 10.745 138 / 10.498 16.56x
phi_4_mini 2824 / 13.437 180 / 13.605 15.68x
qwen2_5-0_5b 486 / 13.951 77 / 13.813 6.31x
qwen2_5-1_5b 1068 / 9.714 116 / 9.669 9.2x
qwen3-1_7b 1478 / 14.756 111 / 14.913 13.31x
smollm2_135m 399 / 19.797 80/19.706 4.98x
smollm3-3b 2065 / 8.345 132 / 8.989 15.64x
smolvlm_500m_instruct 170 / - 86 / - 1.97x
internvl3_1b 170 / - 75 / - 2.26 x
granite_speech_3_3-2b 447 / - 179 / - 2.49x
llama3_2-1b 1237 / 14.973 883 / 15.647 1.4x
qwen3-0_6b 1013 / 19.740 408 / 19.912 2.48x

cc: @shewu-quic, @haowhsu-quic

@DannyYuyang-quic DannyYuyang-quic changed the title Qualcomm AI Engine Direct - Support dataloader-based prefill quantize Qualcomm AI Engine Direct - [LLM Quantization] Support dataloader-based prefill Jun 15, 2026
@DannyYuyang-quic DannyYuyang-quic force-pushed the dev1/danny/remove_token_gen_from_calib branch 2 times, most recently from d80b723 to 01574e1 Compare June 15, 2026 07:54
@DannyYuyang-quic

Copy link
Copy Markdown
Contributor Author

@pytorchbot label "release notes: qualcomm"

Calibration dataset:
- Replace HF AutoModel token generation with direct tokenization of
  curated corpus (llm eval tasks or JSON samples)
- Add default calibration samples: assets/samples/{text,vision,audio}.json

Architecture:
- Introduce PTQStrategy + DecoderInference as unified calibration
  forward-pass primitives; remove decoder_utils.graph_module_inference
- Refactor dataset.py into dataset/ package:
  builders, collators, config, datasets, loaders, preprocessors, schema
@DannyYuyang-quic DannyYuyang-quic force-pushed the dev1/danny/remove_token_gen_from_calib branch from 01574e1 to 73263c2 Compare June 22, 2026 03:58
@psiddh

psiddh commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

@claude review

@claude

claude Bot commented Jun 22, 2026

Copy link
Copy Markdown

Claude finished @psiddh's task in 6m 50s —— View job


Claude Code Review

Reviewed the refactor end-to-end (dataset/, inference/, quantize/, evaluator/, wrappers, tokenizer, masking, tests). This is a clean, well-structured rework — the DecoderInference/PTQStrategy split is a real readability win over the old decoder_utils.graph_module_inference, the collator/builder separation is sensible, and the dataloader-based single-pass calibration with PPL parity (per your table) is a strong result. A few issues and nits below.


Bugs / correctness

1. Dead, self-contradicting warning in export_llamallama.py:653-660

if TASKS_EVAL in args.eval_methods and args.eval_tasks is None:
    if args.calib_tasks is None:
        logging.warning(
            "--eval_tasks is set but --calib_tasks is not; ..."
        )
    raise RuntimeError("Please provide --eval_tasks to eval perplexity")

This block only runs when eval_tasks is None, yet the warning text says "--eval_tasks is set" — the opposite of the condition. And since a raise follows unconditionally, the warning is unreachable as actionable guidance (it always aborts immediately after). Either drop the inner warning or move the calib/eval-tasks guidance to where it can actually take effect. Fix this →

2. Falsy-token-id bug in modality merging — inference/decoder.py:111-123

self.audio_token_id or self.image_token_id

appears in both the all((...)) guard and the merge_modality_embeddings call. If a model's audio_token_id/image_token_id is legitimately 0, the or falls through to the other id (or disables merging). Token id 0 is uncommon for these special tokens, but the or-on-int idiom is fragile. Prefer explicit None checks (e.g. self.audio_token_id if self.audio_token_id is not None else self.image_token_id).

3. (TASKS_EVAL or SQNR_EVAL) in args.eval_methodsllama.py:646 (pre-existing, but in a touched block)
TASKS_EVAL or SQNR_EVAL short-circuits to "tasks_eval" always, so the SQNR_EVAL case is never checked here. Since you're already editing this validation, consider if args.eval_methods and any(m in args.eval_methods for m in (TASKS_EVAL, SQNR_EVAL)).


Nits / polish

  • dataset/config.py:19token_dtype: Optional[str] is annotated as str but from_args assigns a torch.dtype (torch.int64/int32). Annotation should be Optional[torch.dtype].
  • inference/encoder.py:21module(*(inputs,)) is just module(inputs); the splat is noise.
  • dataset/builders.py:163-170 — in from_message_samples, the if not hasattr(self.llm_config, modality): continue is re-checked per sample inside the loop (invariant across samples) and apply_chat_template is computed before that check, so its result is discarded when the attr is missing. Hoist the hasattr check above the loop and return early. Also note this returns an empty ModalityEncoderDataset([]) rather than None when the modality is absent — intentional? The from_*/builder contract elsewhere returns None.
  • dataset/loaders.py:51open(p) without encoding="utf-8"; calibration JSON with non-ASCII content is locale-dependent. Minor.
  • dataset/builders.py:267dict.fromkeys(_ALL_MODALITY_KEYS) | {...} is a nice way to guarantee all keys present; worth a one-line comment that the | lets the real DataLoaders override the None placeholders, since the ordering dependency is subtle.
  • Padding rows in calibrationLLMCalibCollator pads tokens to max_context_len with 0 and _mask_padding_positions masks those query rows. The decoder still computes activations for the padded positions (token-id-0 embeddings), feeding them to the observers. The PPL table suggests the effect is negligible, but a brief comment in LLMCalibCollator noting that padded positions are tolerated by calibration would help future readers.

Tests

test_qnn_delegate.py swaps --skip_user_prompt_calibration for explicit --calib_tasks wikitext --calib_limit 1 and adds --calib_samples .../audio.json|vision.json for the multimodal cases, bumping max_seq_len to 1024. Consistent with the new calibration entrypoints. One thing to confirm: the multimodal --calib_samples paths are relative (./examples/...), so these tests assume the repo root as CWD — fine for CI, but worth being aware of.

Overall: solid refactor, no blocking concerns beyond the dead warning (#1). Nice speedups.
· branch dev1/danny/remove_token_gen_from_calib

@psiddh

psiddh commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Nice refactor, and the calibration speedups look great.!
Can you give the README another pass after this rework to make sure everything's still in order (a few of the flag/section bits may have drifted)? I'm approving and merging, anything that needs fixing we can do in a follow-up PR. (for example --calib_num_fewshot is a no op now ?)

@psiddh psiddh merged commit 60b1351 into pytorch:main Jun 22, 2026
187 of 194 checks passed
@DannyYuyang-quic

Copy link
Copy Markdown
Contributor Author

Nice refactor, and the calibration speedups look great.! Can you give the README another pass after this rework to make sure everything's still in order (a few of the flag/section bits may have drifted)? I'm approving and merging, anything that needs fixing we can do in a follow-up PR. (for example --calib_num_fewshot is a no op now ?)

Will update those. thanks for catching that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: qualcomm Changes to the Qualcomm backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants