Qualcomm AI Engine Direct - [LLM Quantization] Support dataloader-based prefill by DannyYuyang-quic · Pull Request #20273 · pytorch/executorch

DannyYuyang-quic · 2026-06-15T05:41:39Z

Summary

Calibration dataset:

Replace HF AutoModel token generation with direct tokenization of curated corpus (llm eval tasks or JSON samples)
Add default calibration samples: assets/samples/{text,vision,audio}.json
Support Dataloader-based calibration

Architecture:

Introduce PTQStrategy + DecoderInference as unified calibration forward-pass primitives; remove decoder_utils.graph_module_inference
Refactor dataset.py into dataset/ package: builders, collators, config, datasets, loaders, preprocessors, schema

Test plan

Test CI:

ExampleLLMScript
TestExampleMultimodalityScript

pytorch-bot · 2026-06-15T05:41:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20273

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

❌ 2 New Failures, 3 Unrelated Failures, 2 Unclassified Failures

As of commit 73263c2 with merge base 05b977d ():

NEW FAILURES - The following jobs have failed:

pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t 3ddc5a74007d45abd175b059614e2a3d78384c4b5ae731b0695efe0d2606468c /exec failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 84a035e1103e6c2baf3358deed6574a091f292ec61ccb7817cc3319232722241 /exec failed with exit code 1

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Build Aarch64 Linux Wheels / pytorch/executorch / build-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
/__w/executorch/executorch/pytorch/executorch/backends/apple/coreml/runtime/inmemoryfs/inmemory_filesystem.cpp:722:48: error: ‘inmemoryfs::InMemoryFileSystem::InMemoryNode::Kind’ has not been declared
Build Aarch64 Linux Wheels / pytorch/executorch / upload / upload-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Unable to download artifact(s): Artifact not found for name: pytorch_executorch__3.10_cpu_aarch64

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / unittest / macos / macos-job (gh) (similar failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

DannyYuyang-quic · 2026-06-15T05:52:24Z

@psiddh Hi, this PR is to support Dataloader-based calibration in MLLMs. With this PR, LLMs can be calibrated using the full input sequence at once, eliminating the need for iterative autoregressive (AR) processing over long sequences. For example, instead of performing hundreds of iterations for a sequence length of 1024, calibration can now be completed in a single forward pass.

Below is a comparison between AR iterative calibration and dataloader-based calibration across different models:

MLLMs metrics

model name	AR iterative calibration Time(sec)/PPL	Dataloader-based calibration Time(sec)/PPL	speedup
gemma-2b	1216 / 16.588	100 /16.609	12.16x
gemma2-2b	1827 / 11.504	123 / 11.517	14.85x
gemma3-1b	907 / 23.052	81 / 22.722	10.67x
glm-1_5b	963 / 20.180	85 / 20.041	11.32x
llama3_2-3b	2286 / 10.745	138 / 10.498	16.56x
phi_4_mini	2824 / 13.437	180 / 13.605	15.68x
qwen2_5-0_5b	486 / 13.951	77 / 13.813	6.31x
qwen2_5-1_5b	1068 / 9.714	116 / 9.669	9.2x
qwen3-1_7b	1478 / 14.756	111 / 14.913	13.31x
smollm2_135m	399 / 19.797	80/19.706	4.98x
smollm3-3b	2065 / 8.345	132 / 8.989	15.64x
smolvlm_500m_instruct	170 / -	86 / -	1.97x
internvl3_1b	170 / -	75 / -	2.26 x
granite_speech_3_3-2b	447 / -	179 / -	2.49x
llama3_2-1b	1237 / 14.973	883 / 15.647	1.4x
qwen3-0_6b	1013 / 19.740	408 / 19.912	2.48x

cc: @shewu-quic, @haowhsu-quic

DannyYuyang-quic · 2026-06-15T16:20:30Z

@pytorchbot label "release notes: qualcomm"

Calibration dataset: - Replace HF AutoModel token generation with direct tokenization of curated corpus (llm eval tasks or JSON samples) - Add default calibration samples: assets/samples/{text,vision,audio}.json Architecture: - Introduce PTQStrategy + DecoderInference as unified calibration forward-pass primitives; remove decoder_utils.graph_module_inference - Refactor dataset.py into dataset/ package: builders, collators, config, datasets, loaders, preprocessors, schema

psiddh · 2026-06-22T07:15:48Z

@claude review

claude · 2026-06-22T07:16:41Z

Claude finished @psiddh's task in 6m 50s —— View job

Claude Code Review

Reviewed the refactor end-to-end (dataset/, inference/, quantize/, evaluator/, wrappers, tokenizer, masking, tests). This is a clean, well-structured rework — the DecoderInference/PTQStrategy split is a real readability win over the old decoder_utils.graph_module_inference, the collator/builder separation is sensible, and the dataloader-based single-pass calibration with PPL parity (per your table) is a strong result. A few issues and nits below.

Bugs / correctness

1. Dead, self-contradicting warning in export_llama — llama.py:653-660

if TASKS_EVAL in args.eval_methods and args.eval_tasks is None:
    if args.calib_tasks is None:
        logging.warning(
            "--eval_tasks is set but --calib_tasks is not; ..."
        )
    raise RuntimeError("Please provide --eval_tasks to eval perplexity")

This block only runs when eval_tasks is None, yet the warning text says "--eval_tasks is set" — the opposite of the condition. And since a raise follows unconditionally, the warning is unreachable as actionable guidance (it always aborts immediately after). Either drop the inner warning or move the calib/eval-tasks guidance to where it can actually take effect. Fix this →

2. Falsy-token-id bug in modality merging — inference/decoder.py:111-123

self.audio_token_id or self.image_token_id

appears in both the all((...)) guard and the merge_modality_embeddings call. If a model's audio_token_id/image_token_id is legitimately 0, the or falls through to the other id (or disables merging). Token id 0 is uncommon for these special tokens, but the or-on-int idiom is fragile. Prefer explicit None checks (e.g. self.audio_token_id if self.audio_token_id is not None else self.image_token_id).

3. (TASKS_EVAL or SQNR_EVAL) in args.eval_methods — llama.py:646 (pre-existing, but in a touched block)
TASKS_EVAL or SQNR_EVAL short-circuits to "tasks_eval" always, so the SQNR_EVAL case is never checked here. Since you're already editing this validation, consider if args.eval_methods and any(m in args.eval_methods for m in (TASKS_EVAL, SQNR_EVAL)).

Nits / polish

dataset/config.py:19 — token_dtype: Optional[str] is annotated as str but from_args assigns a torch.dtype (torch.int64/int32). Annotation should be Optional[torch.dtype].
inference/encoder.py:21 — module(*(inputs,)) is just module(inputs); the splat is noise.
dataset/builders.py:163-170 — in from_message_samples, the if not hasattr(self.llm_config, modality): continue is re-checked per sample inside the loop (invariant across samples) and apply_chat_template is computed before that check, so its result is discarded when the attr is missing. Hoist the hasattr check above the loop and return early. Also note this returns an empty ModalityEncoderDataset([]) rather than None when the modality is absent — intentional? The from_*/builder contract elsewhere returns None.
dataset/loaders.py:51 — open(p) without encoding="utf-8"; calibration JSON with non-ASCII content is locale-dependent. Minor.
dataset/builders.py:267 — dict.fromkeys(_ALL_MODALITY_KEYS) | {...} is a nice way to guarantee all keys present; worth a one-line comment that the | lets the real DataLoaders override the None placeholders, since the ordering dependency is subtle.
Padding rows in calibration — LLMCalibCollator pads tokens to max_context_len with 0 and _mask_padding_positions masks those query rows. The decoder still computes activations for the padded positions (token-id-0 embeddings), feeding them to the observers. The PPL table suggests the effect is negligible, but a brief comment in LLMCalibCollator noting that padded positions are tolerated by calibration would help future readers.

Tests

test_qnn_delegate.py swaps --skip_user_prompt_calibration for explicit --calib_tasks wikitext --calib_limit 1 and adds --calib_samples .../audio.json|vision.json for the multimodal cases, bumping max_seq_len to 1024. Consistent with the new calibration entrypoints. One thing to confirm: the multimodal --calib_samples paths are relative (./examples/...), so these tests assume the repo root as CWD — fine for CI, but worth being aware of.

Overall: solid refactor, no blocking concerns beyond the dead warning (#1). Nice speedups.
· branch dev1/danny/remove_token_gen_from_calib

psiddh · 2026-06-22T17:27:21Z

Nice refactor, and the calibration speedups look great.!
Can you give the README another pass after this rework to make sure everything's still in order (a few of the flag/section bits may have drifted)? I'm approving and merging, anything that needs fixing we can do in a follow-up PR. (for example --calib_num_fewshot is a no op now ?)

DannyYuyang-quic · 2026-06-23T04:04:15Z

Nice refactor, and the calibration speedups look great.! Can you give the README another pass after this rework to make sure everything's still in order (a few of the flag/section bits may have drifted)? I'm approving and merging, anything that needs fixing we can do in a follow-up PR. (for example --calib_num_fewshot is a no op now ?)

Will update those. thanks for catching that!

DannyYuyang-quic requested review from abhinaykukkadapu and psiddh as code owners June 15, 2026 05:41

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 15, 2026

DannyYuyang-quic had a problem deploying to cadence June 15, 2026 05:44 — with GitHub Actions Failure

DannyYuyang-quic changed the title ~~Qualcomm AI Engine Direct - Support dataloader-based prefill quantize~~ Qualcomm AI Engine Direct - [LLM Quantization] Support dataloader-based prefill Jun 15, 2026

DannyYuyang-quic force-pushed the dev1/danny/remove_token_gen_from_calib branch 2 times, most recently from d80b723 to 01574e1 Compare June 15, 2026 07:54

pytorch-bot Bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Jun 15, 2026

DannyYuyang-quic had a problem deploying to cadence June 16, 2026 16:00 — with GitHub Actions Failure

DannyYuyang-quic force-pushed the dev1/danny/remove_token_gen_from_calib branch from 01574e1 to 73263c2 Compare June 22, 2026 03:58

psiddh approved these changes Jun 22, 2026

View reviewed changes

psiddh merged commit 60b1351 into pytorch:main Jun 22, 2026
187 of 194 checks passed

DannyYuyang-quic mentioned this pull request Jun 23, 2026

Qualcomm AI Engine Direct - update llm document #20449

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qualcomm AI Engine Direct - [LLM Quantization] Support dataloader-based prefill#20273

Qualcomm AI Engine Direct - [LLM Quantization] Support dataloader-based prefill#20273
psiddh merged 1 commit into
pytorch:mainfrom
CodeLinaro:dev1/danny/remove_token_gen_from_calib

DannyYuyang-quic commented Jun 15, 2026

Uh oh!

pytorch-bot Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

DannyYuyang-quic commented Jun 15, 2026 •

edited

Loading

Uh oh!

DannyYuyang-quic commented Jun 15, 2026

Uh oh!

psiddh commented Jun 22, 2026

Uh oh!

claude Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

psiddh commented Jun 22, 2026

Uh oh!

Uh oh!

DannyYuyang-quic commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

DannyYuyang-quic commented Jun 15, 2026

Summary

Test plan

Uh oh!

pytorch-bot Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20273

❗ 1 Active SEVs

❌ 2 New Failures, 3 Unrelated Failures, 2 Unclassified Failures

Uh oh!

DannyYuyang-quic commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MLLMs metrics

Uh oh!

DannyYuyang-quic commented Jun 15, 2026

Uh oh!

psiddh commented Jun 22, 2026

Uh oh!

claude Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Bugs / correctness

Nits / polish

Tests

Uh oh!

psiddh commented Jun 22, 2026

Uh oh!

Uh oh!

DannyYuyang-quic commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented Jun 15, 2026 •

edited

Loading

DannyYuyang-quic commented Jun 15, 2026 •

edited

Loading

claude Bot commented Jun 22, 2026 •

edited

Loading