feat(huggingFace): refactor operator into per-task codegen + text-generation#5278
Conversation
…d media proxy Introduces a new Jersey REST resource exposing endpoints used by the upcoming HuggingFace operator UI: - GET /api/huggingface/models — browse / search models per task - GET /api/huggingface/tasks — list HF pipeline tags with hosted inference - POST /api/huggingface/upload-audio — upload audio for HF audio tasks - GET /api/huggingface/audio-preview — stream uploaded audio (path-validated) - GET /api/huggingface/media-proxy — proxy remote media URLs to bypass CORS This is the first PR in a stacked series landing the HF operator end-to-end. No operator code yet; this resource is independently useful and lets the frontend integrate with HF before the operator class lands.
Addresses xuang7's review on PR apache#5124 — both endpoints previously buffered the full payload into a heap-resident byte[] with no upper bound, leaving the JVM open to OOM on a hostile or buggy upstream response (/media-proxy) or out-of-band write into the audio temp dir (/audio-preview). - /media-proxy: switch from Unirest.asBytes() to asObject(Function<RawResponse, T>), streaming the upstream body in 8 KiB chunks with a running byte counter. Aborts with 413 if the declared Content-Length exceeds the cap (pre-check) or if the body crosses the cap mid-read (defends against missing/lying Content-Length). New MAX_MEDIA_PROXY_BYTES = 50 MiB, sized for HF inference media (text-to-image ~5 MiB, text-to-video ~30 MiB) with headroom. - /audio-preview: add Files.size() defense-in-depth check before readAllBytes. /upload-audio already enforces MAX_AUDIO_BYTES on ingest; this catches the case where a bug or out-of-band write puts an oversized file in the temp dir. Adds a spec covering the audio-preview cap using a sparse-file fixture so the test stays fast (87/87 spec passes). The media-proxy cap path is exercised via the existing input-validation suite plus the new streamMediaWithCap helper - a follow-up can add a fake-RawResponse unit test if reviewers want explicit coverage of the chunked-read cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5278 +/- ##
============================================
+ Coverage 52.95% 53.00% +0.04%
- Complexity 2627 2651 +24
============================================
Files 1090 1094 +4
Lines 42210 42284 +74
Branches 4534 4541 +7
============================================
+ Hits 22353 22413 +60
- Misses 18546 18558 +12
- Partials 1311 1313 +2
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
/request-review @Ma77Ball |
Per review on apache#5124 (xuang7, Ma77Ball): mark the resource with @RolesAllowed(Array("REGULAR", "ADMIN")) to document that all five endpoints require an authenticated user. The annotation isn't enforced yet — that's coming with the auth-enforcement PR @Yicong-Huang and @Ma77Ball are working on — but adding it now means no follow-up change is needed when enforcement lands, and it matches the convention used by UserConfigResource / AdminSettingsResource. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eration Splits the monolithic 1,278-line HuggingFaceInferenceOpDesc from the team's feature branch into a dispatcher + per-task codegen architecture and ships the first task family (text-generation) end-to-end. - TaskCodegen trait + CodegenContext model the per-task variation - PythonCodegenBase emits the shared provider-fallback / process_table / _parse_response infrastructure with two holes for the per-task payload and parse snippets - TextGenCodegen supplies text-generation's chat-completions payload and the body["choices"][0]["message"]["content"] parse branch - HuggingFaceInferenceOpDesc becomes a thin dispatcher (~180 lines) holding @JsonProperty fields and the registeredCodegens map User-input string fields are typed as EncodableString and emitted via the pyb"..." macro so values reach Python as self.decode_python_template('<base64>') rather than raw literals; class constants are assigned in open(self) so self is in scope for the decode call. Generated process_table runs a defensive _HF_MODEL_ID_PATTERN check at runtime before any HF URL is composed. PR 2 of a stacked 9-PR series. PR 1 (apache#5124) ships the supporting REST resource; PRs 3-5 will add image, audio + media-gen, and QA/ranking task families by registering new *Codegen objects in the dispatcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…degen specs Addresses Codecov's 66.85% patch coverage warning by exercising the defensive null-handling branches in HuggingFaceInferenceOpDesc.scala and the TextGenCodegen contract that previously had no spec hits. - null-tolerance: feed null into every @JsonProperty (token, model, prompt col, system prompt, result col, task, maxNewTokens, temperature) and assert generatePythonCode still emits a parseable ProcessTableOperator with sane defaults (TASK falls back to text-generation, MAX_NEW_TOKENS clamps to 256, TEMPERATURE to 0.7). Covers the `if (x == null) ... else x` branches that previously had no test that took the null side. - TextGenCodegen.task: trivial canonical-value check. - TextGenCodegen ctx-independence: pass an "irrelevant"-filled ctx and assert payloadPython / parsePython still reference self.MODEL_ID and body["choices"]…. Catches a future refactor that accidentally splices ctx fields into the static snippets. 13/13 in HuggingFaceInferenceOpDescSpec, 2/2 in PythonCodeRawInvalidTextSpec (117/117 descriptors still py_compile cleanly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
61e6c41 to
8350eb9
Compare
Ma77Ball
left a comment
There was a problem hiding this comment.
Please look over the suggestions below.
…NAI_COMPATIBLE_PROVIDERS to class constants
|
Hi @PG1204 what is the status of this PR? not sure if it is ready for review, given the note about stacked PR, is the current diff accurate? |
@Yicong-Huang The comments given by @Ma77Ball have been resolved, awaiting further review. |
@Yicong-Huang This is PR-2 in the stacked PRs for the HuggingFace operator. PR-1 was merged a while back. |
Ma77Ball
left a comment
There was a problem hiding this comment.
Overall LGTM! I think the below can be implemented or left as is.
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🟢 | bs=10 sw=10 sl=64 | 449 | 0.274 | 21,682/28,635/28,635 us | 🟢 -14.0% / 🟢 -18.1% |
| 🔴 | bs=100 sw=10 sl=64 | 942 | 0.575 | 104,749/159,763/159,763 us | 🔴 +16.0% / 🔴 +14.3% |
| 🟢 | bs=1000 sw=10 sl=64 | 1,118 | 0.683 | 893,104/943,018/943,018 us | 🟢 -6.4% / 🟢 -8.2% |
Baseline details
Latest main 891d2ad from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 449 tuples/sec | 465 tuples/sec | 410.82 tuples/sec | -3.4% | +9.3% |
| bs=10 sw=10 sl=64 | MB/s | 0.274 MB/s | 0.284 MB/s | 0.251 MB/s | -3.5% | +9.3% |
| bs=10 sw=10 sl=64 | p50 | 21,682 us | 21,250 us | 23,785 us | +2.0% | -8.8% |
| bs=10 sw=10 sl=64 | p95 | 28,635 us | 33,299 us | 34,980 us | -14.0% | -18.1% |
| bs=10 sw=10 sl=64 | p99 | 28,635 us | 33,299 us | 34,980 us | -14.0% | -18.1% |
| bs=100 sw=10 sl=64 | throughput | 942 tuples/sec | 982 tuples/sec | 891.94 tuples/sec | -4.1% | +5.6% |
| bs=100 sw=10 sl=64 | MB/s | 0.575 MB/s | 0.599 MB/s | 0.544 MB/s | -4.0% | +5.6% |
| bs=100 sw=10 sl=64 | p50 | 104,749 us | 97,654 us | 112,277 us | +7.3% | -6.7% |
| bs=100 sw=10 sl=64 | p95 | 159,763 us | 137,693 us | 139,802 us | +16.0% | +14.3% |
| bs=100 sw=10 sl=64 | p99 | 159,763 us | 137,693 us | 139,802 us | +16.0% | +14.3% |
| bs=1000 sw=10 sl=64 | throughput | 1,118 tuples/sec | 1,112 tuples/sec | 1,041 tuples/sec | +0.5% | +7.4% |
| bs=1000 sw=10 sl=64 | MB/s | 0.683 MB/s | 0.679 MB/s | 0.635 MB/s | +0.6% | +7.5% |
| bs=1000 sw=10 sl=64 | p50 | 893,104 us | 894,598 us | 972,714 us | -0.2% | -8.2% |
| bs=1000 sw=10 sl=64 | p95 | 943,018 us | 1,007,417 us | 1,023,057 us | -6.4% | -7.8% |
| bs=1000 sw=10 sl=64 | p99 | 943,018 us | 1,007,417 us | 1,023,057 us | -6.4% | -7.8% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,445.39,200,128000,449,0.274,21682.12,28634.95,28634.95
1,100,10,64,20,2123.63,2000,1280000,942,0.575,104749.07,159762.78,159762.78
2,1000,10,64,20,17884.78,20000,12800000,1118,0.683,893103.94,943017.96,943017.96
xuang7
left a comment
There was a problem hiding this comment.
LGTM! I think this may be better categorized as a feature rather than a refactor.
|
Hi @PG1204 before merging this Pr, I think it's better to remove the stack note at the beginning of the PR description, as it contains an obsolete information as the PRs that it based are merged |
Hi @Yicong-Huang, the stack note has been removed. Thanks for the suggestion. |
…#5320) ### What changes were proposed in this PR? Adds the image task family — 9 HF pipeline tasks — as the second `TaskCodegen` plugged into the dispatcher established by apache#5278: image-only: image-classification, object-detection, image-segmentation, image-to-text image + prompt: visual-question-answering, document-question-answering, zero-shot-image-classification, image-text-to-text, image-to-image - `codegen/ImageTaskCodegen.scala` supplies the per-task payload + parse Python branches for all 9 tasks. - `TaskCodegen` trait gains a `tasks: Set[String]` default method (defaults to `Set(task)`) so a single codegen can register under multiple task strings; `ImageTaskCodegen` is the first multi-task codegen to use it. - `CodegenContext` extended with `imageInput` + `inputImageColumn` (`EncodableString`). - `HuggingFaceInferenceOpDesc.scala` gains 2 new `@JsonProperty` fields and registers `ImageTaskCodegen` via the new `tasks` flat-map. `PythonCodegenBase.scala` grows to host the shared image infrastructure: - Task-family tuples (`image_only_tasks`, `image_prompt_tasks`, `image_tasks`) + `image_headers` in `process_table`. - Per-row image-bytes resolution from upload or column with `_read_image_input` / `_read_binary_value` / `_compress_image_bytes`. - `_post_with_fallback` extended with `raw_binary_headers` + `use_raw_binary_body`; adds image-text-to-text chat-completions and model-author vision branches. - `_call_provider` gains zai-org, Replicate predictions + polling, Fal-ai, Wavespeed submit+poll branches, and image embedding for OpenAI-compatible / unknown-provider fallbacks. - Image content-type response handling returns `data:image/...;base64,...` URLs. - Image helpers added: `_read_image_input`, `_compress_image_bytes`, `_image_input_as_base64`, `_read_binary_value`, `_looks_like_html`, `_html_to_image_bytes`, `_extract_json_arg`, `_url_to_data_url`. Frontend integration (HF lines only — no agent / dataset noise): `HuggingFaceImageUploadComponent` declared in `app.module.ts`, `huggingface-image-upload` formly type registered, image upload component .ts/.html/.scss + `HuggingFace.png` + `sample-image.png` assets. User-input strings continue to flow through `pyb"..."` + `EncodableString` so they reach Python as `self.decode_python_template('<base64>')` rather than raw literals. `PythonCodeRawInvalidTextSpec` still passes (117/117 descriptors `py_compile` cleanly). ### Any related issues, documentation, or discussions? - Tracking issue: apache#5319 - Closes: apache#5319 - Stacked on: apache#5278 (operator + text-generation — issue apache#5277) - Parent issue: apache#5041 - Closed sibling issue: apache#5134 (REST resource — landed via apache#5124) ### How was this PR tested? - `sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile"` clean. - `sbt scalafmtCheck` clean. - `sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec"` — 18/18 pass (PR 2's 13 spec tests + 5 new image-task tests: image-only routing, VQA / document-QA payload, image-text-to-text chat-completions, image-to-image data-URL parse, all-9-tasks dispatcher coverage). - `sbt "WorkflowOperator/testOnly org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"` — 117/117 descriptors `py_compile` cleanly with the new operator code paths, no marker leaks. - Generated Python verified via `python3 -m py_compile` on sample image-task outputs. ### Was this PR authored or co-authored using generative AI tooling? Yes, co-authored with Claude Opus 4.7. --------- Signed-off-by: Prateek Ganigi <91584519+PG1204@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…eration (apache#5278) >⚠️ This PR is stacked on apache#5124. Until that lands, the diff below includes apache#5124's `HuggingFaceModelResource.scala` and the 1-line registration in `TexeraWebApplication.scala`. The new code in this PR is everything under `common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/huggingFace/` and the new test under `common/workflow-operator/src/test/.../huggingFace/HuggingFaceInferenceOpDescSpec.scala`. Once apache#5124 merges, this diff will auto-clean to ~839 lines. ### What changes were proposed in this PR? Refactors the monolithic 1,278-line `HuggingFaceInferenceOpDesc` from the team's feature branch into a dispatcher + per-task codegen architecture and ships the first task family (text-generation): - `codegen/TaskCodegen.scala` introduces the trait + `CodegenContext` that model per-task variation. - `codegen/PythonCodegenBase.scala` emits the shared provider-fallback / `process_table` / `_parse_response` infrastructure with two holes for the per-task payload and parse snippets. - `codegen/TextGenCodegen.scala` supplies text-generation's chat-completions payload and the `body["choices"][0 ["message"]["content"]` parse branch. - `HuggingFaceInferenceOpDesc.scala` becomes a thin (~180-line) dispatcher holding the `@JsonProperty` fields and the `registeredCodegens` map. User-input string fields are typed `EncodableString` and emitted via the `pyb"..."` macro so values reach Python as `self.decode_python_template('<base64>')` rather than raw literals. Class constants are assigned in `open(self)` so `self` is in scope for the decode call. The generated `process_table` runs a defensive `_HF_MODEL_ID_PATTERN` check at runtime before any HF URL is composed. The `TaskCodegen` trait also exposes a `tasks: Set[String]` default so a single codegen can register under multiple task strings, this becomes relevant in PR 3 (image family). ### Any related issues, documentation, or discussions? Tracked in apache#5277 & apache#5041(umbrella issue for the HuggingFace operator end-to-end implementation). Closes apache#5277 Stacked on apache#5124 (PR 1 - REST resource). This is PR 2 of a multi-PR series landing the HuggingFace operator end-to-end. The full plan and umbrella issue live separately; this PR's scope is exactly the dispatcher pattern + text-generation codegen. ### How was this PR tested? - `sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile"` clean. - `sbt scalafmtCheck` clean. - `sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec"` - 10/10 pass (operator info, validation, codegen wiring, MODEL_ID runtime check, leak-prevention, clamping, schema). - `sbt "WorkflowOperator/testOnly org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"` - 117/117 descriptors `py_compile` cleanly, no raw-text leaks. The new operator is included in this scan. - Generated Python verified via `python3 -m py_compile` on a sample output. ### Was this PR authored or co-authored using generative AI tooling? Co-authored with Claude Opus 4.7 --------- Co-authored-by: Elliot Lin <36275109+ELin2025@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Xuan Gu <162244362+xuang7@users.noreply.github.com>
…#5320) ### What changes were proposed in this PR? Adds the image task family — 9 HF pipeline tasks — as the second `TaskCodegen` plugged into the dispatcher established by apache#5278: image-only: image-classification, object-detection, image-segmentation, image-to-text image + prompt: visual-question-answering, document-question-answering, zero-shot-image-classification, image-text-to-text, image-to-image - `codegen/ImageTaskCodegen.scala` supplies the per-task payload + parse Python branches for all 9 tasks. - `TaskCodegen` trait gains a `tasks: Set[String]` default method (defaults to `Set(task)`) so a single codegen can register under multiple task strings; `ImageTaskCodegen` is the first multi-task codegen to use it. - `CodegenContext` extended with `imageInput` + `inputImageColumn` (`EncodableString`). - `HuggingFaceInferenceOpDesc.scala` gains 2 new `@JsonProperty` fields and registers `ImageTaskCodegen` via the new `tasks` flat-map. `PythonCodegenBase.scala` grows to host the shared image infrastructure: - Task-family tuples (`image_only_tasks`, `image_prompt_tasks`, `image_tasks`) + `image_headers` in `process_table`. - Per-row image-bytes resolution from upload or column with `_read_image_input` / `_read_binary_value` / `_compress_image_bytes`. - `_post_with_fallback` extended with `raw_binary_headers` + `use_raw_binary_body`; adds image-text-to-text chat-completions and model-author vision branches. - `_call_provider` gains zai-org, Replicate predictions + polling, Fal-ai, Wavespeed submit+poll branches, and image embedding for OpenAI-compatible / unknown-provider fallbacks. - Image content-type response handling returns `data:image/...;base64,...` URLs. - Image helpers added: `_read_image_input`, `_compress_image_bytes`, `_image_input_as_base64`, `_read_binary_value`, `_looks_like_html`, `_html_to_image_bytes`, `_extract_json_arg`, `_url_to_data_url`. Frontend integration (HF lines only — no agent / dataset noise): `HuggingFaceImageUploadComponent` declared in `app.module.ts`, `huggingface-image-upload` formly type registered, image upload component .ts/.html/.scss + `HuggingFace.png` + `sample-image.png` assets. User-input strings continue to flow through `pyb"..."` + `EncodableString` so they reach Python as `self.decode_python_template('<base64>')` rather than raw literals. `PythonCodeRawInvalidTextSpec` still passes (117/117 descriptors `py_compile` cleanly). ### Any related issues, documentation, or discussions? - Tracking issue: apache#5319 - Closes: apache#5319 - Stacked on: apache#5278 (operator + text-generation — issue apache#5277) - Parent issue: apache#5041 - Closed sibling issue: apache#5134 (REST resource — landed via apache#5124) ### How was this PR tested? - `sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile"` clean. - `sbt scalafmtCheck` clean. - `sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec"` — 18/18 pass (PR 2's 13 spec tests + 5 new image-task tests: image-only routing, VQA / document-QA payload, image-text-to-text chat-completions, image-to-image data-URL parse, all-9-tasks dispatcher coverage). - `sbt "WorkflowOperator/testOnly org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"` — 117/117 descriptors `py_compile` cleanly with the new operator code paths, no marker leaks. - Generated Python verified via `python3 -m py_compile` on sample image-task outputs. ### Was this PR authored or co-authored using generative AI tooling? Yes, co-authored with Claude Opus 4.7. --------- Signed-off-by: Prateek Ganigi <91584519+PG1204@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
What changes were proposed in this PR?
Refactors the monolithic 1,278-line
HuggingFaceInferenceOpDescfrom the team's feature branch into a dispatcher + per-task codegen architecture and ships the first task family (text-generation):codegen/TaskCodegen.scalaintroduces the trait +CodegenContextthat model per-task variation.codegen/PythonCodegenBase.scalaemits the shared provider-fallback /process_table/_parse_responseinfrastructure with two holes for the per-task payload and parse snippets.codegen/TextGenCodegen.scalasupplies text-generation's chat-completions payload and thebody["choices"][0 ["message"]["content"]parse branch.HuggingFaceInferenceOpDesc.scalabecomes a thin (~180-line) dispatcher holding the@JsonPropertyfields and theregisteredCodegensmap.User-input string fields are typed
EncodableStringand emitted via thepyb"..."macro so values reach Python asself.decode_python_template('<base64>')rather than raw literals. Class constants are assigned inopen(self)soselfis in scope for the decode call. The generatedprocess_tableruns a defensive_HF_MODEL_ID_PATTERNcheck at runtime before any HF URL is composed.The
TaskCodegentrait also exposes atasks: Set[String]default so a single codegen can register under multiple task strings, this becomes relevant in PR 3 (image family).Any related issues, documentation, or discussions?
Tracked in #5277 & #5041(umbrella issue for the HuggingFace operator end-to-end implementation).
Closes #5277
Stacked on #5124 (PR 1 - REST resource).
This is PR 2 of a multi-PR series landing the HuggingFace operator end-to-end. The full plan and umbrella issue live separately; this PR's scope is exactly the dispatcher pattern + text-generation codegen.
How was this PR tested?
sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile"clean.sbt scalafmtCheckclean.sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec"- 10/10 pass (operator info, validation, codegen wiring, MODEL_ID runtime check, leak-prevention, clamping, schema).sbt "WorkflowOperator/testOnly org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"- 117/117 descriptorspy_compilecleanly, no raw-text leaks. The new operator is included in this scan.python3 -m py_compileon a sample output.Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.7