feat(huggingFace): add HuggingFaceModelResource for model browsing and media proxy by PG1204 · Pull Request #5124 · apache/texera

PG1204 · 2026-05-17T21:33:40Z

What changes were proposed in this PR?

Introduces HuggingFaceModelResource - a Jersey REST resource at /api/huggingface/* that backs the upcoming HuggingFace operator's model picker, audio upload, and media preview UI. Five endpoints:

Endpoint	Purpose
`GET /api/huggingface/models?task=…[&search=…]`	Browse or search HF models
`GET /api/huggingface/tasks`	List HF pipeline tags with hosted inference
`POST /api/huggingface/upload-audio?filename=…`	Stream-upload audio files
`GET /api/huggingface/audio-preview?path=…`	Stream uploaded audio back
`GET /api/huggingface/media-proxy?url=…`	Proxy allowlisted remote media URLs (CORS bypass)

Plus a single-line registration of the resource in TexeraWebApplication.

Architectural notes:

Token sourcing: the user's HF token arrives via the X-HF-Token request header (forwarded by the frontend from the operator's property panel in a follow-up PR). When absent, requests go to HF Hub anonymously. There is no server-side env-var token.
Caching: bounded Guava Cache (size + TTL) for /models and /tasks results. User-token requests bypass the cache to avoid serving one user's token-scoped list to another.
Streaming upload: /upload-audio reads InputStream straight to disk in 8 KB chunks with a 25 MiB cap (returns 413 on exceedance) - the request body is never buffered in memory. Extension allowlist rejects non-audio types up front.
SSRF protection: /media-proxy requires the URL's host to be in an allowlist (HF, fal.media, replicate.delivery/com) with a leading-dot suffix guard against lookalike domains.
Bounded fan-out: /tasks uses a dedicated ForkJoinPool(4) for its per-task probe instead of the JVM's global common pool, with explicit 429/503 detection that logs at WARN.
Truncation visibility: browse and search responses carry an X-Texera-Truncated: true header when results were capped, so the frontend can show "list incomplete" hints.
Error responses: generic Jackson-built JSON bodies (no exception internals leak to clients); details are logged server-side.

Any related issues, documentation, or discussions?

Tracked in #5134 & #5041(umbrella issue for the HuggingFace operator end-to-end implementation). This PR is the backend foundation; subsequent PRs will add the operator class, frontend property panel, result-panel media rendering, and developer documentation.

Closes #5134

How was this PR tested?

Unit tests: amber/src/test/scala/.../HuggingFaceModelResourceSpec.scala - 86 ScalaTest cases covering token sanitization, SSRF allowlist (including lookalike-domain rejection), JSON error escaping, MIME type inference, the audio-upload validation/size-cap/extension paths, audio-preview path validation and traversal rejection, media-proxy rejection paths, cache hit/bypass semantics, and the temp-dir sweep. Run with sbt 'WorkflowExecutionService/testOnly org.apache.texera.web.resource.HuggingFaceModelResourceSpec' - all 86 pass in ~6 seconds, no external network required.
Manual smoke tests against a local backend:
- GET /api/huggingface/tasks returns the expected JSON task list.
- GET /api/huggingface/models?task=text-generation returns the paginated model list; text-generation shows the X-Texera-Truncated: true header when MAX_PAGES=50 is hit.
- POST /upload-audio?filename=evil.sh → 400 (extension allowlist).
- POST /upload-audio with a 30 MiB body → 413 (size cap).
- GET /media-proxy?url=http://localhost:8080/ → 403 (SSRF allowlist).

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.7 in compliance with ASF

…d media proxy Introduces a new Jersey REST resource exposing endpoints used by the upcoming HuggingFace operator UI: - GET /api/huggingface/models — browse / search models per task - GET /api/huggingface/tasks — list HF pipeline tags with hosted inference - POST /api/huggingface/upload-audio — upload audio for HF audio tasks - GET /api/huggingface/audio-preview — stream uploaded audio (path-validated) - GET /api/huggingface/media-proxy — proxy remote media URLs to bypass CORS This is the first PR in a stacked series landing the HF operator end-to-end. No operator code yet; this resource is independently useful and lets the frontend integrate with HF before the operator class lands.

PG1204 · 2026-05-17T21:39:22Z

/request-review @Ma77Ball

codecov-commenter · 2026-05-17T21:44:29Z

Codecov Report

❌ Patch coverage is 66.85393% with 118 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.07%. Comparing base (34be37d) to head (4a52406).
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...texera/web/resource/HuggingFaceModelResource.scala	67.04%	90 Missing and 27 partials ⚠️
...a/org/apache/texera/web/TexeraWebApplication.scala	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #5124      +/-   ##
============================================
- Coverage     49.63%   49.07%   -0.57%     
- Complexity     2385     2411      +26     
============================================
  Files          1051     1045       -6     
  Lines         40399    40384      -15     
  Branches       4292     4303      +11     
============================================
- Hits          20052    19817     -235     
- Misses        19165    19392     +227     
+ Partials       1182     1175       -7

Flag	Coverage Δ		*Carryforward flag
access-control-service	`39.41% <ø> (-2.49%)`	⬇️	Carriedforward from e27d4d0
agent-service	`33.76% <ø> (ø)`		Carriedforward from e27d4d0
amber	`51.96% <66.85%> (+0.29%)`	⬆️
computing-unit-managing-service	`0.00% <ø> (ø)`		Carriedforward from e27d4d0
config-service	`0.00% <ø> (ø)`		Carriedforward from e27d4d0
file-service	`37.99% <ø> (-0.43%)`	⬇️	Carriedforward from e27d4d0
frontend	`40.39% <ø> (-1.83%)`	⬇️	Carriedforward from e27d4d0
python	`90.79% <ø> (-0.02%)`	⬇️	Carriedforward from e27d4d0
workflow-compiling-service	`56.81% <ø> (ø)`		Carriedforward from e27d4d0

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Yicong-Huang · 2026-05-18T04:33:58Z

@PG1204 Thanks for opening this PR! Please do the following:

please follow our PR template and make the description concise.
please make sure your code meets the test coverage.
please use issues to describe future plans such as stacked PRs. This is because each PR after merge will become immutable. Issues can hold information that is longer than a PR's life cycle, and can subject to updates. If you are planning for opening multiple PRs, I suggest you use an umbrella issue to contain multiple sub issues, each for one PR.
you can use /request-review @xxx to request reviewer for review.

PG1204 · 2026-05-18T04:39:04Z

@Yicong-Huang

Thank you for the suggestions. Will update the PR accordingly.

Ma77Ball · 2026-05-18T19:30:44Z

Hi @PG1204, while I begin my review, please address @Yicong-Huang's feedback. Specifically:

Update the PR description to follow this template exactly:

   ### What changes were proposed in this PR?
   ...
   ### Any related issues, documentation, or discussions?
   ...
   ### How was this PR tested?
   ...
   ### Was this PR authored or co-authored using generative AI tooling?
   ...

Add test coverage for as much of the new code as possible. At a minimum, please cover the main features and call paths introduced here.
Relocate the overall PR plan to the parent issue, and keep this PR's description scoped to the code changes it actually contains.
Document any architectural changes. If this PR modifies the architecture, please describe what changed and where, so reviewers can follow the design intent.

Thanks, and looking forward to the updates!

Ma77Ball

Please review and resolve the comments and ask any questions as needed.

PG1204 · 2026-05-20T03:11:07Z

/request-review @Ma77Ball requesting re-review for the changes.

Ma77Ball

LGTM!

Note

Suggestions above that were not resolved should be resolved in the upcoming PRs. Also, test cases should be added in future PRs to address the missing lines reported by codecov.

Ma77Ball · 2026-05-27T11:15:36Z

/request-review @xuang7

xuang7

The PR looks good overall. I left two comments. Please also resolve any existing comments if they can be addressed in this PR, and mark them as resolved.

Addresses xuang7's review on PR apache#5124 — both endpoints previously buffered the full payload into a heap-resident byte[] with no upper bound, leaving the JVM open to OOM on a hostile or buggy upstream response (/media-proxy) or out-of-band write into the audio temp dir (/audio-preview). - /media-proxy: switch from Unirest.asBytes() to asObject(Function<RawResponse, T>), streaming the upstream body in 8 KiB chunks with a running byte counter. Aborts with 413 if the declared Content-Length exceeds the cap (pre-check) or if the body crosses the cap mid-read (defends against missing/lying Content-Length). New MAX_MEDIA_PROXY_BYTES = 50 MiB, sized for HF inference media (text-to-image ~5 MiB, text-to-video ~30 MiB) with headroom. - /audio-preview: add Files.size() defense-in-depth check before readAllBytes. /upload-audio already enforces MAX_AUDIO_BYTES on ingest; this catches the case where a bug or out-of-band write puts an oversized file in the temp dir. Adds a spec covering the audio-preview cap using a sparse-file fixture so the test stays fast (87/87 spec passes). The media-proxy cap path is exercised via the existing input-validation suite plus the new streamMediaWithCap helper - a follow-up can add a fake-RawResponse unit test if reviewers want explicit coverage of the chunked-read cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@JsonProperty

…eration Splits the monolithic 1,278-line HuggingFaceInferenceOpDesc from the team's feature branch into a dispatcher + per-task codegen architecture and ships the first task family (text-generation) end-to-end. - TaskCodegen trait + CodegenContext model the per-task variation - PythonCodegenBase emits the shared provider-fallback / process_table / _parse_response infrastructure with two holes for the per-task payload and parse snippets - TextGenCodegen supplies text-generation's chat-completions payload and the body["choices"][0]["message"]["content"] parse branch - HuggingFaceInferenceOpDesc becomes a thin dispatcher (~180 lines) holding @JsonProperty fields and the registeredCodegens map User-input string fields are typed as EncodableString and emitted via the pyb"..." macro so values reach Python as self.decode_python_template('<base64>') rather than raw literals; class constants are assigned in open(self) so self is in scope for the decode call. Generated process_table runs a defensive _HF_MODEL_ID_PATTERN check at runtime before any HF URL is composed. PR 2 of a stacked 9-PR series. PR 1 (apache#5124) ships the supporting REST resource; PRs 3-5 will add image, audio + media-gen, and QA/ranking task families by registering new *Codegen objects in the dispatcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PG1204 · 2026-05-28T18:27:50Z

@Ma77Ball Would you prefer that I resolve the conversations or you'd rather resolve them. If any of the comments still require work, I shall work on them and update the PR.

@JsonProperty

…eration Splits the monolithic 1,278-line HuggingFaceInferenceOpDesc from the team's feature branch into a dispatcher + per-task codegen architecture and ships the first task family (text-generation) end-to-end. - TaskCodegen trait + CodegenContext model the per-task variation - PythonCodegenBase emits the shared provider-fallback / process_table / _parse_response infrastructure with two holes for the per-task payload and parse snippets - TextGenCodegen supplies text-generation's chat-completions payload and the body["choices"][0]["message"]["content"] parse branch - HuggingFaceInferenceOpDesc becomes a thin dispatcher (~180 lines) holding @JsonProperty fields and the registeredCodegens map User-input string fields are typed as EncodableString and emitted via the pyb"..." macro so values reach Python as self.decode_python_template('<base64>') rather than raw literals; class constants are assigned in open(self) so self is in scope for the decode call. Generated process_table runs a defensive _HF_MODEL_ID_PATTERN check at runtime before any HF URL is composed. PR 2 of a stacked 9-PR series. PR 1 (apache#5124) ships the supporting REST resource; PRs 3-5 will add image, audio + media-gen, and QA/ranking task families by registering new *Codegen objects in the dispatcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

xuang7

LGTM!

@RolesAllowed

Per review on apache#5124 (xuang7, Ma77Ball): mark the resource with @RolesAllowed(Array("REGULAR", "ADMIN")) to document that all five endpoints require an authenticated user. The annotation isn't enforced yet — that's coming with the auth-enforcement PR @Yicong-Huang and @Ma77Ball are working on — but adding it now means no follow-up change is needed when enforcement lands, and it matches the convention used by UserConfigResource / AdminSettingsResource. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@JsonProperty

…eration Splits the monolithic 1,278-line HuggingFaceInferenceOpDesc from the team's feature branch into a dispatcher + per-task codegen architecture and ships the first task family (text-generation) end-to-end. - TaskCodegen trait + CodegenContext model the per-task variation - PythonCodegenBase emits the shared provider-fallback / process_table / _parse_response infrastructure with two holes for the per-task payload and parse snippets - TextGenCodegen supplies text-generation's chat-completions payload and the body["choices"][0]["message"]["content"] parse branch - HuggingFaceInferenceOpDesc becomes a thin dispatcher (~180 lines) holding @JsonProperty fields and the registeredCodegens map User-input string fields are typed as EncodableString and emitted via the pyb"..." macro so values reach Python as self.decode_python_template('<base64>') rather than raw literals; class constants are assigned in open(self) so self is in scope for the decode call. Generated process_table runs a defensive _HF_MODEL_ID_PATTERN check at runtime before any HF URL is composed. PR 2 of a stacked 9-PR series. PR 1 (apache#5124) ships the supporting REST resource; PRs 3-5 will add image, audio + media-gen, and QA/ranking task families by registering new *Codegen objects in the dispatcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eration (apache#5278) > ⚠️ This PR is stacked on apache#5124. Until that lands, the diff below includes apache#5124's `HuggingFaceModelResource.scala` and the 1-line registration in `TexeraWebApplication.scala`. The new code in this PR is everything under `common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/huggingFace/` and the new test under `common/workflow-operator/src/test/.../huggingFace/HuggingFaceInferenceOpDescSpec.scala`. Once apache#5124 merges, this diff will auto-clean to ~839 lines. ### What changes were proposed in this PR? Refactors the monolithic 1,278-line `HuggingFaceInferenceOpDesc` from the team's feature branch into a dispatcher + per-task codegen architecture and ships the first task family (text-generation): - `codegen/TaskCodegen.scala` introduces the trait + `CodegenContext` that model per-task variation. - `codegen/PythonCodegenBase.scala` emits the shared provider-fallback / `process_table` / `_parse_response` infrastructure with two holes for the per-task payload and parse snippets. - `codegen/TextGenCodegen.scala` supplies text-generation's chat-completions payload and the `body["choices"][0 ["message"]["content"]` parse branch. - `HuggingFaceInferenceOpDesc.scala` becomes a thin (~180-line) dispatcher holding the `@JsonProperty` fields and the `registeredCodegens` map. User-input string fields are typed `EncodableString` and emitted via the `pyb"..."` macro so values reach Python as `self.decode_python_template('<base64>')` rather than raw literals. Class constants are assigned in `open(self)` so `self` is in scope for the decode call. The generated `process_table` runs a defensive `_HF_MODEL_ID_PATTERN` check at runtime before any HF URL is composed. The `TaskCodegen` trait also exposes a `tasks: Set[String]` default so a single codegen can register under multiple task strings, this becomes relevant in PR 3 (image family). ### Any related issues, documentation, or discussions? Tracked in apache#5277 & apache#5041(umbrella issue for the HuggingFace operator end-to-end implementation). Closes apache#5277 Stacked on apache#5124 (PR 1 - REST resource). This is PR 2 of a multi-PR series landing the HuggingFace operator end-to-end. The full plan and umbrella issue live separately; this PR's scope is exactly the dispatcher pattern + text-generation codegen. ### How was this PR tested? - `sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile"` clean. - `sbt scalafmtCheck` clean. - `sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec"` - 10/10 pass (operator info, validation, codegen wiring, MODEL_ID runtime check, leak-prevention, clamping, schema). - `sbt "WorkflowOperator/testOnly org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"` - 117/117 descriptors `py_compile` cleanly, no raw-text leaks. The new operator is included in this scan. - Generated Python verified via `python3 -m py_compile` on a sample output. ### Was this PR authored or co-authored using generative AI tooling? Co-authored with Claude Opus 4.7 --------- Co-authored-by: Elliot Lin <36275109+ELin2025@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Xuan Gu <162244362+xuang7@users.noreply.github.com>

Addresses xuang7's review on PR apache#5124 — both endpoints previously buffered the full payload into a heap-resident byte[] with no upper bound, leaving the JVM open to OOM on a hostile or buggy upstream response (/media-proxy) or out-of-band write into the audio temp dir (/audio-preview). - /media-proxy: switch from Unirest.asBytes() to asObject(Function<RawResponse, T>), streaming the upstream body in 8 KiB chunks with a running byte counter. Aborts with 413 if the declared Content-Length exceeds the cap (pre-check) or if the body crosses the cap mid-read (defends against missing/lying Content-Length). New MAX_MEDIA_PROXY_BYTES = 50 MiB, sized for HF inference media (text-to-image ~5 MiB, text-to-video ~30 MiB) with headroom. - /audio-preview: add Files.size() defense-in-depth check before readAllBytes. /upload-audio already enforces MAX_AUDIO_BYTES on ingest; this catches the case where a bug or out-of-band write puts an oversized file in the temp dir. Adds a spec covering the audio-preview cap using a sparse-file fixture so the test stays fast (87/87 spec passes). The media-proxy cap path is exercised via the existing input-validation suite plus the new streamMediaWithCap helper - a follow-up can add a fake-RawResponse unit test if reviewers want explicit coverage of the chunked-read cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@RolesAllowed

Per review on apache#5124 (xuang7, Ma77Ball): mark the resource with @RolesAllowed(Array("REGULAR", "ADMIN")) to document that all five endpoints require an authenticated user. The annotation isn't enforced yet — that's coming with the auth-enforcement PR @Yicong-Huang and @Ma77Ball are working on — but adding it now means no follow-up change is needed when enforcement lands, and it matches the convention used by UserConfigResource / AdminSettingsResource. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@JsonProperty

…eration Splits the monolithic 1,278-line HuggingFaceInferenceOpDesc from the team's feature branch into a dispatcher + per-task codegen architecture and ships the first task family (text-generation) end-to-end. - TaskCodegen trait + CodegenContext model the per-task variation - PythonCodegenBase emits the shared provider-fallback / process_table / _parse_response infrastructure with two holes for the per-task payload and parse snippets - TextGenCodegen supplies text-generation's chat-completions payload and the body["choices"][0]["message"]["content"] parse branch - HuggingFaceInferenceOpDesc becomes a thin dispatcher (~180 lines) holding @JsonProperty fields and the registeredCodegens map User-input string fields are typed as EncodableString and emitted via the pyb"..." macro so values reach Python as self.decode_python_template('<base64>') rather than raw literals; class constants are assigned in open(self) so self is in scope for the decode call. Generated process_table runs a defensive _HF_MODEL_ID_PATTERN check at runtime before any HF URL is composed. PR 2 of a stacked 9-PR series. PR 1 (apache#5124) ships the supporting REST resource; PRs 3-5 will add image, audio + media-gen, and QA/ranking task families by registering new *Codegen objects in the dispatcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…#5320) ### What changes were proposed in this PR? Adds the image task family — 9 HF pipeline tasks — as the second `TaskCodegen` plugged into the dispatcher established by apache#5278: image-only: image-classification, object-detection, image-segmentation, image-to-text image + prompt: visual-question-answering, document-question-answering, zero-shot-image-classification, image-text-to-text, image-to-image - `codegen/ImageTaskCodegen.scala` supplies the per-task payload + parse Python branches for all 9 tasks. - `TaskCodegen` trait gains a `tasks: Set[String]` default method (defaults to `Set(task)`) so a single codegen can register under multiple task strings; `ImageTaskCodegen` is the first multi-task codegen to use it. - `CodegenContext` extended with `imageInput` + `inputImageColumn` (`EncodableString`). - `HuggingFaceInferenceOpDesc.scala` gains 2 new `@JsonProperty` fields and registers `ImageTaskCodegen` via the new `tasks` flat-map. `PythonCodegenBase.scala` grows to host the shared image infrastructure: - Task-family tuples (`image_only_tasks`, `image_prompt_tasks`, `image_tasks`) + `image_headers` in `process_table`. - Per-row image-bytes resolution from upload or column with `_read_image_input` / `_read_binary_value` / `_compress_image_bytes`. - `_post_with_fallback` extended with `raw_binary_headers` + `use_raw_binary_body`; adds image-text-to-text chat-completions and model-author vision branches. - `_call_provider` gains zai-org, Replicate predictions + polling, Fal-ai, Wavespeed submit+poll branches, and image embedding for OpenAI-compatible / unknown-provider fallbacks. - Image content-type response handling returns `data:image/...;base64,...` URLs. - Image helpers added: `_read_image_input`, `_compress_image_bytes`, `_image_input_as_base64`, `_read_binary_value`, `_looks_like_html`, `_html_to_image_bytes`, `_extract_json_arg`, `_url_to_data_url`. Frontend integration (HF lines only — no agent / dataset noise): `HuggingFaceImageUploadComponent` declared in `app.module.ts`, `huggingface-image-upload` formly type registered, image upload component .ts/.html/.scss + `HuggingFace.png` + `sample-image.png` assets. User-input strings continue to flow through `pyb"..."` + `EncodableString` so they reach Python as `self.decode_python_template('<base64>')` rather than raw literals. `PythonCodeRawInvalidTextSpec` still passes (117/117 descriptors `py_compile` cleanly). ### Any related issues, documentation, or discussions? - Tracking issue: apache#5319 - Closes: apache#5319 - Stacked on: apache#5278 (operator + text-generation — issue apache#5277) - Parent issue: apache#5041 - Closed sibling issue: apache#5134 (REST resource — landed via apache#5124) ### How was this PR tested? - `sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile"` clean. - `sbt scalafmtCheck` clean. - `sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec"` — 18/18 pass (PR 2's 13 spec tests + 5 new image-task tests: image-only routing, VQA / document-QA payload, image-text-to-text chat-completions, image-to-image data-URL parse, all-9-tasks dispatcher coverage). - `sbt "WorkflowOperator/testOnly org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"` — 117/117 descriptors `py_compile` cleanly with the new operator code paths, no marker leaks. - Generated Python verified via `python3 -m py_compile` on sample image-task outputs. ### Was this PR authored or co-authored using generative AI tooling? Yes, co-authored with Claude Opus 4.7. --------- Signed-off-by: Prateek Ganigi <91584519+PG1204@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Addresses xuang7's review on PR apache#5124 — both endpoints previously buffered the full payload into a heap-resident byte[] with no upper bound, leaving the JVM open to OOM on a hostile or buggy upstream response (/media-proxy) or out-of-band write into the audio temp dir (/audio-preview). - /media-proxy: switch from Unirest.asBytes() to asObject(Function<RawResponse, T>), streaming the upstream body in 8 KiB chunks with a running byte counter. Aborts with 413 if the declared Content-Length exceeds the cap (pre-check) or if the body crosses the cap mid-read (defends against missing/lying Content-Length). New MAX_MEDIA_PROXY_BYTES = 50 MiB, sized for HF inference media (text-to-image ~5 MiB, text-to-video ~30 MiB) with headroom. - /audio-preview: add Files.size() defense-in-depth check before readAllBytes. /upload-audio already enforces MAX_AUDIO_BYTES on ingest; this catches the case where a bug or out-of-band write puts an oversized file in the temp dir. Adds a spec covering the audio-preview cap using a sparse-file fixture so the test stays fast (87/87 spec passes). The media-proxy cap path is exercised via the existing input-validation suite plus the new streamMediaWithCap helper - a follow-up can add a fake-RawResponse unit test if reviewers want explicit coverage of the chunked-read cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@RolesAllowed

Per review on apache#5124 (xuang7, Ma77Ball): mark the resource with @RolesAllowed(Array("REGULAR", "ADMIN")) to document that all five endpoints require an authenticated user. The annotation isn't enforced yet — that's coming with the auth-enforcement PR @Yicong-Huang and @Ma77Ball are working on — but adding it now means no follow-up change is needed when enforcement lands, and it matches the convention used by UserConfigResource / AdminSettingsResource. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d media proxy (apache#5124) ### What changes were proposed in this PR? Introduces `HuggingFaceModelResource` - a Jersey REST resource at `/api/huggingface/*` that backs the upcoming HuggingFace operator's model picker, audio upload, and media preview UI. Five endpoints: | Endpoint | Purpose | |---|---| | `GET /api/huggingface/models?task=…[&search=…]` | Browse or search HF models | | `GET /api/huggingface/tasks` | List HF pipeline tags with hosted inference | | `POST /api/huggingface/upload-audio?filename=…` | Stream-upload audio files | | `GET /api/huggingface/audio-preview?path=…` | Stream uploaded audio back | | `GET /api/huggingface/media-proxy?url=…` | Proxy allowlisted remote media URLs (CORS bypass) | Plus a single-line registration of the resource in `TexeraWebApplication`. **Architectural notes:** - **Token sourcing:** the user's HF token arrives via the `X-HF-Token` request header (forwarded by the frontend from the operator's property panel in a follow-up PR). When absent, requests go to HF Hub anonymously. There is no server-side env-var token. - **Caching:** bounded Guava `Cache` (size + TTL) for `/models` and `/tasks` results. User-token requests bypass the cache to avoid serving one user's token-scoped list to another. - **Streaming upload:** `/upload-audio` reads `InputStream` straight to disk in 8 KB chunks with a 25 MiB cap (returns 413 on exceedance) - the request body is never buffered in memory. Extension allowlist rejects non-audio types up front. - **SSRF protection:** `/media-proxy` requires the URL's host to be in an allowlist (HF, fal.media, replicate.delivery/com) with a leading-dot suffix guard against lookalike domains. - **Bounded fan-out:** `/tasks` uses a dedicated `ForkJoinPool(4)` for its per-task probe instead of the JVM's global common pool, with explicit 429/503 detection that logs at WARN. - **Truncation visibility:** browse and search responses carry an `X-Texera-Truncated: true` header when results were capped, so the frontend can show "list incomplete" hints. - **Error responses:** generic Jackson-built JSON bodies (no exception internals leak to clients); details are logged server-side. ### Any related issues, documentation, or discussions? Tracked in apache#5134 & apache#5041(umbrella issue for the HuggingFace operator end-to-end implementation). This PR is the backend foundation; subsequent PRs will add the operator class, frontend property panel, result-panel media rendering, and developer documentation. Closes apache#5134 ### How was this PR tested? - Unit tests: `amber/src/test/scala/.../HuggingFaceModelResourceSpec.scala` - 86 ScalaTest cases covering token sanitization, SSRF allowlist (including lookalike-domain rejection), JSON error escaping, MIME type inference, the audio-upload validation/size-cap/extension paths, audio-preview path validation and traversal rejection, media-proxy rejection paths, cache hit/bypass semantics, and the temp-dir sweep. Run with `sbt 'WorkflowExecutionService/testOnly org.apache.texera.web.resource.HuggingFaceModelResourceSpec'` - all 86 pass in ~6 seconds, no external network required. - Manual smoke tests against a local backend: - `GET /api/huggingface/tasks` returns the expected JSON task list. - `GET /api/huggingface/models?task=text-generation` returns the paginated model list; `text-generation` shows the `X-Texera-Truncated: true` header when `MAX_PAGES=50` is hit. - `POST /upload-audio?filename=evil.sh` → 400 (extension allowlist). - `POST /upload-audio` with a 30 MiB body → 413 (size cap). - `GET /media-proxy?url=http://localhost:8080/` → 403 (SSRF allowlist). ### Was this PR authored or co-authored using generative AI tooling? Co-authored with Claude Opus 4.7 in compliance with ASF --------- Co-authored-by: Elliot Lin <36275109+ELin2025@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Xuan Gu <162244362+xuang7@users.noreply.github.com>

…eration (apache#5278) > ⚠️ This PR is stacked on apache#5124. Until that lands, the diff below includes apache#5124's `HuggingFaceModelResource.scala` and the 1-line registration in `TexeraWebApplication.scala`. The new code in this PR is everything under `common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/huggingFace/` and the new test under `common/workflow-operator/src/test/.../huggingFace/HuggingFaceInferenceOpDescSpec.scala`. Once apache#5124 merges, this diff will auto-clean to ~839 lines. ### What changes were proposed in this PR? Refactors the monolithic 1,278-line `HuggingFaceInferenceOpDesc` from the team's feature branch into a dispatcher + per-task codegen architecture and ships the first task family (text-generation): - `codegen/TaskCodegen.scala` introduces the trait + `CodegenContext` that model per-task variation. - `codegen/PythonCodegenBase.scala` emits the shared provider-fallback / `process_table` / `_parse_response` infrastructure with two holes for the per-task payload and parse snippets. - `codegen/TextGenCodegen.scala` supplies text-generation's chat-completions payload and the `body["choices"][0 ["message"]["content"]` parse branch. - `HuggingFaceInferenceOpDesc.scala` becomes a thin (~180-line) dispatcher holding the `@JsonProperty` fields and the `registeredCodegens` map. User-input string fields are typed `EncodableString` and emitted via the `pyb"..."` macro so values reach Python as `self.decode_python_template('<base64>')` rather than raw literals. Class constants are assigned in `open(self)` so `self` is in scope for the decode call. The generated `process_table` runs a defensive `_HF_MODEL_ID_PATTERN` check at runtime before any HF URL is composed. The `TaskCodegen` trait also exposes a `tasks: Set[String]` default so a single codegen can register under multiple task strings, this becomes relevant in PR 3 (image family). ### Any related issues, documentation, or discussions? Tracked in apache#5277 & apache#5041(umbrella issue for the HuggingFace operator end-to-end implementation). Closes apache#5277 Stacked on apache#5124 (PR 1 - REST resource). This is PR 2 of a multi-PR series landing the HuggingFace operator end-to-end. The full plan and umbrella issue live separately; this PR's scope is exactly the dispatcher pattern + text-generation codegen. ### How was this PR tested? - `sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile"` clean. - `sbt scalafmtCheck` clean. - `sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec"` - 10/10 pass (operator info, validation, codegen wiring, MODEL_ID runtime check, leak-prevention, clamping, schema). - `sbt "WorkflowOperator/testOnly org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"` - 117/117 descriptors `py_compile` cleanly, no raw-text leaks. The new operator is included in this scan. - Generated Python verified via `python3 -m py_compile` on a sample output. ### Was this PR authored or co-authored using generative AI tooling? Co-authored with Claude Opus 4.7 --------- Co-authored-by: Elliot Lin <36275109+ELin2025@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Xuan Gu <162244362+xuang7@users.noreply.github.com>

…#5320) ### What changes were proposed in this PR? Adds the image task family — 9 HF pipeline tasks — as the second `TaskCodegen` plugged into the dispatcher established by apache#5278: image-only: image-classification, object-detection, image-segmentation, image-to-text image + prompt: visual-question-answering, document-question-answering, zero-shot-image-classification, image-text-to-text, image-to-image - `codegen/ImageTaskCodegen.scala` supplies the per-task payload + parse Python branches for all 9 tasks. - `TaskCodegen` trait gains a `tasks: Set[String]` default method (defaults to `Set(task)`) so a single codegen can register under multiple task strings; `ImageTaskCodegen` is the first multi-task codegen to use it. - `CodegenContext` extended with `imageInput` + `inputImageColumn` (`EncodableString`). - `HuggingFaceInferenceOpDesc.scala` gains 2 new `@JsonProperty` fields and registers `ImageTaskCodegen` via the new `tasks` flat-map. `PythonCodegenBase.scala` grows to host the shared image infrastructure: - Task-family tuples (`image_only_tasks`, `image_prompt_tasks`, `image_tasks`) + `image_headers` in `process_table`. - Per-row image-bytes resolution from upload or column with `_read_image_input` / `_read_binary_value` / `_compress_image_bytes`. - `_post_with_fallback` extended with `raw_binary_headers` + `use_raw_binary_body`; adds image-text-to-text chat-completions and model-author vision branches. - `_call_provider` gains zai-org, Replicate predictions + polling, Fal-ai, Wavespeed submit+poll branches, and image embedding for OpenAI-compatible / unknown-provider fallbacks. - Image content-type response handling returns `data:image/...;base64,...` URLs. - Image helpers added: `_read_image_input`, `_compress_image_bytes`, `_image_input_as_base64`, `_read_binary_value`, `_looks_like_html`, `_html_to_image_bytes`, `_extract_json_arg`, `_url_to_data_url`. Frontend integration (HF lines only — no agent / dataset noise): `HuggingFaceImageUploadComponent` declared in `app.module.ts`, `huggingface-image-upload` formly type registered, image upload component .ts/.html/.scss + `HuggingFace.png` + `sample-image.png` assets. User-input strings continue to flow through `pyb"..."` + `EncodableString` so they reach Python as `self.decode_python_template('<base64>')` rather than raw literals. `PythonCodeRawInvalidTextSpec` still passes (117/117 descriptors `py_compile` cleanly). ### Any related issues, documentation, or discussions? - Tracking issue: apache#5319 - Closes: apache#5319 - Stacked on: apache#5278 (operator + text-generation — issue apache#5277) - Parent issue: apache#5041 - Closed sibling issue: apache#5134 (REST resource — landed via apache#5124) ### How was this PR tested? - `sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile"` clean. - `sbt scalafmtCheck` clean. - `sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec"` — 18/18 pass (PR 2's 13 spec tests + 5 new image-task tests: image-only routing, VQA / document-QA payload, image-text-to-text chat-completions, image-to-image data-URL parse, all-9-tasks dispatcher coverage). - `sbt "WorkflowOperator/testOnly org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"` — 117/117 descriptors `py_compile` cleanly with the new operator code paths, no marker leaks. - Generated Python verified via `python3 -m py_compile` on sample image-task outputs. ### Was this PR authored or co-authored using generative AI tooling? Yes, co-authored with Claude Opus 4.7. --------- Signed-off-by: Prateek Ganigi <91584519+PG1204@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Addresses xuang7's review on PR apache#5124 — both endpoints previously buffered the full payload into a heap-resident byte[] with no upper bound, leaving the JVM open to OOM on a hostile or buggy upstream response (/media-proxy) or out-of-band write into the audio temp dir (/audio-preview). - /media-proxy: switch from Unirest.asBytes() to asObject(Function<RawResponse, T>), streaming the upstream body in 8 KiB chunks with a running byte counter. Aborts with 413 if the declared Content-Length exceeds the cap (pre-check) or if the body crosses the cap mid-read (defends against missing/lying Content-Length). New MAX_MEDIA_PROXY_BYTES = 50 MiB, sized for HF inference media (text-to-image ~5 MiB, text-to-video ~30 MiB) with headroom. - /audio-preview: add Files.size() defense-in-depth check before readAllBytes. /upload-audio already enforces MAX_AUDIO_BYTES on ingest; this catches the case where a bug or out-of-band write puts an oversized file in the temp dir. Adds a spec covering the audio-preview cap using a sparse-file fixture so the test stays fast (87/87 spec passes). The media-proxy cap path is exercised via the existing input-validation suite plus the new streamMediaWithCap helper - a follow-up can add a fake-RawResponse unit test if reviewers want explicit coverage of the chunked-read cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@RolesAllowed

Per review on apache#5124 (xuang7, Ma77Ball): mark the resource with @RolesAllowed(Array("REGULAR", "ADMIN")) to document that all five endpoints require an authenticated user. The annotation isn't enforced yet — that's coming with the auth-enforcement PR @Yicong-Huang and @Ma77Ball are working on — but adding it now means no follow-up change is needed when enforcement lands, and it matches the convention used by UserConfigResource / AdminSettingsResource. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses xuang7's review on PR apache#5124 — both endpoints previously buffered the full payload into a heap-resident byte[] with no upper bound, leaving the JVM open to OOM on a hostile or buggy upstream response (/media-proxy) or out-of-band write into the audio temp dir (/audio-preview). - /media-proxy: switch from Unirest.asBytes() to asObject(Function<RawResponse, T>), streaming the upstream body in 8 KiB chunks with a running byte counter. Aborts with 413 if the declared Content-Length exceeds the cap (pre-check) or if the body crosses the cap mid-read (defends against missing/lying Content-Length). New MAX_MEDIA_PROXY_BYTES = 50 MiB, sized for HF inference media (text-to-image ~5 MiB, text-to-video ~30 MiB) with headroom. - /audio-preview: add Files.size() defense-in-depth check before readAllBytes. /upload-audio already enforces MAX_AUDIO_BYTES on ingest; this catches the case where a bug or out-of-band write puts an oversized file in the temp dir. Adds a spec covering the audio-preview cap using a sparse-file fixture so the test stays fast (87/87 spec passes). The media-proxy cap path is exercised via the existing input-validation suite plus the new streamMediaWithCap helper - a follow-up can add a fake-RawResponse unit test if reviewers want explicit coverage of the chunked-read cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@RolesAllowed

Per review on apache#5124 (xuang7, Ma77Ball): mark the resource with @RolesAllowed(Array("REGULAR", "ADMIN")) to document that all five endpoints require an authenticated user. The annotation isn't enforced yet — that's coming with the auth-enforcement PR @Yicong-Huang and @Ma77Ball are working on — but adding it now means no follow-up change is needed when enforcement lands, and it matches the convention used by UserConfigResource / AdminSettingsResource. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added the engine label May 17, 2026

github-actions Bot assigned PG1204 May 17, 2026

Ma77Ball suggested changes May 18, 2026

View reviewed changes

fix: address review feedback on HuggingFaceModelResource

935ccc1

github-actions Bot requested a review from Ma77Ball May 19, 2026 23:46

PG1204 mentioned this pull request May 20, 2026

Add HuggingFaceModelResource REST endpoints for HF operator UI #5134

Closed

6 tasks

Merge branch 'apache:main' into hf/01-backend-skeleton

089c3c4

Ma77Ball mentioned this pull request May 26, 2026

Add Hugging Face inference operator #5041

Open

Merge branch 'apache:main' into hf/01-backend-skeleton

2aa865c

Ma77Ball approved these changes May 27, 2026

View reviewed changes

xuang7 requested changes May 28, 2026

View reviewed changes

Comment thread amber/src/main/scala/org/apache/texera/web/resource/HuggingFaceModelResource.scala

Comment thread amber/src/main/scala/org/apache/texera/web/resource/HuggingFaceModelResource.scala

PG1204 and others added 2 commits May 28, 2026 07:01

Merge branch 'apache:main' into hf/01-backend-skeleton

0c30beb

chore: retrigger CI

6857e34

This was referenced May 28, 2026

Add HuggingFaceInferenceOpDesc with dispatcher + per-task codegen architecture (text-generation) #5277

Closed

feat(huggingFace): refactor operator into per-task codegen + text-generation #5278

Merged

PG1204 and others added 3 commits May 28, 2026 13:06

Merge branch 'apache:main' into hf/01-backend-skeleton

6f0f5fb

Merge branch 'main' into hf/01-backend-skeleton

fec6dfb

Merge branch 'apache:main' into hf/01-backend-skeleton

5e95bcd

xuang7 approved these changes May 29, 2026

View reviewed changes

Merge branch 'apache:main' into hf/01-backend-skeleton

4a52406

xuang7 added this pull request to the merge queue May 31, 2026

Merged via the queue into apache:main with commit 1b0ec78 May 31, 2026
14 checks passed

This was referenced Jun 3, 2026

Add image task family (ImageTaskCodegen) to HuggingFace operator #5319

Closed

feat(huggingFace): add image task family via ImageTaskCodegen #5320

Merged

Uh oh!

Conversation

PG1204 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this PR?

Any related issues, documentation, or discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

PG1204 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Yicong-Huang commented May 18, 2026

Uh oh!

PG1204 commented May 18, 2026

Uh oh!

Ma77Ball commented May 18, 2026

Uh oh!

Ma77Ball left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PG1204 commented May 20, 2026

Uh oh!

Ma77Ball left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ma77Ball commented May 27, 2026

Uh oh!

xuang7 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PG1204 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuang7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

PG1204 commented May 17, 2026 •

edited

Loading

PG1204 commented May 17, 2026 •

edited

Loading

codecov-commenter commented May 17, 2026 •

edited

Loading

Ma77Ball left a comment •

edited

Loading

xuang7 left a comment •

edited

Loading

PG1204 commented May 28, 2026 •

edited

Loading