feat: nav2 sensor fix OTA demo + cross-demo script regressions#58
Draft
bburda wants to merge 60 commits into
Draft
feat: nav2 sensor fix OTA demo + cross-demo script regressions#58bburda wants to merge 60 commits into
bburda wants to merge 60 commits into
Conversation
Implements GatewayPlugin + UpdateProvider for the OTA demo. Polls a FastAPI catalog at boot and supports update / install / uninstall operations derived from SOVD ISO 17978-3 metadata. Process model: SIGTERM old executable, swap files on disk, fork+exec new executable. No lifecycle commands. Operation kind is classified from updated_components / added_components / removed_components. Components: - OtaUpdatePlugin: list/get/register/delete/prepare/execute/supports_automated - CatalogClient: cpp-httplib GET /catalog and artifact download, with parse_url - OperationDispatcher: SOVD metadata -> Update/Install/Uninstall/Unknown - ProcessRunner: pgrep via /proc, kill_by_executable with SIGTERM->SIGKILL fallback, fork+exec spawn 21 gtests pass (7 dispatcher, 6 parse_url, 8 plugin smoke).
… configure, add -Wshadow -Wconversion
Adds optional --replaces-executable flag to pack_artifact.py and threads it into the catalog entry as x_medkit_replaces_executable when kind=update. This lets the gateway plugin kill the OLD executable (broken_lidar_node) before spawning the NEW one (fixed_lidar_node) when the two live in separate ROS 2 packages.
…process When a SOVD update package swaps a node across ROS 2 packages (e.g. broken_lidar -> fixed_lidar), the OLD process binary basename differs from the new one. Read x_medkit_replaces_executable from the entry metadata before issuing the kill, falling back to x_medkit_executable when the field is absent (in-package upgrades).
…ime libs in image - ProcessRunner::pgrep now reads /proc/<pid>/cmdline argv[0] basename instead of /proc/<pid>/comm (which kernel truncates to 15 chars - 'broken_lidar_node' would never match). - plugin_exports.cpp exports get_update_provider so the gateway's plugin_loader can resolve the UpdateProvider interface across the dlopen boundary without relying on dynamic_cast. - Dockerfile.gateway: drop --symlink-install (broke multi-stage COPY) and add runtime libs (libcpp-httplib, libsystemd, nlohmann-json3, lifecycle, test_msgs). - ota_update_server Dockerfile: bake artifacts/ into image (WSL2 + Docker Desktop bind mounts unreliable). - Compose: gateway port configurable via OTA_GATEWAY_PORT (default 8080). Verified via end-to-end smoke against the live stack: - Plugin loads and reports as UpdateProvider - Boot poll registers all 3 catalog entries - Update flow kills broken_lidar_node and spawns fixed_lidar_node
pack_artifact.py was emitting 'name' (not in SOVD ISO 17978-3 - spec uses 'update_name') and 'version' (not a SOVD field at all). Spec-compliant clients (ros2_medkit_web_ui, the Foxglove updates panel) expect update_name; vendor-specific data lives under x_medkit_*. Confirmed against the live demo gateway: the web UI happily renders the updated shape, all 3 catalog entries visible end-to-end.
…nst gateway
Verifies the canonical SOVD client flow that the Foxglove updates panel
mirrors: connect form, /api/v1/updates returns {items: [<id>]}, per-id
/status calls, all 3 catalog entries render in the dashboard.
Adopt the same script convention as sensor_diagnostics, multi_ecu_aggregation,
and turtlebot3_integration:
./run-demo.sh build artifacts + bring up gateway + nodes + update
server (daemon mode by default, --attached for fg)
./stop-demo.sh tear down (-v removes volumes, --images removes
built images)
./check-demo.sh show registered updates + per-id status + live
plugin-managed processes inside the gateway
container
./trigger-update.sh broken_lidar -> fixed_lidar (the headline)
./trigger-install.sh install obstacle_classifier_v2 from scratch
./trigger-uninstall.sh remove broken_lidar_legacy
OTA_GATEWAY_PORT (or OTA_GATEWAY_URL for full override) lets the user
sidestep collisions with another gateway on host port 8080.
README quickstart updated to point at run-demo.sh.
…ling ROS on runner The previous step installed ros-jazzy-ros-base + colcon on the runner, but that did not pull in python3-catkin-pkg, so the colcon build inside build_artifacts.sh tripped 'No module named catkin_pkg'. Easier and more faithful to how end-users build the demo: run build_artifacts.sh inside a ros:jazzy container with the demo dir bind-mounted, install only the build deps that the script actually needs, and chown the resulting catalog/tarballs back to the runner user.
Without foxglove_bridge there is nothing for Foxglove Studio to subscribe to - no /scan, no /tf, no 3D visual story. The Updates panel itself is just the SOVD HTTP client and works without it, but the broader demo narrative (phantom obstacle visible, robot stuck) needs the topic stream. Adds ros-jazzy-foxglove-bridge to the runtime stage of Dockerfile.gateway, launches it from entrypoint.sh on port 8765 (0.0.0.0), and maps the port through compose with OTA_FOXGLOVE_BRIDGE_PORT override. Verified live: 'Server listening on port 8765' and channels for /scan, /rosout, /fault_manager/events advertised at startup.
Without /fault_manager/* services running, the gateway's /faults endpoint hangs waiting for the service call (default 5s timeout) and the Faults Dashboard panel surfaces it as 503. Adds 'ros2 run ros2_medkit_fault_manager fault_manager_node' to entrypoint.sh - the gateway image already builds the package; we just need to run it. Defaults are fine for the demo (SQLite at /var/lib/ros2_medkit/faults.db, snapshot capture enabled).
The OTA story needs a robot Foxglove can render. Previously the demo
only spun up broken_lidar / broken_lidar_legacy and pushed nav2 to a
"BYO sim" footnote - so a viewer opening Foxglove just saw a /scan
topic with no robot in 3D space.
Now `docker compose up` produces a self-contained scene:
- ros-jazzy-turtlebot3-* + nav2-* + ros-gz-sim baked into the runtime
image
- ota_nav2_sensor_fix_demo package owns the launch + nav2_params +
map config; entrypoint hands off to `ros2 launch`
- spawn_turtlebot3 is wrapped in a SetRemap GroupAction that pushes
gz-bridge's /scan to /scan_sim, so broken_lidar (and later
fixed_lidar) is the sole publisher on /scan that nav2 + Foxglove
consume
- nav2 launched with use_composition:=False to dodge the apt-shipped
nav2_msgs / fastcdr 2.2.5 typesupport ABI mismatch on Jazzy
- RMW pinned to cyclonedds (same root cause - fastrtps typesupport
pulls a fastcdr symbol the runtime doesn't export)
- shm_size:2gb on the gateway service so gz-sim doesn't wedge on
/dev/shm exhaustion
Image grows by ~3GB; the trade-off is no external sim setup, the
Foxglove 3D panel renders the robot in turtlebot3_world out of the box,
and the OTA narrative ("phantom obstacle disappears after the swap")
becomes literally visible.
…m bundle Both files predated the TB3+Nav2+headless-Gazebo integration: - run-demo.sh header described only broken_lidar / broken_lidar_legacy; now mentions the full self-contained simulator + foxglove_bridge :8765 - usage section was missing OTA_FOXGLOVE_BRIDGE_PORT - "Connect a UI" block now points at the Foxglove 3D panel narrative (TurtleBot3 + /scan cone) as the recommended path - README quickstart called out a ~10 minute first-run build; bumped to ~15-20 minutes to reflect the ~3 GB TB3+Nav2+gz-sim runtime - README port-override hint now lists both OTA_GATEWAY_PORT and OTA_FOXGLOVE_BRIDGE_PORT (the latter was used in code but undocumented)
… grep
The smoke test was missing two regression guards and had a SIGPIPE bug
that started biting now that the gateway image bakes in the nav2 stack
(~500 lines of lifecycle output before the gateway logs anything).
New checks:
- /scan SetRemap regression: assert /scan has exactly one publisher and
it is NOT ros_gz_bridge. The launch wraps spawn_turtlebot3 in a
SetRemap('/scan' -> '/scan_sim') so broken_lidar (and later
fixed_lidar) is the sole publisher on /scan; if that remap regresses
both publishers stomp each other and nav2 sees garbage scans.
- Uninstall flow: PUT /updates/broken_lidar_legacy_remove/prepare +
/execute, assert status is completed and the broken_lidar_legacy
process is gone. Closes the Update / Install / Uninstall trio - the
demo's whole SOVD ISO 17978-3 compliance story.
Bug fix:
- `docker logs $C | grep -q PATTERN` is unsafe under `set -o pipefail`.
When grep finds the match early it exits, SIGPIPEs `docker logs`, and
the pipeline returns 141 - which `if` reads as "no match" even when
the line is there. With the small pre-nav2 log this was lucky enough
to almost always pass; with the full nav2 lifecycle dump it flips to
consistent false-negative. Capture logs into a variable first.
…fix branch Two demo polish items surfaced while preparing the recording: 1. discovery.runtime.create_functions_from_namespaces is now false. Our nav2 / TB3 nodes don't share a meaningful namespace - they all live at root with a few /global_costmap, /local_costmap exceptions - so the gateway was synthesizing single-host "global_costmap" / "local_costmap" / "root" Functions that don't represent any logical capability. Without a manifest those entries were noise; the tree now hides the Functions section entirely until something real exists to list. 2. Pin the gateway image to fix/component-logs-aggregation while the upstream PR lands. The per-component Logs tab in the Foxglove extension was always empty because the synthetic component prefix-match returned zero items for components without a manifest fqn. The branch makes COMPONENT log queries aggregate from hosted apps' fqns, parity with the existing AREA / FUNCTION handlers.
The previous config disabled function auto-generation but the result was zero Functions - which made the tree's Functions section empty rather than meaningful. Add a manifest defining areas, components, apps, and five logical functions: - Autonomous Navigation (bt_navigator + planner + controller + smoother + behaviors + waypoint + velocity smoother + collision monitor + docking + costmaps + lifecycle manager) - Localization (amcl + map server + lifecycle manager) - Perception (scan_sensor_node + ros_gz_bridge - the OTA target) - Fleet Diagnostics (gateway + fault manager) - Live Telemetry (foxglove bridge + robot state publisher) These are the capabilities the demo narrative pivots on: an operator viewing the tree sees "Perception is broken" or "Autonomous Navigation is degraded" rather than scrolling through 27 individual nodes. Switches discovery to hybrid mode so manifest entities + runtime discovery cooperate. unmanifested_nodes: warn + manifest_strict_validation: false tolerate the OTA-driven runtime changes (broken_lidar_legacy disappears on uninstall, obstacle_classifier_v2 appears on install) without manifest reconciliation churn. create_synthetic_components and create_functions_from_namespaces both stay off - the manifest is the source of truth.
Round of work to make the OTA demo's fault story actually visible AND to lock the build/test path against regression. Reactive SCAN_PHANTOM_RETURN fault: - broken_lidar subscribes /cmd_vel (Twist + TwistStamped, since Nav2 Jazzy uses the stamped variant) and only reports the fault while the controller is actively driving. Empty dashboard at boot, lights up the moment the operator publishes /goal_pose, stays active through recovery behaviors. The phantom itself is now a 21-ray, 0.5 m wedge centered on index 180 - one ray at 1 m the planner just routes around because the costmap raytraces it away. - fixed_lidar fires EVENT_PASSED at 500 ms (4x the broken_lidar tick) so the FaultManager debounce counter overtakes whatever broken_lidar accumulated, no matter how long the operator was driving before the OTA. Stops being load-bearing once the fault flips to HEALED. - demo.launch.py sets fault_manager to in-memory storage (clean dashboard at every boot - SQLite was persisting last session's CONFIRMED entry) and turns on healing_enabled with threshold 2. Reproducible artefact build: - ota_update_server/Dockerfile is now multi-stage. Stage 1 (ros:jazzy) clones ros2_medkit at the same ref the gateway image pins, builds ros2_medkit_msgs, builds fixed_lidar + obstacle_classifier_v2 from the demo's ros2_packages/, and runs pack_artifact.py to produce tarballs + catalog.json. Stage 2 ships the slim FastAPI server + the tarballs via COPY --from. `docker compose build` is now the reproducible path - no "build artefacts on host first" step. - CI: dropped the separate "Build artifacts inside ros:jazzy" step; `docker compose up -d --build` does it atomically. - scripts/build_artifacts.sh stays as an opt-in dev convenience. It no longer silently bootstraps msgs - if the env doesn't have it on the prefix path it errors out with instructions to either source an overlay or use `docker compose build`. Less hidden state. Demo narrative regression test: - tests/smoke_test_demo_narrative.sh exercises the full beat: publish /goal_pose, assert SCAN_PHANTOM_RETURN appears (reactive proof - if the fault appears, broken_lidar's /cmd_vel subscription saw motion, so we don't need a brittle `ros2 topic echo` snapshot), trigger OTA prepare+execute, assert process flip, assert fault clears, assert /cmd_vel settles. CI gets a new ota-demo-narrative job that runs this in isolation from the API-only smoke check.
Areas were doing zero work in this manifest: every area held one or two
components, every component lived in exactly one area, and the operator
viewing the tree just saw the same things twice.
For a single-robot demo the meaningful boundaries are:
- Components = OTA / hardware units (lidar, nav2 stack, gateway, ...).
"What can I swap?"
- Apps = the ROS 2 nodes that live on a component.
- Functions = capability groupings (Autonomous Navigation, Perception)
that pull apps from across components.
SOVD allows Areas, but they only earn their slot when there's a real
zone partition - powertrain / body / chassis on a vehicle ECU mesh, or
multi-robot tenancy. Adding them here was decorative.
The Entity Browser Functions section already covers the cross-cutting
view (Perception spans LiDAR Sensor + TurtleBot3 Base components). With
areas gone the tree shows the operator three things, each with a job:
what to update, which node to look at, and which capability is at risk.
The previous manifest split apps across seven components - lidar-sensor,
robot-base, nav2-motion, nav2-localization, medkit-gateway-unit,
fault-manager-unit, foxglove-unit - but none of those subdivisions
correspond to anything you can actually OTA independently in this demo.
They all run on a single host, the only swap target is the LiDAR app
itself, and the seven-component tree just made the operator scroll past
artificial boundaries.
Replace with a single `turtlebot3` component that owns every app. The
hierarchy the operator now sees is:
- 1 Component: TurtleBot3 Robot (the OTA boundary).
- 22 Apps: the ROS 2 nodes inside it.
- 5 Functions: capability groupings (Autonomous Navigation /
Localization / Perception / Fleet Diagnostics /
Live Telemetry).
Functions remain the cross-cutting view - "is Perception working?"
pulls scan-sensor + ros-gz-bridge from the single component, no
component partition needed to answer that question. Multi-component
manifests still make sense in multi-host / multi-ECU scenarios; for
this single-robot demo we don't fake them.
broken_lidar's reactive-fault path was subscribing to both Twist and TwistStamped on /cmd_vel "to be safe". At runtime that worked, but Foxglove inspects every subscriber when listing topic schemas and emitted: "Multiple channels advertise the same topic /cmd_vel but the schema, schema name or encodings do not match". The mixed-types entry also confused ros2 topic info, which then reported the topic with two type strings. Nav2 Jazzy publishes TwistStamped (docking_server / collision_monitor / velocity_smoother all use the stamped variant). The demo doesn't ship any teleop or legacy publisher; there's no Twist source on this stack. Drop the Twist subscription, keep TwistStamped.
…gle ray Two related issues that together broke the navigate_to_pose flow: 1. Spawn pose (-2.0, -0.5) was outside the global_costmap bounds. turtlebot3_world.yaml's origin is (-1.76, -2.42), so the robot sat to the west of where the costmap began. Every navigate_to_pose came back with error_code 203 (NO_VIABLE_PATH) and a planner log spam: "Sensor origin out of map bounds". Move spawn AND the AMCL initial_pose to (-1.5, -0.5), well inside the map. 2. The wide-wedge phantom (21 rays at 0.5 m, then 7 rays at 0.6 m) was too aggressive: the global planner rejected the goal up front, so /cmd_vel never spun and broken_lidar's reactive fault never fired even with a viable spawn pose. Revert to the single 1.0 m return on ray 180 - the controller engages, drives /cmd_vel, broken_lidar sees the motion and raises SCAN_PHANTOM_RETURN. The robot still completes the goal because the planner routes around a single ray; the demo's beat is "fault is visible while the robot drives" rather than "robot hangs forever".
AMCL only broadcast its map->odom TF after the robot drove update_min_d (0.25 m) or rotated update_min_a (0.2 rad). The demo's flow has the robot sitting still while the operator inspects the Foxglove dashboard - so the map frame never showed up in the Display Frame dropdown until they published a goal and the robot actually started moving. Foxglove just sat in 'Missing transform from frame <map>' for the whole pre-OTA narrative. Set both update_min_d and update_min_a to 0 so AMCL broadcasts on every laser scan. Also drops the duplicate keys that were sitting below the original block - they were silently overriding the values I'd just added.
turtlebot3_gazebo's robot_state_publisher.launch.py builds the frame_prefix parameter with PythonExpression(["'", frame_prefix, "/'"]), which appends a literal "/" even when the launch arg is empty. The result was a tf tree where AMCL / odometry broadcast frames as "base_link" (no slash) and robot_state_publisher published joint transforms as "/base_link" (with slash). Two disjoint subgraphs - so Foxglove rendered the URDF as a pile of disconnected meshes scattered around the origin instead of a robot model attached to base_footprint. Spawn robot_state_publisher directly from this launch with frame_prefix='' (empty, no PythonExpression mutilation), reading the turtlebot3 URDF off disk the same way the upstream launch does. Result: tf_static now reports "base_link", "base_footprint" without leading slashes, the URDF renders as one connected robot, and "map -> base_link" resolves end-to-end.
ProcessRunner::spawn was running the new binary with execl(path, path, nullptr) - no --ros-args, no parameters. After an OTA execute the post-update node (fixed_lidar after the update flow, obstacle_classifier_v2 after install) was therefore born with use_sim_time=false while the rest of the demo runs on /clock from gz-sim. Result: every /scan / /diagnostics message stamped by the new node landed outside the gateway's TF cache, nav2's costmaps logged "Message Filter dropping message ... earlier than all the data in the transform cache" on every tick, the obstacle layer stayed empty, and NavigateToPose came back ABORTED with no progress - "robot stops responding" from the operator's seat. Pass --ros-args -p use_sim_time:=true to execl so the spawned node joins the same clock domain as the rest of the stack. Robot resumes following goals after the OTA swap. This is the minimum viable param plumbing for the demo. A full production plugin should plumb arbitrary parameters from the catalog entry through to execve - but for now use_sim_time is the only param the OTA-managed nodes care about.
…l/uninstall Without this, OTA-installed apps (obstacle_classifier_v2 after the Install flow) showed up in the gateway runtime graph but never got attached to a manifest entity, so the Foxglove tree treated them as orphans and they never appeared under the turtlebot3 component or in any function's host list. The catalog said "this update adds the obstacle_classifier app", the runtime saw the new node, but the manifest tree stayed unchanged. Wire the gateway's plugin-side manifest fragment contract: 1. Plugin reads `fragments_dir` from its config (matches the gateway's discovery.manifest.fragments_dir param). Empty = legacy "no fragments" behavior preserved. 2. set_context() now stores the PluginContext so post-execute we can call notify_entities_changed and have the gateway re-merge the base manifest with the fragments dir. 3. On Install: render a minimal manifest YAML for the new app (id from added_components[0], node_name from x_medkit_executable, located on the single turtlebot3 component), atomic-publish via tmp-rename per the ManifestManager fragment contract, then notify. Gateway picks the entity up; new app shows under /components/turtlebot3/hosts and any function that lists it. 4. On Uninstall: drop the fragment file (no-op if the entity was in the base manifest, like broken_lidar_legacy). Notify either way so the cache stops returning a now-dead app. 5. On Update: no fragment change (same app id, just different binary), but still notify so the entity cache rebuilds with the new pid. Adds the dirs to Dockerfile and threads fragments_dir through both gateway_config.yaml (discovery.manifest.fragments_dir) and the plugin config block (plugins.ota_update_plugin.fragments_dir) so they stay in lockstep.
The build-and-test-ota job already moved off this recipe in favor of docker compose's multi-stage build, but the newer ota-demo-narrative job still inherited the old script. With build_artifacts.sh now hard- failing when ros2_medkit_msgs isn't on the prefix path (the script is documented as opt-in dev convenience, not the reproducible path), CI hit "ros2_medkit_msgs not found on the prefix path" and exited 1. Drop the step. ota_update_server's Dockerfile multi-stage build clones ros2_medkit_msgs at the pinned ref, builds it, runs pack_artifact.py, and ships the resulting catalog + tarballs to stage 2 - all triggered by docker compose up --build. No host-side ROS prereqs required.
…y aggregation matches
Gateway's filter_faults_by_sources prefix-matches reporting_sources against
app.effective_fqn() ("/scan_sensor_node"), so a bare "scan_sensor_node" was
visible only via server-level /faults and missing from /components/turtlebot3
/faults and /apps/scan-sensor/faults - the endpoints the Foxglove panels use.
The official ros2_medkit_fault_reporter README documents get_fully_qualified_name()
as the convention, so align broken_lidar / fixed_lidar with it.
Three independent bugs in the nav2 sensor-fix demo that prevented end-to-end runs: - Spawn pose: defaults (-2.0, -0.5) and (-1.5, -0.5) put the robot outside turtlebot3_world's map bounds, so the local_costmap reported "Sensor origin out of map bounds" and every navigate_to_pose returned NO_VIABLE_PATH 203. New default (0.5, -0.5) sits 0.71 m from each TB3 obstacle pillar, just outside the 0.55 m inflation halo, with full 1.5 m local_costmap clearance to every map edge. AMCL's initial_pose bumped to match so its particle filter converges on the actual spawn instead of drifting around the old origin. - foxglove_bridge max_qos_depth: 25 (default) silently drops most /tf samples because the 15 nav2 publishers fan in to ~210 transforms per cycle. Bumping to 1000 keeps Foxglove's TF chain intact - costmaps align with the map, robot mesh tracks /scan, /amcl_pose stays current. - Gateway upstream ref: the fix/component-logs-aggregation branch was merged and deleted upstream, so docker build failed at the git clone step. Switch both Dockerfiles to main; the equivalent commit is on main now anyway. Goal-pose coordinates in README and the smoke test follow the new spawn pose so the published goal still produces a non-trivial path.
ros2_medkit removed the backwards-compat shim headers under include/ros2_medkit_gateway/plugins/, providers/, and updates/. Plugins that legitimately consume the public surface have to include the canonical paths under core/plugins/ and core/providers/ directly. The old shim paths now fail at preprocessing time: fatal error: ros2_medkit_gateway/plugins/gateway_plugin.hpp: No such file or directory Switch every gateway include in the OTA plugin to its core/* equivalent: plugins/gateway_plugin.hpp -> core/plugins/gateway_plugin.hpp plugins/plugin_context.hpp -> core/plugins/plugin_context.hpp plugins/plugin_types.hpp -> core/plugins/plugin_types.hpp plugins/entity_change_scope.hpp -> core/plugins/entity_change_scope.hpp providers/update_provider.hpp -> core/providers/update_provider.hpp updates/update_provider.hpp -> core/providers/update_provider.hpp updates/update_types.hpp -> core/providers/update_types.hpp ros_plugin_context.hpp stays under plugins/ - upstream kept it there because it is a ROS-aware subclass, not a shim.
…tion With set -o pipefail, "printf '%s\n' \"\$LOGS\" | grep -q PATTERN" returns 141 when grep -q exits early on a match and SIGPIPEs printf. The bash if-statement then reads that as "no match" even when the pattern was present in the captured log, and the gateway-plugin-loaded assertion fails on every run despite the log line being there. Capturing into a variable does not help if you re-pipe it. Replace the two pipes with here-strings: grep -q PATTERN <<<"$GATEWAY_LOGS" here-strings feed grep via a temp fd, so there is no upstream process for SIGPIPE to land on.
The per-demo wrappers source lib/setup-trigger.sh and lib/watch-trigger.sh
from inside demos/<name>/. Both sourced files then re-source
triggers-api.sh with:
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
When sourced, $0 stays the caller's script path - the demo wrapper -
not the lib file's path. So SCRIPT_DIR resolves to demos/<name>/ and
the second source fails with:
demos/<name>/triggers-api.sh: No such file or directory
${BASH_SOURCE[0]} is the path of the file currently executing the line,
which is what we actually want.
…lisation The gateway normalises app IDs with hyphens - the entity is registered as e.g. diagnostic-bridge, manipulation-monitor, not diagnostic_bridge. The trigger wrappers were assuming the ROS-node-name spelling (underscore), which matches the reporting_sources path inside the FaultEvent payload but not the URL-addressable entity ID. setup-triggers returned HTTP 404 from /apps/<underscore_id>/triggers and watch-triggers found no active trigger to subscribe to. Switch every wrapper to the hyphenated form. For turtlebot3 also switch from anomaly-detector to diagnostic-bridge: faults on this stack arrive from /diagnostics through the bridge and bucket against the bridge's faults collection, so the anomaly-detector trigger would never see an event even with the correct ID.
Container scripts in moveit_pick_place and multi_ecu_aggregation start with "set -eu" and then source /opt/ros/jazzy/setup.bash. setup.bash dereferences AMENT_TRACE_SETUP_FILES without guarding for unset, so under nounset the source aborts before any payload runs: /opt/ros/jazzy/setup.bash: line 8: AMENT_TRACE_SETUP_FILES: unbound variable The Scripts API reports this as "Script exited with code 1" and the inject/restore never lands. Wrap the source pair with set +u / set -u so AMENT_* defaults can stay implicit while the rest of the script keeps strict variable checking.
…alization-failure The script reinitialises AMCL (scatters particles) and then asks bt-navigator to navigate. Under scattered particles the action server returns HTTP 400 with x-medkit-ros2-action-rejected - which is the demo's intended failure mode, the whole point of the injection. curl -sf turns that 400 into exit 22 and aborts the script before the final "Localization failure injected." print, and the Scripts API reports the script as failed even though the fault path it was supposed to trigger fired correctly. Drop -f on the navigate_to_pose call and print the response body (matching the pattern used in inject-nav-failure on the same demo).
Four host-side helper scripts lost their executable bit somewhere along the way: demos/sensor_diagnostics/inject-fault-scenario.sh demos/sensor_diagnostics/run-diagnostics.sh demos/moveit_pick_place/arm-self-test.sh demos/moveit_pick_place/planning-benchmark.sh README and run-demo.sh both reference them as ./script.sh, so running the demos as documented hits "Permission denied". chmod +x to restore the bit; no content change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Originally landed
demos/ota_nav2_sensor_fix/end-to-end OTA demo (gateway plugin + FastAPI artifact server + 4 ROS 2 demo packages). End-to-end verification on this branch surfaced regressions in the other 5 demos that prevented them from running with their documented scripts; this PR now also fixes those.OTA demo (original scope)
demos/ota_nav2_sensor_fix/end-to-end OTA demoota_update_pluginC++ gateway plugin (UpdateProvider+GatewayPlugin)updated_components/added_components/removed_components)pack_artifact.pyCLI for building tarballs and catalog entriesdocker-compose.yml(gateway + update server); nav2 / Foxglove are bring-your-own (documented in README)Out of scope (deliberate, dev-grade positioning)
Cross-demo regressions (added during end-to-end verification)
While running every demo with its documented scripts, three reproducible regression patterns surfaced. None are new to this PR - they accumulated upstream and went unnoticed because no single PR exercises every demo end-to-end. Bundling the fixes here so the demos folder is consistent again rather than spreading them across single-file PRs.
fix(ota_plugin): migrate includes to gateway core/* canonical paths- upstreamros2_medkitremoved the backwards-compat shim headers underinclude/ros2_medkit_gateway/{plugins,providers,updates}/; the OTA plugin now includes the canonicalcore/plugins/*andcore/providers/*directly. Without this,docker compose buildfailed at colcon withfatal error: ros2_medkit_gateway/plugins/gateway_plugin.hpp: No such file or directory. PR-blocking.fix(tests/ota): use here-string to avoid SIGPIPE in gateway log assertion-smoke_test_ota.shcaptureddocker logsinto a variable to dodgepipefailrace againstgrep -q, but then re-piped the variable, defeating the fix. The "Update backend provided by plugin" assertion failed even when the log line was present. Switched togrep -q PATTERN <<<"$LOGS".fix(demos/lib): resolve sourced lib path via BASH_SOURCE instead of $0-lib/setup-trigger.shandlib/watch-trigger.shresolvedSCRIPT_DIRfrom$0, which stays the caller's script path when the lib is sourced, so the re-source oftriggers-api.shalways failed withNo such file or directory.fix(demos/triggers): use hyphenated entity IDs to match gateway normalisation- gateway exposes entities asdiagnostic-bridge,manipulation-monitoretc. The per-demo trigger wrappers were passing the ROS-node-name spelling (diagnostic_bridge), which matches the reporting source path but not the URL-addressable entity ID, sosetup-triggersreturned HTTP 404. For turtlebot3 also switched fromanomaly-detectortodiagnostic-bridgebecause faults bucket against the bridge on that stack.fix(demos): relax nounset around ROS setup.bash sourcing- container scripts inmoveit_pick_placeandmulti_ecu_aggregationstart withset -eu, then source/opt/ros/jazzy/setup.bashwhich dereferencesAMENT_TRACE_SETUP_FILESwithout guarding for unset. Every inject/restore aborted withAMENT_TRACE_SETUP_FILES: unbound variablebefore the payload ran. Wrapped each source pair withset +u/set -u.fix(demos/tb3): treat nav2 action rejection as expected in inject-localization-failure- the script scatters AMCL particles then asksbt-navigatorto navigate; under scattered particles the action server returns HTTP 400 (the demo's whole point), butcurl -sfexit 22 aborted before the final print. Dropped-fto matchinject-nav-failureon the same demo.chore(demos): restore exec bit on injection/diagnostic scripts- 4 host-side helper scripts insensor_diagnosticsandmoveit_pick_placelost their executable bit; README andrun-demo.shreference them as./script.sh.chmod +x, no content change.Test plan / verification
Unit & integration tests (OTA scope - all clean):
pytest -vforpack_artifact.py(16 tests)pytest -vforota_update_server(5 tests)colcon testforota_update_plugin(24 GTest cases)-Wall -Wextra -Wpedantic -Wshadow -Wconversionbuild_artifacts.shproduces a 3-entry catalog + tarballs end-to-endEnd-to-end smoke - OTA demo:
UpdateProvider(gateway logs: "Update backend provided by plugin")/catalogand registers all 3 catalog entries/updates/fixed_lidar_2_1_0/prepare && /executekillsbroken_lidar_nodeand spawnsfixed_lidar_node/updates/obstacle_classifier_v2_1_0_0/prepare && /executeswaps files and spawnsobstacle_classifier_node/updates/broken_lidar_legacy_remove/prepare && /executereturnsstatus: completedand the legacy process is gonetests/smoke_test_ota.sh- 25/25 pass on a fresh stacktests/smoke_test_demo_narrative.sh- 8/8 pass on a fresh stacktrigger-update.sh,trigger-install.sh,trigger-uninstall.sh,check-demo.sh,stop-demo.shexercised end-to-endEnd-to-end - other demos (after the regression fixes):
sensor_diagnostics- run-demo, all 5 inject scripts (noise / failure / nan / drift / fault-scenario), restore-normal, run-diagnostics, setup-triggers + watch-triggers SSE event with inject-nan, check-demo, stop-demoturtlebot3_integration(--headless) - run-demo, check-entities, check-faults, inject-nav-failure, inject-localization-failure, nav-health-check, reset-navigation, restore-normal, setup-triggers + watch-triggers SSE, send-nav-goal, stop-demomoveit_pick_place(--headless) - run-demo, check-entities, check-faults, arm-self-test, inject-collision, inject-planning-failure, restore-normal, planning-benchmark, setup-triggers + watch-triggers SSE, stop-demomulti_ecu_aggregation- check-demo, inject-sensor-failure, inject-planning-delay, inject-gripper-jam, inject-cascade-failure, restore-normal, stop-demo (note: paralleldocker compose buildraces when multiple services shareimage: multi-ecu-demo:local; workaround is to pre-build a single service)mosaico_integration- single-robot (trigger-fault.sh) and fleet (trigger-fleet-faults.sh) - bridge subscribes SSE, downloads.mcap, ingests into mosaicod over Apache Arrow Flight; 3/3 fleet sequences ingestedNotes
selfpatch/ros2_medkitmainfor the gateway sources (clone happens at image build time)pgrepmatches against/proc/<pid>/cmdlineargv[0] basename (notcomm, which the kernel truncates to 15 chars)artifacts/is baked into theupdate_serverimage at build timeros2_medkit_foxglove_extensionUpdates panel PR (fix(docker): add missing ros2_medkit components and submodules #6 of that repo)