Skip GCS upload when product content is unchanged#88
Merged
Conversation
Every run wrote a new dated archive + overwrote latest.geojson even when the data was identical, duplicating datasets in GCS. Dedup by content hash: - _content_hash hashes the GeoJSON ignoring the volatile timeStamp, so an unchanged dataset hashes the same across runs. - upload_product stores the hash on the dated/latest blob metadata; on the next run it compares against latest.geojson's stored hash and, on a match, skips writing entirely (dated_uri=None, skipped=True). - combine asset surfaces skipped_unchanged in MaterializeResult metadata and omits dated_uri when skipped. Verified: hash is stable across timeStamp, changes with features; skip path writes nothing, upload path sets metadata + copies to latest. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Your pull request is automatically being deployed to Dagster Cloud.
|
Persist last_changed (YYYY-MM-DD the content actually changed) in the GCS blob metadata alongside content_hash, and report days_since_last_change on every run. On a skipped (unchanged) run the last_changed date is carried forward and the day count grows; on a real change it resets to 0. Surfaced in the combine asset's MaterializeResult metadata (last_changed, days_since_last_change) so a product whose data has been static for, say, 60+ days is an obvious candidate to relax from a daily to a monthly schedule. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Avoid duplicating datasets in GCS. Every run wrote a new dated archive (
{date}.geojson) and overwrotelatest.geojsoneven when the data hadn't changed. Now the upload is skipped when content matches what's already there.How
_content_hash(local_path)— SHA-256 of the GeoJSON ignoring the volatiletimeStamp, so an unchanged dataset hashes identically across runs (features + the rest of the collection are included).upload_productstores the hash in the blob metadata (content_hash) on the dated blob;copy_blobcarries it tolatest.geojson. On the next run it readslatest's stored hash and, on a match, skips writing — returnsdated_uri=None, skipped=True. Otherwise it uploads as before.skipped_unchangedinMaterializeResultmetadata and omitsdated_uriwhen skipped.Why hash-minus-timeStamp
The collection embeds
"timeStamp": <now>every run, so a raw file/object hash would always differ. Hashing the content withtimeStampremoved makes "unchanged data" detectable.Verification
timeStamp; changes whenfeatureschange.upload_from_filename/copy_blobnot called),skipped=True,dated_uri=None.Note
First run after deploy always uploads (no stored hash yet), seeding the metadata for subsequent dedup.
🤖 Generated with Claude Code