A Standard for Declaring Source Dataset Metadata, Trust Layers, and Retrieval Scope
Metadata
- rfc_id: RFC-003
- title: Provenance Mapping Specification
- status: Draft
- version: 0.1
- authors:
- David W. Bynon (@TrustPublishing)
- WebMEM Working Group
- date_created: 2025-07-15
- license: CC BY-SA 4.0
- domain_scope: General (Dataset-Centric)
- depends_on: RFC-001, RFC-002
Purpose
This specification defines the standard structure for declaring dataset-level provenance in a WebMEM Digest or any AI-ingestible, fragment-level memory object.
It supports:
- Trust scoring
- Dataset versioning
- Citable metadata
- Cross-format retrieval
- PROV-O compatibility
Core Object: ProvenanceBlock
dataset_id: CMS_PBP_2025
dataset_title: CMS Plan Benefit Package 2025
dataset_type: source # [source, derived, aggregate, inferred]
source_agency: CMS
license: Public domain
published: 2025-07-01
retrieved: 2025-07-06
dataset_home: https://cms.gov/pbp
dataset_archive: https://cms.gov/pbp/2025/files.zip
trust_layer: primary # [primary, secondary, inferred]
trust_scope: semantic-digest # e.g. fragment, plan, claim, model-training
confidence: 1.0
data_format: tsv
schema_format: internal-cms # or 'semantic-digest-v0.1'
fields_covered:
- in_premium
- moop
- in_specialist
- in_primary
- in_mc_dent_preventive
notes: >
This dataset defines cost and benefit data for all MA plans for 2025.
Field Descriptions
| Field | Description |
|---|---|
| dataset_id | Globally unique dataset token (e.g. CMS_PBP_2025) |
| dataset_title | Human-readable name of the dataset |
| dataset_type | Dataset classification: source, derived, aggregate, inferred |
| source_agency | Publishing agency or data originator |
| license | Usage license (e.g., Public Domain, CC BY) |
| published | Date dataset was officially published |
| retrieved | Date dataset was accessed for publishing |
| dataset_home | Landing page or documentation URL |
| dataset_archive | Direct archive or raw file download URL |
| trust_layer | Declared trust tier (primary, secondary, inferred) |
| trust_scope | Retrieval-level scope this dataset substantiates |
| confidence | Optional numeric confidence score (0.0–1.0) |
| data_format | Original format (e.g., csv, json, tsv) |
| schema_format | Schema structure used (internal or RFC-based) |
| fields_covered | Array of data_id tokens supported by this dataset |
| notes | Additional context, qualifiers, or limitations |
Why This Matters
A single ProvenanceBlock enables:
- Trust scoring by AI agents
- Source traceability for derived claims
- Explainable citation behavior in agentic systems
- W3C PROV-O compatibility for structured trust lineage
Format Use Cases
| Use Case | Format |
|---|---|
| Digest metadata | YAML |
| Semantic web ingestion | Turtle (.ttl) |
| Trust trace graphs | PROV-O |
| Retrieval conditioning | JSON-LD |
Example: Digest Inclusion
provenance:
- dataset_id: CMS_Landscape_2025
dataset_title: CMS MA Landscape 2025
...
- dataset_id: CMS_PBP_2025
trust_layer: primary
fields_covered:
- in_premium
- moop
- in_specialist
Suggested Dataset Registry
For large-scale verticals (e.g., healthcare, energy, education), a public dataset registry is encouraged. Each entry should include a descriptor file in .yaml or .json format conforming to RFC-003.
Recommended structure:
/datasets/
├── CMS_PBP_2025.yaml
├── HHS_Eligibility_2024.yaml
└── MedicareEnrollmentRates_2023.json
Canonical Reference
RFC-003 is maintained at webmem.com/rfc/rfc-003/ and versioned in the WebMEM RFC Registry.