XooCode(){

Dataset

Dataset describes a structured collection of data: research datasets, government statistics, scientific measurements, public records. Google indexes Dataset markup for Google Dataset Search, a separate search engine for researchers. It also surfaces datasets in regular Google Search results when the query looks data-related. If you publish downloadable data, Dataset markup is how you get found.

The most important pattern here is distribution, which describes how the dataset can be accessed. Each distribution is a DataDownload with a URL, file format, and size. A single dataset can have multiple distributions (CSV, JSON, Parquet) so consumers can pick the format they need.

Full example of schema.org/Dataset json-ld markup

The markup is verified as valid with Rich Results Test from Google.

Highlight legend:Required by GoogleRecommendedOptional
schema.org/Dataset
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@id": "https://roydanmedjournal.dk/data/pediatric-outcomes-1945-1955#dataset",
  "@type": "Dataset",
  "name": "Pediatric Patient Outcomes, Denmark 1945-1955",
  "alternateName": "Xoo Pediatric Framework Trial Data",
  "description": "Anonymized patient outcome records from the Jane Xoo pediatric framework trial across three Danish hospitals. Covers 2,847 patients aged 0-14, tracking nutritional recovery, growth milestones, and infection rates under the triage protocol introduced in the 1945 clinical framework paper.",
  "url": "https://roydanmedjournal.dk/data/pediatric-outcomes-1945-1955",
  "identifier": "https://doi.org/10.5281/zenodo.xoo-pediatric-1945",
  "creator": {
    "@id": "https://janexoo.com#person"
  },
  "publisher": {
    "@type": "Organization",
    "@id": "https://roydanmedjournal.dk#publication",
    "name": "Royal Danish Medical Journal"
  },
  "datePublished": "1956-03-01",
  "dateModified": "2023-08-15",
  "version": "2.1",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "isAccessibleForFree": true,
  "inLanguage": "en",
  "temporalCoverage": "1945/1955",
  "spatialCoverage": {
    "@type": "Place",
    "name": "Denmark",
    "geo": {
      "@type": "GeoShape",
      "box": "54.56 8.08 57.75 15.20"
    }
  },
  "variableMeasured": [
    {
      "@type": "PropertyValue",
      "name": "patient_age_months",
      "description": "Patient age at admission in months",
      "unitText": "months"
    },
    {
      "@type": "PropertyValue",
      "name": "weight_kg",
      "description": "Patient weight at each measurement point",
      "unitText": "kg"
    },
    {
      "@type": "PropertyValue",
      "name": "recovery_grade",
      "description": "Clinical recovery grade on the Xoo scale (1-5)"
    },
    "infection_count",
    "triage_category",
    "hospital_id",
    "follow_up_months"
  ],
  "measurementTechnique": "Clinical observation with standardized Xoo triage protocol",
  "distribution": [
    {
      "@type": "DataDownload",
      "contentUrl": "https://roydanmedjournal.dk/data/pediatric-outcomes-1945-1955.csv",
      "encodingFormat": "text/csv",
      "contentSize": "4.2 MB"
    },
    {
      "@type": "DataDownload",
      "contentUrl": "https://roydanmedjournal.dk/data/pediatric-outcomes-1945-1955.json",
      "encodingFormat": "application/json",
      "contentSize": "5.8 MB"
    }
  ],
  "citation": {
    "@id": "https://roydanmedjournal.dk/archive/1945/pediatric-care-post-war-denmark#article"
  },
  "isBasedOn": {
    "@id": "https://roydanmedjournal.dk/books/children-first#book"
  }
}
</script>

distribution and DataDownload

Each distribution entry is a DataDownload with contentUrl (the download link), encodingFormat (MIME type like text/csv or application/json), and optionally contentSize. Google Dataset Search reads these to show download buttons with format labels. If the dataset is behind an API rather than a file download, use accessUrl instead of contentUrl and set encodingFormat to the API response format.

temporalCoverage and spatialCoverage

temporalCoverage specifies the time range of the data using ISO 8601 interval notation: 1945/1955 means the dataset covers 1945 through 1955. For ongoing datasets, use an open-ended interval like 2020-01-01/.. (two dots mean "continuing"). spatialCoverage can be a Place with a name (like "Denmark") or a GeoShape with coordinates for precise geographic bounds.

variableMeasured

variableMeasured lists the variables or columns in the dataset. Each can be a simple string or a PropertyValue object with a name, description, and unitText. Google Dataset Search uses these as facets, so researchers can filter datasets by the variables they contain. Always include them if you can.

license and isAccessibleForFree

license takes a URL pointing to the license text (Creative Commons, Open Data Commons, etc.). This is one of the most searched filters in Google Dataset Search. Researchers often filter by license before anything else. isAccessibleForFree signals whether the dataset can be downloaded without payment.

Minimal valid version

The smallest markup that still produces a valid Dataset entity. Use it as the floor. Reach for the advanced example above when you want search engines and AI agents to understand more about your content.

schema.org/Dataset (minimal)
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Dataset",
  "name": "Pediatric Patient Outcomes, Denmark 1945-1955",
  "description": "Anonymized patient outcome records from the Jane Xoo pediatric framework trial across three Danish hospitals.",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "distribution": {
    "@type": "DataDownload",
    "contentUrl": "https://roydanmedjournal.dk/data/pediatric-outcomes-1945-1955.csv",
    "encodingFormat": "text/csv"
  }
}
</script>

Google rich results this unlocks

Markup matching this example makes your page eligible for the following Google Search rich results. The primary target drives the required / recommended property classification in the advanced code block above.

Common Dataset mistakes

Mistakes that pass validation but silently fail to earn rich results or mislead consumers walking the graph. Avoid these and your markup will be ahead of most sites in the wild.

  1. 01

    Missing distribution entirely

    Wrong
    Dataset with description and license but no distribution array
    Right
    "distribution": [{ "@type": "DataDownload", "contentUrl": "https://...", "encodingFormat": "text/csv" }]

    A dataset without a distribution has no download link. Google Dataset Search can index it, but researchers cannot access the actual data. Always include at least one DataDownload with a contentUrl and encodingFormat.

  2. 02

    encodingFormat as a file extension

    Wrong
    "encodingFormat": "csv"
    Right
    "encodingFormat": "text/csv"

    encodingFormat expects a MIME type, not a file extension. Common values: text/csv, application/json, application/vnd.ms-excel, application/x-parquet. File extensions are ambiguous (does "json" mean application/json or application/ld+json?) and may be ignored.

  3. 03

    temporalCoverage as a single date

    Wrong
    "temporalCoverage": "1945"
    Right
    "temporalCoverage": "1945/1955"

    temporalCoverage uses ISO 8601 interval notation with a slash separator. A single date means the dataset covers only that one point in time. For ongoing data, use an open-ended interval: "2020-01-01/.." (two dots). For a single year, use "1945/1945".

  4. 04

    Missing license

    Wrong
    (Dataset with no license property)
    Right
    "license": "https://creativecommons.org/licenses/by/4.0/"

    License is the most-used filter in Google Dataset Search. A dataset without a license is legally ambiguous and most researchers will skip it. Always specify a license URL, even if it is a restrictive one.

  5. 05

    variableMeasured omitted

    Wrong
    (Dataset with only name, description, and distribution)
    Right
    "variableMeasured": [{ "@type": "PropertyValue", "name": "weight_kg", "unitText": "kg" }, "patient_age_months"]

    variableMeasured lists the columns or fields in your dataset. Google Dataset Search uses these as search facets, so researchers can find datasets by the variables they need. Without them, your dataset is only discoverable by name and description keywords.

About the example data

The dataset contains anonymized patient outcome records from Jane Xoo's pediatric framework trial across three Danish hospitals, 1945 through 1955. It is the underlying data behind her 1945 paper and 1948 textbook. The creator references Jane Xoo via @id, and the publisher is the Royal Danish Medical Journal. Two distributions are provided: CSV for spreadsheet analysis and JSON for programmatic access.

Comments

Loading comments...

Leave a comment