Why Materials Datasets Need Version Numbers

You ran an OPTIMADE query in January and retrieved 847 perovskite structures. In July, the same filter returns 891. The band gaps on 23 entries have changed. Three entries from your training set no longer appear in default results.

None of that was announced. There is no changelog entry for it, no notification, no version bump you could have subscribed to. OPTIMADE (the open REST API standard for querying computational materials databases) providers update their databases continuously: DFT recalculations, pipeline bug fixes, new high-throughput runs, deprecated structures. None of these changes are surfaced through the standard query interface.

The problem is not documentation discipline. The tools that mediate database access do not capture what changed. Without immutable dataset snapshots, computationally-sourced research is unreproducible by default. Not by researcher choice, but because the default toolchain does not produce that record.

What Actually Changes in OPTIMADE Databases

Entry Revisions and Property Corrections

Materials Project (MP), AFLOW, OQMD, and JARVIS-DFT all revise stored values without per-entry changelogs accessible through standard API queries. When MP migrated from GGA+U to r2SCAN (documented in Munro et al., Phys. Rev. Materials, 2020), band gaps and formation energies changed across hundreds of thousands of entries. A bug fix in a post-processing pipeline can retroactively correct values across an entire property type. The updated value is the only version the API returns. There is no record of what it was before.

MP releases periodic numbered database snapshots (e.g., v2021.11), the most transparent versioning practice among the major providers. But the mp-api Python client queries the live database by default. Unless you explicitly pin to a release and archive the data, you are querying a moving target.

Added and Removed Structures

New high-throughput runs add entries continuously. The same filter string returns more rows over time as coverage expands. More critically, entries flagged as erroneous or structurally invalid get deprecated or excluded from default queries. MP exposes a deprecated boolean field on entries; AFLOW and OQMD remove entries with considerably less visibility into what was removed and why. A training set built on 10,000 entries today may reference entry IDs that return nothing six months from now.

Schema and Property Definition Changes

The OPTIMADE specification is versioned, from v0.9 through v1.0 (Andersen et al., Scientific Data, 2021) to v1.2 and beyond. Each version can add properties, change definitions, or alter filter syntax. Individual providers also maintain their own extensions (prefixed fields like _mp_ or _aflow_) which can change at any time outside the core spec.

OPTIMADE databases are live systems. Reproducibility requires treating a query result as a snapshot, not a permanent reference.

What Does a Reproducible Dataset Citation Actually Require?

The standard citation practice reflects what the tooling makes available: a database name, sometimes an access date. That is not enough to reproduce the retrieval. A reproducible dataset citation needs five elements that no OPTIMADE client captures automatically at query time:

Provider name and base URL: e.g., https://optimade.materialsproject.org
Query filter string: the exact OPTIMADE filter applied
Fetch timestamp: ISO 8601 format, to the day at minimum
Entry count: total results returned at fetch time
Dataset version or snapshot ID: a stable identifier resolving to the frozen result set

Each element matters. Provider alone does not tell you which entries were returned. Query alone does not tell you what those entries contained at fetch time. Timestamp alone does not give you a reproducible handle on the data.

A complete methods paragraph looks like this:

Structural and thermodynamic data were retrieved from the Materials Project OPTIMADE endpoint (https://optimade.materialsproject.org) on 2026-01-14 using filter elements HAS "Li" AND elements HAS "O" AND _mp_stability_is_stable=true. The query returned 1,247 entries. Results were frozen as an immutable dataset snapshot (ID: ds_7f3a2c) using Alloybase and are available at [DOI or stable URL].

FAIR Principle R1.2 (Wilkinson et al., Scientific Data, 2016) explicitly requires “detailed provenance” for research data. Principle F1 requires globally unique, persistent identifiers. Together, they mean that citing a live database URL without a version or snapshot ID is a structural gap in the citation, not a formatting technicality.

Five elements make a dataset citation reproducible. Most published citations include one.

Saved Searches vs. Versioned Datasets: What Is the Difference?

These are not the same thing, and the distinction is the core of the reproducibility problem.

A saved search stores query parameters and re-executes them against the live database on demand. Results change as the underlying data changes. Useful for tracking new entries over time; not useful for reproducibility.

A versioned dataset is an immutable snapshot of query results frozen at fetch time. The data does not change after capture. It can be cited, shared, and diffed against future database states to surface exactly what changed.

	Saved search	Versioned dataset
Results change over time	Yes	No
Citable in methods section	No	Yes
Reproducible	No	Yes
Supports diff across time	No	Yes
Storage required	Minimal	Proportional to result size

Most current tooling (pymatgen, aflow-py, optimade-python-tools) provides saved searches. They re-execute against the live API. None capture an immutable snapshot by default.

A saved search is a query. A versioned dataset is evidence.

How Alloybase Immutable Snapshots Work

When you run an OPTIMADE query in Alloybase, the results are stored as a write-once snapshot. What gets captured: the full response payload, query filter string, provider URL, fetch timestamp, result count, and schema version. The snapshot cannot be silently updated. Re-running the same query creates a new snapshot; the original is preserved.

Snapshots persist across session restarts, tied to your workspace rather than an active browser tab. Each snapshot gets a stable identifier you can reference directly in a methods section.

The diff feature compares two snapshots of the same query taken at different times, surfacing which entries changed, which were added, and which were removed. For datasets built over months, the diff is how you document what changed between your initial pull and final submission.

Snapshot storage is available on every tier, including Free. Paid tiers (Researcher at $9/mo, Lab at $29/mo) help sustain ongoing development and include power-user features like Parquet export, API keys, semantic search, team workspaces, audit log, and CIF export.

Alloybase snapshots are write-once, workspace-persistent, and citable by design.

Make Your Next Dataset Citable

OPTIMADE databases are maintained, living systems. That is their strength: coverage improves, errors get corrected, new structures get added. For researchers building on that data, it means a published result is only reproducible if the data was frozen at the time of use.

The gap between “we queried Materials Project” and “we queried Materials Project, and here is the exact frozen snapshot of what we retrieved” is the gap between an unverifiable claim and a reproducible methods section.

Version your dataset before you submit. Alloybase handles snapshot creation, diff comparison, and citation export across the full search-to-export workflow. alloybase.app.

FAQ

Does Materials Project have built-in versioning I can reference?

MP releases periodic numbered snapshots (e.g., v2021.11), but mp-api queries the live database by default. There is no MPRester(database_version=...) parameter to time-lock queries to a historical release. The practical approach: record the last_updated and database_version fields from your query results, archive the full result set, and cite the access date explicitly in your methods section.

Is exporting results to CSV enough for reproducibility?

Better than nothing. A CSV captures the data at a point in time but does not capture the query filter, provider URL, schema version, or fetch timestamp. Without those, a reviewer can verify your numbers but cannot reproduce your retrieval or compare it against the current database state.

Can I compare what changed in a database between two time points?

With versioned snapshots, yes: run the same query twice at different times and diff the two results. This surfaces which entries changed, were added, or were removed. Without snapshots, there is no baseline to compare against.

Which OPTIMADE providers change their data most frequently?

Materials Project is the most active, with continuous updates and documented major migrations like the GGA+U to r2SCAN transition. AFLOW updates as high-throughput runs complete. OQMD and JARVIS-DFT are more stable but publish new database releases periodically. None of the four providers expose per-entry change logs through their standard APIs.

Does Alloybase deduplicate results across providers automatically?

Deduplication is user-initiated, not automatic. Alloybase flags duplicate entries across providers based on structure matching; you choose which records to keep. The dedup run and your decisions are recorded in the dataset history.

What Actually Changes in OPTIMADE Databases#

Entry Revisions and Property Corrections#

Added and Removed Structures#

Schema and Property Definition Changes#

What Does a Reproducible Dataset Citation Actually Require?#

Saved Searches vs. Versioned Datasets: What Is the Difference?#

How Alloybase Immutable Snapshots Work#

Make Your Next Dataset Citable#

FAQ#

Does Materials Project have built-in versioning I can reference?#

Is exporting results to CSV enough for reproducibility?#

Can I compare what changed in a database between two time points?#

Which OPTIMADE providers change their data most frequently?#

Does Alloybase deduplicate results across providers automatically?#