Introducing Alloybase: The Missing Workflow Layer for Materials Data

If you’ve used optimade.science or similar OPTIMADE demos for a screening run, you know where they fall apart: the session ends and everything disappears. No saved query, no stored results, no record of which provider returned which row or when the data was fetched.

That’s a reasonable tradeoff for an interactive demo. It’s a problem when you’re building a training dataset, documenting a DFT screening workflow for publication, or trying to reproduce a filter you ran three weeks ago.

Alloybase is a persistent layer on top of OPTIMADE search. The core thing it adds is simple: results that stay, and source attribution that travels with every row.

What This Looks Like in Practice

Say you’re screening perovskite oxides for photovoltaic candidates — bandgap between 1.0 and 2.5 eV, stable hull distance, available across at least two providers for cross-validation. In a standard OPTIMADE demo, you run the query, download a CSV, and move on. The CSV has structure data but no query record, no fetch timestamp, no MP-ID or AFLOW entry ID alongside each row.

In Alloybase, that query goes into a named, versioned dataset. Every row carries its provider, OPTIMADE ID, query string, and fetch timestamp. If Materials Project and AFLOW disagree on the bandgap for a given structure — which, in our early testing, occurs regularly among cross-listed materials — that disagreement surfaces automatically. You can add your own DFT results or a proprietary CSV to the same dataset and export the whole thing to Parquet, ready for a training pipeline or a methods section.

When you cite the dataset, there’s something to cite: an immutable versioned snapshot with a full attribution trail, not “data from the Materials Project, accessed sometime in Q3.”

A second example: you’re screening Li-ion cathode candidates — transition metal oxides with specific voltage windows and low volume change on lithiation, pulling from JARVIS-DFT alongside Materials Project for coverage. The two databases have overlapping entries with different calculated formation energies. In a raw OPTIMADE query, reconciling those discrepancies is manual work you do outside the tool. In Alloybase, you can run a cross-provider comparison directly on the dataset — disagreements surface row by row. Every row already carries its provider, OPTIMADE ID, query string, and fetch timestamp, so by the time the dataset goes into your cathode screening pipeline, you have an unambiguous provenance record: which entries came from where, which discrepancies exist, and exactly when and how the data was fetched.

Why Provenance Matters

ML interatomic potentials and property predictors — MACE, CHGNet, M3GNet and their descendants — are now trained on these databases at scale. A training set assembled from OPTIMADE queries has stricter provenance requirements than a single DFT screening paper: the exact fetch date, database version, and functional settings matter for reproducibility and for understanding where model failures originate.

It’s also not a static problem. Materials Project’s r2SCAN functional rollout changed calculated values for existing entries. Formation energies you fetched in 2022 may not match the same entry today. “Data from the Materials Project” is not a complete citation if the values have since been recalculated.

The same logic applies in industrial R&D. If you’re using computed materials data to select a synthesis target or train a model in a battery, semiconductor, or aerospace context, the dataset needs to be documented well enough to defend if IP is ever questioned. “We ran some queries in late 2024” is not documentation.

Alloybase doesn’t solve the underlying data quality problem across OPTIMADE providers. It gives you the tooling to know exactly what you pulled, when you pulled it, and where the discrepancies were.

What’s Included

Multi-provider OPTIMADE search across 13+ databases (Materials Project, AFLOW, NOMAD, JARVIS-DFT, COD, and others)
Natural language filter builder — type what you want, get a valid OPTIMADE filter string
Persistent versioned datasets with source attribution on every row
Cross-provider comparison with automatic disagreement flagging
Export: CSV/JSON (all tiers), Parquet + Jupyter notebooks (Researcher), CIF (Lab)
Source-agnostic datasets — mix OPTIMADE results with proprietary CSV uploads in a single versioned dataset

Tiers

Free — 5 datasets, 200 rows each, 50 searches/day, CSV/JSON export. No credit card.

Researcher ($9/mo) — 25 datasets, 5,000 rows, unlimited searches, Parquet + Jupyter notebooks, API keys, deduplication, anomaly detection, semantic search.

Lab ($29/mo) — Unlimited everything, team workspaces up to 10, audit log, CIF export.

Beta

Alloybase is in public beta starting this week at alloybase.app. Enter your email on the landing page to request access — we’ll send you a unique invite code to get started.

We’re looking for researchers who run regular screening workflows and have specific opinions about what’s missing. The tool is most useful if we hear from people who actually work with this data.

This is the first post on the Alloybase blog.

What This Looks Like in Practice#

Why Provenance Matters#

What’s Included#

Tiers#

Beta#

What This Looks Like in Practice

Why Provenance Matters

What’s Included

Tiers

Beta