The Fragmentation Problem in Computational Materials Data

Materials Project, AFLOW, OQMD, and JARVIS-DFT each expose different APIs, different schemas, and different filter syntaxes. Querying all four for the same material requires four browser tabs, four query languages, and a spreadsheet to reconcile the results.

That spreadsheet loses context fast. Which provider returned this formation energy value? What query produced this row? When was it fetched? Copy-paste between tabs strips that information silently, and you only notice when a reviewer asks for your data sources six months later.

Session-bound results compound the problem. Close the browser, lose the dataset. Re-run the query next week and count against rate limits for data you already had. The bottleneck in most computational materials projects is not compute. It is the dataset assembly step that precedes it.

Alloybase exists to close that gap: one materials data search tool that handles the query federation, deduplication, attribution, and persistence so you can focus on the analysis.

What Does Alloybase Actually Do?

A single OPTIMADE-compliant query fans out across 13 federated providers and returns merged results. You write one query. Alloybase handles the fan-out, schema normalization, and response aggregation. No provider-specific API wrappers. No manual merging.

Datasets persist across sessions with version numbers. Re-open your browser three months later and the dataset is exactly where you left it, with every row intact. Immutable versioned snapshots mean your methods section can cite a specific dataset state that anyone can verify.

Every row carries complete source attribution: provider name, OPTIMADE ID, the exact query string, fetch timestamp, and source URL. This is not metadata you export separately. It is on every screen, on every row, all the time.

When you run a cross-provider comparison, it surfaces disagreements. In our testing, roughly 7% of cross-listed materials carry conflicting property values across providers. That number is higher than most people expect. These disagreements matter for ML training sets, and they are invisible when you query providers one at a time.

Type “binary oxides with band gap above 2 eV” and Alloybase produces the corresponding OPTIMADE filter expression. The generated filter is displayed before execution so you can verify that what you meant matches what the system will run. Exported datasets retain both the natural language input and the compiled filter for reproducibility.

This addresses the OPTIMADE adoption barrier directly. The spec is well-designed, but the filter grammar is unfamiliar to most researchers. Showing the compiled filter (rather than hiding it) means you learn the syntax while using it.

Exports built for real workflows

Free tier exports CSV and JSON, which covers most spreadsheet and notebook workflows. Researcher tier adds Parquet for fast columnar reads in pandas, Polars, and DuckDB pipelines. Lab tier adds CIF for direct ingestion into DFT codes and crystallographic tools. Every export bundles a manifest with provider attribution and dataset version, so downstream papers can cite the exact data state.

Alloybase handles the full chain in one tool: search, curate, version, export, and cite. We built it because nothing we tried covered all five steps without manual glue between them.

Pricing: What Each Tier Costs

The subscription sustains the federated infrastructure and the OPTIMADE proxy layer.

Free ($0/mo): 5 datasets, 200 rows per dataset, 50 searches per day. CSV and JSON export. Natural language search. Dataset versioning. Enough to run a typical literature survey or course project end-to-end.

Researcher ($9/mo): 25 datasets, 5,000 rows per dataset, unlimited searches. Adds deduplication, anomaly detection, semantic search, cross-provider comparison, API keys, pre-loaded Jupyter notebooks with starter analysis code, and Parquet export. Built for researchers assembling ML training sets or running multi-provider systematic reviews.

Lab ($29/mo): Unlimited datasets and rows. Team workspaces up to 10 members. Audit log for every query and dataset change. CIF export. Built for research groups that need shared datasets with a traceable chain of custody.

For context: enterprise MI platforms that charge tens of thousands per year offer comparable source attribution and reproducibility. Alloybase covers the same core workflow at individual-researcher pricing because the subscription only needs to sustain the federation layer, not a sales team.

FAQ

Can I cite an Alloybase dataset in a paper?

Yes. Each versioned dataset has a stable ID and a manifest listing every source provider, query, and fetch timestamp. Include the dataset ID and version number in your methods section the same way you would cite a code repository at a specific commit.

Which providers are federated through Alloybase right now?

13 OPTIMADE-compliant providers, including Materials Project, AFLOW, OQMD, JARVIS-DFT, NOMAD, Materials Cloud, COD, and others. The full list is on the search page. Any provider that implements the OPTIMADE specification can be added.

What happens to my datasets if I cancel a paid plan?

Your datasets remain readable, but the count reverts to Free tier limits (5 datasets, 200 rows each). Excess datasets become read-only until you trim or re-subscribe. Nothing is deleted.

Does the AI search ever produce wrong filters?

It can. That is why the compiled OPTIMADE filter is displayed before execution. You see exactly what will run and can edit it before submitting. The system never executes a query you have not reviewed.

Is there an API for programmatic access?

API keys are available on Researcher and Lab tiers. The auto-generated Jupyter notebooks double as runnable usage examples, with your dataset ID and API endpoints already filled in.