Query 13 OPTIMADE Databases in One Request

OPTIMADE (Open Databases Integration for Materials Design) is a standardized REST API protocol for querying computational materials databases using a shared filter syntax. Thirteen databases currently support it, including Materials Project, AFLOW, OQMD, JARVIS-DFT, NOMAD, COD, TCOD, Materials Cloud, odbx, MPDS, and NREL Materials Database. The protocol was designed so that a single query string works uniformly across all participating databases. In practice, most researchers still open a separate browser tab for each one.

Why Querying OPTIMADE Databases Individually Costs More Time Than It Should

A typical cross-database search runs something like this: query Materials Project through their REST API, switch to AFLOW’s AFLOWLIB endpoint (which uses slightly different URL patterns), pull OQMD’s REST interface with its own pagination defaults, then check JARVIS-DFT separately. Each database returns OPTIMADE-compliant results, but four separate requests still need to be made, and the results arrive as four separate sets requiring manual deduplication afterward.

Each provider also has its own quirks: different default page limits, different levels of support for optional filter properties, and different uptime characteristics. A query that works cleanly against Materials Project may return an empty result set against COD because COD does not carry computed properties. These differences compound across four or more providers.

The OPTIMADE specification was designed to solve this through federated queries. The providers index at providers.optimade.org lists every compliant database with its base URL, which makes fanning out a single filter string to all providers simultaneously straightforward to implement. Most researchers do not use this approach, even when working with optimade-python-tools, which supports the protocol but queries providers individually by default.

Which OPTIMADE Providers Can You Query Today?

The providers index currently lists 13 active databases, and they differ substantially in what they return. Entry counts change as datasets are updated; the figures below reflect published statistics as of early 2026.

Materials Project: over 150,000 DFT-relaxed inorganic compounds; reliable for formation energy, band gap, and hull distance
AFLOW: nearly 4 million entries (3.93M as of early 2026) across binary and ternary systems; strong coverage for alloy screening
OQMD: over 1.4 million entries; the most internally consistent formation energy values across providers, in our testing
JARVIS-DFT: ~40,000 total calculated systems including 2D materials and ML force field benchmarks
NOMAD: the largest raw calculation repository in the index; limited computed-property filter support
COD (Crystallography Open Database): experimental crystal structures only; no DFT-computed properties
TCOD: theoretical extension of COD; sparse coverage
Materials Cloud: curated datasets from published research workflows
odbx: Oxford database of crystal structures; good coverage for organic crystals
MPDS (Materials Platform for Data Science): broad coverage; licensing restrictions apply to bulk access
NREL Materials Database: photovoltaic and energy materials focus
Alexandria: newer entrant; strong for ML benchmark structures; ~2.9M structures across PBE and PBEsol/SCAN datasets
Open Materials Database (omdb): organic and inorganic structures

These split into two groups with meaningfully different query behavior. Computed-property-capable providers (Materials Project, AFLOW, OQMD, JARVIS-DFT, and Alexandria) reliably return formation_energy_per_atom, band_gap, and other DFT-computed properties for filtered structure queries. Structure-only providers (COD, TCOD, NOMAD, and odbx) hold crystal structures without computed properties. Applying a property filter against them returns empty results rather than an error, which silently excludes an entire database from your dataset if your fan-out logic does not account for this distinction.

NOMAD is worth noting specifically: it contains the largest raw calculation repository but returns sparse results for property-filtered queries. It is most useful when queried without property filters and post-processed locally against the downloaded structure data.

Building a Cross-Provider Query: The Fan-Out Pattern

Getting Per-Provider Timeouts Right

The fan-out pattern requires concurrent requests with independent timeouts per provider. Wrapping each provider coroutine in asyncio.wait_for achieves this: a provider that exceeds the timeout is dropped from the result set without holding up the others. asyncio.gather alone does not provide this; it runs tasks concurrently but cannot cancel individual tasks on timeout, so one unresponsive provider will delay the entire batch until a global timeout fires.

import asyncio
import httpx

PROVIDERS_INDEX = "https://providers.optimade.org/providers.json"

async def fetch_provider_urls() -> dict[str, str]:
    async with httpx.AsyncClient() as client:
        r = await client.get(PROVIDERS_INDEX)
        data = r.json()
    return {
        p["id"]: p["attributes"]["base_url"]
        for p in data["data"]
        if p["attributes"].get("base_url")
    }

async def query_provider(
    client: httpx.AsyncClient,
    provider_id: str,
    base_url: str,
    filter_str: str,
    timeout: float = 10.0,
) -> tuple[str, list]:
    url = f"{base_url}/v1/structures"
    params = {"filter": filter_str, "page_limit": 100}
    try:
        r = await asyncio.wait_for(
            client.get(url, params=params),
            timeout=timeout
        )
        data = r.json()
        results = data.get("data", [])
        if data.get("meta", {}).get("more_data_available"):
            # Paginate: fetch remaining pages here
            pass
        return provider_id, results
    except Exception as e:
        print(f"[{provider_id}] failed: {e}")
        return provider_id, []

async def federated_query(filter_str: str) -> dict[str, list]:
    providers = await fetch_provider_urls()
    async with httpx.AsyncClient() as client:
        tasks = [
            query_provider(client, pid, url, filter_str)
            for pid, url in providers.items()
        ]
        results = await asyncio.gather(*tasks)
    return dict(results)

# Example: binary oxides with band gap > 3 eV
filter_str = 'elements HAS ALL "O" AND nelements=2 AND band_gap > 3.0'
results = asyncio.run(federated_query(filter_str))

Two details are easy to overlook. Setting page_limit=100 explicitly matters because providers default to 10 or 20 results per page, and meta.more_data_available=true is the only signal that results were truncated. Missing 80% or more of results because pagination goes unchecked is one of the most common failure modes in cross-provider queries. Logging every provider failure by name is equally important: without it, there is no way to know which databases are absent from the result set.

Not all providers support the same filter properties. A filter on band_gap against COD or NOMAD returns empty results, not an error. This is the reason for splitting providers into the two groups above rather than assuming uniform property support across all 13 databases.

How Do You Deduplicate Results Across Providers?

The same material appears in multiple databases under different identifiers. Silicon is mp-149 in Materials Project and carries a different identifier in AFLOW and OQMD. Deduplication requires matching on structure, not on database-assigned ID.

Two approaches are practical. Reduced chemical formula combined with space group number is fast, scales to any dataset size, and identifies roughly 95% of duplicates. Pymatgen’s StructureMatcher performs full geometric comparison and correctly distinguishes polymorphs, but becomes impractical for large screening sets due to its computational overhead. For most screening workflows (where the goal is filtering by composition and property rather than distinguishing structural variants), formula plus spacegroup is the right choice. StructureMatcher is worth the overhead only when polymorph discrimination is required.

import pandas as pd

rows = []
for provider_id, entries in results.items():
    for entry in entries:
        attrs = entry.get("attributes", {})
        rows.append({
            "provider": provider_id,
            "optimade_id": entry.get("id"),
            "formula": attrs.get("chemical_formula_reduced"),
            "spacegroup": attrs.get("space_group_symbol"),
            "band_gap": attrs.get("band_gap"),
            "formation_energy": attrs.get("formation_energy_per_atom"),
        })

df = pd.DataFrame(rows)
df_dedup = df.drop_duplicates(subset=["formula", "spacegroup"], keep="first")

Not every OPTIMADE field is populated by every provider. band_gap and formation_energy_per_atom are DFT-computed properties that COD and other experimental databases will not carry. Using .get() with None defaults throughout keeps DataFrame construction from raising key errors on missing fields, and the resulting NaN values make the data source explicit in downstream analysis.

Formula plus spacegroup deduplication produces a clean, analysis-ready dataset without geometric comparison overhead, and is the appropriate default for any screening workflow over more than a few hundred structures.

From Query Results to a Reproducible Research Workflow

Provider datasets are not static. Materials Project has revised formation energies across dataset versions, and a query run today may return different numerical values for the same material six months from now. Caching raw JSON responses with a fetch timestamp makes the screening pipeline exactly reproducible and significantly reduces API load during iterative development. Running deduplication and filter logic locally against cached data is substantially faster than re-querying.

Converting OPTIMADE structures to pymatgen Structure objects requires parsing lattice_vectors and species fields from each entry; optimade-python-tools handles this conversion automatically. From there, the structures feed into a standard screening funnel: hull distance filter, property range filter, then DFT relaxation for the remaining candidates.

The fan-out script, pagination handling, timeout logic, deduplication, and caching described here amounts to roughly 200 lines of Python that also requires maintenance as providers update base URLs or the index gains new entries. Alloybase implements this pipeline and stores results as versioned datasets with per-row source attribution (provider name, OPTIMADE ID, query string, and fetch timestamp), which supports citation in methods sections. The free tier covers cross-provider search; the Researcher tier ($9/month) adds persistent versioned datasets, deduplication, and API key access. That subscription is what keeps the infrastructure running and the provider index current. Start with the free tier at alloybase.app.

Common Pitfalls and How to Avoid Them

Pitfall	What actually happens	Fix
Ignoring pagination	Provider defaults (10–20 results per page) silently truncate result sets; you get the first page and nothing else	Set `page_limit=100` explicitly; check `meta.more_data_available` on every response and paginate until false
Assuming all providers support the same filter properties	Filters on `band_gap` or `formation_energy_per_atom` against COD or NOMAD return empty results, not errors — silently dropping those databases from your dataset	Classify providers as compute-capable or structure-only before querying; skip computed-property filters for structure-only providers
Using `asyncio.gather` without per-provider timeouts	One slow or unresponsive provider stalls the entire result set until a global timeout fires	Wrap each provider coroutine with `asyncio.wait_for(coro, timeout=10.0)` to isolate failures
Not logging provider failures	Silent failures make it impossible to know which databases are missing from the result set; results appear complete when they aren’t	Log every provider failure by name; include a `failed_providers` list in the result metadata

Why Querying OPTIMADE Databases Individually Costs More Time Than It Should#

Which OPTIMADE Providers Can You Query Today?#

Building a Cross-Provider Query: The Fan-Out Pattern#

Getting Per-Provider Timeouts Right#

How Do You Deduplicate Results Across Providers?#

From Query Results to a Reproducible Research Workflow#

Common Pitfalls and How to Avoid Them#