You came to analyze data, not debug filter strings

A reasonable query for binary silicon-cobalt structures: elements HAS "Si" OR elements HAS "Co" AND nelements = 2. This returns thousands of results: every silicon-containing structure in the database, plus cobalt binaries. The filter is syntactically valid, but AND binds tighter than OR, so the semantics don’t match the English.

OPTIMADE (Open Databases Integration for Materials Science) is a standardized REST API for querying computational materials databases. Its filter language is well-designed. It is also genuinely annoying to write correctly on the first try.

The friction is syntactic, not conceptual. You already know what query you want. You lose time to quoting rules on element symbols, operator precedence between AND and OR, and the difference between three operators (HAS, HAS ALL, HAS ONLY) that sound interchangeable but are not. These are the traps that catch people who have read the spec, not just beginners skimming docs.

Why does elements HAS "Si" OR elements HAS "Co" AND nelements = 2 return wrong results?

Five syntax traps account for most OPTIMADE filter debugging time. Each one looks correct at first glance.

HAS vs HAS ALL vs HAS ONLY

You want all structures containing both silicon and oxygen. Two chained HAS statements work:

elements HAS "Si" AND elements HAS "O"

Now you want structures containing only silicon and oxygen, with no other elements. The same pattern fails silently. That filter still returns Si-O-N ternaries, Si-O-Fe spinels, anything with Si and O plus extras. The correct filter uses HAS ONLY:

elements HAS ONLY "Si","O"

HAS ONLY is the operator most people don’t know exists. HAS ALL means “includes at least these.” HAS ONLY means “includes exactly these and nothing else.”

Quoting: elements HAS Si silently fails

Element symbols must be quoted strings. This filter is invalid:

elements HAS Si

Some providers reject it with a parse error. Others silently return no results (or worse, wrong results). The correct form:

elements HAS "Si"

Double quotes, always. No single quotes, no backticks, no exceptions.

AND/OR precedence drops your constraints

You want binary compounds containing either silicon or cobalt. The intuitive filter:

elements HAS "Si" OR elements HAS "Co" AND nelements = 2

AND binds tighter than OR. The parser reads this as:

elements HAS "Si" OR (elements HAS "Co" AND nelements = 2)

You get every silicon-containing structure in the database (any number of elements) plus cobalt binaries. The fix is explicit parenthesization:

(elements HAS "Si" OR elements HAS "Co") AND nelements = 2

Standard operator precedence, same as Python and SQL. Easy to miss outside of daily programming contexts.

IS UNKNOWN, not IS NULL

OPTIMADE does not use SQL’s NULL keyword. To find structures where the space group has not been reported:

space_group_it_number IS NULL    ← invalid

The correct keyword:

space_group_it_number IS UNKNOWN

SQL muscle memory makes this one hard to unlearn.

The water trap: HAS ALL is not exact composition

You want water. A natural first attempt:

elements HAS ALL "H","O"

It matches every structure containing both hydrogen and oxygen: H₂O, H₂O₂, Ca(OH)₂, and thousands of hydroxides and hydrates. For an exact composition, use the Hill formula:

chemical_formula_hill = "H2O"

HAS ALL constrains the element set. It says nothing about stoichiometry or the number of atoms of each element. Every gotcha in this section follows the same pattern: the filter looks right, raises no red flag, and returns wrong results without warning. That is what makes OPTIMADE filter syntax worth automating for routine queries.

What if you typed what you meant in plain English?

Before and after: natural language to OPTIMADE

Alloybase translates plain English queries into valid OPTIMADE filter strings. Here is what that looks like for the exact traps covered above:

You typeAlloybase generates
“structures containing silicon and oxygen”elements HAS ALL "Si","O"
“only silicon and oxygen, nothing else”elements HAS ONLY "Si","O"
“silicon or cobalt binaries”(elements HAS "Si" OR elements HAS "Co") AND nelements = 2
“water”chemical_formula_hill = "H2O"
“structures where band gap is not reported”band_gap IS UNKNOWN
“iron oxides with fewer than 20 atoms per cell”elements HAS ALL "Fe","O" AND nsites < 20

Every filter in that right column handles quoting, operator precedence, HAS vs HAS ALL vs HAS ONLY, and IS UNKNOWN correctly. These are the exact gotchas from the previous section, resolved automatically.

How the translation works

The translation runs on Claude Haiku (Anthropic’s lightweight LLM) with the complete OPTIMADE filter grammar loaded as system context. It is not regex or template substitution.

That distinction matters for compositional queries. A regex translator handles “silicon oxides” by mapping element names to symbols. An LLM with the grammar spec handles “silicon or cobalt binaries” by reasoning that “binary” means nelements = 2, that “or” scopes over the two elements (not the entire query), and that the filter needs parentheses to preserve that scope.

It runs server-side on each search. You type English, you get a valid filter string, and you can inspect and edit that filter before executing it.

When should you still write filters by hand?

Provider-specific properties are the most common reason to go manual. OPTIMADE providers expose custom properties with underscore prefixes (_mp_stability on Materials Project, _aflow_prototype on AFLOW). The LLM knows the core OPTIMADE property set but may not recognize every provider extension. If your query depends on a provider-specific field, write that clause manually or edit the generated filter to add it.

Deeply nested boolean logic is the second case. For queries with three or more boolean groups and mixed AND/OR/NOT, you need direct control over grouping. The NL layer handles two-level nesting well. Beyond that, verify the output or write the filter yourself.

The third case is reproducibility. If you are documenting a query for a paper’s methods section, the exact filter string belongs in the manuscript, not the English paraphrase. Use NL search to get the filter right quickly, then copy the generated string into your methods text. Alloybase’s versioned dataset snapshots make the full query reproducible regardless of how you constructed the filter.

NL search and hand-written filters are not mutually exclusive. Every NL translation shows you the generated filter before execution. Edit it, extend it, or replace it entirely. NL is the fast default; manual is the precision escape hatch.

Frequently asked questions

Does the NL translation support provider-specific properties like _mp_stability?

Partially. The LLM knows the core OPTIMADE property set (elements, nelements, nsites, chemical_formula_hill, lattice_vectors, and other spec-defined properties). Provider-specific properties with underscore prefixes may or may not be recognized. You can always edit the generated filter to add or modify provider-specific fields before executing.

Can I see and edit the generated filter before running it?

Yes. Alloybase displays the generated OPTIMADE filter string before executing the search. You can modify it, extend it with additional clauses, or replace it entirely.

Which OPTIMADE providers does Alloybase query?

Alloybase queries 13+ OPTIMADE-compliant providers in a single request, including Materials Project, AFLOW, OQMD, JARVIS-DFT, NOMAD, and Materials Cloud. Results return in a unified schema with full source attribution on every row.

Does natural language search require a paid account?

No. NL search is included on the free tier. You do need to create an account (the translation calls a paid LLM API on the backend, so authentication is required), but the free plan includes NL search with up to 50 searches per day. Sign up at alloybase.app.