Collections Registry

The collections registry defines how BioStudies submissions are mapped into indexed fields for search, faceting, retrieval, and sorting.

It is stored as JSON and loaded at application startup. Each collection in the registry contains a set of property definitions that describe:

the field name,
the display title,
the field type,
optional JSONPath selectors,
optional parsers,
optional analyzers,
and indexing behavior flags such as retrieval, sorting, and multi-value handling.

classDiagram
    class CollectionRegistryService {
      +loadRegistry() CollectionRegistry
      +getCurrentRegistry() CollectionRegistry
      +getPropertyDescriptor(String) PropertyDescriptor
      +getPublicAndCollectionRelatedProperties(String) List~PropertyDescriptor~
    }

    class CollectionRegistry {
      +getCollectionDescriptor(String) CollectionDescriptor
      +getPropertyDescriptor(String) PropertyDescriptor
      +getCollections() List~CollectionDescriptor~
      +getGlobalPropertyRegistry() Map~String, PropertyDescriptor~
    }

    class CollectionDescriptor {
      +getCollectionName() String
      +getProperties() List~PropertyDescriptor~
      +getPropertyByName(String) PropertyDescriptor
      +containsProperty(String) boolean
    }

    class PropertyDescriptor {
      +getName() String
      +getTitle() String
      +getFieldType() FieldType
      +isFacet() boolean
      +isSortable() boolean
      +hasJsonPaths() boolean
    }

    CollectionRegistryService --> CollectionRegistry : loads / caches
    CollectionRegistry "1" o-- "*" CollectionDescriptor : contains
    CollectionDescriptor "1" o-- "*" PropertyDescriptor : contains
    CollectionRegistry --> PropertyDescriptor : global lookup

To see the full registry, see collections-registry.json

Purpose

The registry is the central schema configuration for the indexing pipeline. It allows the system to support multiple collections with different metadata models while still sharing common public fields.

Registry structure

At the top level, the registry is a list of collections.

Each collection contains:

collectionName — the logical name of the collection
properties — the fields defined for that collection

Property definition

Each property may define the following attributes:

Attribute	Description
`name`	Unique field name used in the index
`title`	Human-readable label for the field
`fieldType`	Field type used by the indexing pipeline
`jsonPaths`	JSONPath expressions used to extract values
`parser`	Parser used to compute or normalize the field value
`analyzer`	Analyzer used for tokenization and indexing
`retrieved`	Whether the field is stored for retrieval
`sortable`	Whether the field can be sorted
`multiValued`	Whether the field can contain multiple values
`expanded`	Whether the field should be expanded during indexing
`private`	Whether the field is hidden from public use
`defaultValue`	Fallback value used when no data is found
`facetType`	Facet-specific type, such as boolean
`naVisible`	Whether `NA` values should be visible
`toLowerCase`	Whether values should be normalized to lowercase
`match`	Pattern used to extract or normalize values

Supported collections

The registry currently includes the following collections:

public
idr
hecatos
arrayexpress
biomodels
europepmc
eu-toxrisk
rh3r
cancermodelsorg

Public collection

The public collection contains shared fields that apply broadly across submissions. It includes core fields such as:

accession
type
title
author
content
links
files
release date and time fields
collection metadata
access-related fields

These fields form the common indexing layer used across collections.

Collection-specific fields

Collection-specific entries define additional metadata for specialized datasets.

Examples include:

study and assay metadata for arrayexpress
model metadata for biomodels
toxicology metadata for eu-toxrisk
cancer model metadata for cancermodelsorg

Parsers and analyzers

Some fields use custom parsers to transform raw JSON into indexed values.

Examples:

AccessParser for access metadata
ContentParser for full content assembly
JPathListParser for list-like JSONPath extraction
NodeCountingParser for count-based fields
ReleaseDateParser, ReleaseTimeParser, ReleaseYearParser
CreationTimeParser, ModificationTimeParser, ModificationYearParser
ViewCountParser
FileTypeParser
TypeParser
EUToxRiskDataTypeParser
NullParser

Analyzer selection also varies by field:

AttributeFieldAnalyzer
AccessFieldAnalyzer
ExperimentTextAnalyzer

Validation rules

The registry is expected to follow a few important rules:

collection names should be unique
property names must be unique across the full registry
fields must be well-formed for their declared type
parser and analyzer names must match supported implementations