feedback Give us your feedback

The Metadata Catalogue: the backbone of AI-on-Demand — Part One

NEWS
Thu 25 Sep 2025
The Metadata Catalogue: the backbone of AI-on-Demand — Part One

AI research is booming. Every day countless papers are published, datasets are uploaded, models are trained, courses are developed, new labs are founded, and so on. Information about these developments is then scattered through many different platforms. Papers may be published as preprints on arXiv, in journals (e.g., JMLR), at conferences (e.g., NeurIPS), and more. Datasets can be found on platforms such as Kaggle, Hugging Face, or OpenML. A plethora of artifacts of all kinds can be found on Zenodo. And when that moment comes where you need some specific asset, it becomes hard to know where to look.

With AI-on-Demand’s Metadata Catalogue, we envision one catalogue that keeps metadata about all AI research in a unified schema. This, in turn, allows the building of services that incorporate data from many different platforms. Whether you browse the AI-on-Demand’s Resource collection to find datasets or visit the AI Ecosystem Map to get an overview of AI organisations in Europe, the data shown originates from the Metadata Catalogue (MDC). The MDC is an open service that provides metadata about anything that relates to AI research, and drives many of the services you see on AI-on-Demand.

This is the first post in a series where we will dive into more details about what the metadata catalogue is and isn’t, what you can expect from it now and going forward, how to use it, and how to contribute to its development.

A Metadata Catalogue

The Metadata Catalogue stores, as the name implies, metadata. That is information about data, but not the data itself. To give a concrete example, let’s consider a dataset hosted on Hugging Face: Palmer Penguins. The data itself is the actual table that denotes on each row the species of a penguin, their bill and flipper dimensions, and other measurements. Data comes in many forms, tables such as this one, sets of images and videos, and many more modalities. The metadata denotes information about that data: a description and title for the dataset, who uploaded it and when, what license it is published under, what the features mean, and so on. In the MDC, we store that metadata alongside a link to the original asset (link).

The strength of the MDC is that it doesn’t just store the metadata as provided by the original provider. In the Metadata Catalogue, we store it under a unified Metadata Schema. This means that the metadata about a dataset in AI-on-Demand is always going to have the same structure, regardless of whether it was originally hosted on Hugging Face, Zenodo, or any other platform. This means that you can now search through assets hosted on various platforms in one unified way.

The decision to store only metadata is a conscious one, but it comes with some trade-offs. The biggest disadvantage is that we do not store the underlying data. As we have no control over the platform from which we index the data, and they may allow their users to delete their data at any time. While in practice, we find this rarely happens, if it is important to you that the data remains accessible, consider finding resources from platforms that have a stronger data retention policy (such as Zenodo). The MDC allows filtering by platform, so it’s easy to find those types of assets through the MDC, too.

So why collect metadata only? It’s simple: we don’t want to divide the research landscape further but allow people to continue using the platforms they know and love. The dedicated platforms that host datasets, models, papers, and more are already good at what they do. Introducing a new platform that competes with all of them at the same time is unlikely to be an added benefit to anyone. However, providing a unified way to search across all of these platforms, allowing the definition of additional metadata where the original platforms don’t allow it, and allowing programmatic access in a unified way is a large added benefit to the community*.

Looking Ahead

We envision this metadata catalogue being used to develop new services that may be discoverable through the AI-on-Demand platform and for people to use this metadata directly in their research. In the next posts in the Metadata Catalogue series, we will show how to contribute data to the catalogue, how to develop a programmatic integration, and how you can get involved in the development process.

If you’re already eager to get started, then you may already have obtained (much) of this information through the documentation that is already available. https://api.aiod.eu has the documentation on the latest endpoints and can be used interactively to construct queries to fetch, e.g., dataset metadata. If you want to contribute to the development of the MDC, then have a look at our contributor guide and drop us a message on GitHub.

 

 

*As a bonus, we do not have to worry about large scale data storage: while research artefacts such as datasets or trained models can quickly demand terabytes of storage (or more), our metadata catalogue is unlikely to surpass gigabyte scale for a long time.