With big data comes big responsibility – of making it FAIR: Findable, Accessible, Interoperable and Reusable. But ensuring data is truly FAIR is challenging as scaling search systems to handle big data requires significant computational resources.

Let’s take an example from a user’s perspective. Say you are using BioStudies, our general purpose repository for publishing life sciences data, and want to download data from all functional genomics studies on chickens this year that use the sequencing assay technology. The traditional way is to search a genomics database with keywords like “chicken” and “sequencing assay”. Then you realise you didn’t include the scientific name for chicken in your keywords so you look it up and search again. Then you have to sort by release date (or use a similar facet) to filter your search to get to a list of studies before figuring out how to download the data files.

An easier way might be to ask your favourite AI to get these studies for you… but when you enter your query, it says something like “Sorry, my knowledge cutoff is October 2024, so I don’t have information about specific studies published after that date.” And you don’t really want to build a RAG for fun around millions of such studies available in a scientific data repository.

From a service provider perspective, search technologies were, until recently, primarily keyword-based. You format your data nicely and throw it in a Lucene-based engine (Solr,ElasticSearch,whatnot) and just wrap an API around that. As technology changed and machines could “understand” natural language, RAGs (Retrieval-Augmented Generation) were all the rage and everyone wanted a chatbot interface which supported natural language queries . RAGs are extremely effective in incorporating real-time, external knowledge in AI (see my last post), but need compute and GPU horsepower for fast retrieval from a semantic/vector DB as well as generating high-quality responses from the base LLM. These costs are affordable for small knowledge bases but can get out of hand when we have to serve millions of requests per day. Is there a way to leverage the existing search technology and shift the LLM costs towards the user?

Enter Model Context Protocol (MCP), Anthropics open standard for connecting data sources and AI-powered tools. It follows a standard client-server architecture, where each server is responsible for providing data to and from a datasource, and communicating it to exactly one client using a standard protocol. Think of an “MCP server” like a USB-C converter which converts one type of input to a standard USB-C interface that you can plug into your computer. The USB-C port acts as the “MCP client” which knows how to communicate using that interface and your computer is the “host” to the multiple clients.

But how does MCP help in replacing RAGs? If you host a data repository, you’ll most probably have a robust search API already. Wrap an MCP server around that API (or ask Claude to do it for you), and ask your users to “install” this server as tools to any MCP client, such as Claude Desktop. No need to spend any budget hosting special vector databases and LLMs – the users will be able to use your search API using natural language queries through their own accounts.

And it’s a game changer! I built an MCP server wrapper around our search API a couple of weeks ago. Once “installed”, I entered “Give me download links for all functional genomics studies released in 2025 on chicken (use scientific name) using sequencing assay technology”. Claude Desktop was able to figure out which API endpoint and parameters to use for search, do all the backend work for me, and give me the download links without my needing to look up the documentation. “AI-enabled” search is as good as the documentation! Expand this to other repositories, tie it with an agent which can infer which accessions are used in a given paper, and you have a low-cost solution to download all data used in that paper – FAIR enough?

So is there a catch? Unfortunately, yes. Firstly, the MCP protocol is Anthropic specific (so far) and you have to use Claude Desktop or other such hosts. You can get around this by implementing your own MCP client though. Secondly, it works only for data that you are willing to share with your AI provider. For example, in case you want to implement search like we did above on private data, you’ll have to look into on-premise or private cloud AI services – the same issue as RAGs.

In one way, we have been here before. Before this there was OpenAPI/REST (2010s), and before that WSDL/SOAP/UDDI (2000s), and IDL/CORBA (1990s) – Seems like technologies shift every decade or so the common goal is to just define how software components communicate over a network. MCP needs to gain traction from developers, IDE/service providers, and companies other than Anthropic in order to be the standard for this decade. Meanwhile, you can try the BioStudies MCP server at https://github.com/EBIBioStudies/biostudies-mcp-server. Happy searching!