Needle in a Haystack: A Rag Architecture and Use Case for Biotech
AI Acceleration with Pluton Bioscience
If you’ve ever had to navigate a complex research process, you likely understand the challenges of sifting through vast amounts of information to find specific, valuable insights. Our partners at Pluton Biosciences have faced similar hurdles in their work. These research processes, while essential for gaining critical knowledge, can be both time-consuming and frustrating. They often involve reviewing a multitude of resources to gather the necessary context to answer a few key questions. The task can feel like searching for a needle in a haystack, especially when the stakes are high and time is limited. For Pluton Biosciences, the challenge is amplified by the technical complexity of their research and the need for precise, credible information.
Retrieval Augmented Generation (RAG)
Natural Language Processing (NLP) encompasses a broad range of methods that can offer a lot of help here. NLP is a field of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. It combines linguistics and computer science to process and analyze large amounts of natural language data, such as text or speech.
The general workflow we are looking at here is called RAG. RAG is a method that enhances NLP models by incorporating external information retrieved from a database or the internet to improve the generation of text. Instead of relying solely on the model’s internal knowledge, RAG combines retrieval from external sources with the model’s generative abilities. This helps the model generate more accurate, context-aware, and factually correct responses. This workflow augments the context of a question posed to Large Language Models (LLMs) at run time by retrieving relevant information from a vector store. Vector stores are a common approach for indexing documents. Vector stores work by converting documents into numerical representations (vectors), that capture their semantic meaning. This indexing method allows the system to quickly search for and retrieve the most relevant documents based on a given query by comparing the similarity of the vectors.
Document Retrievers
The first piece of the RAG workflow is the document retrievers, without these there is nothing to augment with. This workflow needs access to high quality, trustworthy data. The data sources are selected from a curated list of trustworthy repositories specified by Pluton. This workflow needs to be able to query these repositories, utilizing various resources that support such queries. In this case some knowledge came from internal domain expertise provided by Pluton and some coming from publicly available APIs. Each retriever is unique to the source, but each retriever would build the same document structure to index into the vector store, as shown below.
Vector DB
If the context window of the LLM were limitless, we could indiscriminately fill it with retrieved documents. However, this approach would likely be inefficient and could degrade performance. Instead, it’s common to carefully select documents based on their relevance to the query. This selection process, illustrated above, relies on a Vector Database, which stores document/vector pairs. An embeddings model generates vectors for each document, and these pairs are stored in the database. When a query is made, it is also vectorized using the same embeddings model. The system then identifies and retrieves the ‘closest’ k vectors, typically determined by minimizing a distance metric, such as cosine similarity or Euclidean distance.
Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional space. It evaluates how similar the direction of the two vectors is, regardless of their magnitude. A cosine similarity of 1 indicates that the vectors are identical in direction, while a cosine similarity of 0 indicates they are orthogonal (i.e., no similarity). Note that a value of -1 is possible and would indicate an opposite. This metric is particularly useful when you want to compare the relative orientation of vectors rather than their exact distance. In the context of embeddings, cosine similarity is often used when the magnitude of vectors is not as important as their direction or semantic meaning.
Euclidean distance, on the other hand, measures the straight-line distance between two points in a multi-dimensional space. It is calculated by taking the square root of the sum of the squared differences of the corresponding components of the vectors. A smaller Euclidean distance indicates that the vectors are closer to each other in space. This metric is sensitive to the magnitude of vectors and is useful when the absolute distance between vectors is a key factor in determining similarity.
Both metrics are used to identify which vectors are closest to the query vector, but the choice of metric can depend on the specific application and the type of data being processed.
Architecture
This article is an extension of the case study we recently published on our website, but it looks more at some of the details of the architecture and explains more about RAG and our approach to it.
The architecture for this project was deployed in AWS using Terraform and GitHub Actions. Our agent was built into a pipeline so that when Pluton received new genetic inputs the model was executed, passing in relevant and ranked data from various journals and building a rich context using Pluton’s past discoveries.
Bedrock
In this project, AWS Bedrock played a crucial role in facilitating seamless access to the models required for both embedding creation and language inference. Bedrock is a managed service that simplifies the process of integrating and deploying foundation models from various providers, allowing developers to build and scale generative AI applications without the complexity of managing the underlying infrastructure. There was need for two distinct modalities of models both offered through bedrock:
- Embeddings Model
- Language Model
Embedding Model
The embedding model is essential for defining a mapping from the textual data in our sources to an abstract high dimensional vector space. The vectors in this vector space encapsulate the semantic meaning of the documents. This model ensures that the documents stored in the vector database are accurately represented in a format conducive to fast and precise similarity searches.
Language Model
The language model, also provisioned via Bedrock, serves as the core engine of the RAG workflow. This model is tasked with understanding and processing natural language queries posed by the researchers. Upon receiving a query, the language model leverages the contextual information retrieved from the vector database to generate a report on the given topic and inform/guide research.
Citations
Language models are inherently prone to hallucinations, making transparency a critical factor for Pluton to trust the generated reports. To ensure this transparency, we implemented strict in-text citation requirements within the agent. Additionally, as a post-processing step, we verified the existence of all cited sources, reinforcing the reliability and credibility of the information presented.
In conclusion, the automated scholarly source review for Pluton Biosciences exemplifies the transformative potential of combining Retrieval-Augmented Generation (RAG) with advanced NLP techniques. By streamlining the process of extracting relevant information from vast scholarly resources, this solution not only enhances efficiency but also improves the accuracy and reliability of research findings. The integration of AWS Bedrock for embedding and language models, coupled with robust vector store capabilities, ensures precise document retrieval and context-aware responses. Addressing the dual naming conventions in microbiology further highlights the system’s adaptability and thoroughness. Overall, this project demonstrates how cutting-edge AI can revolutionize traditional research processes, empowering scientists to focus on innovation and discovery with greater confidence and clarity.