Retrieval-Augmented Generation (RAG) pipelines are AI systems that provide answers to user’s questions. These tools use large language models to retrieve information and process the findings to provide more accurate responses based on factual data.
However, the issue is that building RAG pipelines is extremely difficult, especially if you need high accuracy. As a result, this article will cover what is RAG pipeline. Issues it has, and how to improve them. This article goes from high-level meaning to technical details and diagrams.
What is RAG Pipeline?
Retrieval-augmented generation pipelines are AI systems that allow you to get more accurate answers to your questions than a single large language model can do. They can do that because they have two or more LLM models, where one retrieves information from data sources such as files or databases, and the second processes this to make better answers.

Basic RAG pipeline components explained
- User Query Submission: The user writes a query (a prompt) to the systems, which imitates the process.
- Query preprocessing: Users' query is broken into vectors for LLMs by embedding models to understand their meaning and relevant relationships.
- Data processing: Data found from PDF files, documents, and Excel sheets, as well as databases, are extracted, turned into vectors (numbers), and embedded into vector databases. Metadata of the original information source is combined with each document for referencing and faster retrieval.
- Retriever: A retriever model takes the query embedding and uses it to retrieve information from vector databases or other sources. Information is divided into chunks and analyzed to understand the relevance to the query.
- Generator: The generator LLM models turn the most relevant vectors found in the chunks into a human-readable response.
Advanced RAG systems include one or more steps over basic LLM-based question-answering tools:
- Pre- or postprocessing: This means optimizing certain processes or features before or after the retrieval model puts data into databases to increase the overall accuracy of the response generated.
Issues With the Basic RAG Pipeline
While most basic RAG tools are straightforward, they retrieve relevant context and use it to generate better answers, the difficult part is doing this with high accuracy because:
- LLMs tend to hallucinate.
- There are specific scenarios that require different strategies.
- Many frameworks do not work out of the box and require custom development.
- Many tools are strictly one-way solutions. You can not change the parts to meet your specific requirements.
Building Advanced RAG Pipelines
With the knowledge that post- and preprocessing of the RAG pipeline can achieve better results, our team started to ask hard questions, some of which were:
- Can we make the AI reformulate the question in a way that will perform better?
- Can the AI add extra information to the data for LLMs to understand it better?
- If there are files, could we chunk them, page by page, to increase their accuracy?
And dozens more questions that directly or indirectly (through combination) influence the accuracy of the output.
The Setup
We tested until we found four strategies that increase RAG systems accuracy. By that, we mean a lot.
Here is what we did:
Improved Prompt
Many users don’t understand how the AI applications work and what questions they should ask. So, we tested this by allowing the AI to rewrite the questions based on the data provided. By doing so, the question makes more sense for LLMs.
Contextual Summarization
When you upload the files, RAG applications turn information into chunks before embedding them into the knowledge base, such as databases. We thought, what if we improve that part? We tried different techniques before and after chunking. We found that if you preprocess data by letting models summarize the documents and combine it with the actual data, the overall outcome is much better.
BM25
Naive semantic encoding of text chunks may lose exact term granularity during compression. For example, exact error codes are difficult to retrieve, and naive embeddings may retrieve something about errors (but not the correct one). We employ BM25, a modified TF-IDF function with a saturation term. We also encode the chunks into semantic vectors and index them through TF-IDF. At query time, we apply Reciprocal Rank Fusion (RRF) to combine the two retrievals and send them to the generator model.
Re-ranker
Bi-encoders are your plain embedding model most of the time. They can produce embeddings and also perform re-ranking. When using bi-encoders for the re-ranking process, it does cosine similarity against the query and each retrieved item from each index.
Cross-encoders do not produce embeddings. They only work in predefined pairs (which is perfect for re-ranking). Cross-encoders pass query and chunk to the transformer while the classifier head outputs a softmax value.
Generally, cross-encoders are better than bi-encoders for reranking, but they can be extra overhead since you cannot generate embeddings with them. Text-embeddings-inference supports a few cross-encoder rankers that we tested on our lab stack.
The Result
The benchmarking numbers are impressive for showcasing the strength of our solution compared to others. However, the effectiveness of the RAG tools often depends highly on your specific use case and data. Certain features work better for particular use cases, while others may need completely different set of tools. For example, a system setup that works perfectly for legal documents might perform poorly on technical documentation. Because:
- Semantic search understands relationships between data and user prompts to underhand the overall concept but is not great at locating exact matches.
- Keyword-based search (BM25) is great at retrieving exact keyword matches but bad at understanding relationships.
- Query expansion: helps capture related concepts the user didn't explicitly mention and are important, but can also include background noise.
- Re-ranking is powerful at matching and retrieving the most relevant information to a query, but tools like gross encoders can add a significant amount of extra overheads due to a lack of embedding generation ability.
- Context summarization helps us to ensures we make data more readily understandable for LLM models, but producing a good summary requires accounting for the context.
Because of these changes, there is not one size fits all. For RAG tools to be beneficial in different domains, we need a different approach.
Why A Modular RAG Architecture Approach Becomes Crucial
This is why we developed a modular RAG architecture. Instead of being one single approach, our solution allows you to:
- Mix and match different retrieval or generative techniques
- Configure each component independently
- Easily add new techniques as the AI evolves
- Automatically evaluate the result of different combinations
The whole idea is that you do not have to do the work to create the pipeline. Instead, you point your data source to a systems-like DKG via API. So, you can concentrate on building world-class experiences in your application or other digital experience without any manual work and stress.
Conclusion
Advanced RAG pipelines are powered by large language models that retrieve information from data sources, enabling more accurate responses. However, there is no one-size-fits-all for this technology. For example, a setup that works for one use case may perform poorly for another. As a result, developers need the freedom to change parts of the systems to meet their specific demands.
This is why we built a modular rag architecture to enable changing tools of the underlying system based on your requirements. Also, we optimized many of these tools for better accuracy in the responses generated. So, you have flexibility to experiment and optimize the RAG pipeline to meet your domain-specific use case. Benchmarking results show that our solution is much better than traditional RAG, and we are only getting started.