Tip Tuesday | Choosing a Chunk Size for RAG
A key step in the Retrieval Augmented Generation (RAG) pipeline is chunking your unstructured source data, the process of dividing a large body of information into small and manageable pieces. These chunks are the elements which will be searched through for relevance to the user query and used as the context for a response, so they must be created in a way that enables efficient search operations while containing meaningful information. If a chunk is too large it may contain information on several topics, reducing its similarity score when compared to a single one of those topics in a query, but if it’s too small then it may not contain a complete or useful factual statement about a given subject. Furthermore, small sizes lead to a greater number of total chunks, thereby increasing the search time and complexity. The choice of chunking strategy can therefore have a significant impact on the performance of the end model.
When deciding on an appropriate chunk size it’s advisable to start by understanding the source data, specifically how it’s structured, how much there is, and how verbose it is. For example, a bullet point list will likely be relatively information-dense and suited to a smaller chunk size than a description given in a conversational style. By getting a feel for what type of data you’re working with you can start to quantify what ‘too big' and ‘too small’ look like. Of course in real world data there will be a mixture of formatting and styles, each of which each lend themselves to different chunk sizes, but this isn’t a problem! It doesn’t matter if one chunk is 100 tokens long and another is 500 long; embedding models will map a chunk to the same number of dimensions regardless of its size, so you’re free to be led by information density instead of word count.
The format of your source documents can also be used to help define and guide this process. Structured formats like HTML and XML self-describe their content, allowing us to detect and treat headings, paragraphs, images, lists etc. differently, whereas code could be chunked on a function-by-function basis, or a slideshow might make most sense with one slide per chunk.
Ultimately there’s no one-size-fits-all solution to picking a chunk size, but you can take a cue from your data to make an informed estimate. Iterating from this estimate, you can try a few chunk sizes or strategies, compare them, and allow model performance to drive the final decision.