Examining the relative size of a data store and the size of the SOLR index of that data, one finds the size of the index is usually larger than the data indexed. This may seem counter-intuitive at first, but it actually makes perfect sense.
In order to understand why, it’s helpful to create a simplified version of a document store and an index. Consider the following collection of 7 documents. Each row consists of a reference number and a document. This data store is 91 bytes in size.
[0,"ABC DEF"]
[1,"AB DEF"]
[2,"AC DEF"]
[3,"BC DEF"]
[4,"ABC DE"]
[5,"ABC EF"]
[6,"ABC DF"]
The index is intended to provide a quick lookup by letter combination. Instead of having to scan all the documents to identify the documents with a given text fragment, we simply lookup the query and receive a list of documents containing the character sequence.
"A"=[0,1,2,4,5,6]
"AB"=[0,1,4,5,6]
"ABC"=[0,4,5,6]
"AC"=[2]
"B"=[0,1,3,4,5,6]
"BC"=[0,3,4,5,6]
"C"=[0,2,3,4,5,6]
"D"=[0,1,2,3,4,6]
"DE"=[0,1,2,3,4]
"DEF"=[0,1,2,3]
"DF"=[6]
"E"=[0,1,2,3,4,5]
"EF"=[0,1,2,3,5]
"F"=[0,1,2,3,5,6]
The index is 225 bytes. That’s more than twice the size of our document store.