When Tantivy indexes documents it will optionally store the original text as well. This is used for generating snippets which are highlights of the matching text. Tantivy uses the filesystem to store these documents and stores them in a compressed format. This works fairly well when not using a networked storage solution but when using EFS, the whole store for a segment needs to be pulled in order to find a given document.
.store
files tend to be considerably larger than the rest of the segment:
-rw-rw-r-- 1 1001 1001 48K Nov 28 22:19 d908e24f73f04e83b85e679acf1d361b.7388140.del
-rw-rw-r-- 1 1001 1001 99 Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.fast
-rw-rw-r-- 1 1001 1001 2.7M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.fieldnorm
-rw-rw-r-- 1 1001 1001 30M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.idx
-rw-rw-r-- 1 1001 1001 17M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.pos
-rw-rw-r-- 1 1001 1001 91M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.store
-rw-rw-r-- 1 1001 1001 5.8M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.term
Looking at all the .store
files in the test index it is clear that pulling .store
files to look up specific documents by id would be incredibly inefficient when taking network latency into account:
-rw-rw-r-- 1 1001 1001 149M Nov 28 22:14 03578055b76b45bd961cf3931a0282d9.store
-rw-rw-r-- 1 1001 1001 176M Nov 28 23:29 103552f789714d07a2dff9f7143e001c.store
-rw-rw-r-- 1 1001 1001 162M Nov 29 00:06 1bfb3f7ef08e40b4bd166919c0786769.store
-rw-rw-r-- 1 1001 1001 9.9M Nov 29 00:38 87c0dd2f0577477ba233ee6a1c57c948.store
-rw-rw-r-- 1 1001 1001 66M Nov 29 00:16 92f9621c4f3e41a6938ad65d2c37e969.store
-rw-rw-r-- 1 1001 1001 51M Nov 29 00:32 b208026b097445c4a8eef6ea7dc6754e.store
-rw-rw-r-- 1 1001 1001 91M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.store
-rw-rw-r-- 1 1001 1001 58M Nov 29 00:24 dd957117dbf3437ca3cdb552b38cc8c4.store
-rw-rw-r-- 1 1001 1001 38M Nov 29 00:37 ea6445abef914faeb767686e1c054987.store
Instead of using Tantivy's built-in storage capability, we could use S3 or DynamoDB to store original documents such that they could be retrieved efficiently by id. A beneficial side-effect of this change would be that it should be cheaper as well as both DynamoDB and S3 have a lower monthly storage cost compared to EFS.
enhancement