This issue describe a possible way to ingest external SST file to a running agatedb instance just as rocksdb.
Background
TiKV use ingest external SST feature for some use case. To integrate agatedb into TiKV, we need this feature.
From Creating and Ingesting SST files · facebook/rocksdb Wiki (github.com), ingesting a SST needs follow steps:
- copy or link the file(s) into the DB directory
- block (not skip) writes to the DB and assign correct sequence number to ingested file
- flush memtable if range overlap (SST file's key range overlap with memtable)
- assign the file to LSM-tree
Things go different in agatedb.
Analysis
Step 1 and step 4 are the same.
For step 3, it is unnecessary to flush memtable when overlap for agatedb. Agatedb reads all levels when get
and picks the latest one (less or equal to read_ts), so it's okay to have new keys below old keys in LSM-tree. This really makes the whole process efficient. (Please point out if I'm wrong)
For step 2, it is the most difficult: How to protect ACID during and after ingestion?
Agatedb has ACID transaction (SSI). The architecture is totally different from rocksdb. We have a read_ts and a commit_ts and we have conflict check when commit.
Possible Implementation
Main idea: Regard ingesting SST files as writing large range of keys and make the whole process a transaction.
Ingesting files shall be an expensive job.
Time Point: before commit
- Add files, fetch new file id from
LevelsController
then move or copy files to DB dir with allocated file id(name).
- Verify file checksum (optional) and get file infomation (just open Table?).
- Make a new transaction, mark update as true (add a ingest flag for transaction?).
- (TBD) other checks…
Check files before starting a txn.
Time Point: commit
- Same as current, but add ingest range field to
CommittedTxn
in CommitInfo
which contains smallest and largest keys in each ingest files.
This is for fast but inaccurate conflict check, otherwise we need to calc every key hash in SST which really takes time.
- Set commit_ts as Table's
global_version
- Send to write channel as usual (see below)
- Process Ingest, find suitable level for each files. (Which will take write lock of
LevelHandler
, and will block read process of other txn)
- Write manifest file. The file's
global_version
is stored in manifest.
A question here. I see we protect the order of Requests send to write channel the same as commit_ts and I wonder why? For WAL in order?
If ingest job can be out-of-order, I think it's possible and better to make ingest job running in current thread and not sending to write channel.
You can see that the whole commit process has only one small I/O — append to manifest. But when picking level, it takes time if there are many tables in this level and this happens under a write lock of LevelHandler
.
Conflict Check
There are two ways for conflict check as I described in TP: commit (step 1)
.
- Calculate every key's hash in ingested files and then conflict check works as same right now. For performance problem, we can calculate key hash before transaction.
- Add ingest range field to
CommittedTxn
in CommitInfo
.
And for any update txn, when adding read keys, mark smallest and largest read key for this txn. When checking conflict, beside checking hash, also check if read range overlap with ingest range.
This is really inaccurate and will cause many txn conflict but I think it makes sense.
The second way needs to refactor transaction much more.
(A more simple way is to break ACID…hhh)
Global Version
This concept is the same as rocksdb. For ingested files, all keys has wrong version and it's unreasonable to modify each. When file has a global version, every key in this file has this version.
In latest rocksdb, this value is stored in manifest to avoid random write to ingested files.
We need to add a field in meta.proto and update Table iterator (also block iterator) to fit this feature.
global_version
may have ambiguity. Maybe another name when impl.
Discuss
- A new transaction struct only for ingest or refactor current?
- Which way to check conflict? (or any better idea?)
- Send to write channel or done in current thread?
- Any good idea to impliment this feature?
- Strategy to pick level. I think from the bottom to check if there is overlap is nice.
- What if external files (in one job) has overlap range? Rocksdb will put all files at level0.