简介 在之前的文章中我们知道RocksDB每一次写入,都是先写WAL,然后写Memtable,这次我们就来分析下MemTable的实现.
在RocksDB中,每个ColumnFamily都有自己的Memtable,互不影响.而在RocksDB中Memtable有多种实现(SkipList/HashSkipList/HashLinkList/Vector),具体的区别可以看这里 ,我们这次主要来分析默认的实现skiplist(只有skiplist是可以并发插入的).
实现 首先从创建Memtable开始,Memtable的创建(ColumnFamilyData::CreateNewMemtable)是在创建ColumnFamily(VersionSet::CreateColumnFamily)的时候创建的.这里就是创建memtable,然后设置到ColumnFamilyData的mem_域中.
1 2 3 4 5 6 7 8 9 10 11 12 13 MemTable* ColumnFamilyData::ConstructNewMemtable( const MutableCFOptions& mutable_cf_options, SequenceNumber earliest_seq) { return new MemTable(internal_comparator_, ioptions_, mutable_cf_options, write_buffer_manager_, earliest_seq, id_); } void ColumnFamilyData::CreateNewMemtable( const MutableCFOptions& mutable_cf_options, SequenceNumber earliest_seq) { if (mem_ != nullptr ) { delete mem_->Unref(); } SetMemtable(ConstructNewMemtable(mutable_cf_options, earliest_seq)); mem_->Ref(); }
上面所提及的,RocksDB有多种MemTable的实现,那么它是如何来做的呢,RocksDB通过memtable_factory来根据用户的设置来创建不同的memtable.这里要注意的是核心的memtable实现是在MemTable这个类的table_域中.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 MemTable::MemTable: table_(ioptions.memtable_factory->CreateMemTableRep( comparator_, &arena_, ioptions.prefix_extractor, ioptions.info_log, column_family_id)), class MemTableRepFactory { public : virtual ~MemTableRepFactory() {} virtual MemTableRep* CreateMemTableRep (const MemTableRep::KeyComparator&, Allocator*, const SliceTransform*, Logger* logger) = 0 ; virtual MemTableRep* CreateMemTableRep ( const MemTableRep::KeyComparator& key_cmp, Allocator* allocator, const SliceTransform* slice_transform, Logger* logger, uint32_t ) { return CreateMemTableRep(key_cmp, allocator, slice_transform, logger); } ........................
然后最后会调用对应的实现的CreateMemTableRep方法,这里我们就来看SkipList的实现.
1 2 3 4 5 MemTableRep* SkipListFactory::CreateMemTableRep( const MemTableRep::KeyComparator& compare, Allocator* allocator, const SliceTransform* transform, Logger* ) { return new SkipListRep(compare, allocator, transform, lookahead_); }
最终就是创建SkipListRep对象,在这个对象里面会创建SkipList(class InlineSkipList).
1 2 3 4 5 6 7 8 9 10 11 12 class SkipListRep : public MemTableRep { InlineSkipList<const MemTableRep::KeyComparator&> skip_list_; ................................... public : explicit SkipListRep (const MemTableRep::KeyComparator& compare, Allocator* allocator, const SliceTransform* transform, const size_t lookahead) : MemTableRep(allocator), skip_list_(compare, allocator), cmp_(compare), transform_(transform), lookahead_(lookahead) {}
这里SkipList就不分析实现了,如果对这个数据结构不了解的,可以去WIKI 看一下.这里我们只需要知道最终所有的memtable数据都是保存在SkipList中就可以了.
在之前的分析中我们知道Memtable的插入是通过WriteBatch然后遍历ColumnFamily来插入的,而最终则是会调用MemTable::Add这个函数.
1 2 3 4 5 6 7 8 9 10 bool MemTable::Add(SequenceNumber s, ValueType type, const Slice& key, const Slice& value, bool allow_concurrent, MemTablePostProcessInfo* post_process_info) { bool res = table->InsertKeyConcurrently(handle); if (UNLIKELY(!res)) { return res; } .............................. }
最终会调用InlineSkipList来对数据进行插入.
1 2 3 4 5 6 7 8 9 template <class Comparator >bool InlineSkipList <Comparator>: :InsertConcurrently(const char * key) { Node* prev[kMaxPossibleHeight]; Node* next[kMaxPossibleHeight]; Splice splice; splice.prev_ = prev; splice.next_ = next; return Insert<true >(key, &splice, false ); }
看到这里或许会有疑问了,那就是skiplist里面只有key,而RocksDB是一个KV存储,那么这个KV是如何存储的呢,这里是这样的,RocksDB会将KV打包成一个key传递给SkipList, 对应的KEY的结构是这样的.
1 2 3 4 5 // Format of an entry is concatenation of: // key_size : varint32 of internal_key.size() // key bytes : char[internal_key.size()] // value_size : varint32 of value.size() // value bytes : char[value.size()]
而数据的格式化就在之前的MemTable::Add中实现的.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 uint32_t key_size = static_cast <uint32_t >(key.size()); uint32_t val_size = static_cast <uint32_t >(value.size()); uint32_t internal_key_size = key_size + 8 ; const uint32_t encoded_len = VarintLength(internal_key_size) + internal_key_size + VarintLength(val_size) + val_size; char * buf = nullptr ; std ::unique_ptr <MemTableRep>& table = type == kTypeRangeDeletion ? range_del_table_ : table_; KeyHandle handle = table->Allocate(encoded_len, &buf); char * p = EncodeVarint32(buf, internal_key_size); memcpy (p, key.data(), key_size); Slice key_slice (p, key_size) ; p += key_size; uint64_t packed = PackSequenceAndType(s, type); EncodeFixed64(p, packed); p += 8 ; p = EncodeVarint32(p, val_size); memcpy (p, value.data(), val_size);
而对于真正的KEY的解析是在SkipList的Comparator中实现的(compare_).下面的代码片段可以看到会解析出来真正的key,然后再进行查找以及插入.
1 2 3 4 5 6 bool InlineSkipList<Comparator>::Insert(const char * key, Splice* splice, bool allow_partial_splice_fix) { Node* x = reinterpret_cast <Node*>(const_cast <char *>(key)) - 1 ; const DecodedKey key_decoded = compare_.decode_key(key); ............................... }