LSH Hashtable core format
The doubly linked lists that represent the LSH hash tables have high overhead due to being heap allocated.
Instead we are making a compressed core format that encodes the t2 hashes and point indexes in efficient form requiring no pointers.
The on-disk format is currently:
{ [ T1 t1 [T2 t2 p+]+ ]* E}^L
The in-core format will be an array of 32-bit cells:
* -> pointer
* t1 32-bit hash
* t2 32-bit hash
* p. 30-bit pointID with 2 skip bits
* p 30-bit pointID without skip bits
* E end of hash-table row token
Retrieval from Hashtable Array
t1->[ t2 p. [p. | NULL.] [p+(1..15)] ]+(1..*) E
- t2 values are in sort order <
- pointID is divided into 18bit track and 12bit vector index
- two-bit skip counter over two points (dummy point if necessary)
- CONSIDER 64-bit pointID
SKIP TABLE
| MS 2 skip bits | point1 data 30 bits |
| LS 2 skip bits | point2 data 30 bits |
Used for retrieval from query_t2:
// *p is a pointer to the beginning of a hashtable row array
// The array consists of t2 hash keys and one or more point identifiers p for each hash key
// Retrieval is performed by generating a hash key query_t2 for query point q
// We identify the row that t2 is stored in using a secondary hash t1, this row is the entry
// point for retrieve_from_core_hashtable_array
#define SKIP_BITS (0xC0000000)
void G::retrieve_from_core_hashtable_array(Uns32T* p, Uns32T qpos){
Uns32T skip;
Uns32T t2;
Uns32T p1;
Uns32T p2;
CR_ASSERT(p);
do{
t2 = *p++;
if( t2 > H::t2 )
return;
p1 = *p++;
p2 = *p++;
skip = (( p1 & SKIP_BITS ) >> SKIP_BITS_RIGHT_SHIFT_LSB) + (( p2 & SKIP_BITS ) >> SKIP_BITS_RIGHT_SHIFT_MSB);
if( t2 == H::t2 ){
add_point_callback(calling_instance, p1 ^ (p1 & SKIP_BITS), qpos, radius);
if(skip--){
add_point_callback(calling_instance, p2 ^ (p2 & SKIP_BITS), qpos, radius);
while(skip-- )
add_point_callback(calling_instance, *p++, qpos, radius);
}
}
else
if(*p != LSH_CORE_ARRAY_END_ROW_TOKEN)
p = p + skip;
}while( *p != LSH_CORE_ARRAY_END_ROW_TOKEN );
}
Insertion code sketch
// Read Disk Format2 LSH into Core Format2 LSH
Uns32T G::unserialize_hashtable_row_coreFormat2(int fid, bucket** b){
...
bucket_insert_point_format2(b, H::t2, H::p);
...
}
// pp address of Hashtable row pointer ( hashTable[ j ] + t1 )
// t2 32-bit hash value
// p 30-bit point value
void H::bucket_insert_point_format2(bucket **pp, Uns32T t2, Uns32T p, Uns32T N){
if( *pp == NULL ){
*pp = new Uns32T[ N ];
(*pp)[0] = t2;
(*pp)[1] = p; // No skip bits required
(*pp)[2] = ET;
}else
__bucket_insert_point_format2(*pp, t2, p);
}
void __bucket_insert_point_format2(bucket *pp, Uns32T t2, Uns32T p ){
bucket* p = seek_t2_or_endpos(t2);
if( *p == ET )
// Make a new chain at this cell (p)
else
// Merge existing chain at this cell (p)
}
--
MichaelCasey - 15 Aug 2008