DBX indexing internals

Pagesize determines the size of each page in an index

Cachesize determines the number of pages maintained in a cache when
writing or reading.

Pages are one of the following types, defined as a value in the page:

Value Type      Description

    1 root      Root page
    2 internal  Internal page
    4 leaf      Leaf node
    8 bucket    Bucket node for string key
   16 overflow  Overflow page (not used)
   32 pribucket Primary key bucket
   64 secbucket Secondary key bucket
  128 numbucket Numeric key bucket

Indexed key terms are stored in a single buckets.

For the main keys (id, ac, sv) the key defined as a "hybrid key", a
combination of the key as a string and a possible list of duplicates
with the same key.

The index data is stored in a bucket, with the
following fields:

 int      dbno: 0..n number of the database source file
 int      dups: 0 for a unique identifier, or number of duplicates
long    offset: file position in sequence file
long refoffset: file position in reference file (if any - e.g. GCG)
                now controlled by a setting in the cache file that
                allows zero or more reference offsets, saving space
                for simple flat files.

Note: the duplicates may begin to overflow for comon terms - e.g. top
level termms in taxonomy - as the databases grow beyond 4 billion
entries. At that point it will become necessary to increase the size,
noting it in the text file.

with information about duplicates (entries with the same identifier)
stored in numbuckets

long    offset: file position in sequence file
long refoffset: file position in reference file (if any - e.g. GCG)
 int      dbno: 0..n number of the database source file

For the secondary keys (key, des, org) the lists of matched IDs are
stored in a secbucket

Split indexes
=============

It would be convenient to split the indexing so that it could be run
as separate small jobs or on machines in a cluster with shared or
duplicated data files.

There are two possible splits - by file or by size. Possibly also by both.

The dbx* application could be told it is doing a split index, told the
part number and told the files to be indexed in some way - it could
work them out by checking all the input filenames and sizes, it could
be given a list of files and start/end points.

First priority is to plan the merging of the resulting index
files. They could be discrete, for example an ID index where the
identifiers are unique will have no overlaps. Others could have common
terms in many or all input files - for example top level taxonomy or
some general words in descriptions.

We need to determine how many files can be open at a time. it is
clearly most efficient to process one index at a time, perhaps with a
master application generating the merge tasks for each index.

On one test system the maximum was 1021 fopen calls (plus presumably
stdin, stdout and stderr for a total of 1024). Printing the fileno
indicates this as numbers 0, 1 and 2 are already taken. This would
allow considerably flexibility in merging as we could keep many
subindex streams open. We need to read one cache file and then one
subindex file for each job. Ideally the initial indexing job could
attempt to open the maximum number of files to check that the merge
will be possible, and perhaps adjust the split or more likely fail
with a message if there is an error in opening any of the files.

Possible parameters for splitting:

* by file (split by file number, many job sizes)

* by size (start at beginning of file 1, split either on a new file
  when a limit is reached or within a file ... may be after the start
  of the last entry. Splitting by new file would be a simple one.

* by filename (tricky, would need a list of wildcards with all other
  files indexed at the end)

Basically, in each case there is a start and an end to test, and a
list of files to be processed. That list could be useful.

When merging, we should merge one index at a time, but there is no
reason not to use a master job that processes each index in succession
- using a select list of fields that can be set to 'all' for all
available fields, reporting any that are not found.

Merging
=======

Count the primary keys for each file, building a new index of primary
pages for them. It may be most efficient to run twice so that the
output is clean. Alternatively, an in-memory list may be best. We need
to be able to hold all primary keys in memory for wildcard ID searches
for example.

Then add the secondary pages, noting their locations and updating the
primaries.

There may be issues in sorting the primary keys to avoid repeated
searches through up to 1000 subindexes for each new identifier.

Given the inputs are already sorted, we could maintain a sorted list
and simply insert the next term when one is taken for
merging... knowing it must be in the same place or later so it may be
an efficient list insert. Test next - if it fits, insert. However, it
could be complicated by many common terms which all get deleted
together. Maybe reinsert last one first then there is never a surprise
downstream. We need only an array of replaced (intact) list
nodes. Note we need to fetch each node from its iterator as the next,
previous and first pointers may have changed. In fact - how safe are
we with multiple iterators from a single list? Maybe easiest to deal
directly with nodes, or have functions that handle an array of nodes.

Let's assume we can step through and count, and then step through and
write at the end when all the secondary buckets are done? or is that
not feasible?

Assuming we can compress and uncompress a database, we have to be able
to store all the bucket offsets old and new in memory and rewrite them.

Some examples would help to make things clearer. We can use statistics
from SRS for EMBL/GenBank and for UniProt as a guide to the sizes of
memory structures needed, then create them and check memory usage.

It is likely that we should turn off the caches for the input
subindexes.

Latest release:

UniProt fasta files: id: 24,564,446 des: 265,082,562


First step - create subindexes for something. Let's start with the
three Uniprot fasta files.

index/varsplic uniprot_sprot_varsplic.fasta
index/sprot    uniprot_sprot.fasta
index/trembl   uniprot_trembl.fasta


File fragmentation
==================

When extending the file, consider doubling current size or at least a
50% increase, with a large start size. Aim to reduce number of fragments.

Look for a way to report file fragmentation with cpu usage.

Check filefrag utility, and FIEMAP and FIBMAP ioctl (see man filefrag)
not in man so check online for more information. Note filefrag is in
man section 8 (non-standard kernel routines)

Look for a way to defragment disk space on Fedora ... and is it done
automatically for free space?

Check SeLinux is turned off and more on what it does

Check what all the -stats outputs are for pages read/written and how
the numbers should add up. Are we missing anything? Did we skip
counting with some of the newer prim/sec page functions?

Can we report on separate effectiveness of the primary and secondary caches?

Run again with tracker-control -terminate to turn off file tracker
which was consuming 50% on two cpus after the run. Seemed to be
already much faster.

Need suitable cache size - increase defaults in emboss.standard, see
swissresource for suggestions

Bug with extending file .... when compressing pages have zero type,
starting with second page (2048). Somewhere there seems to be a read
where the file is larger than the current totsize. As it is so early
we can check on a short file. Looks like page at 2048 is never
created! Check root page creation against previous versions - we may
have accidentally deleted a line.