RealTime Indexes
Preliminary Description
August 29, 2001
I. Terminology
A data base "Index" is a data structure that is used to isolate
the occurrence of one or more "Values" in the data base. For example,
an index for the data base field STATE would consist of all possible
values for STATE; for example, "ALABAMA", "ALASKA", "ARKANSAS", etc
Associated with each value is a "List" of records (or record numbers)
that contain that value.
The lists might look like this:
Value List
----- ---
ALABAMA: 5, 12, 74
ALASKA: 7
ARKANSAS: 6, 8, 22, 47, ...
etc
If a value has exactly one list item, it is called a "unique value";
e.g., ALASKA has exactly one list item.
Note that a list may be represented either as an array of numbers,
or as an array of bits.
The number of values contained in an index is called its "uniqueness".
For example, we would expect the uniqueness of STATE to be 50, assuming
no mis-spellings.
We also refer to the uniqueness of an index as its "cardinality".
In some cases, we are particularly interested in indexes with
"low-cardinality" (we will define "low" later).
We refer to the relative list size of a value as its "density".
We refer to the absolute list size of a value as its "population".
Note that a given index may contain both low-density and high-density
values.
Note also that we often use the terms density and population interchangably;
however, this is valid only for large data bases. For small data bases,
high density does not equate to high population.
II. Background
The RealTime Index feature was created to overcome inherent problems
with the existing indexing methodology. In particular, the simplistic
two level list structure required frequent and lengthy "spool" operations
when data was being created or modified. This made the existing
methodology unsuitable for real time data acquisition and massaging.
III. Description
The RealTime Index feature created two new data structures:
. RealTime Lists (RTLs) are lists similar to the existing lists
superimposed onto a tree structure. This allows fast index creation
for all density/population configurations; and it allows fast
index retrieval (i.e., querying) for low to medium density/population
configurations.
. RealTime Bitmaps (RTBs) are bitmaps that allow more efficient
index retrieval for high density/population configurations.
The user must declare in the data base schema which index(s) are to
be RealTime. The user can optionally specify additional parameters
to tune the RTI performance.
For each RealTime Index, two additional files are created when the
data base is created. The file names are the concatenation of the
data base name and the index name, plus extensions ".rtl" and ".rtb".
IV. DBD Schema Statements
A. REALTIME qualifier
The "REALTIME" qualifier can be added to the "KEY" declaration
to cause the relevant index to be a RealTime Index.
For example:
TD Customers 1000000 records
CustNo x(6) KEY REALTIME
...
TD Sales 10000000 records
CustNo x(6) KEY
...
Output from the INFO command shows a qualifier "(r)" to
indicate a RealTime Index.
B. UNIQUE statement
The UNIQUE statement can and should be used for RealTime Indexes
in the same way it is used for conventional indexes to optimize
performance and storage requirements.
C. UNIQUERTL statement
The UNIQUERTL statement is a new statement that allow the user
optionally to specify RTL parameters. The syntax is:
UNIQUERTL
specifies the number of empty nodes to be
created when the data base is created; this is useful primarily
for calculating disk storage requirements;
is a number from the set {128, 256, 512, 1024, 2048, 4096}
that specifies the RTL node size; the default is 4096.
The high number should be used whenever the average population
per value is greater than 1000. When the average population
per value is less than 1000, one of the smaller
values should be used.
For example,
UNIQUERTL LastName 0 0 256
uses 256 byte RTL blocks; this is optimum if the average population
per value is about 64.
D. UNIQUERTB statement
The UNIQUERTB statement is a new statement that allow the user
optionally to specify RTB parameters. The syntax is:
UNIQUERTB
specifies the number of empty nodes to be
created when the data base is created; this is useful primarily
for calculating disk storage requirements;
specifies the threshhold, as a percent of the
total number of records, where an RTL is to be converted
to an RTB; this feature is not yet implemented;
specifies which values are to be RTBs; this list
may be the literall ALL, to specify all values are to be RTBs;
or it may be a list in the form
{Value1, Value2, ...}
The values in the list may be enclosed in single or double
quotes if required.
For example,
UNIQUERTB state 0 0 {TX, NY, CA}
treats TX, NY and CA as RTBs; the others are treated as RTLs.
UNIQUERTB state 0 0 ALL
treats all states as RTBs.
V. Commands
A. FIND and MATCH
The FIND and MATCH commands are syntactically unchanged.
Functionally they will use the new index structures as needed.
B. STRUCTURE
The STRUCTURE command is syntactically unchanged; but one
restriction has been removed. The STRUCTURE/Cn command will
work on partially structured RealTime Indexes. In addition,
the STRUCTURE/P command is much faster.
C. LOAD
A new form of the LOAD command has been created to allow
data to be imported directly from memory. The prototype is:
void ddlLoadMem(CONTROL *ctl,
char *OptnString,
char *ArgString,
char *DataArray,
int DataItemSize,
long DataItemCount);
ddlLoadMem() should always be used with the /B option,
indicating fixed length records with no CRLF.
VI. Remaining Features
As of this writing the RTL (Realtime List) and RTB (Realtime Bitmap)
capabilities are operational and are producing spectacular results.
However, several planned features are incomplete. They are:
. Realtime Indexes in ATTACHed data bases;
. Locating RTL and/or RTB files on different logical units;
VII. Optimization
The highest rates for "load & structure" are achieved by batching
requests and using commands and command options that lend themselves
to maximum thruput. As would be expected, the overall thruput is
directly related to the size of the batch; as the batch size increases
past a certain threshhold, the rate of improvement decreases and
approaches a steady state. In addition, the higher the cardinality
of the data, the higher the steady state batch size becomes.
This is especially true for high values of "Low Cardinality"
STRUCTURE/C. For example, STRUCTURE/C of 100,000 records with
cardinality 10,000 would be, relatively speaking, very slow.
In this case, it would be better to use STRUCTURE/P.
We can generalize this concept by saying that when the ratio
/
is less than 100, it is better to use STRUCTURE/P rather than
STRUCTURE/C.
In general, the technique for optimizing "load and structure"
thruput is:
1) Batch load the data [with ddlLoadMem(), for example];
2) Batch structure the incremental data;
In particular, consider a data base with three low cardinality keys
and one high cardinality key. Consider further that we wish to
batch process 100,000 records at a time. We could do the following:
. Create a temporary file containing the 100,000 records;
. ddlLoad(ctl,"/","TransActions TransActionsRD TransActionsFile.dat");
. ddlStructure(ctl,"/C100","Field1 Field2 Field3");
. ddlStructure(ctl,"/","Field4");
Another way to approach the same example is:
. Batch 100,000 records into a memory array
. ddlLoadMem(ctl,"/B","TransActions TransActionsRD DummyFileName",...);
. ddlStructure(ctl,"/C100","Field1 Field2 Field3");
. ddlStructure(ctl,"/","Field4");
VII. Benchmarks
A. Unbatched Insert and Structure in 80,000,000 record data base
Record size 24 characters (All RTL)
Three low cardinality keys (10, 10, 1200) and one high cardinality
(10000) key
. 100,000 records / 197 sec == 507 records per second
B. Batched Insert and Structure in 80,000,000 record data base
Record size 24 characters (All RTL)
Three low cardinality keys (10, 10, 1200) and one high cardinality
(10000) key; separate STRUCTURE/P for high cardinality key
. 100,000 records / 11 sec == 9090 records per second
. 1,000,000 records / 85 sec == 11764 records per second
. 1,000,000 records / 84 sec == 11904 records per second
C. Batched Insert and Structure in 80,000,000 record data base
Record size 24 characters (All RTL)
Three low cardinality keys (10, 10, 1200) and one high cardinality
(10000) key; single STRUCTURE/C for all 4 keys
. Load & Structure 1,000,000 records & 4 fields
Load Structure
2001.05.23 00:13 00:51
2001.05.23 00:12 00:50
. Load & Structure 100,000 records & 4 fields
Load Structure
2001.05.23 00:01 00:33
D. Batched Insert and Structure in 80,000,000 record data base
Record size 24 characters (All RTL)
Three low cardinality keys (10, 10, 1200)
. Load & Structure 100,000 records & 3 fields
Load Structure
2001.05.23 00:01 00:02
2001.07.16 00:02 00:03
E. Batched Insert and Structure in 80,000,000 record data base (ships)
Record size 105 characters (All RTL)
One low cardinality key (3 chars, cardinality 100)
. 1,000 records / 4 sec (1,3) == 250
. 1,000 records / 1 sec (0,1) == 1000
. 10,000 records / 2 sec (1,1) == 5000
. 10,000 records / 1 sec (0,1) == 10000
. 100,000 records / 7 sec (5,2) == 14285
. 100,000 records / 5 sec (3,2) == 20000
. 1,000,000 records / 51 sec (44,7) == 19607
. 1,000,000 records / 42 sec (36,6) == 23809
One low cardinality key (6 chars, cardinality 100)
. 1,000 records / 1 sec (0,1) == 1000
. 1,000 records / 2 sec (0,2) == 500
. 10,000 records / 1 sec (0,1) == 10000
. 10,000 records / 1 sec (1,0) == 10000
. 100,000 records / 6 sec (4,2) == 16666
. 100,000 records / 5 sec (4,1) == 20000
. 1,000,000 records / 50 sec (43,7) == 20000
. 1,000,000 records / 42 sec (36,6) == 23809
One low cardinality key (10 chars, cardinality 100)
. 1,000 records / 2 sec (0,2) == 500
. 1,000 records / 2 sec (0,2) == 500
. 10,000 records / 1 sec (0,1) == 10000
. 10,000 records / 1 sec (1,0) == 10000
. 100,000 records / 6 sec (5,1) == 16666
. 100,000 records / 5 sec (4,1) == 20000
. 1,000,000 records / 50 sec (44,6) == 20000
. 1,000,000 records / 41 sec (35,6) == 24390
One low cardinality key (15 chars, cardinality 100)
. 1,000 records / 2 sec (0,2) == 500
. 1,000 records / 2 sec (0,2) == 500
. 10,000 records / 3 sec (1,2) == 3333
. 10,000 records / 2 sec (0,2) == 5000
. 100,000 records / 5 sec (4,1) == 20000
. 100,000 records / 5 sec (3,2) == 20000
. 1,000,000 records / 50 sec (44,6) == 20000
. 1,000,000 records / 42 sec (36,6) == 23809
Load 1,000,000 records with * 35/34
F. Structure 80,000,000 record data base
Record size 24 characters (Mixed RTL & RTB)
One low cardinality key (2 chars, cardinality 10)
. STRUCTURE/C 80,000,000 record RTB 07:17
. STRUCTURE/C 80,000,000 record RTL 07:47
. STRUCTURE/C 80,000,000 record RTB 07:26
. STRUCTURE/C 80,000,000 record RTL 07:38
. STRUCTURE/C 80,000,000 record RTB 06:52 (MODE TEMP C:\TEMP)
. STRUCTURE/C 80,000,000 record RTL 07:27 (MODE TEMP C:\TEMP)
G. Find 8,000,000 records in 80,000,000 record data base
Record size 24 characters (Mixed RTL & RTB)
One low cardinality key (2 chars, cardinality 10)
. FIND single RTB 00:01
. FIND single RTL 00:03
. FIND single RTB 00:00
. FIND single RTL 00:03
. FIND single RTB 00:01
. FIND single RTL 00:03
. FIND double (03,05) RTB 00:01
. FIND double (03,05) RTL 00:06
. FIND double (03,05) RTB 00:01
. FIND double (03,05) RTL 00:07
. FIND double (03 OR 05) RTB 00:01
. FIND double (03 OR 05) RTL 00:07
. FIND double (03 OR 05) RTB 00:03
. FIND double (03 OR 05) RTL 00:08
. FIND double (03 OR 05) RTB 00:03
. FIND double (03 OR 05) RTL 00:07
H. Batched Insert and Structure in 80,000,000 record data base
Record size 24 characters (One RTL, One RTB)
. Load & Structure 100,000 records & 2 fields
Load Struct RTB Struct RTL
2001.08.01 (4) 00:01 00:01 00:02
2001.08.01 (4) 00:01 00:01 00:02
. Load & Structure 1,000,000 records & 2 fields
Load Struct RTB Struct RTL
2001.08.01 (22) 00:13 00:05 00:04
2001.08.01 (21) 00:13 00:04 00:04
2001.08.01 (21) 00:12 00:04 00:05
Copyright © 2019 , WhamTech, Inc. All rights reserved. This
document is provided for information purposes only and the contents hereof are
subject to change without notice. Names may be
trademarks of their respective owners.