Q1. What load is on operational/transactional systems with WhamTech SmartData Fabric™ (SDF)?
A1. Load can be measured in terms of CPU, memory and disk input/output (I/O). This can vary widely for queries, depending on the data, data schema, query complexity, joins, interim results, final results and more. Typically, with appropriately sized virtual or physical servers for WhamTech EIQ Adapters, there is almost zero CPU or memory load on data source systems and minimal disk read load for results retrieval*see DETAILS below.
In general, unless the index updates have associated load, e.g., instantiation of triggers on each database table, the only noticeable overhead is on the results read from source systems. That overhead compared to a similar conventional query executed on a source system has been measured at less than 5% overall for the similar conventional query, as a conventional query usually allocates user space, and creates, uses and then eliminates temporary tables.
Minimal load on operational/transactional systems, for the following reasons:
- Index updates have a low overhead associated with sequentially reading transaction logs, change logs, message queues or other means, that are usually independent of main data source systems (in case they go down).
- Indexing and query processing is performed on independent servers and index storage. Sometimes, there is an interim data source read during a complex query execution (see next part), but WhamTech SDF administrators try to minimize the need by configuring indexed views and/or additional indexes.
- Results retrieval is usually a straightforward, low level, typically sequential read from source systems on a GETREC() level. An example request would be “read records a, b and c from table 1, d and e from table 2, and f, g, h, i and j from table 3.” Data source drivers, e.g., JDBC and ODBC, will pass such requests through without allocating user resources or performing operations usually associated with conventional queries.
Q2. What makes WhamTech SDF queries faster than other federated data systems and, in some cases, data source systems themselves?
A2. There will always be an inherent advantage when indexes and data coexist on the same system, as per most modern databases, however, if the indexes are not available and/or query processing is not advanced, that advantage erodes.
There are many different types of indexes from simple hash lookups to complex multi-branch tree structures, which were developed to reduce the number of layers a query has to traverse to resolve; however, CPU-consuming algorithms need to be run at each level to determine the branch that a query needs to traverse, slowing down query processing.
WhamTech SDF EIQ Adapters have always used the renowned, fastest way to query indexes**see DETAILS below. WhamTech uses the three “Bs” of computing for indexing and query processing and makes this very fast - in many cases, faster than native database indexing and query processing.
The main bottleneck that WhamTech SDF EIQ Adapters have to contend with, is the time it takes to typically sequentially read raw results data from sources and that time consist of two major components: Size of results dataset and storage read-time for source system. Both these can be greatly alleviated by:
• Use of indexed views for pre-aggregations, in particular, which are automatically used during query execution, which is particularly useful for BI/reporting/analytics
• In some cases, index inversion instead of going to sources
• Making queries very specific and/or limited through constraints
• Staging results and delivering as needed
• Applying cutoffs that may be appropriate
Balanced binary trees, which, while recognized as the fastest way to query indexes, are usually dismissed in Database 101 as not being scalable and not updateable; however, WhamTech solved these problems and have:
• Balanced binary trees that scale to 10s of billions of records, terabytes (petabytes?) of data, and are real-time updateable
• Virtual or physical bitmap representations of intermediate and final query result-sets - these can be integer lists or actual bitmaps, depending on field data node-level "data density", and are real-time updateable
• Boolean and other arithmetic operations on virtual or physical bitmaps (some very creative)
Some modern RDBMSs now offer similar forms (although not identical) to WhamTech indexing and query processing, but while they tend to be scalable, they are not updateable, intended for static BI/reporting/analytics instead of transaction processing or real-time BI/reporting/analytics.
Q3. How difficult is it to configure the various WhamTech SDF components?
A3. WhamTech SDF has admin user-friendly graphical user interface (GUI) configuration tools, with dropdown menus and options readily available. WhamTech also has an extensive admin user manual with detailed descriptions of the configuration tools and APIs for its adapters. Similar user manuals will be made available for other WhamTech SDF components. As an indication of how easy it is to install, configure and use WhamTech SDF components, basic training usually takes less than two (2) days – usually several hour sessions over five (5) days. Some of that time is used for a basic overview of the WhamTech SDF unique approach.
Q4. What type of people are needed to configure and set up WhamTech SDF solutions?
A4. Ideally, people familiar with the Extract Transform and Load (ETL) tools associated with data warehousing and data marts, but anyone familiar with database administration, data warehousing, data science and data marts should have no difficulty in configuring and testing WhamTech SDF solutions.
Q5. What training is needed to configure and use the WhamTech SDF?
A5. Typically, it takes less than two days of training to learn the configuration and testing tools. Some customers deliberately learn on their own through online tutorials with little or no support from WhamTech/WhamTech.
Q6. What drivers and query languages does the WhamTech SDD support?
A6. Query languages: SQL, PLSQL, Native TQL and others through conversion (OQL and SPARQL).
Q7. What level of data source access does a WhamTech SDF administrator need to configure and test a solution?
A7. For most data sources, regular ODBC/JDBC level WhamTech SDF administrator access would be sufficient – no need for higher, e.g., admin, access. The user should have sufficient read privileges for building indexes and also for retrieving results data. Higher access may be needed for classified, confidential or private records/documents. This is different from end-user operational access through applications or middleware, where specific role-based access (RBAC) controls will be in place.
Q8. What are the differences between data virtualization, federation and integration?
A8. These different terms were defined well in the following article: http://www.b-eye-network.com/view/14815. Essentially these are three distinct terms. People mistake data virtualization and federation as being the same, but they are different, as per the linked article. Another distinction is centralized vs. distributed. While some processes are distributed, ultimately, many data virtualization and federation solutions centralize much of the work. WhamTech SDF is different in that data virtualization, federation and even levels of integration processes are distributed. Processing is pushed to the edge. The only higher level, tending to centralized, processes are for results data consolidation.
Q9. Does the HadoopEIQ™ HDFS Smart Connector add-on product require low level access to Hadoop file system (HDFS)?
A9. No. HadoopEIQ uses HBase ID to retrieve results from an HBase/Hadoop system. HadoopEIQ avoids using any retrieval mechanism that deals with Hadoop/HDFS directly and relies on the higher level storage management system, in this case, HBase. But it could also be Apache Hive, which works well with Apache Thrift, or any other higher level storage management system that runs on top of Hadoop and any file system that works under Hadoop.
Q10. Have you run the Hadoop aggregation benchmark test, Terasort, on WhamTech SDF EIQ Products?
A10. We can, of course, look into how we can compare Terasort, which is designed for Hadoop-based SDFs with our EIQ Products that index and process queries for distributed data sources, but that may prove difficult. EIQ Products can either take advantage of any aggregations that exist in data stores or in the case they do not, create and maintain pre-aggregated and pre-calculated indexes within the EIQ adapters in near real-time as indexes are being updated – these tend to be materialized views at a local EIQ adapter level. There are situations where a pre-aggregation does not exist in the EIQ adapter indexes, in which case EIQ adapters can aggregate in a batch mode, similar to a Terasort test on a Hadoop-based SDF.