SmartData Fabric® Architecture

SmartData Fabric^® Architecture

The SmartData Fabric^® architecture is designed to plug-and-play in, and is transparent to, existing IT infrastructures, and complements and leverages existing IT systems, tools and applications. Its unique index-based and conventional federation adapters leave and guard data in sources, with no copying or moving data. It enables seamless data discovery, security, quality, connections/links and master data management. Some of the attributes that facilitate deployment and usefulness are as follows:

Independently configurable, flexible, comprehensive and scalable
Allow for load balancing, and failover and backup through redundancy
Conform to standards, without imposing them on data source systems and without needing standard driver access to data source systems
Cleanse, transform and standardize data as it is indexed and results data is retrieved
Absorb almost all index and query processing load, relieving data source systems
Include both structured and unstructured data with multiple types of indexes and associated processes
Communicate with other EIQ Products™
Execute multiple query languages
Deploy on any x86 64-bit hardware or virtual machine, as installed software, on the Cloud or as a service
Currently runs on Windows, Linux and in the past, non-x86 IBM Power Systems

EIQ Adapters™ are based on one set of code with configuration options for product variations. The following diagram EIA1 illustrates the component stack for an EIQ SuperAdapter™, the most featured EIQ Product™:

EIA1: Illustration of the component stack for an EIQ SuperAdapter™.

INDEPENDENTLY CONFIGURABLE

One of the key design features of EIQ Products™ is that each is independently configurable. Indexes are 100% contiguous, self-describing and self-contained, regardless of the type, location, size or segmentation, or number of data sources involved. EIQ Product™ can be added to or removed from a network without impacting other EIQ Products™. This allows individual EIQ Products™ to have a specific local function and also be available for a larger scale enterprise function, or to be added to, or removed from, a network without extensive configuration. For example, an EIQ SuperAdapter™ could be used locally as an EIQ TurboAdapter™, where queries are submitted directly to the indexes without going through a standard data model, and as an EIQ SuperAdapter™ for a larger scale enterprise data integration solution with a standard data view front-end. Each EIQ Product™ can be independently accessed through standard drivers and standard query language (SQL), with other query language options. As each EIQ Product™ is independently configurable, there are multiple variations where EIQ Products™ can connect with each other and with data sources. The following diagrams EIA2 to EIA12 illustrate some of the possible variations:

EIA2: Basic configuration of one EIQ Federation Server™ accessing five data sources, with three EIQ SuperAdapters™, an EIQ ConventionalAdapter™ and a 3rd party adapter or middleware.

EIA3: Basic configuration with a combination of regional and local indexes, which remain 100% contiguous.

EIA4: Basic configuration with all indexes at a central level – no regional or local indexes. This may be a preferred deployment for a highly centralized organization, a SaaS or public cloud solution.

EIA5: A second EIQ Federation Server™ is added for some reason, for example, to provide department-only access to the two configured data sources, execute external joins across the two data sources, or shift load from the higher-level EIQ Federation Server™.

EIA6: A third EIQ Federation Server™ is added for some reason similar to those listed in the caption to diagram EIA5.

EIA7: The first and third EIQ Federation Servers™ are reconfigured for some reason, for example, provide additional department-only access to three configured data sources, execute local external joins across the data sources, or balance load between the first and third EIQ Federation Servers™. Of note, some LIFO or FIFO rule must be set up to manage the same query hitting the middle data source EIQ SuperAdapter™ more than once – FIFO limits the number of queries submitted and improves query response – LIFO ensures the absolute latest results, which could be important in a real-time situation, but could take longer than FIFO.

EIA8: EIQ Products™ can be configured ad hoc, for example, through Web services or other metadata-based system – EIQ Products™only need to be preconfigured where wanted.

EIA9: SaaS configurations can be a combination of DaaS (data) and IaaS (infrastructure), as well as applications, with the benefit of leaving data in sources behind organization firewalls and having indexes local to data sources, behind organization firewalls, regional/departmental behind firewalls or outside firewalls in regional centers, or central in the SaaS environment. Data could also exist in a cloud and be indexed in the same cloud environment or elsewhere. For example, WhamTech has implemented a Hybrid Cloud for multiple organizations with multiple data sources, some of which are in the Cloud and others are on premise, all exposed as data services running in a Private Cloud.

EIA10: It does not matter where indexes or data are physically located, however, the closer an index is to its associated data source, the better for index updates and query performance – HTTP socket communication is used among EIQ Products™.

EIA11: As indexes are 100% consistent regardless of the level of indexing, whether local, regional or central, it does not matter how the various EIQ Products™ are configured.

FLEXIBILITY

EIQ Products™ have many options that allow them to be used for different purposes, including multiple types of indexes, multiple query languages on the front-end, multiple number and types of standard data views, and add-ons of virtual advanced data discovery and classification, virtual data security, virtual event processing, virtual cybersecurity, virtual master data management, virtual reporting, BI and analytics, virtual link analysis and visualization, and other process components.

COMPREHENSIVE

EIQ Products™ can build indexes on almost any and all data, including structured, unstructured, semi-structured, archived, real-time memory, in diverse locations and on diverse systems, and for both different and similar purposes. The main reasons for this are the use of multiple parsers, update processes and components to accommodate diversity, parallel distributed architecture and scalability, which is discussed next.

SCALABILITY, AND LOAD BALANCING, AND FAILOVER AND BACKUP THROUGH REDUNDANCY

One of the benefits of independently configurable EIQ Products™ is that they can scale in various ways, allowing for high performance, LOAD BALANCING, AND FAILOVER AND BACKUP THROUGH REDUNDANCY. The following diagram EIA13 illustrates a potential configuration:

EIA12: Diagram illustrating a potential configuration for a critical data source with three sets of indexes, EIQ SuperAdapters™ and EIQ Federation Servers™, each/some of which could be in separate locations and on separate systems.

In the diagram EIA13, apart from the first EIQ Federation Server™, there are 3 options at each level with 27 possible paths to execute a query. Any failed component has two others as backup. As is, the configuration serves as a load balancing option with built-in redundancy for failover and backup. Even if the data source system becomes unavailable, multiple indexes can be inverted and results data assembled from the indexes, although this process may take longer than results data retrieved directly from the data source.

Diagram EIA13 assumes that a single set of indexes meets single query performance requirements, however, a single set of indexes may not be sufficient for larger data sources. (Multiple queries can be spread across the three sets of indexes.) An option is to segment indexes similar to another EIQ Product™, HadoopEIQ™, developed for a customer, for a single data source. Segmentation (aka sharding) in the HadoopEIQ™ example is chronologically based, as many transaction systems would be, and allows for indefinite scalability. Each segment can contain indexes for up to 2 billion records. Solutions may require multiple parallel segments to accommodated 10s of billion of records per day. Other segmentation schemes may involve a key range such as last name or customer number, with indefinite scalability for each segment key range. See HadoopEIQ™ for more on how WhamTech manages simultaneous index segmentation and query processing.

WhamTech’s index and query processing are inherently high performance and scalable. See technologies for more information on WhamTech’s unique index and query processing.

STANDARDS, WITHOUT IMPOSING THEM ON DATA SOURCE SYSTEMS AND WITHOUT NEEDING STANDARD
DRIVER ACCESS TO DATA SOURCE SYSTEMS

WhamTech enables standard driver (ODBC, JDBC and Web/data services) and standard query language (SQL) access to any of the EIQ Products™, but in the backend, makes use of the best driver access and best and low level results data retrieval language available for individual data sources. In many cases, data sources have standard drivers and SQL available to EIQ Products™, but for others, WhamTech has had to use or build proprietary drivers and query languages to retrieve results data; one extreme example is using file pointers to access very large COBOL-generated mainframe VSAM files on the backend, but provide access to a modern BI application through ODBC and SQL on the frontend. See solutions for more information.

WhamTech can also expose a standard data view to calling applications that could be based on internal organization standards or public industry standards, such as the Association for Cooperative Operations Research and Development (ACORD), Fast Healthcare Interoperability Resources (FHIR), Health Level 7 (HL7) Reference Information Model (RIM), National Information Exchange Model (NIEM) and XML Business Reporting Language (XBRL) data models. The index and results data mapping process to a standard data model or models (as more than one can be used) is part of an EIQ Adapter configuration.Â Data discovery and the subsequent data profiling, discussed next, support the data mapping and standardization processes.

CLEANSE, TRANSFORM AND STANDARDIZE DATA AS IT IS INDEXED AND RESULTS DATA RETRIEVED

Data in real world sources does not usually conform to uniform standards nor is it free from typos, transpositions, being missing, misplacement, obfuscation, etc., unless strict data governance has been enforced from its beginning. Data usually requires some form of data profiling, and subsequent cleansing, transformation and standardization (CTS) to conform to an organization/industry-wide standard data model.

To profile data and develop CTS transforms, WhamTech takes advantage of its fast indexing to build raw indexes and then using the tree portion of the indexes to create data profiles, which are, in-turn, used as “before” CTS transforms data profiles to develop and test CTS transforms, and review their effectiveness with new “after” CTS transforms data profiles. In this way, it is relatively easy to develop CTS transforms using lookup tables, regular expressions, APIs, DLLs, etc., in Perl, Python and C/C++. It is far better for performance to execute CTS transforms local to the indexes versus the conventional approach of using a centralized data transformation server, as typically used for data warehousing. Slated for development is the capability to then execute CTS transforms on the initial set of raw indexes instead of the current method of reading the data source again. This development would allow for CTS indexes to be built from raw indexes.

ABSORB ALMOST ALL INDEX AND QUERY PROCESSING LOAD, RELIEVING DATA SOURCE SYSTEMS

WhamTech’s philosophy is to minimize any load on or interference with data source systems. Probably the most detrimental impact of conventional adapters for federated data access is the query processing load imposed on data sources, for example, transaction systems subject to BI/analytics queries. Transaction systems are not designed or configured for BI/analytics queries and therefore impose a heavy load that can shut down or slow down such systems to the point that they are unusable for normal potentially mission-critical transaction query processing. WhamTech takes advantage of existing data source system components, for example, one of the best means to update indexes is to use a provided, third-party or proprietary utility to read a database’s transaction/redo/change log, as it usually resides on a separate system or disk volume from the database itself. Therefore, sequentially reading the log to obtain the latest updates has a minimal impact on on the database. WhamTech has identified at least twelve different options to update indexes, as illustrated in the following diagram EIA14:

EIA13: WhamTech has identified at least twelve different options to update indexes, some more preferred than others.

WhamTech EIQ Products™ process data for indexing and queries as much as possible within indexes, separate from data sources. Some queries can be answered directly from indexes, such as whether specific results are available or not, result counts, pointers to results or derived value indexes, such as pre-aggregations or pre-calculations. And as discussed earlier, results data can also be obtained from indexes without going to data sources through inversion, although this is usually not a normal process. The normal process is to process queries as much as possible within indexes to obtain result-set pointers to source data and then using normal user access to pass pointers to data source systems to obtain raw results data. Since EIQ Products™ only request relatively simple data reads from data sources, no significant queries are executed. The overhead associated with reading raw results data is minimal and typically < 5% of normal query execution.

INCLUDE BOTH STRUCTURED AND UNSTRUCTURED DATA WITH MULTIPLE TYPES OF INDEXES AND
ASSOCIATED PROCESSES

WhamTech EIQ Products™ do not differentiate between structured and unstructured indexes as they are basically the same. Experience has shown that either types of data can exist in almost any data source. Some data sources contain more structured than unstructured data, such as databases, and unstructured data sources, such as Web pages, PDF files and emails, can contain structured data. Structured data is, or entities are, identified in unstructured data and extracted to structured indexes. Structured and unstructured data queries can be combined in a single query and with add-on products such as Link Indexes™ (link mapping and link analysis). Structured data has a number of specific associated types of indexes, including pre-aggregated, pre-calculated and pre-joined that are extremely useful for event processing, reporting, BI and analytics. Other types of indexes can be used for both structured and unstructured data, including fuzzy match indexes, Link Indexes™ and categorization indexes using third-party software. With unstructured data, there are a number of specific indexes that are listed under WhamSearch™. In addition, using the CTS transforms option, an EIQ Products™ administrator can specify APIs and DLLs to be used when generating indexes.

COMMUNICATE WITH OTHER EIQ PRODUCTS™

EIQ Products™ communicate with each other through TCP/IP sockets that can be secured through SSL. Regardless of how they are configured, all EIQ Products™ are accessible by third-party applications/middleware through standard drivers and Web/data services, and SQL and other query languages.

EXECUTE MULTIPLE QUERY LANGUAGES

As discussed earlier, EIQ Products™ have been designed to accept SQL queries, but WhamTech is developing/adopting open source NoSQL, OQL and SPARQL translators. The key to fully enabling these non-SQL translators is making compatible representations of standard data views available.

DEPLOY ON ANY 64-BIT HARDWARE, VIRTUAL MACHINE OR CONTAINER, AS INSTALLED SOFTWARE, IN THE CLOUD OR AS A SERVICE

Currently, EIQ Products™ are deployed on 64-bit x86-based Windows and Linux, and RISC-based IBM Power systems, with optional 32-bit drivers for 32-bit applications, that can be installed as software on physical hardware, virtual machines (VMs) or containers, in the Cloud or as a service. In the past, the core relational index and query engine ran on IBM AIX, Mac and Solaris operating systems and machines, but these versions are no longer supported.

SmartData Fabric® Architecture

SmartData Fabric® EIQ Products™ are server-based and multi-tiered to scale and support multiple locations, systems and connections to multiple applications/middleware, multiple data sources and multiple other EIQ Products™

SmartData Fabric^® Architecture

SmartData Fabric^® EIQ Products™ are server-based and multi-tiered to scale and support multiple locations, systems and connections to multiple applications/middleware, multiple data sources and multiple other EIQ Products™