EIQ Products™
Related information
Find out more about using Link Indexes for:
Data Discovery and Semantic Reasoning
“Over the next three years, analytics will mature along a third dimension, from structured and simple data analyzed by individuals to analysis of complex information of many types (text, video, etc…) from many systems supporting a collaborative decision process that brings multiple people together to analyze, brainstorm and make decisions.”
-
Forrester 2012 Predictions
Link Indexes are an add-on option for EIQ adapters. Link Indexes
in combination with normal content indexes enable accelerated database joins, link
mapping, degrees of separation queries, link analysis, virtual CDI-MDM, and physical data, logical
and ontology model discovery, among other capabilities. Link
Indexes change the way data is managed, as it provides structure to
content and acknowledges the "connectivity" of data, regardless of
whether structured or unstructured, type, format , location, system,
cloud or non-cloud, access, etc., and could ultimately empower
user-level application development.
HISTORY
WhamTech discovered the predecessor of Link Indexes through development work on its Web search engine during 2000 - 2001, when normal content indexes were used to capture hyperlinks in Web pages and documents. By storing and indexing "from and to" hyperlinks, WhamTech established the following:
- Indexing "from and to" links and inverting them to establish "to and from" links, enabled
backwards and forwards Web navigation and therefore lateral
browsing, e.g., go to Web pages that are hyperlinked TO the current
Web page, as well as the easier option of FROM the current Web page
-
Hyperlinked communities on the Web could be discovered and navigated, and Web pages filtered in or out, using WhamTech's powerful search (for more information, see WHAMSEARCH)
- Hyperlink indexes added structure to content
- Hyperlink indexes added structure to content
-
Metadata from hyperlink indexes could be used to calculate how popular Web pages and documents were in general, and how popular and important (using social network analysis (SNA)) they were to specific communities
- Could be used for ranking
- Could be used for ranking
- All networks, regardless of structural complexity
(hierarchical, relational, random and even 3D objects) could be represented by
multiple pairs of links, like hyperlinks or primary key - foreign
keys in relational databases
From WhamTech's experience with content, structure and ranking of very large data sets, an understanding of data-based solutions evolved, as illustrated in the diagram LI1 below:.
LI1: WhamTech EIQ Products combine content, links and ranking
LINK INDEXES EVOLVED FROM HYPERLINK INDEXES
In 2001, WhamTech started working on law enforcement and intelligence data access issues, and link analysis. And over time, extended hyperlink indexes to arrive at the current Link Indexes EIQ Product add-on, which can be described as follows:
-
Link Indexes are representations of links between data pointers to records/files/documents in the same or other data sources
- Based on data/entity (collectively known as "entity") matches, e.g., PK-FK,
or exact,
fuzzy or algorithm match – these are direct or obvious relationships
- As a consequence, all data in the same record, file or
document/paragraph are connected – these are indirect or
non-obvious relationships
- Based on data/entity (collectively known as "entity") matches, e.g., PK-FK,
or exact,
fuzzy or algorithm match – these are direct or obvious relationships
- Link Indexes can be combined, using Boolean operations, to
represent networks through simple SQL - added LINK (a specialized
JOIN) and DOS (degrees of separation) to SQL
-
Link Indexes can be used in conjunction with content indexes for accelerating internal and external joins, degrees of separation queries, and link mapping and link analysis solutions, again, using simple SQL
- Other applications possible, such as federated and hybrid CDI-MDM
There are unlimited ways that Link Indexes can be built, including using hyperlinks and primary keys - foreign keys mentioned above, and also using various matches on structured data, extracted entities, context, words, predictive analytics algorithms, combinations of these, etc. To maximize the value of links and minimize the number of links, Link Indexes should be built and maintained using high cardinality entities, i.e., entities that are either unique or have limited populations, such as person (defined as more than just a name), address, email, phone number, SSN, etc. Link Indexes can be stored at different levels to match content indexes or not, at individual data source, department, organization or regional levels, combinations of levels or as one at a central level. Regardless of where they are built and maintained, like content indexes, Link Indexes are 100% contiguous across multiple data sources and levels.
The following diagram LI2 illustrates the basic differences between normal content indexes and Link Indexes:
LI2: A content index is used to generate a Link Index - in this
case, of self-joins within a single (table) index
In the above example, a simple name within a single (table) index is used to generate a Link Index through self-joins. In some cases, there may be separate Link Indexes for the different types of links - in other cases, not. The Link Index is an inversion of the content index in the sense that record numbers are made queryable, but not in the sense that content can be read from the Link Index like a normal inverted index.
LINK MAPPING - BUILDING AND MAINTAINING LINK INDEXES
The process to build and maintain Link Indexes can run in parallel with building and maintaining content indexes or after content indexes are built. The following two diagrams LI3 and LI4 illustrate the Link Index build and maintenance process:
LI3: Link Index build and maintenance process across four data
source indexes
Note that it does not matter which Link Index initiates the link mapping process, as all links will eventually be captured. The update record schema reflects the logical/standard data model mapping and not necessarily the physical record from the data source. As can be seen, the logical/standard data model represents entities. An example of a four data source link map is illustrated in the following diagram LI4:
LI4: An example of a four data source link map, built and
maintained using internal, self and external joins
The above link mapping example results in the Link Indexes illustrated in the following diagram LI5, when built and maintained per data source:
LI5: Example of a distributed Link Index built from four data
sources
As mentioned earlier, Link Indexes do not need to be distributed and the following diagram LI6 is an illustration of a consolidated version of the above distributed Link Index:
LI6: Example of a consolidated Link Index built from four data
sources
LINK ANALYSIS
To illustrate working with Link Indexes, the above consolidated Link Index is simplified and a query submitted at a lower entity level to determine if two entities belonging to two nodes (e.g., records in separate databases) are connected in any way. These nodes may contain multiple entities that are inherently connected as being part of the same record. An example could be a link query: "Does a person with a particular SSN have any connection to a vehicle with a particular VIN?" WhamTech passes the appropriate SQL query to an EIQ Federation Server™. The degrees of separation (DOS) can be specified or not for the solution - in this example, the DOS is not specified. The example Link Index is simplified and the mentioned link query is represented in the following diagram LI7:
LI7: Simplified representation of a consolidated Link Index,
with source and target nodes specified
Entity queries on content indexes are used to initially isolate the source and target nodes, but thereafter, the process to determine the solution involves only the Link Index or Indexes in the case of distributed Link Indexes. However, from a presentation point-of-view, entities associated with any and all nodes can be read from data sources and presented and interacted with visually, as per a conventional link analysis application - this process is discussed more below. When seeking a solution using a Link Index or Indexes, "walking the tree (or "trees)" from the source node on one side and the target node on the other, along with Boolean operations on bitmap subsets, results in a rapidly converging solution, as illustrated in the following diagram LI8
LI8: Two solutions determined from the link analysis performed
on the Link Index
Of the two solutions determined, the path in black is shorter than the one in gray; however, there could be additional metadata that favors the longer path solution such as a higher probability of the links or entities, or a more recent timeline. The link analysis process is performed in middleware, in the background, where the end-user is unaware of it, as it is a query function performed in multiple query engines associated with multiple EIQ adapters for multiple data sources. Any and all entities associated with any and all nodes can be read from data sources and visually presented. WhamTech currently includes interfaces for open source Prefuse (http://prefuse.org) and Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek), and is developing an interface for open source Gephi (http://gephi.org). WhamTech has also reviewed and discussed interfacing with third-party commercial link visualization software. An example of the transition from innate physical models to logical models is illustrated in the following diagrams LI9 and LI10:
LI9: Physical to logical data model, grouped on entities
LI10: Logical model, grouped on entities, combined with physical model
Of note in the above, even though there was more than one link discovered between URN1 and URN 10, and between URN1 and URN100, only one link is represented in Link Indexes. There was more than one matching entity in each case, PERSON1 and PHONE1, and PERSON1 and EMAIL1, respectively. Link Indexes capture the "physical" links between records/files/documents and represent "physical models", which can be visualized, but this not normally how end-users interact with link networks. An application running in middleware combines records/files/documents retrieved through content indexes with those through Link Indexes to group similar entities and provide "logical models", which are more familiar and can be visualized by almost any visualization software, as mentioned earlier. The following diagram LI11 shows a screen shot from an interactive link analysis visualization interface built using Prefuse for Teracase™, WhamTech's eDiscovery tool:
LI11:
An interactive link analysis visualization interface built using Prefuse
for
Teracase™,
WhamTech's eDiscovery tool
The end-user now has the power to:
- Apply filters, ranges, probabilities and threat/favorability scores
- Update automatically in near real-time within n degrees of separation (DOS)
- Update manually and interactively, select, combine, separate, delete, expand and analyze
- View original records/files/documents
- View more detail
- Execute social network analysis (SNA) calculations
- Set alerts/notifications with thresholds and importance
- Execute external data source queries
- Etc.
And all of this power can be made available through a highly interactive, visual and comprehensive link analysis solution built on EIQ Products that can scale across multiple large data sources and does not need data to be extracted to a database for analysis. For more information, see SOLUTIONS - VIRTUAL LINK MAPPING AND VIRTUAL LINK ANALYSIS and the near-real-time updatable SOLUTIONS - LIVING NETWORKS, which allows for DOS-based alerts/notifications, e.g., "let me know when any new entities appear in any data source link to my network within two DOS, with a probability > 80% and carry a threat score of > 60%." If any threat or favorability score is known about an entity, e.g., address, person, vehicle, etc., WhamTech can use guilt by association network techniques to estimate threat or favorability scores of other entities. This, combined with probabilities of links and entities being accurate themselves and/or links, authority of data sources, social media analysis and social network analysis, create a power analytics tool that could be used for more than just intelligence, e.g., virtual/hybrid CDI-MDM, marketing, fraud detection, anti-money laundering, predictive analytics, etc.
Link Indexes can be further used for data discovery and to discover, merge, extend, validate and present ontologies. For more information, see DATA DISCOVERY AND SEMANTIC REASONING.