LINK INDEXES™

Link Indexes™ are an add-on option for EIQ Adapters™. Link Indexes™ in combination with normal content indexes enable accelerated database joins, link mapping, degrees of separation queries, link analysis, virtual CDI-MDM, and physical data, logical and ontology model discovery, among other capabilities.

Link Indexes™ changed the way data is managed, as it provides structure to content and acknowledges the "connectivity" of data, regardless of whether structured or unstructured, type, format, location, system, cloud or non-cloud, access, etc., and could ultimately empower user-level application development.

HISTORY

WhamTech discovered the predecessor of Link Indexes™ through development work on its Web search engine during 2000 - 2001, when normal content indexes were used to capture hyperlinks in Web pages and documents.  By storing and indexing "from and to" hyperlinks, WhamTech established the following:

  • Indexing "from and to" links and inverting them to establish "to and from" links, enabled backwards and forwards Web navigation and therefore lateral browsing, e.g., go to Web pages that are hyperlinked TO the current Web page, as well as the easier option of FROM the current Web page
  • Hyperlinked communities on the Web could be discovered and navigated, and Web pages filtered in or out, using WhamTech's powerful search (for more information, see WHAMSEARCH)

    • Hyperlink indexes added structure to content
  • Metadata from hyperlink indexes could be used to calculate how popular Web pages and documents were in general, and how popular and important (using social network analysis (SNA)) they were to specific communities
    • Could be used for ranking
  • All networks, regardless of structural complexity (hierarchical, relational, random and even 3D objects) could be represented by multiple pairs of links, like hyperlinks or primary key - foreign keys in relational databases

From WhamTech's experience with content, structure and ranking of very large data sets, an understanding of data-based solutions evolved, as illustrated in the diagram LI1 below:



LI1: WhamTech EIQ Products combine content, links and ranking

LINK INDEXES™ EVOLVED FROM HYPERLINK INDEXES 

In 2001, WhamTech started working on law enforcement and intelligence data access issues, and link analysis.  And over time, extended hyperlink indexes to arrive at the current Link Indexes™ EIQ Product add-on, which can be described as follows:

  • Link Indexes™ are representations of links between data pointers to records/files/documents in the same or other data sources
    • Based on data/entity (collectively known as "entity") matches, e.g., PK-FK, or exact, fuzzy or algorithm match – these are direct or obvious relationships
    • As a consequence, all data in the same record, file or document/paragraph are connected – these are indirect or non-obvious relationships
  • Link Indexes™ can be combined, using Boolean operations, to represent networks through simple SQL - added LINK (a specialized JOIN) and DOS (degrees of separation) to SQL
  • Link Indexes™ can be used in conjunction with content indexes for accelerating internal and external joins, degrees of separation queries, and link mapping and link analysis solutions, again, using simple SQL
    • Other applications possible, such as federated and hybrid CDI-MDM

There are unlimited ways that Link Indexes™ can be built, including using hyperlinks and primary keys - foreign keys mentioned above, and also using various matches on structured data, extracted entities, context, words, predictive analytics algorithms, combinations of these, etc.  To maximize the value of links and minimize the number of links, Link Indexes™ should be built and maintained using high cardinality entities, i.e., entities that are either unique or have limited populations, such as person (defined as more than just a name), address, email, phone number, SSN, etc. Link Indexes™ can be stored at different levels to match content indexes or not, at individual data source, department, organization or regional levels, combinations of levels or as one at a central level.  Regardless of where they are built and maintained, like content indexes, Link Indexes™ are 100% contiguous across multiple data sources and levels.

The following diagram LI2 illustrates the basic differences between normal content indexes and Link Indexes™:       li2
LI2: A content index is used to generate a Link Index - in this case, of self-joins within a single (table) index

In the above example, a simple name within a single (table) index is used to generate a Link Index through self-joins.  In some cases, there may be separate Link Indexes™ for the different types of links - in other cases, not.  The Link Index is an inversion of the content index in the sense that record numbers are made queryable, but not in the sense that content can be read from the Link Index like a normal inverted index.

LINK MAPPING - BUILDING AND MAINTAINING LINK INDEXES™

The process to build and maintain Link Indexes™ can run in parallel with building and maintaining content indexes or after content indexes are built.  The following two diagrams LI3 and LI4 illustrate the Link Index build and maintenance process:        li3
LI3: Link Index™ build and maintenance process across four data source indexes

Note that it does not matter which Link Index™ initiates the link mapping process, as all links will eventually be captured.  The update record schema reflects the logical/standard data model mapping and not necessarily the physical record from the data source.  As can be seen, the logical/standard data model represents entities.  An example of a four data source link map is illustrated in the following diagram LI4:   li4       
LI4: An example of a four data source link map, built and maintained using internal, self and external joins

The above link mapping example results in the Link Indexes™ illustrated in the following diagram LI5, when built and maintained per data source:     li5 
LI5: Example of a distributed Link Index built from four data sources

As mentioned earlier, Link Indexes™ do not need to be distributed and the following diagram LI6 is an illustration of a consolidated version of the above distributed Link Index:      li6
LI6: Example of a consolidated Link Index built from four data sources

LINK ANALYSIS 

To illustrate working with Link Indexes™, the above consolidated Link Index is simplified and a query submitted at a lower entity level to determine if two entities belonging to two nodes (e.g., records in separate databases) are connected in any way.  These nodes may contain multiple entities that are inherently connected as being part of the same record.  An example could be a link query: "Does a person with a particular SSN have any connection to a vehicle with a particular VIN?"  WhamTech passes the appropriate SQL query to an EIQ Federation Server.  The degrees of separation (DOS) can be specified or not for the solution - in this example, the DOS is not specified.  The example Link Index is simplified and the mentioned link query is represented in the following diagram LI7:             

LI7: Simplified representation of a consolidated Link Index, with source and target nodes specified

Entity queries on content indexes are used to initially isolate the source and target nodes, but thereafter, the process to determine the solution involves only the Link Index or Indexes in the case of distributed Link Indexes™.  However, from a presentation point-of-view, entities associated with any and all nodes can be read from data sources and presented and interacted with visually, as per a conventional link analysis application - this process is discussed more below.  When seeking a solution using a Link Index™ or Link Indexes™, "walking the tree (or "trees)" from the source node on one side and the target node on the other, along with Boolean operations on bitmap subsets, results in a rapidly converging solution, as illustrated in the following diagram LI8

         li8
LI8: Two solutions determined from the link analysis performed on the Link Index

Of the two solutions determined, the path in black is shorter than the one in gray; however, there could be additional metadata that favors the longer path solution such as a higher probability of the links or entities, or a more recent timeline.  The  link analysis process is performed in middleware, in the background, where the end-user is unaware of it, as it is a query function performed in multiple query engines associated with multiple EIQ adapters™ for multiple data sources.  Any and all entities associated with any and all nodes can be read from data sources and visually presented.  WhamTech currently includes interfaces for open source Prefuse (http://prefuse.org) and Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek), and is developing an interface for open source Gephi (http://gephi.org).  WhamTech has also reviewed and discussed interfacing with third-party commercial link visualization software.  An example of the transition from innate physical models to logical models is illustrated in the following diagrams LI9 and LI10:
li9
LI9: Physical to logical data model, grouped on entities
li10
LI10: Logical model, grouped on entities, combined with physical model

Of note in the above, even though there was more than one link discovered between URN1 and URN 10, and between URN1 and URN100, only one link is represented in Link Indexes™.  There was more than one matching entity in each case,  PERSON1 and PHONE1, and PERSON1 and EMAIL1, respectively. Link Indexes™ capture the "physical" links between records/files/documents and represent "physical models", which can be visualized, but this not normally how end-users interact with link networks.  An application running in middleware combines records/files/documents retrieved through content indexes with those through Link Indexes™ to group similar entities and provide "logical models", which are more familiar and can be visualized by almost any visualization software, as mentioned earlier.  The following diagram LI11 shows a screen shot from an interactive link analysis visualization interface built using Prefuse for Teracase, WhamTech's eDiscovery tool:

li11

 LI11: An interactive link analysis visualization interface built using Prefuse for Teracase, WhamTech's eDiscovery tool

The end-user now has the power to:

  • Apply filters, ranges, probabilities and threat/favorability scores
  • Update automatically in near real-time within n degrees of separation (DOS)
  • Update manually and interactively, select, combine, separate, delete, expand and analyze
  • View original records/files/documents
  • View more detail
  • Execute social network analysis (SNA) calculations
  • Set alerts/notifications with thresholds and importance
  • Execute external data source queries
  • Etc.

And all of this power can be made available through a highly interactive, visual and comprehensive link analysis solution built on EIQ Products that can scale across multiple large data sources and does not need data to be extracted to a database for analysis.  For more information, see solutions and the near-real-time updatable solutions, which allows for DOS-based alerts/notifications, e.g., "let me know when any new entities appear in any data source link to my network within two DOS, with a probability > 80% and carry a threat score of > 60%."  If any threat or favorability score is known about an entity, e.g., address, person, vehicle, etc., WhamTech can use guilt by association network techniques to estimate threat or favorability scores of other entities.  This, combined with probabilities of links and entities being accurate themselves and/or links, authority of data sources, social media analysis and social network analysis, create a power analytics tool that could be used for more than just intelligence, e.g., virtual/hybrid CDI-MDM, marketing, fraud detection, anti-money laundering, predictive analytics, etc.

Link Indexes™ can be further used for data discovery and to discover, merge, extend, validate and present ontologies.

More information on WhamTech products, click here.

 SmartData Fabric™:  UNLEASH the value of data.

For more related information, please visit the pages listed below.