Data discovery and semantic reasoning

Data exists in physical stores/models that are usually practical implementatons of logical data models that are, in-turn, usually a representation of ontologies, although they may not be recognized as such.  EIQ Products and the add-on option of Link Indexes™ provide a means to semi-automatically discover physical data models in multiple data sources, present a logical standard data model-based indexed view, and discover, merge, extend, validate and present a single ontology or ontologies.  For unstructured data, other EIQ Product add-ons such as WhamSearch™, WhamEE™ (entity extraction), information geometry tool and other knowledge management (KM) tools can be used.  WhamTech is developing a method to represent an enterprise ontology standard data model indexed view that accepts SPARQL queries.

LINK INDEXES AND DATA DISCOVERY

EIQ Products can index raw data to obtain data profiles from index trees to develop data quality transforms - on entire data sources, if not prohibitively large, or failing that, representative samples.  Indexes, in general, consist of sorted trees with nodes representing values and sorted lists or bitmaps representing pointers to records/files/documents in which these values exist - see TECHNOLOGY for more information on this.  Customers learn a lot from data profiles and can use a combination of data source metadata and manually match data profiles within the same data source, e.g., primary key - foreign keys data, and across multiple data sources, e.g., persons and addresses.  There are third-party software vendors that offer semi-automatic data matching capabilities.  WhamTech is reviewing options for a semi-automatic statistical data matching method, but the one method that may be even better is using raw data Link Indexes, where values are matched without any logical/standard data model mapping.  There are methods to accommodate differences in format within and across multiple data sources.  Timing to implement semi-automatic statistical data matching depends on customer demand.

LINK INDEXES AND ONTOLOGIES / SEMANTIC REASONING

The normal process is that individual data sources are discovered, profiled, mapped and then indexed.  Either during indexing or after normal content indexes have been initially built, Link Indexes are built and subsequently maintained along with normal content indexes.  Link mapping and the physical-to-logical model transitions used for degrees of separation queries, link analysis, etc., can be further used to discover, merge, extend, validate and present ontologies, some of which is currently a semi-automatic effort.  Fully automating this further use could be a future development option, depending on customer demand.

As an example of semantic reasoning, a subset of the link mapping and data source mappings illustrated in diagram LI4 under LINK INDEXES is used as an example, as illustrated in the following diagram DDSR1:


DDSR1: A subset of the link mapping and data source mappings illustrated in diagram LI4 under LINK INDEXES used as an example for ontology data models

The fields in each of the above example two data sources are mapped to a logical/standard data model.  Some assumptions have to be made about the reason or "predicate" (in semantic reasoning) for the mapping/relationship, which at this time, unless it is automatically available in some form, should come from someone who is familiar with the data source, e.g., a DBA.  The resultant schemas are illustrated in the following diagram DDSR2:


DDSR2: Data source schemas illustrated as standard data model-generated ontologies, with associated attributes - the predicates ("lives at", "registered at" and "owns") were provided by respective DBAs

Link mapping, as per diagram DDSR1, implies that the address where a person lives at in data source DS2, is the same address that a vehicle was registered at in data source DS3, as illustrated in the following diagram DDSR3:


DDSR3: The address where a person lived at in data source DS2, is the same address that a vehicle was registered at in data source DS3

From an ontological point-of-view, the addresses can be the same, therefore, the ontologies can be merged, as illustrated in the following diagram DDSR4:

 
DDSR4: Merged ontology based on a common address between two separate ontologies

Also, link mapping, as per diagram DDSR1, implies that the person in data source DS2 is the same person as in data source DS3, as illustrated in the following diagram DDSR5:


DDSR5: The person data source DS2 is the same person as in data source DS3

From an ontological point-of-view, the persons can be the same, therefore, the ontologies can be merged, as illustrated in the following diagram DDSR6:


DDSR6: Merged ontologies based on a common person between two separate ontologies

If all persons with addresses in data source DS2 also own at least one vehicle and vice versa, then the above merged ontology would be 100% representative.  However, there are  persons in data source DS2 who do not own vehicles and there are vehicle owners in data source DS3 who are not listed as residents in data source DS2.  Therefore, persons need to be subdivided into vehicle owners and non-vehicle owners, and the merged ontology extended, as illustrated in the following diagram DDSR7:


DDSR7: Merged ontology extended to allow for vehicle owners and non-vehicle owners

The eventual discovered ontology or ontologies can be captured as a Resource Description Framework (RDF) series in XML or an RDF file/database, and used to develop processes and applications with Business Process Management (BPM) software or similar.

Please contact WhamTech if you are interested in the above capabilities.

 
 Print