WhamSearch - search for all data sources

WhamSearch™ - search for data, metadata and text in databases, files, email, Web documents, blogs, social media, etc. (add-on).   Content and context-directed Intelligent Spider™.  Used as an add-on for other EIQ products for structured, unstructured and semi-structured data search.

Unlike most other text search, the unstructured text search option for WhamTech EIQ Products is a specialized form of structured query processing.  For example, the indexes used for structured database queries are the same as used for unstructured text search on Web pages.  As a result:

  • Structured data queries and unstructured text search resolve to the same data source pointers and can be combined in the same SQL query

  • Numerical methods can be applied to text search, allowing word weighting, Boolean operations on bitmaps and scoring

  • Easier to implement information geometry and other text analytics that use integers instead of text – also applies to sensor data Allows more creativity for specialized queries – tokenization, encryption, language mapping, etc.

  • Link mapping and link analysis can be used on any data, whether structured, extracted entities or text  

Status:

 

Feature

Details

Example

1

ALL words indexed

Can find every single word in a document, even stop words

A form of steganography would be to communicate almost entirely in stop words, as most search engines would not retain in an index or filter

“To be or not to be, that is the question” consists entirely of stop words, except for “question”

2

Stop words customized for specialist search

General stop words and specialized search application stop words use updateable dictionaries

Common English words like “a”, “the”, “to”, etc. In a specialized financial services search application, for example, terms like “earnings”, “quarter” and “revenue” are common.

3

Word Weighting

Use tags, font size, font attributes and word frequency to weight the importance of words and associated phrases in text

Advanced Weighting: Stemming (see 4) and synonyms (applies stemming first) (see 5)

Interested in brown bears fishing in Alaska:

·         High weight: Alaska

·         Medium weight: Brown bears

·         Low weight: Fishing

Any other word weighting order would produce a different result-set or rank to a result-set

4

Stemming algorithms

Porter stemming, with additional options for Metaphone and  Soundex

Word Weighting can be applied to stems

Fish, fishing, fished and fisher have the same stem: fish

5

Synonyms

WordNet – public domain and extensive linguistics – available in foreign languages

Word Weighting can be applied to synonyms (applies stemming first)

Car, auto, motorcar and vehicle could all be included under a single weighted word: car

6

Proximity

Adding an additional field to the words database that keeps track of position – Boolean combination to determine proximity – phrase searching

A near-proximity search on “George Bush” would produce results with “George H. Bush” and “George W. Bush”, but exclude: “George of the Jungle, grew up in the African Bush"

7

Highlighting search text

Confine to Word documents – option to do it on retrieved docs or on saved text version, as per Google

A non-proximity search on “George” AND “Bush” would include: “George of the Jungle, grew up in the African Bush

8

Document Summary

Use first x words in document

As an alternative, use High and Medium weighted words. Also incorporate advanced word weighting features (stemming as in 4, and synonyms as in 5)

 

9

Context

Based on Northrop Grumman’s or other information geometry tool for eDiscovery and commercial applications only (not government)

Training on selected text develops highly accurate models, which are in-turn used for finding similar text

Can be visually represented based on match probabilities and combined with link analysis

A text search on birds yields a few documents that is used to develop a model on birds, which is refined in two or three steps, and further refined to birds of prey in another two or three steps with very few false positives and no false negatives

10

Ranking

Based on almost any criteria

Context score, latest, oldest, most referred to, composite, etc.

11

Entity Extraction (add-on)

Based on the open source software GATE and integrated into EIQ Products

Identifies “University of Texas” and indexes it as an organization - “Texas University” would be a different organization

Identifies “John E. Smith” as either a complete person name or broken down into first, middle and last names

12

Link Analysis (add-on)

Can combine structured data, extracted entities and unstructured text

 

A screen shot from WhamSearch to select options for text search:

 
 Print