EIQ Products™
Related information
WhamSearch™ - search for data, metadata and text in databases, files,
email, Web documents, blogs, social media, etc. (add-on). Content and
context-directed Intelligent Spider™. Used as an add-on for other
EIQ products for structured, unstructured and semi-structured
data search.
Unlike most
other text search, the unstructured text search option for WhamTech EIQ
Products is a specialized form of structured query processing.
For example, the indexes used for structured database queries are
the same as used for unstructured text search on Web pages.
As a result:
-
Structured data queries and unstructured text search resolve to the same data source pointers and can be combined in the same SQL query
-
Numerical methods can be applied to text search, allowing word weighting, Boolean operations on bitmaps and scoring
-
Easier to implement information geometry and other text analytics that use integers instead of text – also applies to sensor data Allows more creativity for specialized queries – tokenization, encryption, language mapping, etc.
-
Link mapping and link analysis can be used on any data, whether structured, extracted entities or text
Status:
|
|
Feature |
Details |
Example |
|
1 |
ALL words indexed |
Can find every single word in a document,
even stop words A form of steganography would be to
communicate almost entirely in stop words, as most search
engines would not retain in an index or filter |
“To be or not to be, that is the question”
consists entirely of stop words, except for “question” |
|
2 |
Stop words customized for specialist search |
General stop words and specialized search
application stop words use updateable dictionaries |
Common English words like “a”, “the”, “to”,
etc. In a specialized financial services search application, for
example, terms like “earnings”, “quarter” and “revenue” are
common. |
|
3 |
Word Weighting |
Use tags, font size, font attributes and word
frequency to weight the importance of words and associated
phrases in text Advanced Weighting: Stemming (see 4) and
synonyms (applies stemming first) (see 5) |
Interested in brown bears fishing in
·
High weight: Alaska
·
Medium weight: Brown bears
·
Low weight: Fishing Any other word weighting order would produce
a different result-set or rank to a result-set |
|
4 |
Stemming algorithms |
Porter stemming, with additional options for
Metaphone and
Soundex Word Weighting can be applied to stems |
Fish, fishing, fished and fisher have the
same stem: fish |
|
5 |
Synonyms |
WordNet – public domain and extensive
linguistics – available in foreign languages Word Weighting can be applied to synonyms
(applies stemming first) |
Car, auto, motorcar and vehicle could all be
included under a single weighted word: car |
|
6 |
Proximity |
Adding an additional field to the words
database that keeps track of position – Boolean combination to
determine proximity – phrase searching |
A near-proximity search on “George Bush” would produce results with “George H. Bush” and “George W. Bush”, but exclude: “George of the Jungle, grew up in the African Bush" |
|
7 |
Highlighting search text |
Confine to Word documents – option to do it
on retrieved docs or on saved text version, as per Google |
A non-proximity search on “George” AND “Bush”
would include: “George
of the Jungle, grew up in the African
Bush” |
|
8 |
Document Summary |
Use first x words in document As an alternative, use High and Medium
weighted words. Also incorporate advanced word weighting
features (stemming as in 4, and synonyms as in 5) |
|
|
9 |
Context |
Based on Northrop Grumman’s or other
information geometry tool for eDiscovery and commercial
applications only (not government) Training on selected text develops highly
accurate models, which are in-turn used for finding similar text Can be visually represented based on match
probabilities and combined with link analysis |
A text search on birds yields a few documents
that is used to develop a model on birds, which is refined in
two or three steps, and further refined to birds of prey in
another two or three steps with very few false positives and no
false negatives |
|
10 |
Ranking |
Based on almost any criteria |
Context score, latest, oldest, most referred
to, composite, etc. |
|
11 |
Entity Extraction (add-on) |
Based on the open source software GATE and
integrated into EIQ Products |
Identifies “University of Texas” and indexes
it as an organization - “Texas University” would be a different
organization Identifies “John E. Smith” as either a
complete person name or broken down into first, middle and last
names |
|
12 |
Link Analysis (add-on) |
Can combine structured data, extracted
entities and unstructured text |
|
A screen shot from WhamSearch to select options for text search:
