Entity Extraction Help
Version 8.0.0.490
Entity
Extraction in EIQ Product Suite Installation
And Configuration Guidelines Installing
Prerequisite Software Installing
and Running WhamEE with JavaGateway Configuring
JavaGateway Properties Installing
and Running Standalone WhamEE Configuring
GATE for Standalone WhamEE Using
Entity Extraction: Extracting Entities from Structured Data Source Text Columns Extract
Entities and Build Indexes in RTI Tool Configure
the Virtual Data Source for Entity
Queries Running
a Sample GATE Application Step 2: Add Language Resources Step
5: View ANNIE annotations in the documents Adding
a New Entity Type to GATE for Extraction Step
1: Add JAPE rule for the new entity Step
2: Add lookup values list for Gazetteer Step
3: Add new entity name for extraction. The EIQ Product Suite comes with the following components to
extract entities out of text content: ·
WhamEE uses the
open source software GATE (General Architecture for Text Engineering) for text
analytics. See http://gate.ac.uk/ for
more information on GATE. ·
Download
and install the latest version of Java runtime. For advanced users
who want to customize GATE: ·
Download
Java SDK and set the system variable JAVA_HOME to the java running directory
(e.g. C:\Program Files\Java\jdk1.6.0_17). http://www.java.com/en/download/index.jsp ·
Download and install the latest version (6.0) of
GATE. If possible, choose the Windows version of the installer as this document
refers to the Windows version of GATE. GATE
provides a flexible, powerful, open source framework to process textual data
and identify entities using customizable rules and lookup lists. ·
Download and install the latest version of
Apache Ant. Ant is used for running JavaGateway,
which loads WhamEE and GATE. ·
Copy the JavaGateway
package to the C:\javagateway folder. The JavaGateway package comes with
WhamEE libraries and several configuration files. ·
Edit the build.xml file and change the path of gate.home to point to the GATE installation
folder on your local computer. There may be multiple places in the file where
you need to make this change. ·
Edit the
extraction.properties file
and set properties for WhamEE including entities to
extract. For further information on the properties file, see "Configuring JavaGateway
properties" below. ·
Edit the
logging.properties file to
set the path for log files. ·
At the command prompt, change to the javagateway folder path, and type "ant run" to
start JavaGateway. JavaGateway loads WhamEE, initializing GATE. The following files
under the JavaGateway install path allow users to
configure Java Gateway and WhamEE. extraction.properties file contains settings for GATE such as the entities to extract and
the processing batch size. ######################################################################## # GATE SETUP ######################################################################## # Maximum number of files/records to process
in a batch GATE.MaxFileProcess=30 # Entity types to extract GATE.ExtractionEntities=Person,Location,Organization,Date,Address,Product # Maximum corpus size GATE.MaxCorpusSize = 500000 server.properties file contains JavaGateway server properties
such as the thread pool size to use. ######################################################################## # JavaGateway SETUP ######################################################################## #--JavaGateway ServerBase Thread Pool
Size ServerBase.ThreadPoolSize=6 The logging.properties file contains JavaGateway logging properties. ######################################################################## # LOG SETUP ######################################################################## #Default Logging File. # DEBUG property produces more logging
information than INFO # INFO produces minimal logging information
from JavaGateway, WhamEE
and GATE # DEBUG produces more debugging information
even from GATE. log4j.rootLogger= INFO, A2 #log4j.rootLogger= DEBUG, A2 # Appender A2 writes
to file.. rolls daily log4j.appender.A2=org.apache.log4j.DailyRollingFileAppender log4j.appender.A2.DatePattern='.'yyyyMMdd log4j.appender.A2.append=true # Log path log4j.appender.A2.File=c:\\javagateway\\logs\\javaGateWay.log # Appender A2 uses
the PatternLayout. log4j.appender.A2.layout=org.apache.log4j.PatternLayout log4j.appender.A2.layout.ConversionPattern=%d
%5p [%t] (%F:%L) - %m%n ·
Extract files
from whamee_Setup.zip to a local temporary folder. Its located in the EIQ
Product Suite Installation Media under folder "WhamEE". ·
From the
command line, go to the local temporary folder and run ‘whameesetup.bat’. ·
‘Whameesetup.bat’
creates the whamee folder under "C:\Program
Files\Whamtech\Whamee".
·
Verify
that the configuration files (extraction.properties)
are set up under the ‘Whamee’ folder. Set the source
directory, destination directory, and output directory. For further information
on the configuration file, see "Configuring
WhamEE properties" below. ·
Edit the
‘whamee_run.bat’ file located in "C:\Program files\whamtech\whamee" and modify the property "gate.home" to point to the GATE
installation directory. Make sure to use double-quotes around the path, for
instance, "C:\Program Files\GATE". ·
Run
"whamee_run.bat" from the command prompt to launch WhamEE server. server.properties - contains socket and thread
pools to use logging.properties - logging properties and tags extraction.properties - defined below ######################################################################## # Extraction Setups ######################################################################## # ExtractionConfig.Count used to define
the number of configs below # The below configs will be numbered 0 to n-1. ExtractionConfig.Count=2 ######################################################################## # Multiple configurations can be specified below using this format # Extraction.<<configid>>.<<property>>=<<value>> # configid: configuration id starting
from 0 # property: configuration property name # value: configuration property value ######################################################################## # Configuration ID=0 ######################################################################## Extraction.0.SourceDir=C:\/wham\/Projects\/BulkFiles\/IGISWEBDS1 Extraction.0.DestinationDir=C:\/wham\/Projects\/BulkFiles\/whamee_output Extraction.0.OutputDir=C:\/wham\/Projects\/BulkFiles\/whamee_output # Unprocessed folder goes under the outputdir
with the below directory #name - full path is not needed for this as the outputdir is appended as #a prefix Extraction.0.UnprocessedDir=whamee_failed # Output Format supported: CSV, UpdateSrvFile,
UpdateSrvSingleFile Extraction.0.OutputFormat=UpdateSrvSingleFile # Extraction.0.Continuous=TRUE #Interval in seconds Extraction.0.Interval=30 #Max Batch Size of collecting files Extraction.0.BatchSize=100 #ExtractionSoftware = GATE is currently
the only supported extraction software Extraction.0.ExtractionSoftware=GATE Extraction.0.ExtractSoftwareHashName=GATE0 #Entities to extract comma delimited) #Person,Location,Organization,Date,Address Extraction.0.EntitiesToExtract=Person,Location,Organization #File mapping for the above entities Extraction.0.OutputFiles=Person,Location,Organization #Character separating each field value within the same record in
the text file Extraction.0.OutputFile.ColumnDelimiter=, #character used to identify beginning and end of a string Extraction.0.OutputFile.StringQualifier=' #FileName to be put in OutputDir for db update Extraction.0.Person.File.Name.Prefix= Extraction.0.Person.File.Name.Suffix=txt #CSV file only needs columns to dump data #DocName will be a constant for Document
Name from which entity was retrieved. #e.g. Extraction.0.Person.File.Values=DocName,Person #Values format is entity:ColumnName #If update server load file format we will need to have the column
value #equivalent and a line for table name and schema Extraction.0.Location.File.Name.Prefix=Person Extraction.0.Location.File.Name.Suffix=txt Extraction.0.Person.File.Values=DocName:Document,DocPageHash:PageHash,Person:Name,DocLoc:documentLocation Extraction.0.Person.File.Schema=NULL Extraction.0.Person.File.Database=NULL Extraction.0.Person.File.Table=PersonEntity Extraction.0.Location.File.Name.Prefix=Location Extraction.0.Location.File.Name.Suffix=txt Extraction.0.Location.File.Values=DocName:Document,DocPageHash:PageHash,Location:Place,DocLoc:documentLocation Extraction.0.Location.File.Schema=NULL Extraction.0.Location.File.Database=NULL Extraction.0.Location.File.Table=LocationEntity Extraction.0.Organization.File.Name.Prefix=Organization Extraction.0.Organization.File.Name.Suffix=txt Extraction.0.Organization.File.Values=DocName:Document,DocPageHash:PageHash,Address:Org,DocLoc:documentLocation Extraction.0.Organization.File.Schema=NULL Extraction.0.Organization.File.Database=NULL Extraction.0.Organization.File.Table=OrganizationEntity These steps are
needed only when WhamEE is invoked by the command
line. Skip this section if you are running WhamEE and
GATE from JavaGateway. WhamEE must load certain GATE plugins to use their
processing resources. ·
Load the
plugins by launching GATE and selecting "Manage CREOLE Plugins" from
the "File" menu. ·
Select
the "Load now" and "Load always" options for the plugins
given below. See http://gate.ac.uk/sale/tao/splitch3.html#x6-550003.5
for further information. The required
plug-ins: ·
ANNIE ·
Ontology ·
Gazetteer_Ontology_Based ·
Tools ·
Ontology_Tools Many
structured data sources contain vast amounts of unstructured information in
text columns. Applications benefit from applying structured queries on
this unstructured information. Use Entity extraction to identify and extract
entities from unstructured text and build structured indexes on the entities. Make
sure that JavaGateway,
WhamEE, and GATE are setup properly and that JavaGateway is running. In the
EIQ Server RTI Tool, enable the entity extraction feature: ·
Connect to the structured data source and switch to RTI mode. ·
Select ‘Options’ from the ‘Tools’ menu and select ‘Entity
Extraction Settings’. ·
Select 'Enable entity extraction'. ·
Enter the server address and port for the JavaGateway
server where WhamEE and GATE are configured for
entity extraction. ·
Select one or more entity types to extract. Note: These options are global and
apply to all columns designated as Entity Fields. See below for more
information on Entity Field designation. ·
Designate the columns containing the unstructured text information
as Entity Fields by right-clicking the column and selecting Modify
Flags->Entity Field from the context menu. Entity Fields tell the EIQ Server RTI tool to generate an
entity table for each entity type and the corresponding association tables. The
association tables relate the entity tables to the data source table containing
the Entity Field column. These tables store the extracted entity data and allow
SQL queries that relate and join the entity data with the source table. The RTI
generated tables have the following naming convention: ·
D#_E_Location – the table name for
Location entity type ('D#' for derived table; E for entity type; Location for
the type of entity) ·
D#_EA_Location_Person – the table name
for the association table relating the D#_E_Location
table with the data source Person table. ·
Proceed to build EIQ indexes as usual. This
step is required to configure an EIQ SuperAdapter VDS and is unnecessary for an
EIQ TurboAdapter VDS. While
configuring the EIQ SuperAdapter, the only additional step is to map the entity
table columns to a virtual schema view (SuperSchema).
Each entity type table contains a text column named EntityValue.
This column contains the extracted values for that entity type. ·
Map the EntityValue columns to a virtual
schema view. ·
Connect to the VDS and make queries involving
the entity table values. This section describes a sample scenario for running GATE to annotate
sample documents for entity extraction. ·
Select GATE 6.0 GUI from the Start Menu. This opens a workspace window. Certain GATE plugins
need to be loaded first. ·
Load the
plugins by selecting ‘Manage CREOLE Plugins’ from the File menu. ·
Select
the "Load now" options for the plugins given below. See http://gate.ac.uk/sale/tao/splitch3.html#x6-550003.5
for further information. In this sample, Language resources are documents that you want GATE to
process. From GATE->Language Resources: ·
Right-click on Language Resources and select ‘New
-> GATE Document’ to add html documents from your local system. ·
Right-click on Language Resources and select ‘New
-> GATE Corpus’ and name the corpus. ·
Double-click on the newly created corpus under
Language Resources, and add the above documents to the corpus on the right by
clicking the '+' button. ANNIE is the default information extraction system application that
comes with GATE. It contains a collection of plugins to process the documents
in the corpus created above. ·
Select File -> Load ANNIE System - with defaults GATE loads various processing resources such as tokenizes, gazetteers,
sentence splitters, taggers etc. See
http://gate.ac.uk/releases/gate-6.0-build3764-ALL/doc/tao/splitch6.html#x9-1260021
for details on ANNIE processing resources. ·
Double-click on ‘GATE->ANNIE’. GATE opens ANNIE on the right side and shows the loaded and selected
processing resources. The order in which they are selected is very important: o
Document Reset PR o
ANNIE English Tokenizer o
ANNIEE Gazetter o
ANNIE Sentence Splitter o
ANNIE POS Tagger o
ANNIE NE Transducer o
ANNIE OrthoMatcher ·
Click 'Run this Application' at the bottom (or run
it through GATE->Applications->ANNIE->Run this Application) ·
Double click on a document in the GATE->Language
Resources. The document content is shown on the right side. ·
Click 'Annotation Sets' and 'Annotation Lists' to
view the corresponding information. ·
In the right-most panel, expand the arrows to open
the original and ANNIE-created new markups (Address, Date, etc.). ·
Select the markups to see the corresponding text
highlighted with matching colors. The annotation lists are shown at the bottom. GATE comes with support for several default entity types such as Person,
Organization, Address, etc. Users can
create their own entity types for extraction by GATE. The following steps show
an example of creating 'Product' entity type. ·
Under the “GATE-6.0\plugins\ANNIE\resources\NE\”
directory, make a copy of an existing file, for example jobtitle.jape,
and rename it for the new entity (product.jape). ·
Open and change the file by defining JAPE rules for
the new entity. JAPE file contents for the 'Product' entity type: Rule: Product1 ( {Lookup.majorType ==
product} ( {Lookup.majorType
== product} )? ) :product --> :product.Product =
{rule = "Product1"} ·
Add an entry for 'product' in main.jape
file. JAPE rules are used by ANNIE NE Transducer. Verify that it can load the
new file. ·
From GATE->Processing Resources, double-click on
'ANNIE NE Transducer' and verify that the new entity type is listed. ·
If the new name is not listed, try reinitializing
the transducer. ·
Under the
“GATE-6.0\plugins\ANNIE\resources\gazetteer\” directory, make a copy of an
existing file, for example jobtitles.lst, and rename it
for the new entity type (product.lst). ·
Open the file and delete all existing entries. ·
Add a couple of lookup values for the new entity;
one value per line. EIQ Product Suite GATE SQL Server 2008 ·
Add an entry for 'product' in lists.def file as
follows: product.lst:product ANNIE Gazetteer uses lists for lookups. Verify that it can access the
new list. ·
From GATE GUI, double-click on 'ANNIE Gazetteer'
and verify that the new entity type is listed. ·
If the new type is not listed, try reinitializing
the gazetteer. In the extraction.properties file, add the new entity name as
follows: # Entity types to extract GATE.ExtractionEntities=Person,Location,Organization,Date,Address,Product While
building indexes for text search columns in EIQ Server RTI Tool, select the new
entity type in the ‘Options’ menu under ‘Entity Extraction Settings’. The EIQ
Server RTI tool would get the extracted entities from JavaGateway
and build indexes for the new entity in a new derived table. Click here for more details on using entity
extraction with EIQ Product Suite tools. Entity
Extraction in EIQ Product Suite
Installation
And Configuration Guidelines
Installing
Prerequisite Software
Installing
and Running WhamEE with JavaGateway
Configuring
JavaGateway Properties
Installing
and Running Standalone WhamEE
Configuring
WhamEE Properties
Configuring
GATE for Standalone WhamEE
Using
Entity Extraction: Extracting Entities from Structured Data Source Text Columns
Extract
Entities and Build Indexes in RTI Tool
Configure
the Virtual Data Source for Entity
Queries
Query
the Virtual Data Source
GATE
Quick Start Guide
Running
a Sample GATE Application
Step
1: Initialize GATE GUI
Step 2: Add Language Resources
Step 3: Load
ANNIE System
Step
4: Run ANNIE application
Step
5: View ANNIE annotations in the documents
Adding
a New Entity Type to GATE for Extraction
Step
1: Add JAPE rule for the new entity
Step
2: Add lookup values list for Gazetteer
Step
3: Add new entity name for extraction
Step
4: Extract new entities
Copyright
© 2023 , WhamTech, Inc. All rights reserved. This document is
provided for information purposes only and the contents hereof are subject to
change without notice. Names may be trademarks of their respective owners.