- id: full file name is used for this field.
- title: title of the dataset,
- provider: provider of the dataset
- providerExact: same as the previous field, but uses String data type for facets and exact match search
- description: description or abstract of the dataset
- beginDate: begin date of the temporal coverage of dataset
- endDate: end date of the temporal coverage of dataset, when the input format only supports one dataset date, beginDate and endDate will contain the same value
- westBoundingCoordinate: Geographic west coordinate
- eastBoundingCoordinate: Geographic east coordinate
- northBoundingCoordinate: Geographic north coordinate
- southBoundingCoordinate: Geographic south coordinate
- fullText: The complete text of the XML metadata document
- externalUrl: Url containing specific information about the dataset; in the case of the
- serverId: Id of the source OAI-PMH Service; this information is taken from the file system structure and is used for the facets search.
The data import handlers are implemented using three main features available in Solr:
- FileDataSource: allows fetching content from files on disk.
- FileListEntityProcessor: an entity processor used to enumerate the list of files.
- XPathEntityProcessor: used to index the XML files, it allows defining of Xpath expressions to retrieve specific elements.
- PlainTextEntityProcessor: reads all content from the data source into a single field; this processor is used to import the whole XML file into one field.
- DateFormatTransformer: parses date/time strings into java.util.Date instances; it is used for the date fields.
- RegexTransformer: helps in extracting or manipulating values from fields (from the source) using Regular Expressions.
- TemplateTransformer: used to overwrite or modify any existing Solr field or to create new Solr fields; it is used to create the id field.
- org.gbif.solr.handler.dataimport.ListDateFormatTransformer: this is a custom transformer to handle non-standard date formats that are common in input dates; it can handle dates with formats like: 12-2010, 09-1988, and (1998)-(2000). It has three important attributes: i) separator that defines the character/string to be used as separator between year and month fields, ii) lastDay to define if the date to be used with a particular year value (e.g., 1998) should be the first or the last day of the year: if the year is being interpreted as a beginDate, then the value is set to yyyy-01-01 and lastDay is set to false; if the year is interpreted as an endDate then the value is set to yyyy-12-31 and the lastDay value is set to true, iii) selectedDatePosition to define which date is being processed when a range of dates is present in the input field; for example:
<field column="beginDate" listDateTimeFormat="yyyy-MM-dd" selectedDatePosition="1" separator="_" lastDay="false" xpath="/dc/date"/>
imports the “dc/date” into the begin date using “_” as separator ; selectedDatePosition=”1” states the date to be processed is the first one in the range of dates and lastDay is thus set to false. The implementation of this custom handler is available on google code site. The web interface can be visited in this url, in a next blog I'll exaplained how this user interface was implemented using some a simple ajax framework.
No comments:
Post a Comment