-
-
Notifications
You must be signed in to change notification settings - Fork 492
Moving to a search server: Why & how ?
Since 2010, GeoNetwork community has been discussing the move from Lucene to Solr in order to improve user search experience. Main motivations for that move was:
- Improve suggestions (eg. spell check, suggest only on user records)
- Facets on any fields and cross fields facet hierarchy
- Scoring, boosting results
- Similar documents
- Highlights in response
- Join query
- Improve Lucene memory issues on some setup (require restart)
- Reduce Lucene multilingual/search complexity
Moving from Lucene to Solr or Elasticsearch introduce a major change in the application. The search server is running aside GeoNetwork. A proxy is implemented in GeoNetwork to do search and enrich queries and responses based on user privileges.
Based on the WFS data indexing funded by Ifremer, a first codesprint was made in April/May 2016 with titellus (Francois Prunayre) and camptocamp (Patrick Valsecchi, Antoine Abt, Florent Gravin) to replace Lucene by Solr.
This codesprint focus on starting the move to Solr in order to identify main issues & risks / main benefits and draw a roadmap in order to then look for funding. This document sum-up what has been done so far and illustrate features that could be relevant for GeoNetwork.
- Analyze how to move to Solr
- Investigate Solr features and illustrate the benefits
- Start migration & refactoring focusing on main search service and CSW; identify features to deprecate
- Illustrate with a simple search interface providing the capability to search on metadata and datasets
New dependency:
- Solr 6
- Java 8 required
Removed dependency:
- Lucene 4.9 dependency
GeoNetwork major changes:
- Angular app use a simple HTTP Interceptor to allows basic search (the interceptor mimic q service query/response translation from/to Solr format). This is used to enable basic functionalities in current UI.
- New experimental Angular UI for search (on features and metadata)
- Integrate cleaning PR ie. Remove ExtJS UI, Old XSL services, Z39.50 server
See branch https://github.com/geonetwork/core-geonetwork/tree/solr
First experiments:
- Geographic features https://www.youtube.com/watch?v=VFiQEi0U-yc (indexing, attribute table view, facetting, heatmaps)
- (Meta)data search https://www.youtube.com/watch?v=3FyugQMxaiE for search in metadata and data at the same time
Spell checking module allows to suggest related search to end users in case of typo. Suggestion module can be used to provide suggestions based on field in the index.
Example of suggestions and similar words:
Examples on typos:
Spell check also works on phrases:
Current suggestion in GeoNetwork is based on a search and could not provide terms that are not matching results as current implementation does (see https://github.com/geonetwork/core-geonetwork/issues/1466, https://github.com/geonetwork/core-geonetwork/issues/634, https://github.com/geonetwork/core-geonetwork/issues/1003.
Using "MoreLikeThis" component, easily provide similar document to the one you're currently looking at (eg. other versions of the same dataset). See https://cwiki.apache.org/confluence/display/solr/MoreLikeThis
eg. search for ortho imagery, when you retrieve an image for 2015, you also have similar images in 2009, 2012. More like this response is structured that way
Search can know boost on specific fields during search or indexing (eg. give more score for match in the title) using Solr search API.
Solr support synonyms configuration based on simple text file or more advanced synonym map (configurable using API). Synonyms are heavily used in the INSPIRE dashboard project (eg. INSPIRE themes & annex https://github.com/INSPIRE-MIF/daobs/blob/daobs-1.0.x/solr/solr-config/src/main/solr-cores/data/conf/_schema_analysis_synonyms_inspireannex.json, contact and territory in France https://github.com/fxprunayre/daobs/blob/geocataloguefr/solr/solr-config/src/main/solr-cores/data/conf/_schema_analysis_synonyms_geocat_producer_territory.json).
Once configured, synonyms can be used in search/facets/stats components.
It extends the use of thesaurus in GeoNetwork currently only broader/narrower relation in thesaurus is used for hierarchycal facets (https://github.com/geonetwork/core-geonetwork/wiki/201411HierarchicalFacetSupport).
Query syntax could be used to make more flexible searches:
Search and index analysis chain is also better configured and will avoid search errors like when searching on full title.
The Highlighter module provides the capability to highlight matching words in results eg. in abstract.
- Sample query: http://localhost:8984/solr/catalog_srv_shard1_replica1/select?q=map&rows=1&wt=json&indent=true&hl=true&hl.fl=resource*&hl.simple.pre=%3Cstrong%3E&hl.simple.post=%3C%2Fstrong%3E
- Sample response:
document....
},
highlighting: {
501: {
resourceAbstract: [
"Use this template to describe a static <strong>map</strong> (eg. PDF or image) or an interactive <strong>map</strong> (eg. WMC)."
]
}
}
Note: Field MUST be tokenized. eg. does not work with String, should use text_general type.
Instead of using the server config-summary.xml which defines a predefined list of facets, Solr allows to create facet on any fields. The client could easily request any facets required. For example, the WFS feature data filter computes automatically facets on all feature attributes. It computes statistics on field for numeric and date type fields and compute facet configuration on-the-fly:
GeoNetwork facet only support term facet returning a list of values with a count of records. More advanced facetting could be done with Solr:
- range
- interval
- heatmap (for geometry)
- pivot
Pivot can also be quite flexible using the new Solr facet API allowing multilevel facets. User could for example request:
- a first level facet on resource type (eg. feature/dataset/service)
- a second level facet on point of contact
- a third level on conformity
- ... and get statistics on each pivot
Facet API also provide the capability to request more facet values, paging in facets, ...
When data is available using WFS (see https://github.com/geonetwork/core-geonetwork/wiki/WFS-Filters-based-on-WFS-indexing-with-SOLR). This work needs to be extended to also index other types of document (eg. PDF). Parser like Apache Tika can be used for this task.
Those features could be relevant to grouping results (datasets/serie, features/dataset, ...). Links between document must be added in the index. eg. search can be combined on both metadata and features.
- Sample query:
- http://localhost:8984/solr/catalog_srv/select?indent=on&q=+docType:feature&wt=json&group=true&group.field=parent&group.limit=4
- Get metadata with feature http://localhost:8984/solr/catalog_srv/select?indent=on&q={!join%20from=parent%20to=metadataIdentifier}+%2BdocType:feature&wt=json&fl=resourceTitle
- Get metadata with feature about "MEDECO" http://localhost:8984/solr/catalog_srv/select?indent=on&q={!join%20from=parent%20to=metadataIdentifier}+%2BdocType:feature+%2BMEDECO&wt=json&fl=resourceTitle
grouped: {
parent: {
matches: 8624,
groups: [
{
groupValue: "89dee307e38c972b333b152d9bd19bb2e9bb0d4d",
doclist: {
numFound: 49,
start: 0,
docs: [
{
id: "states.1",
docType: "feature"
More work required:
- groupValue contains the UUID (how to get label ?)
- Issue on group on multivalue field
- Block join (https://cwiki.apache.org/confluence/display/solr/Other+Parsers)
Issues:
- Does not return info about child docs.
## Spatial searches
Spatial search has been tested for both feature and metadata indexing/searching. Indexing of millions of object was tested. Some limitations were identified and need some more testing (eg. indexing ship track over the world was quite long to index based on the index grid size).
Heatmap feature is also used in feature analysis.
Spatial searches is based on Lucene spatial and does not use GeoTools filter. So far, spatial queries looks to be working fine.
To be tested.
Moving from Lucene to a search engine will bring major benefits by bringing many features implemented in search servers like Solr or Elasticsearch (including better scalability). In both cases, a proxy is placed in front in order to deal with privileges and building responses. Major tasks which will represent most of the workload is:
- implementing multilingual support (by using one field per language instead of one index by language as we do now).
- rework the Angular client to deal with the new format response
- re-implement all search protocols (the POC focused on CSW, but GN also implement OpenSearch, OAIPMH, SRU, Atom, ...)
Also, this move will allow to make more advanced dashboards based on banana (for Solr) or Kibana (for Elasticsearch) like what the daobs project do (eg. https://inspire-dashboard.eea.europa.eu/official/dashboard2/#/dashboard/solr/INSPIRE Reporting 2011 - Ref. year 2010 - Metadata availability and conformity). Dashboards could be created dynamically from the catalog content (based on record content) and could also replace the search statistics pages available in the admin.
Sample query http://localhost:8984/solr/catalog_srv_shard1_replica1/spell?q=bosin&spellcheck=true&spellcheck.collateParam.q.op=AND
Spellcheck and suggestion configuration is made in:
- solrconfig.xml: define module configuration
- schema: define which fields use to build the dictionary (currently, title, tags, abstract)
Response contains a dedicated spellcheck
and suggestion
section:
<response>
<result name="response" numFound="0" start="0"/>
<lst name="spellcheck">
<lst name="suggestions">
<lst name="bosin">
<int name="numFound">1</int>
<int name="startOffset">0</int>
<int name="endOffset">5</int>
<int name="origFreq">0</int>
<arr name="suggestion">
<lst>
<str name="word">basins</str>
<int name="freq">1</int>
</lst>
</arr>
</lst>
</lst>
<bool name="correctlySpelled">false</bool>
<lst name="collations">
<lst name="collation">
<str name="collationQuery">basins</str>
<int name="hits">1</int>
<lst name="misspellingsAndCorrections">
<str name="bosin">basins</str>
- Configure the dictionary updates (on commit ?)
- Add from the admin the capability to rebuild the dictionary
- "Also don't forget to build the spellcheck dictionary before you use it:"
- URL to trigger dictionnary update http://localhost:8984/solr/catalog_srv_shard1_replica1/select?&suggest=true&suggest.dictionary=mainSuggester&suggest.buildAll=true
The simple search application focused on drafting Angular components to easily create interface on top of Solr Search. In that work, we tried to overcome issues made in the first Angular components (eg. difficulties to have more than one search in the same app) and we started the design of components for search (eg. requestHandler, facets, results, paging, ...).
TODO: Add some more details.
- TODO
All communications made with Solr is handled by a proxy. The proxy takes care of:
- Query / Add user privileges to search filters
- Response / Add extra information on metadata document eg. can edit, is selected (formerly geonet:info)
- Provide access to search, spellcheck, suggestion, facet.
- Provide access to search for any type of document ie. metadata or data. The client should filter what to query.
Search response format is JSON.
Solr is not required to start the application but a warning is displayed in case of error contacting the search engine.
A health check tests if Solr is up & running and report status in the admin console.
Major changes:
- Search / Parameters / No default set. Client needs to define all (before, search defaults on isTemplate:n)
- Selection / Add q parameter to select the records matching a specific query. Not related to session last search anymore. See SelectionManager
More work required
- Multilingual search / Move from one index per language to field in each language in same index
- OAI-PMH
- Atom service
- RSS search
- CSV search
- Server / Response / Can we have complex JSON object in response instead of only flat structure ?
- Client / Can not sort on multivalue field (eg. denominator): Create min and max field in index
- GetDomain / Basic support / RangeValues is not supported
- GetRecords
- Config / Review mapping to solr field
More work required
- Virtual CSW / Needs testing
- Testing
Indexing is still made in 2 steps:
- XSL transformation to extract information from metadata record
- Add information from the database.
BTW, atomic update have been implemented in order to update popularity and rating without reindexing the full document for better performance.
More work required:
- How to setup/start Solr for running tests ?
- Editor / Update field name in relation panel
Not taken into account during the codesprint. It sounds relevant to have one Solr collection per node and provide one searcher per node. The way bean are accessed could probably be improved in order to better use Spring bean scope.
- GetPublicMetadataAsRdf : Move from URL params to Solr query eg. /rdf.metadata.public.get?q=...
- Log search
- Removed: Analyze Solr log instead - all requests made using GET contains parameters.
- Quid: Search Solr
- Requests and Params tables removed.
- Admin console / Dashboard : Removed - Use Solr facets instead and build new dashboard from that.
- Search
- No support of geom by id geometry:region:kantone:15
- CSW
- Language is defined by URL only to return DC response (no language detection).
- GetRecords / Result_with_summary custom extension is removed
- GetDomain / no support for range
More work required:
- Homogenous date time for records in db/index/xml
- Index / Add timezone. Value in index is in UTC. https://github.com/geonetwork/core-geonetwork/blob/develop/domain/src/main/java/org/fao/geonet/domain/ISODate.java
- Client side: move away from Wro4j to Brunch ?
- Merge cleaning PR in Solr branch ? https://github.com/geonetwork/core-geonetwork/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+cleaning to make Lucene removal easier.
- Remove Lucene* (ie. deps, index, config)
- Field constants - Move all Solr fields in one class in a module that all other modules can access https://github.com/geonetwork/core-geonetwork/blob/develop/core/src/main/java/org/fao/geonet/kernel/search/LuceneIndexField.java and https://github.com/geonetwork/core-geonetwork/blob/develop/core/src/main/java/org/fao/geonet/constants/Geonet.java#L624
- Client / Drop all
@json
and drop this mode in favour of _content_type
If you have some comments, start a discussion, raise an issue or use one of our other communication channels to talk to us.