Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk index performance decays substantially (and may eventually fail) on large data sets #53

Open
2 tasks done
Trebla7th opened this issue Jul 8, 2020 · 1 comment
Open
2 tasks done

Comments

@Trebla7th
Copy link

Trebla7th commented Jul 8, 2020

NOTE: This was discovered against MySQL, loading data with offsets get slower the more the data is offset. This may not be an issue against other DBs

When indexing a domain class with greater than 1 million records, index performance decays and eventually dies. This is caused by the way data is loaded in ElasticSearchService.doBulkRequest(). The line:

List<Class<?>> results = domainClass.listOrderById([offset: offset, max: max, readOnly: true, sort: 'id', order: "asc"])

Does a poor job of loading data as the offset increases.

Locally, we fixed this by making the following changes (corporate policy prevents me from submitting an actual pull request to this project, but I can suggest the fix... beauracracy!!!!)

def idResults = domainClass.createCriteria().list {
 projections {
    property 'id'
 }
 order("id", "asc")
}

..snip..

//The loop
idResults?.collate(max)?.eachWithIndex { subList, i ->
  
   //Other stuff here, then load the actual domains to index like this
    def results = domainClass.createCriteria().list {
         'in'('id', subList)
     }

  //everything else
}

Task List

  • Steps to reproduce provided
  • [N/A] Stacktrace (if present) provided
  • [N/A] Example that reproduces the problem uploaded to Github
  • Full description of the issue provided (see below)

Steps to Reproduce

  1. Create 1 million or more domain objects to be indexed
  2. Start application to index (or trigger an index after startup)
  3. Observe that subsequent iterations of the bulk loop slowly decay

Expected Behaviour

Indexing would continue at a consistent pace regardless of number of records

Actual Behaviour

Indexing decays linearly, each iteration slowing until eventually data connections start timing out

Environment Information

  • Operating System: RHEL, MacOS Mojave
  • GORM Version: 7.0.2.RELEASE
  • Grails Version (if using Grails): 4.0.3
  • JDK Version:
java -version
openjdk version "1.8.0_192"
OpenJDK Runtime Environment (Zulu 8.33.0.1-macosx) (build 1.8.0_192-b01)
OpenJDK 64-Bit Server VM (Zulu 8.33.0.1-macosx) (build 25.192-b01, mixed mode)

Example Application

  • N/A
@puneetbehl
Copy link
Contributor

Thank you, I will update this in the plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants