Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add SearchAfterMixin for ES search_after capability #4536

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Ali-D-Akbar
Copy link
Contributor

@Ali-D-Akbar Ali-D-Akbar commented Jan 8, 2025

PROD-4233
Adds a new SearchAfterMixin to be added in place of PkSearchableMixin that allows using search_after. Once this mixin is used, it will bypass the default search limit of 10k by making multiple calls to ES in case we have more than 10k records in an index.

Previously, we faced an issue regarding the search limit, resulting in less records to be returned. We increased the MAX_RESULT_WINDOW before but a better way is to use search_after capability for an optimal and flexible result.

This PR also adds a v2 CatalogQueryContainsViewSet that fixes the querying mechanism by filtering the items at the time of when we're executing queries on ES. In v1, our CatalogQueryContainsViewSet first searched all the records AND THEN filtered it.

Testing Instructions:

  1. Run update_index locally in Discovery shell.
  2. Visit /api/v1/catalog/query_contains/ endpoint and add a sample query like this: http://localhost:18381/api/v2/catalog/query_contains/?course_uuids=2de67490-f748-4efd-8532-b445f7ecc6f9,f9f1e100-668a-4fd5-a966-a127de1f69de&query=org:edX

You can set ELASTICSEARCH_DSL_QUERYSET_PAGINATION to your specific value in order to test the behavior. While the end result will be the same but this can affect the number of times the search_after mechanism is called.

@Ali-D-Akbar Ali-D-Akbar marked this pull request as ready for review January 10, 2025 09:59
@Ali-D-Akbar Ali-D-Akbar force-pushed the aakbar/PROD-4233 branch 4 times, most recently from 2b92b48 to d93f032 Compare January 13, 2025 20:30
Comment on lines 4245 to 4264
mock_document = MagicMock()
mock_search = MagicMock()
mock_document.search.return_value = mock_search

mock_search.query.return_value = mock_search
mock_search.sort.return_value = mock_search
mock_search.extra.return_value = mock_search

mock_result1 = MagicMock()
mock_result1.pk = 1
mock_result1.meta.sort = ["sort1"]

mock_result2 = MagicMock()
mock_result2.pk = 2
mock_result2.meta.sort = ["sort2"]

mock_search.execute.side_effect = [
[mock_result1, mock_result2],
[],
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right way to test. This is mocking everything, including ES search after behavior. All the ES related tests in course-discovery hit test ES instance. The search after mixin tests should aim for that too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the tests to use ElasticsearchTestMixin instead.

break

all_ids.update(ids)
search_after = results[-1].meta.sort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add safety check here, like

search_after = results[-1].meta.sort if results[-1] else None

Also, have you tried it locally? By default, results[-1].meta.sort is a Elastic.AttrList, which is not json serializable. Using results[-1].meta.sort.copy() will convert it into a simple list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch for the search_after check.
I've tested this locally, I'm using this search_after value just to be sent back to the ES query execution.

Comment on lines 44 to 67
identified_course_ids.update(
i.key
for i in CourseRun.search(
query,
partner=ESDSLQ('term', partner=partner.short_code),
identifiers=ESDSLQ('terms', **{'key.raw': course_run_ids})
).source(['key'])
)
if course_uuids:
course_uuids = [UUID(course_uuid) for course_uuid in course_uuids.split(',')]
specified_course_ids += course_uuids

log.info(f"Specified course ids: {specified_course_ids}")
identified_course_ids.update(
Course.search(
query,
partner=ESDSLQ('term', partner=partner.short_code),
identifiers=ESDSLQ('terms', **{'uuid': course_uuids})
).values_list('uuid', flat=True)
Copy link
Contributor Author

@Ali-D-Akbar Ali-D-Akbar Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from this code that updates the querying logic for ES (so rather than fetching all the products AND THEN filter it, we're fetching only the necessary products), this is just a copy of https://github.com/openedx/course-discovery/blob/a948a741b1839e845e07c762bab6ee51622daea5/course_discovery/apps/api/v1/views/catalog_queries.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change not compatible with v1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be, yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requires changing the pkSearchableMixin a bit as well if we were to keep a single CatalogQueryContainsViewSet.

@Ali-D-Akbar Ali-D-Akbar force-pushed the aakbar/PROD-4233 branch 2 times, most recently from de8f6a3 to e2052c4 Compare January 16, 2025 17:19
Comment on lines 4209 to 4217
def test_fetch_all_courses(self, mock_get_documents):
query = "Course*"
mock_get_documents.return_value = [CourseDocument]

queryset = CourseProxy.search(query=query)

self.assertEqual(len(queryset), self.total_courses)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the page_size here? I think we should explicitly have courses greater than page size to emphasize that search will return everything in a single call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, the results should not contain any duplicates. That needs to be checked as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll double check that but given the correct sort id, it should never have duplicates.

For subsequent pages, they simply include this `next` value as the `search_after` parameter in their next request.

A new mixin, SearchAfterMixin, will be created and added to enable the search_after functionality in catalog query endpoints, such as `/api/v1/catalog/query_contains`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We should add more details of the mixin
  • nit: v2/catalog/query_contains

partner=ESDSLQ('term', partner=partner.short_code),
identifiers=ESDSLQ('terms', **{'uuid': course_uuids}),
document=CourseDocument
).values_list('uuid', flat=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this uses values_list while course_run_ids is using comprehension. We can should make it consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's rather consistent now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants