Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subset selection for selecting diverse subsets of large datasets #499

Open
bbrowning opened this issue Jan 23, 2025 · 0 comments
Open
Milestone

Comments

@bbrowning
Copy link
Contributor

When mixing generated data, we have the ability to upsample or downsample datasets based on a random sampling. We want to add the ability to select a subset of generated data that maximizes diversity across the generated samples, so that we're effectively reducing duplicate samples without losing meaningful amounts of diversity and variety of the topics covered by the samples.

There's a research project at https://github.com/krishnatejakk/DataCurate4LLMs that implements some of these ideas. Let's take a look at that project, investigate what dependencies that brings in, prototype what integration would look like, and determine the overall work required and whether this is the right approach to take to solving the generic problem of selecting subsets of our generated samples.

@bbrowning bbrowning added this to the 0.8.0 milestone Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant