Add subset selection for selecting diverse subsets of large datasets #499

bbrowning · 2025-01-23T18:58:45Z

When mixing generated data, we have the ability to upsample or downsample datasets based on a random sampling. We want to add the ability to select a subset of generated data that maximizes diversity across the generated samples, so that we're effectively reducing duplicate samples without losing meaningful amounts of diversity and variety of the topics covered by the samples.

There's a research project at https://github.com/krishnatejakk/DataCurate4LLMs that implements some of these ideas. Let's take a look at that project, investigate what dependencies that brings in, prototype what integration would look like, and determine the overall work required and whether this is the right approach to take to solving the generic problem of selecting subsets of our generated samples.

bbrowning added this to the 0.8.0 milestone Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add subset selection for selecting diverse subsets of large datasets #499

Add subset selection for selecting diverse subsets of large datasets #499

bbrowning commented Jan 23, 2025

Add subset selection for selecting diverse subsets of large datasets #499

Add subset selection for selecting diverse subsets of large datasets #499

Comments

bbrowning commented Jan 23, 2025