You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When mixing generated data, we have the ability to upsample or downsample datasets based on a random sampling. We want to add the ability to select a subset of generated data that maximizes diversity across the generated samples, so that we're effectively reducing duplicate samples without losing meaningful amounts of diversity and variety of the topics covered by the samples.
There's a research project at https://github.com/krishnatejakk/DataCurate4LLMs that implements some of these ideas. Let's take a look at that project, investigate what dependencies that brings in, prototype what integration would look like, and determine the overall work required and whether this is the right approach to take to solving the generic problem of selecting subsets of our generated samples.
The text was updated successfully, but these errors were encountered:
When mixing generated data, we have the ability to upsample or downsample datasets based on a random sampling. We want to add the ability to select a subset of generated data that maximizes diversity across the generated samples, so that we're effectively reducing duplicate samples without losing meaningful amounts of diversity and variety of the topics covered by the samples.
There's a research project at https://github.com/krishnatejakk/DataCurate4LLMs that implements some of these ideas. Let's take a look at that project, investigate what dependencies that brings in, prototype what integration would look like, and determine the overall work required and whether this is the right approach to take to solving the generic problem of selecting subsets of our generated samples.
The text was updated successfully, but these errors were encountered: