Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calculate prepare LLM dataset total size #236

Open
huseinzol05 opened this issue Jul 30, 2023 · 3 comments
Open

calculate prepare LLM dataset total size #236

huseinzol05 opened this issue Jul 30, 2023 · 3 comments

Comments

@huseinzol05
Copy link
Member

Notebook https://github.com/huseinzol05/malaysian-dataset/blob/master/prepare-llm/calculate-size.ipynb,

from bs4 import BeautifulSoup
import requests
import re
import json

r = requests.get('https://github.com/users/huseinzol05/projects/1/views/1')
soup = BeautifulSoup(r.content, "lxml")

data = json.loads(soup.find('script', {'id': 'memex-items-data'}).contents[0])
len(data)

parsed = []
for d in data:
    t = d['memexProjectColumnValues'][0]['value']['title']['raw']
    if '(' in t and ')' in t:
        parsed.append(t)

sizes = []

units = {
    'MB': 1e6,
    'GB': 1e9,
    'KB': 1e4,
}

for string in parsed:
    for r in re.finditer(r'\([^()]*\)', string):
        span = r.span()
        subs = string[span[0] + 1: span[1] - 1]
        s, unit = subs.split()
        sizes.append(float(s) * units[unit])
        
sum(sizes) / 1e9
@huseinzol05
Copy link
Member Author

Currently 89.27078 GB.

@huseinzol05
Copy link
Member Author

Currently 97.054874 GB.

@huseinzol05
Copy link
Member Author

huseinzol05 commented Sep 11, 2023

97.751634, freezing dataset to train LLM.

@huseinzol05 huseinzol05 moved this from In Progress to misc in Prepare LLM dataset (240 GB - 2023-11-21) Oct 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant