calculate prepare LLM dataset total size #236

huseinzol05 · 2023-07-30T07:02:36Z

Notebook https://github.com/huseinzol05/malaysian-dataset/blob/master/prepare-llm/calculate-size.ipynb,

from bs4 import BeautifulSoup
import requests
import re
import json

r = requests.get('https://github.com/users/huseinzol05/projects/1/views/1')
soup = BeautifulSoup(r.content, "lxml")

data = json.loads(soup.find('script', {'id': 'memex-items-data'}).contents[0])
len(data)

parsed = []
for d in data:
    t = d['memexProjectColumnValues'][0]['value']['title']['raw']
    if '(' in t and ')' in t:
        parsed.append(t)

sizes = []

units = {
    'MB': 1e6,
    'GB': 1e9,
    'KB': 1e4,
}

for string in parsed:
    for r in re.finditer(r'\([^()]*\)', string):
        span = r.span()
        subs = string[span[0] + 1: span[1] - 1]
        s, unit = subs.split()
        sizes.append(float(s) * units[unit])
        
sum(sizes) / 1e9

huseinzol05 · 2023-07-30T07:02:45Z

Currently 89.27078 GB.

huseinzol05 · 2023-09-08T07:21:43Z

Currently 97.054874 GB.

huseinzol05 · 2023-09-11T04:52:01Z

97.751634, freezing dataset to train LLM.

huseinzol05 added this to Prepare LLM dataset (240 GB - 2023-11-21) Jul 16, 2023

huseinzol05 converted this from a draft issue Jul 30, 2023

huseinzol05 moved this from In Progress to misc in Prepare LLM dataset (240 GB - 2023-11-21) Oct 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calculate prepare LLM dataset total size #236

calculate prepare LLM dataset total size #236

huseinzol05 commented Jul 30, 2023

huseinzol05 commented Jul 30, 2023

huseinzol05 commented Sep 8, 2023

huseinzol05 commented Sep 11, 2023 •

edited

Loading

calculate prepare LLM dataset total size #236

calculate prepare LLM dataset total size #236

Comments

huseinzol05 commented Jul 30, 2023

huseinzol05 commented Jul 30, 2023

huseinzol05 commented Sep 8, 2023

huseinzol05 commented Sep 11, 2023 • edited Loading

huseinzol05 commented Sep 11, 2023 •

edited

Loading