Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assess migration performance #43

Open
1 of 7 tasks
Lun4m opened this issue Dec 11, 2024 · 0 comments
Open
1 of 7 tasks

Assess migration performance #43

Lun4m opened this issue Dec 11, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@Lun4m
Copy link
Collaborator

Lun4m commented Dec 11, 2024

  • Check the size of each partition after migrations to make sure they are balanced
  • Probably we should also partition the flag table
  • The import subcommands require a connection to KDVH and Kvalobs, so they can't be used when those databases go down.
  • Use parallel index creation
  • Use the CONCURRENTLY keyword for index creation to allow concurrent writes to the tables?
  • Add missing partition
  • Find a way to deal with missing metadata

KDVH

  • T_SECOND, T_MINUTE, T_10MINUTE are not that big, so it's probably not worth dumping them by year

  • Import errors have to do with either missing entries in the elem_map_cfnames_param Stinfosys table, or a missing partition (for data earlier than 1850)

  • The timings seem to be all over the place. For T_EDATA and T_ADATA I had to limit the number of import workers, since they would consume too much memory.
    But maybe that also helps since it limits the number of concurrent writes to the database?

# stations # elements dump time (errors) dump size import time (errors) # rows imported1
T_HOMOGEN_DIURNAL 15 4 0:00:06 9.2M --- ---
T_HOMOGEN_MONTH 1641 2 0:01:37 21M --- ---
T_CDCV_DATA 2 94 0:01:58 79M --- ---
T_ADATA_LEVEL 46 74 0:02:42 121M --- ---
T_TJ_DATA 73 99 0:08:01 431M 0:00:10 (1) 691,045
T_MDATA 74 73 0:09:57 1.4G 0:02:24 (64) 18,300,433
T_METARDATA 81 47 0:15:29 4.8G 0:20:17 (221) 180,587,866
T_PDATA 321 4 0:31:21 3.1G 0:03:29 (1) 11,405,461
T_AVINOR 11 52 0:32:23 8.3G --- ---
T_NDATA 1254 16 0:33:14 6.9G 0:27:42 241,525,015
T_SECOND_DATA 135 1 0:34:09 237M --- ---
T_VDATA 626 94 0:55:03 9.5G 0:40:26 (523) 334,744,233
T_MONTH 3059 85 1:09:06 2.9G 0:18:17 (7650) 25,712,282
T_EDATA 19 124 2:52:41 18G 0:53:14 609,550,143
T_SVVDATA 460 46 3:05:13 44G --- ---
T_UTLANDDATA 590 106 3:08:00 17G 0:00:02 (92) 126,403
T_DIURNAL 2939 79 3:31:25 35G 2:05:452 (6445) 871,355,731
T_MINUTE_DATA 38 80 6:08:083 11G --- ---
T_10MINUTE_DATA 263 101 4 11G --- ---
T_MERMAID 107 613 14:13:17 (4) 19G --- ---
T_ADATA 1001 294 2d, 12:09:13 55G 1:44:02 (1028) 1,056,754,243
TOTAL --- --- 3d, 2:03:03 ~248G 6:35:48 3,350,752,855

Kvalobs

  • first row: data table; second row: text_data table

Kvalobs

  • here dump time considers both labels and observations
table range # stations # labels dump time dump size import time
data 2006-01-01
2025-01-13
2,356 46,025 2:32:42 12G
text 2006-01-01
2025-01-13
1,608 3,004 0:05:31 915M

Histkvalobs

table range # labels labels dump time # stations dump time dump size import time
data 1700-01-01
2000-01-01
2029 0:02:13 201 0:01:26 2.1G
data 2000-01-01
2006-01-01
22364 0:21:06 786 0:01:21 2G
data 2006-01-01
2008-01-01
17,693 1:09:38 843 0:03:21 5.7G
data 2006-01-01
2025-01-13
195,012 18:27:08 --- --- --- ---
data 2008-01-01
2013-01-01
101,593 --- 7,721 1:14:44 26G
data 2013-01-01
2018-01-01
109,430 --- 6,419 2:36:31 85G
data 2018-01-01
2023-01-01
---
data 2023-01-01
2025-01-01
---
text 2006-01-01
2025-01-13
6,228 0:21:41 2140 2:10:26 26G
TOTAL --- ---

Partition size check

range data count
(after kdvh)
text count
(after kdvh)
data count
(after kvalobs)
text count
(after kvalobs)
1750-1850
1850-1950 146,512,417 23,716,443
1950-2000 1,980,470,183 134,865,284
2000-2010 776,874,157 26,137,253
2010-2015 148,924,938 1,014,354
2015-2016 29,813,809 246,495
2016-2017 29,864,784 229,259
2017-2018 11,521,613 250,647
2018-2019 7,158,211 221,792
2019-2020 7,540,807 261,057
2020-2021 7,785,857 281,953
2021-2022 7,940,701 278,115
2022-2023 8,610,380 302,971
2023-2024 9,091,440 358,020
2024-20255 75,139,998 8,374,721

Footnotes

  1. only accounts for rows in the data and nonscalar_data tables, should be roughly double the amount if considering also the flags.kvdata table

  2. accidentally duplicated some of the data

  3. data from 2024 was not dumped for stations 7420, 15270, 16400

  4. accidentally deleted the log file, but the dump is missing data from 2024

  5. not that reliable since we started ingesting from obsinn at the end of 2024

@Lun4m Lun4m added the enhancement New feature or request label Dec 14, 2024
@Lun4m Lun4m changed the title Assess performance of migration package Assess migration performance Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant