Archiving an entry increases the database file size instead of decreasing it #1

eyahlin · 2022-08-14T22:09:46Z

Steps to reproduce

Check the database file size (ls -l of Data.fs in var/filestorage)
Archive an entry
Check the database file size again. It has increased.

Current behavior

Archiving increases the database file size.

Expected behavior

Archiving should decrease the database file size, as stated in the "About" section of the README:

Screenshot (optional)

The text was updated successfully, but these errors were encountered:

ramonski · 2022-08-15T07:41:42Z

Please make sure that you have packed your database before and after your test to ensure old transactions are removed.

eyahlin · 2022-08-15T08:48:44Z

Hello. I've just packed it in that screenshot. It's actually the first write transaction I've done after the database pack, and the add-on installation.

eyahlin · 2022-08-17T16:26:56Z

Any ideas why the filesize increased despite having packed the database?

xispa · 2022-08-17T18:22:48Z

How many objects (samples, worksheets, whatever) are we talking about? Maybe there are no enough objects stored/archived to see a significative difference

eyahlin · 2022-08-17T23:20:52Z

I am planning to archive at least 300 samples which are contained in at least 50 batches and at least 15 worksheets. This is just for the initial test data. When our retention period of 3 years kicks in it will be significantly more (at least 10 times more). This is the reason I'm curious about the file size increase. If I end up increasing the size by archiving wouldn't it do more harm than good to the performance?

eyahlin · 2022-08-22T01:03:50Z

Hello. What do you think?

xispa · 2022-08-22T06:57:21Z

Hello. What do you think?

I think that with that few records you won't see a significant difference.

eyahlin · 2022-08-22T07:02:17Z

I can create a copy of my production environment (Data.fs is 7GB large) right now to archive with more records. I could then provide you a screenshot of the file sizes before and after archiving. How many records is enough to see a "significant difference"?

Also, is it not an issue that the Data.fs file size has increased after archiving? I believe the behavior directly contradicts the description of senaite.archive.

xispa · 2022-08-22T07:50:21Z

I can create a copy of my production environment (Data.fs is 7GB large) right now to archive with more records. I could then provide you a screenshot of the file sizes before and after archiving. How many records is enough to see a "significant difference"?

Probably yes.

Also, is it not an issue that the Data.fs file size has increased after archiving? I believe the behavior directly contradicts the description of senaite.archive.

Some background first: SENAITE uses an object-oriented database (ZopeDB), that stores serialized objects. Direct searches against such database are not performant, cause system would need to deserialize and wake up every single object stored and then check if any of the values from searchable fields match with the search term. To overcome this, we make use of what is called "catalog", that stores data from objects as an SQL-like database. We can then perform searches against catalogs and we can wake up the objects with a match afterwards if we want it.

Archive creates a small object for each sample/worksheet/etc. before the object is definitely removed from the database. Archive also creates a catalog where metadata of these "small" objects is stored. This allows you to search for basic information from historic data. Besides, objects are removed only when they don't have other referenced objects. For instance, a worksheet will only be deleted after all its analyses are deleted.

As you can imagine, for a database with few objects, the overhead that comes with archive machinery may cause the database to increase rather than shrink. The number of objects required to see a "difference" depends on the size of each stored object (a sample with the field remarks filled weights more than a sample without remarks set) and the number of objects left without removal because they still keep references to other objects.

For further info, the archiving and removal of old objects takes place here:
https://github.com/senaite/senaite.archive/blob/1.x/src/senaite/archive/utils.py#L169

Hope it helps

eyahlin · 2022-08-22T11:02:57Z

Thanks for the explanation. I understand better now how archive works.

From what I understood, in order for the archive to "work", the database must be sufficiently large enough with a lot of objects inside. My database in the screenshot is actually 7 GB now. It contains more than 31,000 samples now. Is this size still too small for the archiving to be worth it?

I ask this question because I just deployed senaite.archive in my production environment, and I want to know if I should enable it or not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archiving an entry increases the database file size instead of decreasing it #1

Archiving an entry increases the database file size instead of decreasing it #1

eyahlin commented Aug 14, 2022

ramonski commented Aug 15, 2022

eyahlin commented Aug 15, 2022

eyahlin commented Aug 17, 2022

xispa commented Aug 17, 2022

eyahlin commented Aug 17, 2022

eyahlin commented Aug 22, 2022

xispa commented Aug 22, 2022

eyahlin commented Aug 22, 2022

xispa commented Aug 22, 2022 •

edited

Loading

eyahlin commented Aug 22, 2022

Archiving an entry increases the database file size instead of decreasing it #1

Archiving an entry increases the database file size instead of decreasing it #1

Comments

eyahlin commented Aug 14, 2022

Steps to reproduce

Current behavior

Expected behavior

Screenshot (optional)

ramonski commented Aug 15, 2022

eyahlin commented Aug 15, 2022

eyahlin commented Aug 17, 2022

xispa commented Aug 17, 2022

eyahlin commented Aug 17, 2022

eyahlin commented Aug 22, 2022

xispa commented Aug 22, 2022

eyahlin commented Aug 22, 2022

xispa commented Aug 22, 2022 • edited Loading

eyahlin commented Aug 22, 2022

xispa commented Aug 22, 2022 •

edited

Loading