Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archiving an entry increases the database file size instead of decreasing it #1

Open
eyahlin opened this issue Aug 14, 2022 · 10 comments

Comments

@eyahlin
Copy link

eyahlin commented Aug 14, 2022

Steps to reproduce

  1. Check the database file size (ls -l of Data.fs in var/filestorage)
  2. Archive an entry
  3. Check the database file size again. It has increased.

Current behavior

Archiving increases the database file size.

Expected behavior

Archiving should decrease the database file size, as stated in the "About" section of the README:

image

Screenshot (optional)

image

@ramonski
Copy link

Please make sure that you have packed your database before and after your test to ensure old transactions are removed.

@eyahlin
Copy link
Author

eyahlin commented Aug 15, 2022

Hello. I've just packed it in that screenshot. It's actually the first write transaction I've done after the database pack, and the add-on installation.

@eyahlin
Copy link
Author

eyahlin commented Aug 17, 2022

Any ideas why the filesize increased despite having packed the database?

@xispa
Copy link
Member

xispa commented Aug 17, 2022

How many objects (samples, worksheets, whatever) are we talking about? Maybe there are no enough objects stored/archived to see a significative difference

@eyahlin
Copy link
Author

eyahlin commented Aug 17, 2022

I am planning to archive at least 300 samples which are contained in at least 50 batches and at least 15 worksheets. This is just for the initial test data. When our retention period of 3 years kicks in it will be significantly more (at least 10 times more). This is the reason I'm curious about the file size increase. If I end up increasing the size by archiving wouldn't it do more harm than good to the performance?

@eyahlin
Copy link
Author

eyahlin commented Aug 22, 2022

Hello. What do you think?

@xispa
Copy link
Member

xispa commented Aug 22, 2022

Hello. What do you think?

I think that with that few records you won't see a significant difference.

@eyahlin
Copy link
Author

eyahlin commented Aug 22, 2022

I can create a copy of my production environment (Data.fs is 7GB large) right now to archive with more records. I could then provide you a screenshot of the file sizes before and after archiving. How many records is enough to see a "significant difference"?

Also, is it not an issue that the Data.fs file size has increased after archiving? I believe the behavior directly contradicts the description of senaite.archive.

@xispa
Copy link
Member

xispa commented Aug 22, 2022

I can create a copy of my production environment (Data.fs is 7GB large) right now to archive with more records. I could then provide you a screenshot of the file sizes before and after archiving. How many records is enough to see a "significant difference"?

Probably yes.

Also, is it not an issue that the Data.fs file size has increased after archiving? I believe the behavior directly contradicts the description of senaite.archive.

Some background first: SENAITE uses an object-oriented database (ZopeDB), that stores serialized objects. Direct searches against such database are not performant, cause system would need to deserialize and wake up every single object stored and then check if any of the values from searchable fields match with the search term. To overcome this, we make use of what is called "catalog", that stores data from objects as an SQL-like database. We can then perform searches against catalogs and we can wake up the objects with a match afterwards if we want it.

Archive creates a small object for each sample/worksheet/etc. before the object is definitely removed from the database. Archive also creates a catalog where metadata of these "small" objects is stored. This allows you to search for basic information from historic data. Besides, objects are removed only when they don't have other referenced objects. For instance, a worksheet will only be deleted after all its analyses are deleted.

As you can imagine, for a database with few objects, the overhead that comes with archive machinery may cause the database to increase rather than shrink. The number of objects required to see a "difference" depends on the size of each stored object (a sample with the field remarks filled weights more than a sample without remarks set) and the number of objects left without removal because they still keep references to other objects.

For further info, the archiving and removal of old objects takes place here:
https://github.com/senaite/senaite.archive/blob/1.x/src/senaite/archive/utils.py#L169

Hope it helps

@eyahlin
Copy link
Author

eyahlin commented Aug 22, 2022

Thanks for the explanation. I understand better now how archive works.

From what I understood, in order for the archive to "work", the database must be sufficiently large enough with a lot of objects inside. My database in the screenshot is actually 7 GB now. It contains more than 31,000 samples now. Is this size still too small for the archiving to be worth it?

I ask this question because I just deployed senaite.archive in my production environment, and I want to know if I should enable it or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants