Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference-based IPs #747

Open
jmaferreira opened this issue Sep 4, 2024 · 6 comments
Open

Reference-based IPs #747

jmaferreira opened this issue Sep 4, 2024 · 6 comments
Assignees
Labels
Feature request This issue is a feature which will be implemented further on. Used together with a milestone.

Comments

@jmaferreira
Copy link

Reference-based IPs

The goal of this recommendation is to add the capability to build IPs that can transport metadata about content files without having to include the files themselves.

Essentially, the IP will only carry pointers or references to the actual files. This approach has been referred to as "Shallow IPs".

In theory, this functionality is already possible without altering the current implementation, but it would be beneficial to include a section in the CSIP detailing how to achieve this. This recommendation affects all information packages (SIP, AIP and DIP).

Use cases

One of the primary use cases supporting this recommendation concerns I/O logistics and performance. Pre-ingest is responsible for organising data into SIPs before submitting them to the archive. The traditional approach involves copying the data into a physically bound E-ARK SIP, i.e.. a package that includes all the referenced files within its physical boundaries.

This SIP is then submitted to the archive, which generates E-ARK AIPs. After verifying that the ingest was successful, the system must then remove the E-ARK SIP and the original data. Additionally, the system must handle ingest failures and the potential need to regenerate SIPs. This process can lead to several issues, most notably the requirement for significant storage space, which could reach up to three times the original size of the data if we consider the uningested data, the SIP, and the AIP copies.

Moreover, the I/O operations needed to move the data around can place unnecessary stress on the archival system and storage infrastructure.

Instead of duplicating the data within the archival system, Shallow IPs allow for the efficient referencing of data stored in external systems, thereby reducing the need for extensive storage and minimizing the load on the repository’s I/O operations.

This approach ensures that data can be efficiently managed and accessed without unnecessary duplication, improving performance and reducing resource consumption. However, it also increases the risk of data loss, as the repository lacks full control over the data. In certain scenarios, though, this risk may be acceptable in exchange for reducing the use of expensive storage, e.g., when the pre-ingest area and the archival storage area share the same infrastructure.

Given that the OAIS repository is able to retrieve the file whenever it needs for any access or preservation operations the capacity of referring external content can greatly reduce the amount of local resources needed to manage a OAIS system or can allow modern storage systems to be used as a backend for storing E-ARK AIPs. It can also reduce the overall amount of storage needed for a institution when the content is both in the current information management system (e.g., ERMS) and also in the OAIS archive.

For example, a TV broadcast station may have its own archive, which demands significant storage space. While backups and remote replicas are already in place, they may wish to incorporate this data into an OAIS archive to benefit from enhanced preservation features (e.g. Representation Information). However, duplicating the storage or transferring the content into the OAIS archive, and modifying the production system to rely on the OAIS archive, would not be feasible due to the prohibitive costs of such strategies.

The option would be for the OAIS archive to refer to the content in the productive system, allow curators to create shallow SIPs and submit them to the OAIS archive. The OAIS archive would need to access the external content and execute ingest workflow validations.

Preservation actions like fixity checks, file format identification, file format validation, and file format conversion could be done using the external data as input as well. However, every outcome of these operations shall become a local files (including the outcome of the file format conversion actions). We recommend keeping all (descriptive, preservation, other) metadata local to the AIPs, but allow representation data to be external/remote.

CSIP principles 1.4 and 3.6 state that:

  • "the Information Package SHOULD be scalable" and that
  • "[...] it is clear that any given technical implementation will become obsolete in time, for example, as new transfer methods and storage solutions emerge. As such this specification does not prohibit the take-up of any emerging logical or physical technical solutions."

Taking these principles into account, we suggest a modification to the CSIP text (which would apply to SIP, AIP, and DIP) to allow representation data to be non-local.

We propose altering the following sections in the Common IP specification:

Section 4. CSIP structure

Update text from:

The preferred implementation of the logical model described in Principle 3.6 is a strict physical (folder) structure that precisely follows the logical structure. While the specification does not prohibit alternative implementations of the conceptual model, the practice is not recommended.

The main reason for this implementation decision is that a fixed and documented folder structure makes the package layout clear to both human users and automated tools. The main benefit this clarity is that many archival tasks (e.g. file format risk analysis), can be executed directly on the data portion of the package structure, as opposed to first processing potentially large amounts of metadata for file locations. This allows for more efficient processing which is valuable in the case of large collections and bulk operations. A fixed folder structure, therefore, provides efficiency and scalability.

Many data storage solutions do not explicitly support folder structures, but use other means for structuring and storing AIP data and metadata. However, the purpose of this specification is to facilitate and support Information Package interoperability. When storage solutions do not support the implementation of the package structure for native AIP storage, it is still possible to implement the physical structure for SIPs and DIPs. While systems need to implement transformations between Common Specification IPs and internal AIPs it allows interoperability between tools that support the common specification, easy transfer of IPs to new repository systems or storage solutions and the establishment of multi-repository duplicated storage solutions.

To:

The preferred implementation of the logical model described in Principle 3.6 is a strict physical (folder) structure that precisely follows the logical structure. While the specification does not prohibit alternative implementations of the conceptual model, the practice is not recommended. However, it is recognised that reference-based packages, where files can be physically separated and referred to by links, are also viable.

The main reason for preferring a fixed and documented folder structure is that it makes the package layout clear to both human users and automated tools. This clarity is crucial as it allows many archival tasks (e.g., file format risk analysis) to be executed directly on the data portion of the package structure, without the need to first process potentially large amounts of metadata to locate files. This leads to more efficient processing, which is valuable in the case of large collections and bulk operations. A fixed folder structure, therefore, provides efficiency and scalability.

While many data storage solutions do not explicitly support folder structures and may use alternative methods, such as Content Addressable Storage systems (CAS) that allow files to be stored separately and retrieved by a unique identifier (and not by its name and location), the purpose of this specification remains to facilitate and support Information Package interoperability. Even when storage solutions do not support the implementation of the folder structures for native AIP storage, it is still possible to implement the physical structure for SIPs and DIPs mainly for transport reasons.

Systems that use reference-based packages should ensure that they implement the necessary processes to consolidate referenced files into a local physically structure, enabling interoperability between tools that support the common specification. This also allows for the easy transfer of IPs to new repository systems or storage solutions and the establishment of multi-repository duplicated storage solutions.

Section 4.3 Implementing referenced-based information packages (NEW SECTION)

In certain implementation contexts, it is advantageous for the files included in an information package - whether it be an SIP, AIP, or DIP - to be referenced via links rather than physically bound together in a folder structure.

One of the primary use cases supporting this recommendation concerns I/O logistics and performance. Pre-ingest is responsible for organising data into SIPs before submitting them to the archive. The traditional approach involves copying the data into a physically bound E-ARK SIP, i.e.. a package that includes all the referenced files within its physical boundaries.

This SIP is then submitted to the archive, which generates E-ARK AIPs. After verifying that the ingest was successful, the system must then remove the E-ARK SIP and the original data. Additionally, the system must handle ingest failures and the potential need to regenerate SIPs. This process can lead to several issues, most notably the requirement for significant storage space, which could reach up to three times the original size of the data if we consider the uningested data, the SIP, and the AIP copies.

Moreover, the I/O operations needed to move the data around can place unnecessary stress on the archival system and storage infrastructure.

Referenced-based IPs, also known as Shallow IPs, offer a solution by allowing for the efficient referencing of data stored in external systems, thereby reducing the need for extensive storage and minimising the strain on the repository’s I/O operations. This approach enables data to be managed and accessed without unnecessary duplication, leading to improved performance and reduced resource consumption. However, it is important to note that this approach carries an increased risk of data loss, as the repository does not maintain full control over the physical storage of the data and also reduced performance when data needs to be fetched from a low bandwidth communication channel. Despite this risk, in certain scenarios, the trade-off may be justified in order to reduce the use of costly resources.

Referenced-based IPs can significantly optimise file storage usage, particularly in scenarios where large volumes of data remain actively used by other systems that function as both processing and archival platforms.

For instance, in environments where data is stored in Content Addressable Storage (CAS) or cloud-based systems like Amazon S3 and OpenStack Swift, the advantages of Referenced-based IPs are evident. These systems inherently manage data through references or identifiers, making Referenced-based IPs an ideal solution for efficient storage and retrieval without the need for duplicating data before and after archiving.

Examples

Typically, the METS file included in an Information Package contains relative URLs that reference files included within the package itself.

Below is an example of an IP that references files available locally within the IP.

<file
    ID="uuid-DD985B25-9328-4F4E-B387-A6F20CA44500" MIMETYPE="application/xml"
    SIZE="3180"
    CREATED="2020-12-22T10:28:45.668Z"
    CHECKSUM="F1F5BB6003165CDD8F6C1FCC32F8FD1F965E1681010F3B9806D9460BCFFA8A3C"
    CHECKSUMTYPE="SHA-256">
    <FLocat
        LOCTYPE="URL"
        xlink:type="simple"
        xlink:href="schemas/xlink.xsd"/>
</file>

However, the METS specification supports several options for LOCTYPE, and the URL option supports the full expressiveness of the URL standard. To create a reference-based information package one should make use of this expressiveness and specify the full location of external files.

The following excerpt is an example of an IP that references files available remotely instead of stored in a local folder structure:

<file
    ID="uuid-DD985B25-9328-4F4E-B387-A6F20CA44500"
    MIMETYPE="application/xml" SIZE="3180"
    CREATED="2020-12-22T10:28:45.668Z"
    CHECKSUM="F1F5BB6003165CDD8F6C1FCC32F8FD1F965E1681010F3B9806D9460BCFFA8A3C"
    CHECKSUMTYPE="SHA-256">
    <FLocat
        LOCTYPE="URL"
        xlink:type="simple"
        xlink:href="https://www.w3.org/XML/2008/06/xlink.xsd"/>
</file>

Other protocols may also be used. The following example, depicts how files may be located on a shared drive, accessible via the NFS protocol:

<file
    ID="uuid-DD985B25-9328-4F4E-B387-A6F20CA44500"
    MIMETYPE="application/xml" SIZE="3180"
    CREATED="2020-12-22T10:28:45.668Z"
    CHECKSUM="F1F5BB6003165CDD8F6C1FCC32F8FD1F965E1681010F3B9806D9460BCFFA8A3C"
    CHECKSUMTYPE="SHA-256">
    <FLocat
        LOCTYPE="URL"
        xlink:type="simple"
        xlink:href="nfs://shared-host/schemas/xlink.xsd"/>
</file>
@jmaferreira jmaferreira added the Feature request This issue is a feature which will be implemented further on. Used together with a milestone. label Sep 4, 2024
@jmaferreira jmaferreira self-assigned this Sep 4, 2024
@adamfarquhar
Copy link

It is helpful to discuss this explicitly. The TV Broadcaster provides a good use case.

“referenced via links rather than physically bound together in a folder structure” - from the METS example, everything is referenced using a link (FLcat) – both files included in the IP as well as ones that are not. The difference is that the included files have a relative URL. Perhaps there is another way to make the difference clear.

The METS examples would be clearer and easier to understand if they referred to a real content file rather than an XML schema file. The differences would be clearer if they were highlighted or even if the text just showed the new href. As it is, the reader has to parse it carefully and confirming that the ID and CHECKSUMs are the same is not readily done by eye!

“place unnecessary stress on the archival system and storage infrastructure” – What does this mean? The text already mentions the cost of multiple copies of the data.

I think that the text may underplay the risks that can come from referencing the content. Take the TV Broadcaster, for example. Imagine that disk space is tight and a big event risks filling up available storage. It is easy to make an operational decision to delete files that seem unlikely to be used. Or a recording is embarrassing to a presenter and they put pressure on operations staff to delete it. Or the operational version is directly edited. It is also harder to ensure that reference-based copies are physically distinct.

Using remote/cloud-based storage that is part of the archival solution does not introduce the risks that you might see when using copies in an operational system.

“the package layout clear to both human users and automated tools” – as an aside, I guess that this is more about being a lowest-common-denominator technology (file-system with files and folders) and that easily using tools designed for another context.

A final editing pass should catch the minor errors in the very clear writing.

@shsdev
Copy link
Contributor

shsdev commented Sep 5, 2024

I believe the proposal is reasonable. I am familiar with requirements of this kind related to archive systems that use an S3 storage backend. However, regarding the AIP, I would suggest adding a subordinate requirement that it must always be possible to generate an all-inclusive AIP, which can be divided or split for very large AIPs. The reasoning behind this is that when using external storage for AIPs, it is then still possible to safeguard all-inclusive AIPs on more affordable long-term storage options, such as tape drives, for example.

@jmaferreira
Copy link
Author

jmaferreira commented Nov 3, 2024

Revised text taking into account Adam and Sven's comments. @karinbredenberg I believe the text is good enough to initiate the voting process.

Implementing Reference-Based Information Packages (IPs)

The goal of this recommendation is to add the capability to build IPs that can transport metadata about content files without having to include the content files themselves.

Essentially, the IP will only carry pointers or references to the actual content files. This approach has been previously referred to as "Shallow IPs".

In theory, this functionality is already possible without altering the current implementation, however it would be beneficial to include a section in the CSIP detailing how to achieve this. This recommendation affects all information packages (SIP, AIP and DIP).

Use cases

One of the primary use cases supporting this recommendation concerns I/O performance and storage space. Pre-ingest is responsible for process of organising data into SIPs before submitting them to the OAIS. The traditional approach involves copying data into a physically bound E-ARK SIP, i.e., a package that includes all the metadata and content files within its physical boundaries.

The SIP is then submitted to the OAIS, which transforms it into proper AIPs. Typically, after verifying that the ingest was successful, the system eliminates the original SIP and all its included data and metadata.

The OAIS must be able to handle ingest failures and the potential need to reingest SIPs. This process can lead to several issues, most notably the requirement for significant storage space, which could reach up to three times the original size of the data if we consider the uningested data, the SIP, and the AIP copies.

Moreover, I/O operations needed to move the data from its original location to the temporary ingest storage can place unnecessary stress on the archival storage infrastructure.

Rather than duplicating data within the archival system, Shallow IPs allow for the efficient referencing of data stored in external systems, thereby reducing the need for extensive storage and minimizing I/O operations.

This approach ensures that data can be efficiently managed and accessed without unnecessary duplication, improving performance and reducing resource consumption. However, it increases the risk of data loss, as the repository lacks full control over the data. In certain scenarios, though, this risk may be acceptable in exchange for reducing the use of large amounts of expensive storage, e.g., when the pre-ingest area and the archival storage area share the same infrastructure.

As long as the OAIS is able to retrieve the file whenever it needs for any access or preservation operations the capacity of referring external content can greatly reduce the amount of local resources needed to manage a OAIS system or can allow modern storage systems to be used as a backend for storing E-ARK AIPs. It can also reduce the overall amount of storage needed for a institution when the content is both in the current information management system (e.g., on Electronic Records Management Systems) and also in the OAIS.

For example, a TV broadcast station may have its own archival storage, which already demands significant storage space. While backups and remote replicas are already in place, they may wish to incorporate this data into an OAIS to benefit from enhanced preservation features (e.g. Representation Information and Preservation Planning). Duplicating the storage or transferring the content into the OAIS, and modifying the production system to rely on the OAIS archival storage would not be feasible due to the prohibitive costs of such a strategy.

Please note that this approach does not come without risks. Taking the TV broadcaster as an example, one can imagine that if disk space is running tight, an operational decision may be made to delete files that seem unlikely to be used without the proper concern about the preservation policy in place by the OAIS. An important characteristic of this strategy is immutability, which means that the content should not change once it has been ingested into the OAIS. The organisation should employ safeguards to ensure the integrity of the content stored in external environments.

The option would be for the OAIS to maintain references to the files managed by the productive system. This can be accomplished by ingesting Reference-based SIPs that contain pointers to the content files instead of the files themselves. In this scenario, the OAIS would not have full control over the content files. Instead it would have to request those files from an external archival storage given a well know access protocol.

Preservation actions like fixity checks, file format identification, file format validation, and file format conversion would have be done using the external data as input. However, every outcome of these operations shall become local files (including the outcome of the file format conversion actions). Therefore, we recommend keeping all metadata local to the AIPs within the OAIS (descriptive, preservation, other), but allow representation data to be externally managed by another archival storage.

The Common Specification for Information Packages (CSIP) principles 1.4 and 3.6 state that:

  • "the Information Package SHOULD be scalable" and that
  • "[...] it is clear that any given technical implementation will become obsolete in time, for example, as new transfer methods and storage solutions emerge. As such this specification does not prohibit the take-up of any emerging logical or physical technical solutions."

Taking these principles into account, we suggest a modification of the CSIP (which would apply to SIP, AIP, and DIP) to allow representation data to be non-local.

To attain this goal, we propose altering the following sections in the CSIP:

Section 4. CSIP structure

Update text from:

The preferred implementation of the logical model described in Principle 3.6 is a strict physical (folder) structure that precisely follows the logical structure. While the specification does not prohibit alternative implementations of the conceptual model, the practice is not recommended.

The main reason for this implementation decision is that a fixed and documented folder structure makes the package layout clear to both human users and automated tools. The main benefit this clarity is that many archival tasks (e.g. file format risk analysis), can be executed directly on the data portion of the package structure, as opposed to first processing potentially large amounts of metadata for file locations. This allows for more efficient processing which is valuable in the case of large collections and bulk operations. A fixed folder structure, therefore, provides efficiency and scalability.

Many data storage solutions do not explicitly support folder structures, but use other means for structuring and storing AIP data and metadata. However, the purpose of this specification is to facilitate and support Information Package interoperability. When storage solutions do not support the implementation of the package structure for native AIP storage, it is still possible to implement the physical structure for SIPs and DIPs. While systems need to implement transformations between Common Specification IPs and internal AIPs it allows interoperability between tools that support the common specification, easy transfer of IPs to new repository systems or storage solutions and the establishment of multi-repository duplicated storage solutions.

To:

The preferred implementation of the logical model described in Principle 3.6 is a strict physical (folder) structure that precisely follows the logical structure. While the specification does not prohibit alternative implementations of the conceptual model, the practice is not recommended. However, it is recognised that Reference-based packages, where files can be physically separated and referred to by links, are also viable.

The main reason for preferring a fixed and documented folder structure is that it makes the package layout clear to both human users and automated tools. This clarity is crucial as it allows many archival tasks (e.g., file format risk analysis) to be executed directly on the data portion of the package structure, without the need to first process potentially large amounts of metadata to locate files. This leads to more efficient processing, which is valuable in the case of large collections and bulk operations. A fixed folder structure, therefore, provides efficiency and scalability.

While many data storage solutions do not explicitly support folder structures and may use alternative methods, such as Content Addressable Storage systems (CAS) that allow files to be stored separately and retrieved by a unique identifier (and not by its name and location), the purpose of this specification remains to facilitate and support Information Package interoperability. Even when storage solutions lack support for the implementation of folder structures for native AIP storage, it is still possible to implement the physical structure for SIPs and DIPs mainly for transport reasons.

Systems that use Reference-based packages MUST ensure that they implement the necessary processes to consolidate referenced files into an all-inclusive local physical structure, enabling interoperability between tools that support the common specification. This approach enables easy transfer of IPs to new repository systems or to more affordable long-term storage options such as tapes or cold data storage systems and the establishment of multi-site duplicated storage solutions.

Section 4.3 Implementing referenced-based information packages (NEW SECTION)

In certain implementation contexts, it is advantageous for the files included in an information package - whether it be an SIP, AIP, or DIP - to be referenced via links rather than physically bound together in a folder structure.

One of the primary use cases supporting this recommendation concerns I/O performance and storage space. Pre-ingest is responsible for the process of organising data into SIPs before submitting them to the archive. The traditional approach involves copying the data into a physically bound E-ARK SIP, i.e.m a package that includes all the metadata and content files within its physical boundaries.

The SIP is then submitted to the archive, which transforms it into proper AIPs. Typically, after verifying that the ingest was successful, the system eliminates the original SIP and all its included data and metadata.

The archive must be able to handle ingest failures and the potential need to reingest SIPs. This process can lead to several issues, most notably the need for a significant amount of storage space, which could reach up to three times the original size of the data if we consider the uningested data, the SIP being transferred, and the AIP copies.

Moreover, I/O operations needed to move the data from its original location to the temporary ingest storage can place unnecessary stress on the archival storage infrastructure.

Rather than duplicating data within the archival system, Shallow IPs allow for the efficient referencing of data stored in external systems, thereby reducing the need for extensive storage space and minimizing I/O operations.

Referenced-based IPs, also known as Shallow IPs, offer a solution by allowing for the referencing of data stored in external systems without the need to copy them into the archival storage of the OAIS. This approach enables data to be managed and accessed without unnecessary duplication, leading to improved performance and reduced resource consumption. However, it is important to note that this approach carries an increased risk of data loss, as the repository does not maintain full control over the physical storage where the data being managed and also reduced performance when data needs to be fetched through a low bandwidth communication channel. Despite these risks, in certain scenarios, the trade-off may be justified in order to reduce the use of costly resources and time.

Referenced-based IPs can significantly optimise file storage usage, particularly in scenarios where large volumes of data remain actively used by other systems that may function both as processing and archival storage platforms.

For instance, in environments where data is stored in Content Addressable Storage (CAS) or cloud-based systems like Amazon S3 and OpenStack Swift, the advantages of referenced-based IPs are evident. These systems inherently manage data through references or identifiers, making referenced-based IPs an ideal solution for efficient storage and retrieval without the need for duplicating data before and after archiving.

Examples

Typically, the METS file included in an Information Package contains relative URLs that reference files included within the package itself. In the following example, the <FLcat> element uses the href attribute to refer a local file by using a relative path to the file within the information package.

<file
    [...]
    <FLocat
        LOCTYPE="URL"
        xlink:type="simple"
        xlink:href="representations/rep1/data/council-meetings/2024-01-05/File1.doc"/>
</file>

The METS specification supports several options for LOCTYPE attribute such as URN, URL, PURL, HANDLE, DOI, and OTHER.. The URL option supports the full expressiveness of the Uniform Resource Locator standard. To create a Reference-based information package one should make use of this expressiveness and specify the full location of the external files.

The following excerpt is an example of an IP that references files available remotely instead of stored in a local folder structure. One can easily spot this difference in the href attribute as it references a file via a a full resource locator composed of scheme, server, port and a full path to the file within the server.

<file
    [...]
    <FLocat
        LOCTYPE="URL"
        xlink:type="simple"
        xlink:href="https://production-system-host:80/council-meetings/2024-01-05/File1.doc"/>
</file>

Other protocols may also be used. The following example, depicts how files may be referenced on shared folder accessible via the NFS protocol:

<file
    [...]
    <FLocat
        LOCTYPE="URL"
        xlink:type="simple"
        xlink:href="nfs://shared-storage-host/council-meetings/2024-01-05/File1.doc"/>
</file>

@karinbredenberg
Copy link
Contributor

karinbredenberg commented Nov 14, 2024

The suggestion is:

  • To update CSIP with information about referencing content housed outside of the IP
  • Add a new chapter with description of how referencing outside the IP is done

Board members acknowledgment of the issue:
Tick the box in front of you name to indicate that you have looked at the suggestion.

  • Karin Bredenberg (Kommunalförbundet Sydarkivera, chair)
  • Anders Bo Nielsen (National Archives of Denmark)
  • Anja Paulič (National Archives of Slovenia)
  • Arne-Kristian Groven (National Archives of Norway)
  • Gregor Zavrsnik (Geoarh)
  • Janet Anderson (Highbury Research & Development Ltd.)
  • Maya Bangerter (Swiss Federal Archives)
  • Miguel Ferreira (KEEP Solutions)
  • Stephen Mackey (Penwern Limited)
  • Sven Schlarb (Austrian Institute of Technology)

Voting
(Decision making will be carried out on the basis of majority voting by all eligible members of the Board acknowledging the issue. In the case of a tied vote, decisions will be made at the discretion of the Chair)

Tick the box in front of you name to say yes to the suggestion.

  • Karin Bredenberg (Kommunalförbundet Sydarkivera, chair)
  • Anders Bo Nielsen (National Archives of Denmark)
  • Anja Paulič (National Archives of Slovenia)
  • Arne-Kristian Groven (National Archives of Norway)
  • Gregor Zavrsnik (Geoarh)
  • Janet Anderson (Highbury Research & Development Ltd.)
  • Maya Bangerter (Swiss Federal Archives)
  • Miguel Ferreira (KEEP Solutions)
  • Stephen Mackey (Penwern Limited)
  • Sven Schlarb (Austrian Institute of Technology)

@stephenmackey
Copy link

I was not involved in the consortium when the CSIP was originlly drafted, but my interpretation of the specification as a user is that it has a preference for physical IPs (i.e. ones that contain content data and preservation description information) but in deference to software designs that do not manage data wihin the OAIS as physical folder structures it allows for this implementation as long as physical SIPs can be ingested and physical DIPs exported. An argument can be made that the CSIP defines a folder structure but does not mandate that this actually contains the content data, however if so the following would not be stated:

The main reason for this implementation decision is that a fixed and documented folder structure makes
the package layout clear to both human users and automated tools. The main benefit this clarity is that
many archival tasks (e.g. file format risk analysis), can be executed directly on the data portion of the package
structure, as opposed to first processing potentially large amounts of metadata for file locations. This allows
for more efficient processing which is valuable in the case of large collections and bulk operations. A fixed
folder structure, therefore, provides efficiency and scalability.

Specifically:

  1. I see this proposal as changing a fundemental principle of the CSIP rather than slightly adjusting it and it requires more debate. Such a change if carried, requires significant re-writes of the CSIP (and probably other specifications), otherwise the CSIP and other specifications will look (in my view) contradictory.

  2. I disagree to the term 'Shallow IP', not a common term for me, this would be better called a 'Virtual IP' as it is not a physical package.

  3. The section 4.3 on implementation is appropriate for a guideline, not the specification.

  4. This change is proposed for the CSIP and so should apply to all packages: SIP, AIP, DIP and consequently proposes virtual SIPs and DIPs as well as AIPs. So, the OAIS sends or receives packages that contain no data and references links to external storage systems. This is not a model for an OIAS that I recognise.

Proposal:

The reference model for an OAIS is well defined and understood and contains all of the necessary functions, data and metadata. The OAIS accepts SIPs containing data and distributes DIPs,containing data. AIPs are held in the OAIS. The boundaries of the OAIS are not restricted, so a system can be envisaged where different functions are performed in different systems and even different locations but are governed by the management functions of the OAIS. Any 'external systems' used must thus be considered part of the OAIS and any data exchange with truly external systems must be via submission and dissemination packages. I think that any guideline notes on virtual AIPs should be in the AIP specification or guideline alongside the guidance for splitting AIPs but should not be applicable to the SIP and DIP. CSIP should be unchanged but if the philosophy is changing a more fundemental rewrite is necessary.

@karinbredenberg
Copy link
Contributor

4 DILCIS Board members have acknowledge the issue
1 DILCIS Board members agree with the solution

The suggested solution will not be included in the next version based upon only one of four agrees to the solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request This issue is a feature which will be implemented further on. Used together with a milestone.
Projects
Status: Candidates for voting
Development

No branches or pull requests

5 participants