Sage-Bionetworks · thomasyu888 · Oct 31, 2024 · Oct 31, 2024 · Nov 1, 2024 · Nov 1, 2024
diff --git a/.readthedocs.yml b/.readthedocs.yml
@@ -14,7 +14,8 @@ build:
     post_install:
       - pip install poetry==1.3.0
       - poetry config virtualenvs.create false
-      - poetry install --with doc
+      - poetry install --all-extras
+      - pip install typing-extensions
     #Poetry will install my dependencies into the virtualenv created by readthedocs if I set virtualenvs.create=false
     # You can also specify other tool versions:
     # nodejs: "16"

@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2021 Sage Bionetworks
+Copyright (c) 2024 Sage Bionetworks
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

@@ -270,12 +270,17 @@ poetry debug info
 
 Before you begin, make sure you are in the latest `develop` of the repository.
 
-The following command will install the dependencies based on what we specify in the `poetry.lock` file of this repository. If this step is taking a long time, try to go back to Step 2 and check your version of `poetry`. Alternatively, you can try deleting the lock file and regenerate it by doing `poetry install` (Please note this method should be used as a last resort because this would force other developers to change their development environment)
+The following command will install the dependencies based on what we specify in the `poetry.lock` file of this repository (which is generated from the libraries listed in the `pyproject.toml` file). If this step is taking a long time, try to go back to Step 2 and check your version of `poetry`. Alternatively, you can try deleting the lock file and regenerate it by doing `poetry lock` (Please note this method should be used as a last resort because this would force other developers to change their development environment).
 
 ```
-poetry install --all-extras
+poetry install --dev,doc
 ```
 
+This command will install:
+* The main dependencies required for running the package.
+* Development dependencies for testing, linting, and code formatting.
+* Documentation dependencies such as `sphinx` for building and maintaining documentation.
+
 ### 5. Set up configuration files
 
 The following section will walk through setting up your configuration files with your credentials to allow for communication between `schematic` and the Synapse API.
@@ -484,12 +489,23 @@ docker run -v %cd%:/schematic \
 
 # Contributors
 
-Main contributors and developers:
 
+Sage main contributors and developers:
+
+- [Gianna Jordan](https://github.com/giajordan)
+- [Lingling Peng](https://github.com/linglp)
+- [Bryan Fauble](https://github.com/BryanFauble)
+- [Andrew Lamb](https://github.com/andrewelamb)
+- [Brad Macdonald](https://github.com/BWMac)
 - [Milen Nikolov](https://github.com/milen-sage)
+
+## Alumni
 - [Mialy DeFelice](https://github.com/mialy-defelice)
 - [Sujay Patil](https://github.com/sujaypatil96)
 - [Bruno Grande](https://github.com/BrunoGrandePhD)
-- [Robert Allaway](https://github.com/allaway)
-- [Gianna Jordan](https://github.com/giajordan)
-- [Lingling Peng](https://github.com/linglp)
+- [Jason Hwee](https://github.com/hweej)
+- [Xengie Doan](https://github.com/xdoan)
+- [James Eddy](https://github.com/jaeddy)
+- [Yooree Chae](https://github.com/ychae)
+
+See all [contributors](https://github.com/Sage-Bionetworks/schematic/graphs/contributors)
@@ -0,0 +1,81 @@
+Setting up your asset store
+===========================
+
+.. note::
+
+   You can ignore this section if you are just trying to contribute manifests.
+
+This document covers the minimal recommended elements needed in Synapse to interface with the Data Curator App (DCA) and provides options for Synapse project layout.
+
+There are two options for setting up a DCC Synapse project:
+
+1. Each team of DCC contributors has its own Synapse project that stores the team's datasets.
+2. All DCC datasets are stored in the same Synapse project.
+
+Option 1: Distributed Synapse Projects
+--------------------------------------
+
+Pick **option 1** if you answer "yes" to one or more of the following questions:
+
+- Does the DCC have multiple contributing institutions/labs, each with different data governance and access controls?
+- Does the DCC have multiple institutions with limited cross-institutional sharing?
+- Will contributors submit more than 100 datasets per release or per month?
+- Are you not willing to annotate each DCC dataset folder with the annotation `contentType:dataset`?
+
+Access & Project Setup - Multiple Contributing Projects
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. Create a DCC Admin Team with admin permissions.
+2. Create a Team for each data contributing institution. Begin with a "Test Team" if all teams are not yet identified.
+3. Create a Synapse Project for each institution and grant the respective team **Edit** level access.
+   - E.g., for institutions A, B, and C, create Projects A, B, and C with Teams A, B, and C. Team A has **Edit** access to Project A, etc.
+4. Within each project, create top-level dataset folders in the **Files** tab for each dataset type.
+5. Create another Synapse Project (e.g., MyDCC) containing the main **Fileview** that includes in the scope all the DCC projects.
+   - Ensure all teams have **Download** level access to this file view.
+   - Include both file and folder entities and add ALL default columns.
+
+
+Option 2: Single Synapse Project
+--------------------------------
+
+Pick **option 2** if you don't select option 1 and you answer "yes" to any of these questions:
+
+- Does the DCC have a project with pre-existing datasets in a complex folder hierarchy?
+- Does the DCC envision collaboration on the same dataset collection across multiple teams with shared access controls?
+- Are you willing to set up local access control for each dataset folder and annotate each with `contentType:dataset`?
+
+If neither option fits, select option 1.
+
+
+Access & Project Setup - Single Contributing Project
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. Create a Team for each data contributing institution.
+2. Create a single Synapse Project (e.g., MyDCC).
+3. Within this project, create dataset folders for each contributor. Organize them as needed.
+   - Use `contentType:dataset` for each dataset folder, which should not nest inside other dataset folders and must have unique names.
+4. In MyDCC, create the main **DCC Fileview** with `MyDCC` as scope. Add column `contentType` to the schema and grant teams **Download** level access.
+   - Add both file and folder entities and add ALL default columns.
+
+
+Synapse External Cloud Buckets Setup
+------------------------------------
+
+If DCC contributors require external cloud buckets, select one of the following configurations.  For more information on how to
+set this up on Synapse, view this documentation: https://help.synapse.org/docs/Custom-Storage-Locations.2048327803.html
+
+1. **Basic External Storage Bucket (Default)**:
+   - Create an S3 bucket for Synapse uploads via web or CLI. Contributors will upload data without needing AWS credentials.
+   - Provision an S3 bucket, attach it to the Synapse project, and create folders for specific assay types.
+
+2. **Custom Storage Location**:
+
+This is an advanced setup for users that do not want to upload files directly via the Synapse API, but rather
+create pointers to the data.
+
+   - For large datasets or if contributors prefer cloud storage, enable uploads via AWS CLI or GCP CLI.
+   - Configure the custom storage location with an AWS Lambda or Google Cloud function for syncing.
+   - If using AWS, provision a bucket, set up Lambda sync, and assign IAM write access.
+   - For GCP, use Google Cloud function sync and obtain contributor emails for access.
+
+Finally, set up a `synapse-service-lambda` account for syncing external cloud buckets with Synapse, granting "Edit & Delete" permissions on the contributor's project.
@@ -2,6 +2,45 @@
 CLI Reference
 =============
 
+When you're using this tool `-d` flag is referring to the Synapse ID of a folder that would be found under the files tab
+that contains a manifest and data. This would be referring to a Schematic Dataset. It is not required to provide a dataset_id
+but if you're trying to pull existing annotations by using the `-a` flag and the manifest is file-based then you would
+need to provide a dataset_id.
+
+
+Generate a new manifest as a Google Sheet
+-----------------------------------------
+
+
+.. code-block:: shell
+
+   schematic manifest -c /path/to/config.yml get -dt <your data type> -s
+
+Generate an existing manifest from Synapse
+------------------------------------------
+
+.. code-block:: shell
+
+   schematic manifest -c /path/to/config.yml get -dt <your data type> -d <your synapse dataset folder id> -s
+
+Validate a manifest
+-------------------
+
+.. code-block:: shell
+
+   schematic model -c /path/to/config.yml validate -dt <your data type> -mp <your csv manifest path>
+
+Submit a manifest as a file
+---------------------------
+
+.. code-block:: shell
+
+   schematic model -c /path/to/config.yml submit -mp <your csv manifest path> -d <your synapse dataset folder id> -vc <your data type> -mrt file_only
+
+
+In depth guide
+--------------
+
 .. click:: schematic.__main__:main
   :prog: schematic
   :nested: full
@@ -12,6 +12,9 @@
 #
 import os
 import sys
+
+import sphinx_rtd_theme
+
 file_dir = os.path.dirname(__file__) 
 sys.path.append(file_dir)
 from utils import _parse_toml
@@ -25,20 +28,21 @@
 
 toml_metadata = _parse_toml(toml_file_path)
 project = toml_metadata["name"]
-copyright = "2022, Sage Bionetworks"
+copyright = "2024, Sage Bionetworks"
 
 author = toml_metadata["authors"]
 
 # The full version, including alpha/beta/rc tags
 release = toml_metadata["version"]
 
 
+
 # -- General configuration ---------------------------------------------------
 
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
-extensions = ["sphinx_click"]
+extensions = ["sphinx_click", "sphinx_rtd_theme"]
 
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ["_templates"]
@@ -55,15 +59,21 @@
 # This pattern also affects html_static_path and html_extra_path.
 exclude_patterns = []
 
+# The master toctree document.
+master_doc = "index"
 
 # -- Options for HTML output -------------------------------------------------
 
 # The theme to use for HTML and HTML Help pages.  See the documentation for
 # a list of builtin themes.
 #
-html_theme = "alabaster"
+html_theme = "sphinx_rtd_theme"
 
 # Add any paths that contain custom static files (such as style sheets) here,
 # relative to this directory. They are copied after the builtin static files,
 # so a file named "default.css" will overwrite the builtin "default.css".
 html_static_path = ["_static"]
+
+html_theme_options = {
+    'collapse_navigation': False,
+}
@@ -0,0 +1,84 @@
+Configure Schematic
+===================
+
+This is an example config for Schematic. All listed values are those that are the default if a config is not used. Remove any fields in the config you don't want to change.
+If you remove all fields from a section, the entire section should be removed including the header.
+Change the values of any fields you do want to change.  Please view the installation section for details on how to set some of this up.
+
+.. code-block:: yaml
+
+    # This describes where assets such as manifests are stored
+    asset_store:
+        # This is when assets are stored in a synapse project
+        synapse:
+            # Synapse ID of the file view listing all project data assets.
+            master_fileview_id: "syn23643253"
+            # Path to the synapse config file, either absolute or relative to this file
+            config: ".synapseConfig"
+            # Base name that manifest files will be saved as
+            manifest_basename: "synapse_storage_manifest"
+
+    # This describes information about manifests as it relates to generation and validation
+    manifest:
+        # Location where manifests will saved to
+        manifest_folder: "manifests"
+        # Title or title prefix given to generated manifest(s)
+        title: "example"
+        # Data types of manifests to be generated or data type (singular) to validate manifest against
+        data_type:
+            - "Biospecimen"
+            - "Patient"
+
+    # Describes the location of your schema
+    model:
+        # Location of your schema jsonld, it must be a path relative to this file or absolute
+        location: "tests/data/example.model.jsonld"
+
+    # This section is for using google sheets with Schematic
+    google_sheets:
+        # Path to the google service account creds, either absolute or relative to this file
+        service_acct_creds: "schematic_service_account_creds.json"
+        # When doing google sheet validation (regex match) with the validation rules.
+        #   true is alerting the user and not allowing entry of bad values.
+        #   false is warning but allowing the entry on to the sheet.
+        strict_validation: true
+
+
+This document will go into detail what each of these configurations mean.
+
+Asset Store
+-----------
+
+Synapse
+~~~~~~~
+This describes where assets such as manifests are stored and the configurations of the asset store is described
+under the asset store section.
+
+* master_fileview_id: Synapse ID of the file view listing all project data assets.
+* config: Path to the synapse config file, either absolute or relative to this file. Note, if you use `synapse config` command, you will have to provide the full path to the configuration file.
+* manifest_basename: Base name that manifest files will be saved as on Synapse. The Component will be appended to it so for example: `synapse_storage_manifest_biospecimen.csv`
+
+Manifest
+--------
+This describes information about manifests as it relates to generation and validation.  Note: some of these configurations can be overwritten by the CLI commands.
+
+* manifest_folder: Location where manifests will saved to. This can be a relative or absolute path on your local machine.
+* title: Title or title prefix given to generated manifest(s). This is used to name the manifest file saved locally.
+* data_type: Data types of manifests to be generated or data type (singular) to validate manifest against. If you wanted all the available manifests, you can input "all manifests"
+
+
+Model
+-----
+Describes the location of your schema
+
+* location: This is the location of your schema jsonld, it must be a path relative to this file or absolute path.  Currently URL's are NOT supported, so you will have to download the jsonld data model.  Here is an example: https://raw.githubusercontent.com/ncihtan/data-models/v24.9.1/HTAN.model.jsonld
+
+Google Sheets
+-------------
+Schematic leverages the Google API to generate manifests. This section is for using google sheets with Schematic
+
+* service_acct_creds: Path to the google service account creds, either absolute or relative to this file. This is the path to the service account credentials file that you download from Google Cloud Platform.
+* strict_validation: When doing google sheet validation (regex match) with the validation rules.
+
+    * True is alerting the user and not allowing entry of bad values.
+    * False is warning but allowing the entry on to the sheet.