diff --git a/docs/03-data-analysts.md b/docs/03-data-analysts.md index e286d41c..7f3cb4ec 100644 --- a/docs/03-data-analysts.md +++ b/docs/03-data-analysts.md @@ -24,7 +24,7 @@ In this guide, you will go through the following steps: 1. Create a Google account 1. [Launch Terra](https://anvil.terra.bio/#workspaces) and sign in with your Google account -1. Link external accounts (Gen3, dbGaP) to Terra (optional - enables you to import AnVIL open access datasets and to access protected data if you have appropriate authorization) +1. Link external accounts (e.g., dbGaP) to Terra (optional - enables you to import AnVIL open access datasets and to access protected data if you have appropriate authorization) ## Step 1: Create a Google Account {#data-analysts-step-1} @@ -52,8 +52,6 @@ AnVIL provides access to a wide selection of datasets, including controlled-acce The following links will take you to the Terra documentation for setting up and linking external accounts. -- [Set up a Gen3 account](https://gen3.theanvil.io/login) - allows you to use the Gen3 data explorer to create artificial cohorts over AnVIL datasets that have been indexed by Gen3. -- [Link Gen3 and Terra accounts](https://support.terra.bio/hc/en-us/articles/360050390451) - allows you to analyze Gen3 data on Terra. - [Link Terra and eRA Commons ID](https://support.terra.bio/hc/en-us/articles/360038086332-Linking-Terra-to-External-Servers) - To use controlled-access data on Terra, you will need to link your Terra user ID to your authorization account (such as a dbGaP account). Linking to external servers will allow Terra to automatically determine if you can access controlled datasets hosted in Terra (ex. TCGA, TOPMed, etc.) based on your approved dbGaP applications. ## Wrap-Up {#data-analysts-wrap-up} diff --git a/docs/06-tools-rstudio.md b/docs/06-tools-rstudio.md index 2fceb2af..0208f1e3 100644 --- a/docs/06-tools-rstudio.md +++ b/docs/06-tools-rstudio.md @@ -233,7 +233,7 @@ sessionInfo() ## [1] sass_0.4.8 utf8_1.2.4 generics_0.1.3 xml2_1.3.6 ## [5] stringi_1.8.3 lattice_0.21-9 hms_1.1.3 digest_0.6.34 ## [9] magrittr_2.0.3 evaluate_0.23 grid_4.3.2 timechange_0.3.0 -## [13] bookdown_0.40 fastmap_1.1.1 rprojroot_2.0.4 jsonlite_1.8.8 +## [13] bookdown_0.41 fastmap_1.1.1 rprojroot_2.0.4 jsonlite_1.8.8 ## [17] Matrix_1.6-1.1 processx_3.8.3 chromote_0.3.1 ps_1.7.6 ## [21] promises_1.2.1 httr_1.4.7 fansi_1.0.6 ottrpal_1.3.0 ## [25] udpipe_0.8.11 cow_0.0.0.9000 jquerylib_0.1.4 cli_3.6.2 diff --git a/docs/07-data.md b/docs/07-data.md index 52ac260b..501a1972 100644 --- a/docs/07-data.md +++ b/docs/07-data.md @@ -115,8 +115,8 @@ The [AnVIL Dataset Catalog](https://anvilproject.org/data) displays key NHGRI da Image shows a screenshot of the AnVIL Dataset Catalog website landing page. -### Gen3 Data Explorer +### AnVIL Data Explorer -The [Gen3 Data Explorer and Data Commons](https://gen3.theanvil.io/) provides their API for data queries and downloads, supporting cross-project analyses. Gen3 provides access to open and protected datasets that can be exported to an AnVIL Workspace. For example, users can find the 1000 Genomes dataset on Gen3 and filter by ancestry, age, and other features prior to performing analyses on AnVIL. +The [AnVIL Data Explorer](https://explore.anvilproject.org/datasets) enables faceted searches of open and managed access datasets hosted in AnVIL, making it easier for researchers to find and custom-build cohorts. -Image shows a screenshot of the Gen3 on AnVIL Data Explorer website landing page. +Image shows a screenshot of the AnVIL Data Explorer website landing page. diff --git a/docs/07-data_files/figure-html/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM_g30d935bde8e_0_0.png b/docs/07-data_files/figure-html/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM_g30d935bde8e_0_0.png new file mode 100644 index 00000000..6b342474 Binary files /dev/null and b/docs/07-data_files/figure-html/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM_g30d935bde8e_0_0.png differ diff --git a/docs/07-data_files/figure-html/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM_gf5fa6f264a_0_31.png b/docs/07-data_files/figure-html/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM_gf5fa6f264a_0_31.png index 91e5a9b0..d7668eee 100644 Binary files a/docs/07-data_files/figure-html/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM_gf5fa6f264a_0_31.png and b/docs/07-data_files/figure-html/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM_gf5fa6f264a_0_31.png differ diff --git a/docs/07-data_files/figure-html/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM_gf5fa6f264a_0_40.png b/docs/07-data_files/figure-html/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM_gf5fa6f264a_0_40.png deleted file mode 100644 index c8c22d0b..00000000 Binary files a/docs/07-data_files/figure-html/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM_gf5fa6f264a_0_40.png and /dev/null differ diff --git a/docs/11b-irb-templates.md b/docs/11b-irb-templates.md index d632ca50..5b271d6e 100644 --- a/docs/11b-irb-templates.md +++ b/docs/11b-irb-templates.md @@ -19,7 +19,7 @@ The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-Spac - The **Terra platform**: provides a compute environment with secure data and analysis sharing capabilities - **Dockstore**: provides standards based sharing of containerized tools and workflows -- **The Gen3 data commons framework**: provides data and metadata ingest, querying, and organization +- **The AnVIL Data Explorer**: enables faceted searches of open and managed access datasets hosted in AnVIL - **Bioconductor and Galaxy**: provide environments for users at different skill levels to construct and execute analyses ### Data Storage diff --git a/docs/12-overview-videos.md b/docs/12-overview-videos.md index f9b9b749..75f408e4 100644 --- a/docs/12-overview-videos.md +++ b/docs/12-overview-videos.md @@ -7,4 +7,3 @@ - Galaxy -- [Galaxy on AnVIL Walkthrough](https://youtube.com/watch?v=-Q4SjLEd99s) -- 5m50s 12/9/20 - Data Tables -- [Introduction to Data Tables in Terra](https://youtube.com/watch?v=IeLywroCNNA) -- 5m25s 4/5/20 - Dockstore -- [Importing a GATK workflow from Dockstore into Terra](https://youtube.com/watch?v=SGqMPNITQSE) -- 8m29s 10/15/20 -- Workflows -- [Configuring and running a GATK workflow on Gen3 data in Terra](https://youtube.com/watch?v=Vr7GH_h49ts) -- 28m10s 11/25/20 diff --git a/docs/404.html b/docs/404.html index a94b6d51..c5617232 100644 --- a/docs/404.html +++ b/docs/404.html @@ -6,7 +6,7 @@ Page not found | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@
  • 10 Workflows diff --git a/docs/About.md b/docs/About.md index 0c402942..bf01a4bc 100644 --- a/docs/About.md +++ b/docs/About.md @@ -41,12 +41,12 @@ These credits are based on our [course contributors table guidelines](https://gi ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2024-10-22 +## date 2024-10-28 ## pandoc 3.1.1 @ /usr/local/bin/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date (UTC) lib source -## bookdown 0.40 2024-07-02 [1] CRAN (R 4.3.2) +## bookdown 0.41 2024-10-16 [1] CRAN (R 4.3.2) ## bslib 0.6.1 2023-11-28 [1] RSPM (R 4.3.0) ## cachem 1.0.8 2023-05-01 [1] RSPM (R 4.3.0) ## cli 3.6.2 2023-12-11 [1] RSPM (R 4.3.0) diff --git a/docs/about.html b/docs/about.html index 9312f910..dbfaafde 100644 --- a/docs/about.html +++ b/docs/about.html @@ -6,7 +6,7 @@ About the Authors | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@
  • 10 Workflows @@ -466,12 +466,12 @@

    About the Authors C Authorization Domains | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@

  • 10 Workflows diff --git a/docs/budget-templates.html b/docs/budget-templates.html index 1734ab0d..24fe088e 100644 --- a/docs/budget-templates.html +++ b/docs/budget-templates.html @@ -6,7 +6,7 @@ D Budget Templates | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@
  • 10 Workflows diff --git a/docs/consortia.html b/docs/consortia.html index 4b874dd5..06694194 100644 --- a/docs/consortia.html +++ b/docs/consortia.html @@ -6,7 +6,7 @@ Chapter 4 Consortia | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@
  • 10 Workflows diff --git a/docs/data-analysts.html b/docs/data-analysts.html index 3f871c34..6cc1109f 100644 --- a/docs/data-analysts.html +++ b/docs/data-analysts.html @@ -6,7 +6,7 @@ Chapter 3 Data Analysts | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@
  • 10 Workflows @@ -393,7 +393,7 @@

    3.1.2 Starting Setup
  • Create a Google account
  • Launch Terra and sign in with your Google account
  • -
  • Link external accounts (Gen3, dbGaP) to Terra (optional - enables you to import AnVIL open access datasets and to access protected data if you have appropriate authorization)
  • +
  • Link external accounts (e.g., dbGaP) to Terra (optional - enables you to import AnVIL open access datasets and to access protected data if you have appropriate authorization)
  • @@ -418,8 +418,6 @@

    3.4 Step 3: Link External Account

    AnVIL provides access to a wide selection of datasets, including controlled-access data. Linking your accounts will enable you to import these data into Terra.

    The following links will take you to the Terra documentation for setting up and linking external accounts.

    diff --git a/docs/data.html b/docs/data.html index cbcb7fa6..c646ec64 100644 --- a/docs/data.html +++ b/docs/data.html @@ -6,7 +6,7 @@ Chapter 9 Data | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@

  • 10 Workflows @@ -452,10 +452,10 @@

    9.2.2 AnVIL Dataset CatalogThe AnVIL Dataset Catalog displays key NHGRI datasets accessible in AnVIL, such as the CCDG (Centers for Common Disease Genomics), CMG (Centers for Mendelian Genomics), eMERGE (Electronic Medical Records and Genomics), as well as other relevant datasets. You will need to coordinate access to controlled data.

    Image shows a screenshot of the AnVIL Dataset Catalog website landing page.

    -
    -

    9.2.3 Gen3 Data Explorer

    -

    The Gen3 Data Explorer and Data Commons provides their API for data queries and downloads, supporting cross-project analyses. Gen3 provides access to open and protected datasets that can be exported to an AnVIL Workspace. For example, users can find the 1000 Genomes dataset on Gen3 and filter by ancestry, age, and other features prior to performing analyses on AnVIL.

    -

    Image shows a screenshot of the Gen3 on AnVIL Data Explorer website landing page.

    +
    +

    9.2.3 AnVIL Data Explorer

    +

    The AnVIL Data Explorer enables faceted searches of open and managed access datasets hosted in AnVIL, making it easier for researchers to find and custom-build cohorts.

    +

    Image shows a screenshot of the AnVIL Data Explorer website landing page.

    diff --git a/docs/faqs.html b/docs/faqs.html index f2827f3c..e05837b5 100644 --- a/docs/faqs.html +++ b/docs/faqs.html @@ -6,7 +6,7 @@ A FAQs | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@

  • 10 Workflows diff --git a/docs/galaxy.html b/docs/galaxy.html index ad89799a..b275ee59 100644 --- a/docs/galaxy.html +++ b/docs/galaxy.html @@ -6,7 +6,7 @@ Chapter 7 Galaxy | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@
  • 10 Workflows diff --git a/docs/give-us-feedback.html b/docs/give-us-feedback.html index 1923324d..7c023876 100644 --- a/docs/give-us-feedback.html +++ b/docs/give-us-feedback.html @@ -6,7 +6,7 @@ G Give us Feedback | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@
  • 10 Workflows diff --git a/docs/index.html b/docs/index.html index 244e37cf..87d236b2 100644 --- a/docs/index.html +++ b/docs/index.html @@ -6,7 +6,7 @@ Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@
  • 10 Workflows @@ -372,7 +372,7 @@

    About this Book

    diff --git a/docs/index.md b/docs/index.md index 0f041dd4..db2fc020 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,6 +1,6 @@ --- title: "Getting Started on AnVIL" -date: "October 22, 2024" +date: "October 28, 2024" site: bookdown::bookdown_site documentclass: book bibliography: [book.bib, packages.bib] diff --git a/docs/introduction.html b/docs/introduction.html index 3a0b2919..cf3d10e5 100644 --- a/docs/introduction.html +++ b/docs/introduction.html @@ -6,7 +6,7 @@ Chapter 1 Introduction | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@

  • 10 Workflows diff --git a/docs/irb-templates.html b/docs/irb-templates.html index d2064a12..5716d1b2 100644 --- a/docs/irb-templates.html +++ b/docs/irb-templates.html @@ -6,7 +6,7 @@ E IRB Templates | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@
  • 10 Workflows @@ -390,7 +390,7 @@

    E.0.2 Description of AnVIL
  • The Terra platform: provides a compute environment with secure data and analysis sharing capabilities
  • Dockstore: provides standards based sharing of containerized tools and workflows
  • -
  • The Gen3 data commons framework: provides data and metadata ingest, querying, and organization
  • +
  • The AnVIL Data Explorer: enables faceted searches of open and managed access datasets hosted in AnVIL
  • Bioconductor and Galaxy: provide environments for users at different skill levels to construct and execute analyses
  • diff --git a/docs/jupyter-notebook.html b/docs/jupyter-notebook.html index f7f0ae7d..afe1503c 100644 --- a/docs/jupyter-notebook.html +++ b/docs/jupyter-notebook.html @@ -6,7 +6,7 @@ Chapter 6 Jupyter Notebook | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@

  • 10 Workflows diff --git a/docs/overview-videos.html b/docs/overview-videos.html index 21a487f3..fc8c92f8 100644 --- a/docs/overview-videos.html +++ b/docs/overview-videos.html @@ -6,7 +6,7 @@ F Overview Videos | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@
  • 10 Workflows @@ -380,7 +380,6 @@

    F Overview VideosGalaxy on AnVIL Walkthrough – 5m50s 12/9/20

  • Data Tables – Introduction to Data Tables in Terra – 5m25s 4/5/20
  • Dockstore – Importing a GATK workflow from Dockstore into Terra – 8m29s 10/15/20
  • -
  • Workflows – Configuring and running a GATK workflow on Gen3 data in Terra – 28m10s 11/25/20
  • diff --git a/docs/pis-and-lab-managers.html b/docs/pis-and-lab-managers.html index 69537536..a584d9ee 100644 --- a/docs/pis-and-lab-managers.html +++ b/docs/pis-and-lab-managers.html @@ -6,7 +6,7 @@ Chapter 2 PIs and Lab Managers | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@
  • 10 Workflows diff --git a/docs/reference-keys.txt b/docs/reference-keys.txt index c852f789..42f871bb 100644 --- a/docs/reference-keys.txt +++ b/docs/reference-keys.txt @@ -97,7 +97,7 @@ copy-files-from-your-local-computer-to-a-workspace-bucket analyze-existing-data anvil-data-library anvil-dataset-catalog -gen3-data-explorer +anvil-data-explorer workflows pre-configured-workflow create-a-custom-workflow diff --git a/docs/rstudio.html b/docs/rstudio.html index e4d6f616..bde8c0ba 100644 --- a/docs/rstudio.html +++ b/docs/rstudio.html @@ -6,7 +6,7 @@ Chapter 8 RStudio | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@
  • 10 Workflows @@ -514,7 +514,7 @@

    8.2.4 Delete RStudio Cloud Enviro ## [1] sass_0.4.8 utf8_1.2.4 generics_0.1.3 xml2_1.3.6 ## [5] stringi_1.8.3 lattice_0.21-9 hms_1.1.3 digest_0.6.34 ## [9] magrittr_2.0.3 evaluate_0.23 grid_4.3.2 timechange_0.3.0 -## [13] bookdown_0.40 fastmap_1.1.1 rprojroot_2.0.4 jsonlite_1.8.8 +## [13] bookdown_0.41 fastmap_1.1.1 rprojroot_2.0.4 jsonlite_1.8.8 ## [17] Matrix_1.6-1.1 processx_3.8.3 chromote_0.3.1 ps_1.7.6 ## [21] promises_1.2.1 httr_1.4.7 fansi_1.0.6 ottrpal_1.3.0 ## [25] udpipe_0.8.11 cow_0.0.0.9000 jquerylib_0.1.4 cli_3.6.2 diff --git a/docs/search_index.json b/docs/search_index.json index 6fc6b201..15004a84 100644 --- a/docs/search_index.json +++ b/docs/search_index.json @@ -1 +1 @@ -[["index.html", "Getting Started on AnVIL About this Book", " Getting Started on AnVIL October 22, 2024 About this Book This book is part of a series of books for the Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) of the National Human Genome Research Institute (NHGRI). Here, we present opinionated step-by-step guides for setting up accounts focused on three personas: PIs, Analysts, and Consortia. Skills Level Please choose the closest matching persona from the lefthand menu. Genetics Novice: no genetics knowledge needed Programming skills Novice: no programming experience needed AnVIL Collection Additional guides are provided to help you with Workspaces, launch interactive tools, and start working with data. Learn more about AnVIL by visiting https://anvilproject.org or reading the article in Cell Genomics. Please check out our full collection of AnVIL and related resources: https://hutchdatascience.org/AnVIL_Collection/ "],["introduction.html", "Chapter 1 Introduction", " Chapter 1 Introduction Welcome to the AnVIL Getting Started guide! In it you will find step-by-step instructions for setting up your accounts, as well as guides on how to use some of AnVIL’s key features. 1.0.1 What Is AnVIL? AnVIL is NHGRI’s Genomic Data Science Analysis, Visualization, and Informatics Lab-space. It provides a platform for performing genomic data analysis on the cloud. 1.0.2 Does AnVIL Cost Money? Through AnVIL, you pay for computing resources as you use them. If you’d like to try it out, new users can claim a $300 Google Cloud credit to test out the platform and perform some small analyses. We also provide a cost estimator. 1.0.3 Where Can I Get Help? Please visit our community support forum at help.anvilproject.org with any questions (or suggestions!) you may have. 1.0.4 How to Use This Book This book is not intended to be read through sequentially, rather, it is a collection of guides that you can reference based on your needs. It is divided into two major sections: Account Setup Step-by-step instructions for new AnVIL users to set up their accounts and start using the AnVIL platform. We have included recommendations for configuring your accounts based on several common use cases: PIs and Lab Managers: managing a team of researchers working on AnVIL Data Analysts: joining a team working on AnVIL Consortia: using AnVIL as part of a research consortium Working on AnVIL Examples and walkthroughs of common tasks on the AnVIL platform: Workspaces: how to create and clone research spaces on AnVIL Tools: how to run common tools including Jupyter Notebooks, Galaxy, and RStudio Data: how to find and access AnVIL datasets, as well as upload and manage your own data Workflows: how to find and run existing automated data processing pipelines, and how to customize and share your own 1.0.5 Activate scroll_highlight Feature Note that some sections of this book cover steps in a lot of detail. When navigating the table of contents, you can click subsection (e.g., 2.1, 4.3) headers a second time to expand the table of contents and enable the scroll_highlight feature. This can help you follow the separate steps within more clearly. "],["pis-and-lab-managers.html", "Chapter 2 PIs and Lab Managers 2.1 Account Setup Overview 2.2 Step 1: Create a Google Account 2.3 Step 2: Set Up Google Billing 2.4 Step 3: Add Terra to Google Billing Account 2.5 Step 4: Create Terra Billing Projects 2.6 Step 5: Set Budgets and Alerts 2.7 Step 6: Add Users and Workspaces 2.8 Wrap-Up", " Chapter 2 PIs and Lab Managers This chapter is targeted towards people who are responsible for bringing a team to AnVIL. Broadly targeted towards principal investigators (PIs), but also relevant to team leads or lab managers, you will find here: Account Setup Overview – Design philosophy and goals for this guide - is this a good fit for your team? What should you know before you start? Account Setup Steps – Step-by-step instructions to create your first accounts on AnVIL and connect your team members The Appendices of this book contain additional information that may be of interest, including: Templates for including AnVIL in grant applications (Budget Templates, IRB Templates) Information regarding AnVIL’s security features for protecting sensitive research (Authorization Domains) Please click on the subsection headers in the left hand navigation bar (e.g., 2.1, 4.3) a second time to expand the table of contents and enable the scroll_highlight feature (see more). 2.1 Account Setup Overview 2.1.1 Goals for This Guide 2.1.2 Design Philosophy This guide provides an opinionated walkthrough on how to set up AnVIL for your lab, based on experiences from many labs actively using AnVIL. These step-by-step instructions take team leads that are completely new to the AnVIL through account setup to the point where team members can start working on AnVIL. Following the recommendations in this guide will help you more clearly see where charges are coming from and have greater control over which users can spend your money and access your data. In support of these goals we have made the following design decisions: COST CONTROL Prevent charges to your funding account until you explicitly give authorization by starting with Google’s free $300 credit program Control who can charge to your account by limiting who can “share” permission to compute - yourself and any designated “Lab Managers” COST TRANSPARENCY Allow fine-grain accounting of who spent what by creating individual “Billing Projects” for each user Monitor costs by setting up email alerts to warn you when you reach spending thresholds Enable detailed analysis of costs by exporting cost data using BigQuery DATA ACCESS CONTROLS Reduce unwanted access by limiting who can “share” your data and analyses - yourself and any designated “Lab Managers” Stricter data access management can be enforced through “Authorization Domains”; however this can make future sharing and publication difficult. This guide recommends avoiding Authorization Domains for most uses, especially as you are starting out. If you are working with highly sensitive data, see this documentation for more information. These design decisions are made to help you get up and running as quickly as possible without overwhelming new users. As your experience and comfort with AnVIL grows, you will likely change your design to better match your unique needs e.g. enabling Authorization Domains when working with protected data. 2.1.3 Before You Start You will need a credit card or bank account to activate your free trial and get started. Don’t worry! You won’t be billed until you explicitly turn on automatic billing, but payment information is needed for verification purposes. Before setting up billing yourself, you may want to check with your institutional procurement office and see if they have a preferred account set-up method with Google (such as a third party reseller or an existing account). To add lab members, you will need to know the Google account they will use to access Terra. You can send lab members to the Data Analysts chapter for instructions on how they can sign up and start working on AnVIL. You can complete most setup steps without this information and then add them once you know the correct accounts. 2.1.4 Starting Setup AnVIL uses Terra to run analyses. Terra operates on Google Cloud Platform (GCP), so you’ll pay for all storage and analysis costs through a Google account linked to Terra. The costs are the standard Google Cloud Platform fees for storing and moving data as well as executing an analysis. These costs are passed along through Terra without any markup. Create a Google account Set up Google Billing (and claim your free credits!). Add an administrator or viewer (optional) Link Terra to the Google Billing Account Create Terra Billing Projects Set budgets and alerts (optional, but highly recommended) Add users and Workspaces 2.1.5 Lab Management Roles While there are many ways to configure your lab, this guide defines the following roles and responsibilities: PI - The PI sets up the lab’s Google Cloud Account, creates its Google Billing Account(s), and Google Payment Method(s), links Terra with GCP, and invites Lab Managers to be Google Cloud “Billing Account Users.” Lab Manager (Optional) - A Lab Manager creates or clones Terra Workspaces and manages who can use those Workspaces. The Lab Manager is also responsible for creating one or more Terra Billing Projects configuring GCP budgets and alerts. Importantly, lab managers control who can spend lab money and should have an understanding of Google Cloud Billing and Terra Billing Projects. Depending on your lab, the PI may choose to be the only Lab Manager, or may appoint trusted lab members to assist. Data Analyst - A lab member who is granted write + can-compute access on one or more Terra Workspaces by a Lab Manager and who will run analyses in Terra. Data Analysts cannot share Terra Workspaces (this prevents them from enabling others to spend lab money). 2.2 Step 1: Create a Google Account Terra operates on Google Cloud Platform, so you will need a (free) Google account which will allow you to Access the Terra platform to manage team members, data, and analyses Access Google Cloud Platform to manage billing Receive alerts when spending reaches specified thresholds If you do not already have a Google account that you would like to use for accessing Terra, create one now. If you would like to create a Google account that is associated with your non-Gmail, institutional email address, follow these instructions. 2.3 Step 2: Set Up Google Billing Terra operates on Google Cloud Platform, and does not charge any markup. Rather than paying Terra or AnVIL, users set up billing directly with Google Cloud Platform. Make sure to use the same Google account ID you use to log into Terra for Google Cloud Billing. To set up billing, you must first create a Google “Billing Account”. You can create multiple Billing Accounts associated with your Google ID. We recommend creating separate Billing Accounts for different funding sources. 2.3.1 Create a Google Billing Account Log in to the Google Cloud Platform console using your Google ID. Make sure to use the same Google account ID you use to log into Terra. If you are a first time user, don’t forget to claim your free credits! If you haven’t been to the console before, once you accept the Terms of Service you will be greeted with an invitation to “Try for Free.” Follow the instructions to sign up for a Billing Account and get your credits. Choose “Individual Account”. This “billing account” is just for managing billing, so you don’t need to be able to add your team members. You will need to give either a credit card or bank account for security. Don’t worry! You won’t be billed until you explicitly turn on automatic billing. You can view and edit your new Billing Account, by selecting “Billing” from the left-hand menu, or going directly to the billing console console.cloud.google.com/billing Clicking on the Billing Account name will allow you to manage the account, including accessing reports, setting alerts, and managing payments and billing. At any point, you can create additional Billing Accounts using the Create Account button. We generally recommend creating a new Billing Account for each funding source. 2.3.2 Add Users or Viewers (optional) If you have a project manager or finance administrator who needs access to a Billing Account, you can add them with a few different levels of permissions. Generally the most useful are: Users have a great deal of power over spending - they can create new “Billing Projects” and control who can spend money on those projects. If you have a lab or accounts manager responsible for expenses, it may make sense to add them as a Billing Account User. If you wish to retain full control over who can spend money on GCP, you should not add any Users. Viewers can see the activity in the Billing Account but can’t make any changes. This can be useful for finance staff who need access to the reports, or for lab members to be able to see what their analyses are costing. Anyone you wish to add to the Billing Account will need their own Google ID. To add a member to a Billing Project: Log in to the Google Cloud Platform console using your Google ID. Navigate to Billing You may be automatically directed to view a specific Billing Account. If you see information about a single account rather than a list of your Billing Accounts, you can get back to the list by clicking “Manage Billing Accounts” from the drop-down menu. Check the box next to the Billing Account you wish to add a member to, click “ADD MEMBER”. Enter their Google ID in the text box. In the drop-down menu, mouse over Billing, then choose the appropriate role. Click “SAVE”. 2.4 Step 3: Add Terra to Google Billing Account This gives Terra permission to create projects and send charges to the Google Billing Account, and must be done by an administrator of the Google Billing Account. Terra needs to be added as a “Billing Account User”: Log in to the Google Cloud Platform console using your Google ID. Navigate to Billing You may be automatically directed to view a specific Billing Account. If you see information about a single account rather than a list of your Billing Accounts, you can get back to the list by clicking “Manage Billing Accounts” from the drop-down menu. Check the box next to the Billing Account you wish to add Terra to, click “ADD MEMBER”. Enter terra-billing@terra.bio in the text box. In the drop-down menu, mouse over Billing, then choose “Billing Account User”. Click “SAVE”. 2.5 Step 4: Create Terra Billing Projects This is how you enable Terra users to charge to the Google Billing Account. Note that Google will report charges at the level of Billing Projects. If you create only one Billing Project for your lab, you will not be able to see a breakdown of where charges are coming from. It is highly recommended that you create separate Billing Projects for each category of spending you would like to track. For example: A Billing Project for each lab member, if you would like to track individual spending A Billing Project for each analysis type, if you would like to track spending on e.g. RNA-seq vs. variant calling. A Billing Project for each cohort, if you would like to track spending per data set If you are uncertain, we recommend starting by setting up a Billing Project per lab member. This makes it easy to track lab member spending, and also makes it easier to cleanly shut down projects when a member leaves the lab. 2.5.1 Create a Billing Project Launch Terra and sign in with your Google account. If this is your first time logging in to Terra, you will need to accept the Terms of Service. In the drop-down menu on the left, navigate to “Billing”. Click the triple bar in the top left corner to access the menu. Click the arrow next to your name to expand the menu, then click “Billing”. You can also navigate there directly with this link: https://anvil.terra.bio/#billing On the Billing page, click the “+ CREATE” button to create a new Billing Project. Select GCP Billing Project (Google’s Platform). If prompted, select the Google account to use and give Terra permission to manage Google Cloud Platform billing accounts. Enter a unique name for your Terra Billing Project and select the appropriate Google Billing Account. The name of the Terra Billing Project must: Only contain lowercase letters, numbers and hyphens Start with a lowercase letter Not end with a hyphen Be between 6 and 30 characters Select the Google Billing Account to use. All activities conducted under your new Terra Billing Project will charge to this Google Billing Account. If prompted, give Terra permission to manage Google Cloud Platform billing accounts. Click “Create”. Your new Billing Project should now show up in the list of Billing Projects Owned by You. You can add additional members or can modify or deactivate the Billing Project at any time by clicking on its name in this list. The page doesn’t always update as soon as the Billing Project is created. If it’s been a couple of minutes and you don’t see a change, try refreshing the page. As mentioned above, we recommend creating separate Terra Billing Projects for each of your team members so you can track their spending. These Billing Projects can all be associated with the same Google Billing Account if they are all funded by the same source. Having trouble? Check out the Troubleshooting appendix Visit our community support forum at help.anvilproject.org with any questions. 2.6 Step 5: Set Budgets and Alerts Cloud computing can save a great deal of money, time and effort by providing compute on an as-needed basis. However, care must be taken that users do not accidentally request excessive resources, or leave resources running when not needed. Unfortunately, there are two issues that make direct cost control difficult: The Google Cloud billing interface does not provide a way to automatically cancel computations when a spending threshold is reached Compute costs are reported with a delay (~1 day) As a PI or lab manager, there are some steps you can take to help monitor and limit spending: Be careful with members and permissions in your Billing Projects and Workspaces on Terra (see Adding Users and Workspaces for recommended setup) Most importantly, monitor your spending so you can shut down unnecessary expensive activities before they have time to accumulate. Terra provides extensive documentation and examples regarding cost management while working in the cloud We highly recommended you set budgets and alerts to notify you if spending starts to exceed expectations. This will make it easier to notice and shut down any accidental overspending. A good starting point is to set a monthly budget, and then set alerts at 50 percent and 90 percent of expected spend. You can add additional alerts if you desire. You can set a single Budget for your entire lab, set up individual budgets for each Billing Project, or even set budgets for certain subsets of your Billing Projects. This will depend on the size of your lab and how closely you want to monitor spending. More granular budgets make it quicker to notice and track down overspending from a particular project but mean you will get more emails every month. When setting budgets with broader scope, you can always find out which particular Billing Project is spending the money by checking in the GCP Billing interface. NOTE: that there may be some restrictions on the budgets and alerts you can set while you’re using GCP’s free credits. At the time of writing (Feb 2021) you are not able to set budgets for individual projects while you are using the GCP free credits, but can still set an overall budget. Any restrictions should be lifted when you upgrade to a paid account. 2.6.1 Set Alerts Log in to the Google Cloud Platform console using the Google ID associated with your Google Cloud projects. Open the dropdown menu on the top left and click on Billing. You may be automatically directed to view a specific Billing Account. If you see information about a single account (and it’s not the one you’re interested in), you can get back to the list of all your Billing Accounts by clicking “Manage Billing Accounts” from the drop-down menu. Click on the name of the Billing Account you want to set alerts for. In the left-hand menu, click “Budgets & alerts”. Click the “Create Budget” tab. Enter a name for your budget, and then choose which projects you want to monitor. Then click “Next”. For Budget Type, select “Specified amount”. Enter the total budget amount for the month (you will set alerts at different thresholds in the next step). Click “Next” (do not click “Finish”). Enter the threshold amounts where you want to receive an alert. We recommend starting with 50% and 90%. You can set other alerts if you prefer. Check the box for “Email alerts to billing admins and users”, then click “Finish”. Now you (as the owner and admin), along with anyone you added with admin or user privileges (e.g. lab managers) will receive alerts when your lab members reach the specified spending thresholds. These emails will be sent to the Gmail accounts associated with the Billing Account. You can edit your budgets at any time by going to Billing > Budgets & alerts, and clicking on the name of the budget you want to edit. 2.6.2 View spend You can always check your current spend through the Google Billing console, but remember There is a reporting delay (~1 day), so you cannot immediately see what an analysis cost Costs are reported at the level of Workspaces, so if there are multiple people using a Workspace, you will not be able to determine which of them was responsible for the charges. The Google Billing console displays information by Billing Account. To view spending: Log in to the Google Cloud Platform console using the Google ID associated with your Google Cloud projects. Open the dropdown menu on the top left and click on Billing. You may be automatically directed to view a specific Billing Account. If you see information about a single account (and it’s not the one you’re interested in), you can get back to the list of all your Billing Accounts by clicking “Manage Billing Accounts” from the drop-down menu. Click on the name of the Billing Account for the project you want to view. Look at the top of the Overview tab to see your month-to-date spending. Scroll further down the Overview tab to show your top projects. Click on the Reports tab to see more detailed information about each of your projects. This is probably the most useful tab for exploring costs of individual projects over time. Click on the Cost table tab to obtain a convenient table of spending per project. 2.6.3 Export Cost Data to BigQuery Coming soon – instructions on how to export your cost data so you can better analyze and control your expenses. 2.7 Step 6: Add Users and Workspaces Finally, back on Terra, you can add lab members and give them permission to run analyses funded through your Billing Projects. There are two primary ways to permit users to charge to your Billing Projects: Add them directly to the Billing Project. This gives them flexibility to create and manage their own Workspaces, but reduces your control over spending. Anyone they add to their Workspaces with sufficient permissions (i.e. permission to compute) can charge to your Billing Project. Create a Workspace yourself, and add them to the Workspace (or have a designated Lab Manager responsible for managing Workspaces). This gives you much more control over who can charge to your Billing Project. Billing permissions on Terra can be confusing. For this reason, We recommend starting by having a single person responsible for managing all Workspaces (either yourself or a trusted “lab manager”). This person should create all Workspaces and add lab members as Writers (not Owners) to the Workspaces. This provides the greatest control over spending. Once you are familiar with the permissions system and are certain your lab members understand the implication of different permission settings, you may decide to give them greater control over Workspace access. 2.7.1 Create a New Workspace Launch Terra In the drop-down menu on the left, navigate to “Workspaces”. Click the triple bar in the top left corner to access the menu. Click “Workspaces”. Click on the plus icon near the top of left of the page. Name your Workspace and select the appropriate Billing Project. All activity in the Workspace will be charged to this Billing Project (regardless of who conducted it). If you are working with protected data, you can set the Authorization Domain to limit who can be added to your Workspace. Note that the Authorization Domain cannot be changed after the Workspace is created (i.e. there is no way to make this Workspace shareable with a larger audience in the future). Workspaces by default are only visible to people you specifically share them with. Authorization domains add an extra layer of enforcement over privacy, but by nature make sharing more complicated. We recommend using Authorization Domains in cases where it is extremely important and/or legally required that the data be kept private (e.g. protected patient data, industry data). For data you would merely prefer not be shared with the world, we recommend relying on standard Workspace sharing permissions rather than Authorization Domains, as Authorization Domains can make future collaborations, publications, or other sharing complicated. Click “CREATE WORKSPACE”. The new Workspace should now show up under your Workspaces. To start, we recommend creating one Workspace for each lab member (associated with that lab member’s Billing Project, with separate Billing Projects for your lab members). This will enable you and your lab members to familiarize yourself with Workspaces and decide how best to organize your work. You can then create additional Workspaces as needed. 2.7.2 Add Members to Workspaces Lab members must have logged in to Terra at least once before they can be added to your Billing Projects and Workspaces (they do not need to log in to Google Cloud Console). You can send lab members to the Data Analysts guide for instructions on how they can sign up and start working on AnVIL. Lab members can be added to a Workspace with a few different permission levels: Readers can view the Workspace but not make edits or run analyses (i.e. they cannot spend your money) Writers can make edits and run analyses (i.e. they can spend your money) Owners can make edits and run analyses and can also manage the permissions of other users (i.e. they can enable others to spend your money) More details about the permissions associated with each Access Level can be found in the Terra documentation. Managing permissions for a Workspace has important implications: Billing: Terra charges are associated with Workspaces rather than users. Any billable activity that takes place in a given Workspace will be charged to the associated Billing Project, regardless of who conducted the activity. If there are multiple users with permission to compute, it is impossible to tell who conducted the activity. Data access: Especially when working with protected data, it’s important to ensure that users have proper authorization to view the data before giving them access to a Workspace containing the data. Terra provides Authorization Domains to assist with this. In general we recommend: Writers: Lab members who need permission to compute (and charge to your Billing Project). This gives them permission to freely use the Workspace, (adding and removing data, conducting analyses, etc.) but prevents them from adding additional members who could charge to your Billing Project. This ensures you have control over who is doing the spending. Readers: All other users (i.e. users who need to see the Workspace but should not charge to your Billing Project). Readers can always “clone” the Workspace (creating a copy of it associated with their own Billing Project) if they want to run computations themselves. If working with protected data, take advantage of Authorization Domains to increase security. To add a member to a Workspace: Launch Terra In the drop-down menu on the left, navigate to “Workspaces”. Click the triple bar in the top left corner to access the menu. Click “Workspaces”. Click on the name of the Workspace to open the Workspace. Opening a Workspace does not cost anything. Certain activities in the Workspace (such as running an analysis) will charge to the Workspace’s Billing Project. Workspace management (e.g. adding and removing members, editing the description) does not cost money. Click the teardrop button () on the right hand side to open the Workspace management menu. Click “Share” Enter the email address of the user or Group you’d like to share the Workspace with. If adding an individual, make sure to enter the account that they use to access AnVIL. If adding a Terra Group, use the Group email address, which can be found on the Terra Group management page. Choose their permission level. Remember that all activity in the Workspace will be charged to the Workspace’s Billing Project, regardless of who conducts it, so only add members as “Writers” or “Owners” if they should be charging to the Workspace’s Billing Project. “Readers” can view all parts of the Workspace but cannot make edits or run analyses (i.e. they cannot spend money). They can also clone their own copy of the Workspace where they can conduct activity on their own Billing Project. Click “Save”. The user should now be able to see the Workspace when logged in to Terra. 2.7.3 Request Quota Increase To prevent abuse, new users of GCP are only permitted to create a few Google Cloud “Projects”. When working on Terra, each Terra Workspace is associated with its own Google Cloud Project, so if your team has multiple members you can bump up against this limit fairly quickly and won’t be able to create more Workspaces. Since this limit is imposed by Google, you will need to contact them directly to request a quota increase, using this form. At the time of writing (April 2022) Terra is working to expedite this process for Terra users; we recommend checking the relevant Terra documentation for the latest information as well as recommendations about how to fill out the form. 2.8 Wrap-Up Congratulations! You have successfully set up AnVIL for your lab! Your lab members should be free to carry out analyses in the Workspaces you created. You should not need to do any further configuration through Terra until you decide to add or change user permissions for your Billing Projects and Workspaces. You can view costs at any time through Google Cloud Billing. Note that costs are reported with a delay (~1 day). To learn more about billing and setup, we recommend checking out this Leanpub course. "],["data-analysts.html", "Chapter 3 Data Analysts 3.1 Account Setup Overview 3.2 Step 1: Create a Google Account 3.3 Step 2: Set Up Terra 3.4 Step 3: Link External Accounts (optional) 3.5 Wrap-Up", " Chapter 3 Data Analysts This chapter is targeted towards people who are joining an existing team on AnVIL. You will find here: Account Setup Overview – Introductory information and goals for this guide Account Setup Steps – Step-by-step instructions to create your first accounts on AnVIL 3.1 Account Setup Overview 3.1.1 Goals for This Guide 3.1.2 Starting Setup Terra is the compute engine of AnVIL; i.e. where you will run your analyses. Terra currently offers access to Jupyter Notebooks and RStudio for interactive analysis, as well as the Workflow Description Language (WDL) for batch processing of many samples. Behind the scenes, Terra runs on Google Cloud Platform, so you will need a (free) Google Account. In this guide, you will go through the following steps: Create a Google account Launch Terra and sign in with your Google account Link external accounts (Gen3, dbGaP) to Terra (optional - enables you to import AnVIL open access datasets and to access protected data if you have appropriate authorization) 3.2 Step 1: Create a Google Account Terra operates on Google Cloud Platform, so you will need a (free) Google account to create a Terra account and run analyses on AnVIL. If you do not already have a Google account that you would like to use for accessing Terra, create one now. If you would like to create a Google account that is associated with your non-Gmail, institutional email address, follow these instructions. 3.3 Step 2: Set Up Terra Launch Terra, and you should be prompted to sign in with your Google account. Once you have signed in, your Terra account is set up and your PI or manager should be able to add you to projects and/or Workspaces. If this is the first time you or your team has used Terra, the PI or manager will also need to set up billing. You can always access Terra by going to anvil.terra.bio, or by clicking the link on the AnVIL home page. 3.4 Step 3: Link External Accounts (optional) AnVIL provides access to a wide selection of datasets, including controlled-access data. Linking your accounts will enable you to import these data into Terra. The following links will take you to the Terra documentation for setting up and linking external accounts. Set up a Gen3 account - allows you to use the Gen3 data explorer to create artificial cohorts over AnVIL datasets that have been indexed by Gen3. Link Gen3 and Terra accounts - allows you to analyze Gen3 data on Terra. Link Terra and eRA Commons ID - To use controlled-access data on Terra, you will need to link your Terra user ID to your authorization account (such as a dbGaP account). Linking to external servers will allow Terra to automatically determine if you can access controlled datasets hosted in Terra (ex. TCGA, TOPMed, etc.) based on your approved dbGaP applications. 3.5 Wrap-Up Congratulations! You have successfully set up your AnVIL account! Your PI or lab manager should be now be able to add you to Workspaces so that you can perform analyses. Please contact your PI or manager to coordinate your user permissions for Terra Projects and Workspaces. To learn more about how to perform analyses on AnVIL, see the Working on AnVIL section of this book. The Workspaces chapter introduces AnVIL Workspaces, the fundamental unit of research organization on AnVIL. All analyses on AnVIL are performed in a Workspace. The [Tools], Data, and Workflows chapters explain how to perform a variety of common research tasks on AnVIL. "],["consortia.html", "Chapter 4 Consortia 4.1 Account Setup Overview 4.2 Consortium Data Managers 4.3 Consortium PIs and Lab Managers 4.4 Consortium Data Analysts and Researchers 4.5 Consortium Data Submitters 4.6 Wrap-Up", " Chapter 4 Consortia This chapter is targeted towards people who are part of a consortium and plan to use AnVIL as part of their activities. You will find here: Account Setup Overview – Goals for this guide and help choosing a role in the consortium Account Setup Steps – Step-by-step instructions to create your first accounts as part of the consortium on AnVIL Please click on the subsection headers in the left hand navigation bar (e.g., 2.1, 4.3) a second time to expand the table of contents and enable the scroll_highlight feature (see more). 4.1 Account Setup Overview 4.1.1 Goals for This Guide 4.1.2 Choosing a Role Consortia are usually made up of scientists taking on different research roles. This guide will be most helpful to you if you target the section most closely related to your role. The sections are summarized by role below: Data Manager The data manager is responsible for controlling access of consortium members. To protect data security, data managers ensure that consortium members only have access to appropriate datasets. If there are multiple datasets, the data manager will manage consortium member access as appropriate to a subset of those datasets. These limited access permissions are organized via distinct Terra Authorization Domains. Principal Investigator / Lab Manager Principal Investigators are responsible for managing billing expenses accrued by members of the consortium. Within the consortium, PIs can add members to Workspaces managed under specific Google Billing Accounts and Billing Projects, providing centralized control over computing and data storage expenses. Principal Investigators will need to work with data managers to ensure their team members have correct data access via Terra Authorization Domains, where necessary. Data Analyst / Researcher A consortium data analyst might describe someone who is performing analyses or writing code, but isn’t responsible for managing dataset access or billing expenses. The Analyst should be able to identify the PI or data manager point of contact in the consortium. Analysts will not be able to run analyses or access data without explicit permission from consortium management. This role also describes research leaders who will participate in activities but will not be managing personnel or expenses as part of the consortium. Data Submitter Data submitters are responsible for working with the AnVIL Data Ingestion Team as well as the National Center for Biotechnology Information (NCBI) to make data available on AnVIL. Data submitters will work with the data manager to ensure the data is protected via the appropriate Terra Authorization Domains. 4.1.3 Consortia Responsibilities It is important for consortium members of all roles to have a common understanding of the consortium’s security and data access policies. Consortia leadership are responsible for the following: Drafting Memorandums of Understanding (MOUs) and Data Use Agreements (DUAs) Arranging IRB oversight Drafting policy and protocol for data security and management incidents Ensuring all consortium members are aware of all data use limitations and the terms of the consortium agreements Defining what it means to be a consortium member Developing clear timelines and milestones for sharing data with the community rapidly, completely, and in NIH-designated data repositories Creating a timeline for closing out the consortium and thus the consortium-managed, streamlined access for consortium members Communicating the timeline with relevant NHGRI program staff and provide AnVIL with 6 months’ notice of when the consortium’s data access is expected to end 4.2 Consortium Data Managers 4.2.1 Managing Security and Privacy Policies As a data manager, you are responsible for managing access to sensitive data. You should be involved in managing the following: Drafting Memorandums of Understanding (MOUs) and Data Use Agreements (DUAs) Arranging and managing IRB oversight Drafting policy and protocol for reporting data security and management incidents Managing data use limitations and conditions of consortium membership 4.2.2 Account Setup To set up your account on AnVIL, please see the chapter for PIs and Lab Managers. Once the setup is complete, return to this page to continue. Go to PIs and Lab Managers chapter 4.2.3 Two Factor Authentication Note that you must establish two factor authentication on your Google Account for added security. 4.2.4 Set Up Terra Authorization Domains Terra Authorization Domains help keep sensitive genomic data secure while still allowing easy sharing with collaborators. See the Security section on Terra Authorization Domains for more information on setting up these secure access groups. 4.2.5 Granting Access Data managers should confirm that all other consortium members have two factor authentication set up on their Google accounts prior to granting access to datasets. Granting member access to datasets should be controlled via Terra Authorization Domains. 4.2.6 Revoking Access Revoking access to datasets, either when a member leaves the consortium or the consortium concludes, should be controlled via Terra Authorization Domains. 4.3 Consortium PIs and Lab Managers 4.3.1 Be Aware of Security Policies As part of a consortium on AnVIL, you should be aware of the following: Memorandums of Understanding (MOUs) and Data Use Agreements (DUAs) IRB oversight of the consortium Policy and protocol for reporting data security and management incidents 4.3.2 Account Setup To set up your account on AnVIL, please see the chapter for PIs and Lab Managers. Once the setup is complete, return to this page to continue. Go to PIs and Lab Managers chapter 4.3.3 Two Factor Authentication Note that you must establish two factor authentication on your Google Account for added security. 4.4 Consortium Data Analysts and Researchers 4.4.1 Be Aware of Security Policies As part of a consortium on AnVIL, you should be aware of the following: Memorandums of Understanding (MOUs) and Data Use Agreements (DUAs) IRB oversight of the consortium Policy and protocol for reporting data security and management incidents 4.4.2 Account Setup To set up your account on AnVIL, please see the chapter for Data Analysts. Once the setup is complete, return to this page to continue. Go to Data Analysts chapter 4.4.3 Two Factor Authentication Note that you must establish two factor authentication on your Google Account for added security. 4.5 Consortium Data Submitters 4.5.1 Be Aware of Security Policies As part of a consortium on AnVIL, you should be aware of the following: Memorandums of Understanding (MOUs) and Data Use Agreements (DUAs) IRB oversight of the consortium Policy and protocol for reporting data security and management incidents 4.5.2 Account Setup To set up your account on AnVIL, please see the documentation for Data Submitters. Once the setup is complete, return to this page to continue. Go to Data Submitters documentation 4.5.3 Two Factor Authentication Note that you must establish two factor authentication on your Google Account for added security. 4.6 Wrap-Up Congratulations! You have successfully set up your AnVIL account! You should now be able to move forward with your consortium activities, such as adding personnel to appropriate datasets or working in a Workspace. Remember that you will need to coordinate with other members of the consortium to establish appropriate permissions for datasets, Terra Billing Projects, and Workspaces. "],["workspaces.html", "Chapter 5 Workspaces 5.1 Access Your Workspaces 5.2 Clone an Existing Workspace 5.3 Create a New Workspace 5.4 Add Members to Workspace", " Chapter 5 Workspaces Workspaces are the building blocks of projects in Terra. Inside a Workspace, you can run analyses, launch interactive tools like RStudio and Galaxy, store data, and share results. To get a Workspace of your own, you can Clone a Workspace: Cloning an existing Workspace allows you to copy existing documentation, code, and/or data into your own experimental space. Create a Workspace: Creating a new Workspace from scratch allows you to fully customize the contents. The video below gives a brief introduction to the parts of a Workspace. 5.1 Access Your Workspaces If you are part of a research team, you may have been added to some existing Workspaces. To find and access your Workspaces, follow the steps below. Launch Terra In the drop-down menu on the left, navigate to “Workspaces”. Click the triple bar in the top left corner to access the menu. Click “Workspaces”. You are automatically directed to the “My Workspaces” tab. Here you can see any Workspaces that have been shared with you, along with your permission level. Reader means you can open the Workspace and see everything, but can’t do any computations or make any edits. Writer means you can run computations, which will charge costs to the Workspace’s Billing Project. Writers can also make edits to the Workspace. Owner is similar to Writer, but also allows you to control who can access the Workspace. Click on the name of a Workspace to open it. Opening and viewing a Workspace does not cost anything. When you open a Workspace, you are directed to the Workspace Dashboard. This generally has a description of the Workspace contents, as well as some useful details about the Workspace itself. From here you can navigate through the different tabs of the Workspace, and if you have sufficient permission, you can start running analyses. If you are only a Reader, you may need to “clone” (make your own copy) of the Workspace before you can start working. 5.2 Clone an Existing Workspace Cloning an existing Workspace allows you to copy existing documentation, code, and/or data into your own experimental space. Cloning creates a new copy of the Workspace that will charge costs to the Billing Project of your choice. Note that you can only clone a Workspace if you are at least a “User” on the Terra Billing Project. This helps prevent unwanted charges. Workspaces charge money to their associated Billing Project, regardless of who conducts the activity, so it’s important to be careful about who has permission to use the Workspace (see Add Members to Workspace for details). If you need to clone a Workspace but don’t have permission to create your own Workspaces, contact your PI or lab manager so that they can either grant you permission or clone the Workspace for you. The following steps show you how to clone a Workspace that has already been developed by other AnVIL users. When cloning, AnVIL makes a copy of notebooks and code for you to modify. Data however, is linked back to the original Workspace through Data Tables, which saves space! Launch Terra Locate the Workspace you want to clone. If a Workspace has been shared with you ahead of time, it will appear in “MY WORKSPACES”. You can clone a Workspace that was shared with you to perform your own analyses. In the screenshot below, no Workspaces have been shared. If a Workspace hasn’t been shared with you, navigate to the “FEATURED” or “PUBLIC” Workspace tabs. Use the search box to find the Workspace you want to clone. Click the teardrop button on the far right next to the Workspace you want to clone. Click “Clone”. You can also clone the Workspace from the Workspace Dashboard instead of the search results. You will see a popup box appear. Name your Workspace and select the appropriate Terra Billing Project. All activity in the Workspace will be charged to this Billing Project (regardless of who conducted it). Remember that each Workspace should have its own Billing Project. If you are working with protected data, you can set the Authorization Domain to limit who can be added to your Workspace. Note that the Authorization Domain cannot be changed after the Workspace is created (i.e. there is no way to make this Workspace shareable with a larger audience in the future). Workspaces by default are only visible to people you specifically share them with. Authorization domains add an extra layer of enforcement over privacy, but by nature make sharing more complicated. We recommend using Authorization Domains in cases where it is extremely important and/or legally required that the data be kept private (e.g. protected patient data, industry data). For data you would merely prefer not be shared with the world, we recommend relying on standard Workspace sharing permissions rather than Authorization Domains, as Authorization Domains can make future collaborations, publications, or other sharing complicated. Click “CLONE WORKSPACE”. The new Workspace should now show up under your Workspaces. 5.3 Create a New Workspace Creating a new Workspace from scratch allows you to fully customize the contents. Note that you can only create a new Workspace if you are at least a “User” on a Terra Billing Project. This helps prevent unwanted charges. Workspaces charge money to their associated Billing Project, regardless of who conducts the activity, so it’s important to be careful about who has permission to use the Workspace (see Add Members to Workspace for details). If you need to create a Workspace but don’t have permission, contact your PI or lab manager so that they can either grant you permission or create the Workspace for you. The following steps show you how to create a Workspace so you can get started. Launch Terra In the drop-down menu on the left, navigate to “Workspaces”. Click the triple bar in the top left corner to access the menu. Click “Workspaces”. Click on the plus icon near the top of left of the page. Name your Workspace and select the appropriate Billing Project. All activity in the Workspace will be charged to this Billing Project (regardless of who conducted it). If you are working with protected data, you can set the Authorization Domain to limit who can be added to your Workspace. Note that the Authorization Domain cannot be changed after the Workspace is created (i.e. there is no way to make this Workspace shareable with a larger audience in the future). Workspaces by default are only visible to people you specifically share them with. Authorization domains add an extra layer of enforcement over privacy, but by nature make sharing more complicated. We recommend using Authorization Domains in cases where it is extremely important and/or legally required that the data be kept private (e.g. protected patient data, industry data). For data you would merely prefer not be shared with the world, we recommend relying on standard Workspace sharing permissions rather than Authorization Domains, as Authorization Domains can make future collaborations, publications, or other sharing complicated. Click “CREATE WORKSPACE”. The new Workspace should now show up under your Workspaces. 5.4 Add Members to Workspace Members can be added to a Workspace with a few different permission levels: More details about the permissions associated with each Access Level can be found in the Terra documentation. Managing permissions for a Workspace has important implications: Billing: Terra charges are associated with Workspaces rather than users. Any billable activity that takes place in a given Workspace will be charged to the associated Billing Project, regardless of who conducted the activity. If there are multiple users with permission to compute, it is impossible to tell who conducted the activity. Data access: Especially when working with protected data, it’s important to ensure that users have proper authorization to view the data before giving them access to a Workspace containing the data. Terra provides Authorization Domains to assist with this. Make sure you understand what permissions you are granting before adding someone to your Workspace. To add a member to a Workspace: Launch Terra In the drop-down menu on the left, navigate to “Workspaces”. Click the triple bar in the top left corner to access the menu. Click “Workspaces”. Click on the name of the Workspace to open the Workspace. Opening a Workspace does not cost anything. Certain activities in the Workspace (such as running an analysis) will charge to the Workspace’s Billing Project. Workspace management (e.g. adding and removing members, editing the description) does not cost money. Click the teardrop button () on the right hand side to open the Workspace management menu. Click “Share” Enter the email address of the user or Group you’d like to share the Workspace with. If adding an individual, make sure to enter the account that they use to access AnVIL. If adding a Terra Group, use the Group email address, which can be found on the Terra Group management page. Choose their permission level. Remember that all activity in the Workspace will be charged to the Workspace’s Billing Project, regardless of who conducts it, so only add members as “Writers” or “Owners” if they should be charging to the Workspace’s Billing Project. “Readers” can view all parts of the Workspace but cannot make edits or run analyses (i.e. they cannot spend money). They can also clone their own copy of the Workspace where they can conduct activity on their own Billing Project. Click “Save”. The user should now be able to see the Workspace when logged in to Terra. "],["jupyter-notebook.html", "Chapter 6 Jupyter Notebook 6.1 Jupyter Notebook: Video tutorial", " Chapter 6 Jupyter Notebook One of the analysis platforms available on AnVIL is Jupyter Notebook. This platform offers accessible analysis reports incorporating multiple languages in the case of Jupyter. This chapter focuses on launching and highlighting a few features for Jupyter Notebook. 6.1 Jupyter Notebook: Video tutorial Here is a video tutorial that describes the basics of using Jupyter Notebook on AnVIL. 6.1.1 Objectives Start compute for your Jupyter environment Create notebook to perform analysis Stop compute to minimize expenses 6.1.2 Slides The slides for this tutorial are are located here. "],["galaxy.html", "Chapter 7 Galaxy 7.1 Galaxy: Video tutorial 7.2 Galaxy: Step-by-step guide", " Chapter 7 Galaxy One of the analysis platforms available on AnVIL is Galaxy. This platform offers a graphical interface for thousands of tools(https://anvilproject.org/learn/interactive-analysis/getting-started-with-galaxy). This chapter focuses on launching and highlighting a few features for Galaxy. 7.1 Galaxy: Video tutorial Here is a video tutorial that describes the basics of using Galaxy on AnVIL. 7.1.1 Objectives Start compute for your Galaxy on AnVIL Run tool to quality control sequencing reads Stop compute to minimize expenses 7.1.2 Slides The slides for this tutorial are are located here. 7.2 Galaxy: Step-by-step guide This step-by-step guide provides written instructions and screenshots for getting started with Galaxy on AnVIL. 7.2.1 Starting Galaxy Note that, in order to use Galaxy, you must have access to a Terra Workspace with permission to compute (i.e. you must be a “Writer” or “Owner” of the Workspace). Open your Workspace, and click on the “Environment configuration” button, a cloud icon on the righthand side of the screen. Under Galaxy, click on “Create new Environment”. Click on “Next” and “Create” to keep all settings as-is. This will take 8-10 minutes. Click on “Open Galaxy” when the environment is ready. 7.2.2 Navigating Galaxy Notice the three main sections. Tools - These are all of the bioinformatics tool packages available for you to use. The Main Dashboard - This contains flash messages and posts when you first open Galaxy, but when we are using data this is the main interface area. History - When you start a project you will be able to see all of the documents in the project in the history. Now be aware, this can become very busy. Also the naming that Galaxy uses is not very intuitive, so you must make sure that you label your files with something that makes sense to you. On the welcome page, there are links to tutorials. You may try these out on your own. If you want to try a new analysis this is a good place to start. 7.2.3 Deleting Galaxy environment Once you are done with your activity, you’ll need to shut down your Galaxy cloud environment. This frees up the cloud resources for others and minimizes computing cost. The following steps will delete your work, so make sure you are completely finished at this point. Otherwise, you will have to repeat your work from the previous steps. Return to AnVIL, and find the Galaxy logo that shows your cloud environment is running. Click on this logo. Next, click on “Settings”. Click on “Delete Environment”. Finally, select “Delete everything, including persistent disk”. Make sure you are done with the activity and then click “Delete”. "],["rstudio.html", "Chapter 8 RStudio 8.1 RStudio: Video tutorial 8.2 RStudio: Step-by-step guide", " Chapter 8 RStudio One of the analysis platforms available on AnVIL is RStudio. This platform offers rich genomics support through Bioconductor. This chapter focuses on launching and highlighting a few features for RStudio. 8.1 RStudio: Video tutorial Here is a video tutorial that describes the basics of using RStudio on AnVIL. 8.1.1 Objectives Start compute for your RStudio environment Tour RStudio on AnVIL Stop compute to minimize expenses 8.1.2 Slides The slides for this tutorial are are located here. 8.2 RStudio: Step-by-step guide This step-by-step guide provides written instructions and screenshots for getting started with RStudio on AnVIL. 8.2.1 Launch RStudio Cloud Environment AnVIL is very versatile and can scale up to use very powerful cloud computers. It’s very important that you select a cloud computing environment appropriate to your needs to avoid runaway costs. If you are uncertain, start with the default settings; it is fairly easy to increase your compute resources later, if needed, but harder to scale down. Note that, in order to use RStudio, you must have access to a Terra Workspace with permission to compute (i.e. you must be a “Writer” or “Owner” of the Workspace). Open Terra - use a web browser to go to anvil.terra.bio In the drop-down menu on the left, navigate to “Workspaces”. Click the triple bar in the top left corner to access the menu. Click “Workspaces”. Click on the name of your Workspace. You should be routed to a link that looks like: https://anvil.terra.bio/#workspaces/<billing-project>/<workspace-name>. Click on the cloud icon on the far right to access your Cloud Environment options. If you don’t see this icon, you may need to scroll to the right. In the dialogue box, click the “Settings” button under RStudio. You will see some configuration options for the RStudio cloud environment, and a list of costs because it costs a small amount of money to use cloud computing. Configure any settings you need for your cloud environment. If you are uncertain about what you need, the default configuration is a reasonable, cost-conservative choice. It is fairly easy to increase your compute resources later, if needed, but harder to scale down. Scroll down and click the “CREATE” button when you are satisfied with your setup. The dialogue box will close and you will be returned to your Workspace. You can see the status of your cloud environment by hovering over the RStudio icon. It will take a few minutes for Terra to request computers and install software. When your environment is ready, its status will change to “Running”. Click on the RStudio logo to open a new dialogue box that will let you launch RStudio. Click the launch icon to open RStudio. This is also where you can pause, modify, or delete your environment when needed. You should now see the RStudio interface with information about the version printed to the console. 8.2.2 Tour RStudio Next, we will be using RStudio and the package Glimma to create interactive plots. See this vignette for more information. The Bioconductor team has created a very useful package to programmatically interact with Terra and Google Cloud. Install the AnVIL package. It will make some steps easier as we go along. You can now quickly install precompiled binaries using the AnVIL package’s install() function. We will use it to install the Glimma package and the airway package. The airway package contains a SummarizedExperiment data class. This data describes an RNA-Seq experiment on four human airway smooth muscle cell lines treated with dexamethasone. {Note: for some of the packages, you will have to install packaged from the CRAN repository, using the install.packages() function. The examples will show you which install method to use.} <img src="06-tools-rstudio_files/figure-html//1BLTCaogA04bbeSD1tR1Wt-mVceQA6FHXa8FmFzIARrg_g11f12bc99af_0_56.png" alt="Screenshot of the RStudio environment interface. Code has been typed in the console and is highlighted." width="480" /> Load the example data. The multidimensional scaling (MDS) plot is frequently used to explore differences in samples. When this data is MDS transformed, the first two dimensions explain the greatest variance between samples, and the amount of variance decreases monotonically with increasing dimension. The following code will launch a new window where you can interact with the MDS plot. Change the colour_by setting to “groups” so you can easily distinguish between groups. In this data, the “group” is the treatment. You can download the interactive html file by clicking on “Save As”. You can also download plots and other files created directly in RStudio. To download the following plot, click on “Export” and save in your preferred format to the default directory. This saves the file in your cloud environment. You should see the plot in the “Files” pane. Select this file and click “More” > “Export” Select “Download” to save the file to your local machine. 8.2.3 Pause RStudio You can view costs and make changes to your cloud environments from the panel on the far right of the page. If you don’t see this panel, you may need to scroll to the right. Running environments will have a green dot, and paused environments will have an orange dot. Hovering over the RStudio icon will show you the costs associated with your RStudio environment. Click on the RStudio icon to open the cloud environment settings. Click the Pause button to pause RStudio. This will take a few minutes. When the environment is paused, an orange dot will be displayed next to the RStudio icon. If you hover over the icon, you will see that it is paused, and has a small ongoing cost as long as it is paused. When you’re ready to resume working, you can do so by clicking the RStudio icon and clicking Resume. The right-hand side icon reminds you that you are accruing cloud computing costs. If you don’t see this icon, you may need to scroll to the right. You should minimize charges when you are not performing an analysis. You can do this by clicking on the RStudio icon and selecting “Pause”. This will release the CPU and memory resources for other people to use. Note that your work will be saved in the environment and continue to accrue a very small cost. This work will be lost if the cloud environment gets deleted. If there is anything you would like to save permanently, it’s a good idea to copy it from your compute environment to another location, such as the Workspace bucket, GitHub, or your local machine, depending on your needs. You can also pause your cloud environment(s) at https://anvil.terra.bio/#clusters. 8.2.4 Delete RStudio Cloud Environment Pausing your cloud environment only temporarily stops your work. When you are ready to delete the cloud environment, click on the RStudio icon on the right-hand side and select “Settings”. If you don’t see this icon, you may need to scroll to the right. Click on “Delete Environment”. If you are certain that you do not need the data and configuration on your disk, you should select “Delete everything, including persistent disk”. If there is anything you would like to save, open the compute environment and copy the file(s) from your compute environment to another location, such as the Workspace bucket, GitHub, or your local machine, depending on your needs. Select “DELETE”. You can also delete your cloud environment(s) and disk storage at https://anvil.terra.bio/#clusters. sessionInfo() ## R version 4.3.2 (2023-10-31) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 22.04.4 LTS ## ## Matrix products: default ## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 ## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: Etc/UTC ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## loaded via a namespace (and not attached): ## [1] sass_0.4.8 utf8_1.2.4 generics_0.1.3 xml2_1.3.6 ## [5] stringi_1.8.3 lattice_0.21-9 hms_1.1.3 digest_0.6.34 ## [9] magrittr_2.0.3 evaluate_0.23 grid_4.3.2 timechange_0.3.0 ## [13] bookdown_0.40 fastmap_1.1.1 rprojroot_2.0.4 jsonlite_1.8.8 ## [17] Matrix_1.6-1.1 processx_3.8.3 chromote_0.3.1 ps_1.7.6 ## [21] promises_1.2.1 httr_1.4.7 fansi_1.0.6 ottrpal_1.3.0 ## [25] udpipe_0.8.11 cow_0.0.0.9000 jquerylib_0.1.4 cli_3.6.2 ## [29] rlang_1.1.4 gitcreds_0.1.2 cachem_1.0.8 yaml_2.3.8 ## [33] tools_4.3.2 tzdb_0.4.0 dplyr_1.1.4 curl_5.2.0 ## [37] png_0.1-8 vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4 ## [41] lubridate_1.9.3 snakecase_0.11.1 stringr_1.5.1 janitor_2.2.0 ## [45] pkgconfig_2.0.3 later_1.3.2 pillar_1.9.0 bslib_0.6.1 ## [49] data.table_1.15.0 glue_1.7.0 Rcpp_1.0.12 highr_0.11 ## [53] xfun_0.48 tibble_3.2.1 tidyselect_1.2.0 knitr_1.48 ## [57] textrank_0.3.1 websocket_1.4.2 htmltools_0.5.7 igraph_2.0.2 ## [61] rmarkdown_2.25 webshot2_0.1.1 readr_2.1.5 compiler_4.3.2 ## [65] askpass_1.2.0 openssl_2.1.1 "],["data.html", "Chapter 9 Data 9.1 Bring Your Own Data 9.2 Analyze Existing Data", " Chapter 9 Data Data is stored by Workspaces in two different locations: the Workspace Bucket and the Persistent Disk. The Workspace Bucket is a special Google Cloud Storage bucket that is governed by the built-in AnVIL security policies. This durable and scalable storage location is suitable for both raw data as well as analysis outputs that need to be preserved and/or shared. In contrast, Persistent Disks provide a working directory for Cloud Environments that run Jupyter, RStudio, and Galaxy. Input data can be localized to Persistent Disks for analysis while output data can be transferred to the Workspace Bucket for more reliable long term storage. Data Tables provide a way to organize data and metadata, including URI links to storage buckets. These tables are a convenient way to organize input for analyses as well as tracking workflow outputs. More details can be found in the Terra documentation. 9.1 Bring Your Own Data 9.1.1 Overview The starting point for bringing your own data to AnVIL is the Workspace Dashboard. At the bottom right, you’ll find the full path to the Google Bucket information corresponding to your Workspace. You can click the clipboard icon on the right to copy the name of your Workspace Bucket. You will be able to see any uploaded files by clicking the “Open in browser” link. You can also see any uploaded files by clicking the “Files” directory at the bottom left in the Data Tab. 9.1.2 Browser: Upload Single Files Click the “Files” directory at the bottom left of the Data Tab. Then click the “+” button in the bottom right corner of the screen. This will prompt a file browser on your local machine. 9.1.3 Browser: Upload Folders Click the “Open in browser” link on the bottom right of the Workspace Dashboard Tab. This will open a new browser window or tab directed to your Workspace’s Google Bucket on the Google Cloud Platform. Here, you can upload files and manage your data and folders. You can also upload an entire folder by clicking on “UPLOAD FOLDER”. 9.1.4 gsutil: Local to Cloud gsutil is a Python application that lets you access Cloud Storage from the command line in a terminal. The terminal you use can be run on your local machine (local instance) or built into the Workspace Cloud Environment. 9.1.4.1 Install gsutil on Your Local Computer or Local Server Cloud SDK is a set of tools that you can use to manage resources and applications hosted on Google Cloud. These tools include the gsutil command-line tool. Ensure you have a terminal available. MacOS and Linux users have a terminal application available by default. Terminal applications are also available through third party software, such as RStudio. Windows users should download a terminal application, such as Putty. Install Cloud SDK following the appropriate link below: Windows MacOS Linux Test that Cloud SDK has been successfully installed by typing gsutil in the terminal application prompt: gsutil If the installation was successful, you should see information about using gsutil that looks like the following: Usage: gsutil [-D] [-DD] [-h header]... [-i service_account] [-m] [-o] [-q] [-u user_project] [command [opts...] args...] If the installation was not successful, you should see a warning that gsutil was not found. Please return to the installation steps to ensure they have been completed correctly. command not found: gsutil 9.1.4.2 Copy Files From Your Local Computer to a Workspace Bucket The gsutil cp command allows you to copy data from one machine to another. On your local machine’s terminal, you should use the command in the following format: gsutil cp where_to_copy_data_from/filename where_to_copy_data_to Example: To copy the file test.bam located on your local computer at users/name/data/ into the Workspace Bucket gs://ab5-27x on the cloud: gsutil cp users/name/data/test.bam gs://ab5-27x Remember that you can easily copy the Workspace Bucket ID using the clipboard button on the Workspace Dashboard. Please see the gsutil cp documentation for more details, such as how to do parallel multi-threaded/multi-processing copying or copying an entire directory tree. The gsutil cp command can also be used to copy files from one Workspace Bucket to another (cloud-to-cloud copying). 9.2 Analyze Existing Data In addition to bringing your own data, you can use existing data on AnVIL. Using the following resources can help you discover data to use in your analyses. 9.2.1 AnVIL Data Library The Datasets Library is a good place to get started and familiarize yourself with existing data. Here, you can find curated datasets from thousands of participants. Some of these are open access (such as the 1000 Genomes dataset) while others will require you to request access. Taking a look at Featured Workspaces can get you started quickly. Remember that when you clone a Workspace, AnVIL automatically cross-links to the original data contained within the Data Tables. 9.2.2 AnVIL Dataset Catalog The AnVIL Dataset Catalog displays key NHGRI datasets accessible in AnVIL, such as the CCDG (Centers for Common Disease Genomics), CMG (Centers for Mendelian Genomics), eMERGE (Electronic Medical Records and Genomics), as well as other relevant datasets. You will need to coordinate access to controlled data. 9.2.3 Gen3 Data Explorer The Gen3 Data Explorer and Data Commons provides their API for data queries and downloads, supporting cross-project analyses. Gen3 provides access to open and protected datasets that can be exported to an AnVIL Workspace. For example, users can find the 1000 Genomes dataset on Gen3 and filter by ancestry, age, and other features prior to performing analyses on AnVIL. "],["workflows.html", "Chapter 10 Workflows", " Chapter 10 Workflows Workflows allow you to run whole genomic pipelines in Terra. Workflows are written in WDL (Workflow Description Language), which is a human-readable and writable language originally developed for genomic analysis pipelines. Terra on AnVIL is specifically designed to integrate a WDL workflow’s input and output information directly into the platform. This integration allows you to easily configure a workflow from your Workspace. Established workflows help make analyses more reproducible. They also make it easier to configure, launch, and monitor analyses across many samples. You can access workflows from a Workspace by clicking on the Workflows Tab. To run a workflow you can either use a pre-configured workflow, or create your own workflow. 10.0.1 Pre-configured Workflow You will need to clone a Workspace that contains the workflow you’d like to run. Refer to Terra’s quickstart guide for running a pre-configured workflow. You can browse code and workflows in the AnVIL library here. 10.0.2 Create a Custom Workflow Custom workflows are written in WDL. Refer to Terra’s “Hello, learn-wdl!” tutorial to learn the basics before fully customizing your analysis pipeline. "],["faqs.html", "A FAQs", " A FAQs A.0.1 What is AnVIL? Watch our 4 minute video Read more at https://anvilproject.org/overview See more Overview Videos A.0.2 How Do I Get Started? Identify Billing Account(s) e.g. Google $300 Free Trial, AC2 Make an account (see PIs and Lab Managers example) A.0.3 What Data Is Available? Browse the Data Dashboard Request access if necessary for Controlled Access Data or Consortia Access Data Bring Your Own Data A.0.4 How Do I Perform an Analysis? Launch interactive tools like Jupyter, RStudio, Galaxy “Terminal” available through Jupyter and RStudio Migrate your HPC analysis to AnVIL using Docker/WDLs Introducing the Learn WDL Course (10 min video) Basic WDL covered in Webinar 2 from BDC Spring 21 Become a WDL super hero with https://github.com/openwdl/learn-wdl A.0.5 Where Can I Get Help? General questions in https://help.anvilproject.org Additional options at https://anvilproject.org/help "],["troubleshooting.html", "B Troubleshooting", " B Troubleshooting B.0.1 Create Terra Billing Project The error messages generated during billing project creation can be hard to understand (see this example. If you get an error message, Double check that your project name follows the rules. If your chosen name follows the rules but you are getting an error then the name may be already taken - try adding something (e.g. your lab or team name, date, etc.) to make it unique. "],["authorization-domains.html", "C Authorization Domains", " C Authorization Domains Terra Authorization Domains help keep sensitive genomic data secure while still allowing easy sharing with collaborators. Authorization Domains work like a badge. This badge can be associated with a Workspace that allows access only to people with the same badge. When you clone a Workspace that has an Authorization Domain, the new copy will also have the badge: anyone who wants to access the cloned Workspace has to have the badge. You don’t have to worry about accidentally sharing sensitive data because if you try to share the cloned workspace with a user who doesn’t have the right badge, that researcher won’t be able to enter. An Authorization Domain is a Managed Group with strictly defined and enforced Workspace permissions where they: Restrict Workspace access to only the individuals in the group Are assigned to Workspaces when they are created Follow all workspace copies Learn how to create an Authorization Domain here. "],["budget-templates.html", "D Budget Templates", " D Budget Templates If you want to apply for a grant and you plan to use the AnVIL platform for data storage, data movement, and data analysis, you can include the anticipated costs in your proposal. We have created a template for the budget justification paragraph of your grant proposal. The documents described in the following provide you with insightful knowledge. D.0.1 Types of Costs There are three types of costs that are typically occur when performing operations on the Google Cloud Platform. 1. Cost for Computing is driven by your particular CPU and memory requirements. Importantly, you can save money if your work can tolerate being interrupted (also known as a preemptible compute resource). In this case, you pay less per hour with the understanding that your work may be interrupted by a customer willing to pay more. Details and current pricing can be found here. 2. Cost for Storage is driven by the amount of data and the length of time you wish to store the data. Here, you can save money if you have data that you do not plan to access frequently. This would be the case for raw data that has already been processed, backups, and archives. Details and current pricing can be found here. 3. Cost for Network Usage (egress) applies to data being transferred out of a Cloud resource. In this context, a Cloud resource refers to a set of computers in a particular region. This would apply, for example, if you transferred data from Google’s East Coast computers to Amazon’s West Coast computers. In general, while it’s free to upload data to the Cloud, you will incur costs when downloading data to your local computer or between Cloud regions. Details and current pricing can be found here. D.0.2 Usage of Budget Templates In a first step, you can use the template Google Sheet AnVIL_Cost_Estimator to calculate costs for computing, storage, and network usage (egress) for your proposal. In a second step, you can use the template Google Doc AnVIL_Budget_Justification to create a budget justification paragraph for your proposal by including the information highlighted in pink (mostly copying entries from your Google Sheet AnVIL_Cost_Estimator). Please download and adapt both documents to your project. Please check that the prices are up to date by using the links listed below or in the AnVIL_Cost_Estimator. For further guidance, you can have a look at a completed document AnVIL_Budget_Justification_Example. "],["irb-templates.html", "E IRB Templates", " E IRB Templates If you plan to use the AnVIL platform for data storage, data movement, and data analysis, you will likely need to provide information about AnVIL in your IRB application. While all IRBs will have different questions and requirements, there are general areas of concern that are common across IRB applications. The following sections provide general information about AnVIL as well as examples of language addressing some of these concerns. E.0.1 General Information When writing an IRB research plan you will be required to fill out a few fields about data collection and data storage. The following sections provide the information that you will need for these IRB sections. There is also language regarding consent, with some suggested text that you may include in the IRB research plan and your consent form. If you are using AnVIL, you should know that the storage is on Google Cloud Platform, so it is encrypted, it is not stored on a secure server, and it is backed up through multi-site replication. Important things to know about AnVIL security are that there are four key design concepts: Authenticate: All components require authentication at every step, not just the perimeter Authorize: All data requires explicit authorization to access Audit: All data access is logged (to a different system), with alerts for anomalous events Encrypt: All data-in-transit and all data-at-rest is encrypted There is language below about data storage, which you will need for your research plan. There is also a section about privacy and confidentiality, which you will also need. E.0.2 Description of AnVIL The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-Space, or AnVIL provides a collaborative environment for creating and sharing data and analysis workflows for both users with limited computational expertise and sophisticated data scientist users. The platform is built on a set of established components that have been used in a number of flagship scientific projects: The Terra platform: provides a compute environment with secure data and analysis sharing capabilities Dockstore: provides standards based sharing of containerized tools and workflows The Gen3 data commons framework: provides data and metadata ingest, querying, and organization Bioconductor and Galaxy: provide environments for users at different skill levels to construct and execute analyses E.0.3 Data Storage We will keep a copy of the participant’s genomics data in various data formats (formats - CRAMs, BAMs, vcfs, BCLs, fastqs). This will be stored in Broad’s Google Cloud services, which is the underlying storage platform for AnVIL. The Google Cloud Platform has summarized its services with respect to genomics data processing in a white paper here: https://cloud.google.com/files/genomics-data-wp.pdf E.0.4 Privacy and Confidentiality There is a small risk of loss of privacy and confidentiality for the participant and their personal health information. While it is unlikely, there is a possibility that genetic information could be seen by unauthorized individuals. Researchers who will have access to genetic information about participants will take measures to maintain the privacy of the participant and confidentiality of their information. E.0.5 Security Terra (powered by Broad Genomics’ Workbench) is operated by the Broad Institute at the FISMA (Federal Information Systems Management Act) “moderate” level and received Authority to Operate from NCI and NIH in May of 2016. FISMA is a practice of documentation, audit, and organizational risk acceptance. It is centered on the controls outlined in NIST (National Institute of Standards and Technology) Special Publications 800-30 and 800-53. Covered topics include: Network penetration testing and assessment by a Federally authorized outside firm Maintaining system logs separate from the primary system for forensic analysis Regular review of logs and changes by an in-house auditor Security training and background screening for staff with elevated access to the system Documented procedures to respond to security incidents The Terra portal and its underlying platform, Broad Genomics’ Workbench, are hosted on Google’s Cloud Services. See below for details. Since Terra requires that users utilize Google logins, the application operates on top of Google’s world-class security that protects from nation-state level attacks. Terra supports the use of Google’s 2 Factor authentication as well. As a FISMA Moderate system, all logs are audited continually and various levels of security layering are required. These include Web Application Firewalls, weekly scanning, code scanning (dynamic and static), dependency scanning and manual penetration testing. Data analysis is constrained to computing nodes that are sandboxed using Docker within Google’s Pipelines API. Google undergoes several independent third party audits on a regular basis to provide verification of security, privacy and compliance controls including annual audits for SSAE 16/ISAE 3402 Type II. Google’s infrastructure provides reliable information security that can meet or exceed the requirements of HIPAA and protected health information. Broad Information Security & Compliance Team provides risk review and consulting security services for legacy, current, and future computing infrastructure, applications, and information services across the enterprise. Risk management and mitigation services are also provided to assist business units with meeting regulatory requirements. The team also takes a lead role in incident response processes, including network, system, and data forensic efforts and is responsible for developing and ensuring proper review of all security policies, standards, and procedures. The Senior Director of Information Security & Compliance (DISC) facilitates the development and maintenance of the information security program, leads the development of information security policies, standards and procedures, and collaborates with business and system owners in understanding their security responsibilities. The DISC also promotes consistency in the development and implementation of strategic initiatives and policies based on industry and regulatory standards (FISMA, etc). The Broad Institute has a strong commitment to information security, appropriate use of data, compliance with applicable regulations, and respect for data privacy. E.0.6 Example Language for Informed Consent Below we provide some possible language that could be used for consent forms. E.0.6.1 Types of Research We may use your sample to do other biological and genetic research. Genetic research may include looking at some or all of your genes and DNA to see if there are links to different types of health conditions. We may use your sample and your DNA sequence information to help create new analytical and machine learning methods to look at genomic and clinical data. We may share your samples, your DNA sequence information, your health information, and results from our research results with other data banks, such as those sponsored by the National Institutes of Health, so that researchers from around the world can use them to study many conditions. A code will be assigned to your samples and health information. Your name, medical record number, or other information that easily identifies you will not be stored with your samples or health information. The key to the code that connects your name to your samples and information will be stored securely in a controlled access database. This method is discussed more below. E.0.6.2 Data Repository Submission Your individual genomic data and health information may be put in a controlled-access data commons. This means that only researchers who apply for and get permission to use the information for a specific research project will be able to access the information. If this happens, your genomic data and health information will not be labeled with your name or other information that could be used to identify you. Your data and information will be entered in the database with the random study identifier created by the research team. Researchers approved to access information in the database will agree not to attempt to identify you. E.0.6.3 Potential Risks Every research study involves the risk of loss of privacy. The personal information that will be associated with your WGS cannot typically be used to identify you. However, it is possible that in the future there may be a way to link stored genomic information back to you. We believe that the possibility of this is low. Your information will be stored on a cloud-computing storage system administered by The Broad Institute. E.0.7 Additional Resources Workspace access controls https://support.terra.bio/hc/en-us/articles/360025851892 Workspace owners (e.g. PI, Lab Manager, Data Coordination Center) can grant read / write / compute access, and can revoke access Authorization Domains See our introduction here https://support.terra.bio/hc/en-us/articles/360026775691 Must be set up when a Workspace is created - prevents accidental sharing "],["overview-videos.html", "F Overview Videos", " F Overview Videos AnVIL – Streamlining Accessibility and Computability of Large-scale Genomic Datasets – 4m58s 9/30/20 Workspaces – Introduction to using Workspaces in Terra – 9m52s 4/5/20 Jupyter – Interactive Analysis using a Jupyter Notebook in Terra – 9m40s 5/7/20 RStudio – A sneak preview of RStudio in Terra – 6m57s 1/22/21 Galaxy – Galaxy on AnVIL Walkthrough – 5m50s 12/9/20 Data Tables – Introduction to Data Tables in Terra – 5m25s 4/5/20 Dockstore – Importing a GATK workflow from Dockstore into Terra – 8m29s 10/15/20 Workflows – Configuring and running a GATK workflow on Gen3 data in Terra – 28m10s 11/25/20 "],["give-us-feedback.html", "G Give us Feedback", " G Give us Feedback Thank you for your interest in this book! There are a few ways you can suggest improvements: Fill out this Google form. If you have a GitHub account, you can raise an issue in our repository. Submit a pull request! Click the pencil icon on any page (top left) to view the source .Rmd for the page and suggest changes. "],["about.html", "About the Authors", " About the Authors These credits are based on our course contributors table guidelines.     In memory of James Taylor, who was instrumental in initiating this project.   Credits Names Pedagogy Lead Content Instructors Katherine Cox, Ava Hoffman Content Contributors Kai Kammers, Frederick Tan, Sarah Wheelan Content Editors/Reviewers Natalie Kucher, Jeff Leek, Valerie Reeves, Frederick Tan Content Directors Jeff Leek, Frederick Tan Content Consultants Allie Cliffe, Candace Patterson Production Content Publisher Ira Gooding Technical Template Publishing Engineers Candace Savonen, Carrie Wright Publishing Maintenance Engineer Candace Savonen Technical Publishing Stylists Carrie Wright, Candace Savonen Package Developers (ottrpal) John Muschelli, Candace Savonen, Carrie Wright Funding Funder National Human Genome Research Institute (NHGRI) #5U24HG010263 Funding Staff Fallon Bachman, Jennifer Vessio, Emily Voeglein   ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.3.2 (2023-10-31) ## os Ubuntu 22.04.4 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC ## date 2024-10-22 ## pandoc 3.1.1 @ /usr/local/bin/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date (UTC) lib source ## bookdown 0.40 2024-07-02 [1] CRAN (R 4.3.2) ## bslib 0.6.1 2023-11-28 [1] RSPM (R 4.3.0) ## cachem 1.0.8 2023-05-01 [1] RSPM (R 4.3.0) ## cli 3.6.2 2023-12-11 [1] RSPM (R 4.3.0) ## devtools 2.4.5 2022-10-11 [1] RSPM (R 4.3.0) ## digest 0.6.34 2024-01-11 [1] RSPM (R 4.3.0) ## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.3.0) ## evaluate 0.23 2023-11-01 [1] RSPM (R 4.3.0) ## fastmap 1.1.1 2023-02-24 [1] RSPM (R 4.3.0) ## fs 1.6.3 2023-07-20 [1] RSPM (R 4.3.0) ## glue 1.7.0 2024-01-09 [1] RSPM (R 4.3.0) ## htmltools 0.5.7 2023-11-03 [1] RSPM (R 4.3.0) ## htmlwidgets 1.6.4 2023-12-06 [1] RSPM (R 4.3.0) ## httpuv 1.6.14 2024-01-26 [1] RSPM (R 4.3.0) ## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.3.0) ## jsonlite 1.8.8 2023-12-04 [1] RSPM (R 4.3.0) ## knitr 1.48 2024-07-07 [1] CRAN (R 4.3.2) ## later 1.3.2 2023-12-06 [1] RSPM (R 4.3.0) ## lifecycle 1.0.4 2023-11-07 [1] RSPM (R 4.3.0) ## magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.3.0) ## memoise 2.0.1 2021-11-26 [1] RSPM (R 4.3.0) ## mime 0.12 2021-09-28 [1] RSPM (R 4.3.0) ## miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.3.0) ## pkgbuild 1.4.3 2023-12-10 [1] RSPM (R 4.3.0) ## pkgload 1.3.4 2024-01-16 [1] RSPM (R 4.3.0) ## profvis 0.3.8 2023-05-02 [1] RSPM (R 4.3.0) ## promises 1.2.1 2023-08-10 [1] RSPM (R 4.3.0) ## purrr 1.0.2 2023-08-10 [1] RSPM (R 4.3.0) ## R6 2.5.1 2021-08-19 [1] RSPM (R 4.3.0) ## Rcpp 1.0.12 2024-01-09 [1] RSPM (R 4.3.0) ## remotes 2.4.2.1 2023-07-18 [1] RSPM (R 4.3.0) ## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.2) ## rmarkdown 2.25 2023-09-18 [1] RSPM (R 4.3.0) ## sass 0.4.8 2023-12-06 [1] RSPM (R 4.3.0) ## sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.3.0) ## shiny 1.8.0 2023-11-17 [1] RSPM (R 4.3.0) ## stringi 1.8.3 2023-12-11 [1] RSPM (R 4.3.0) ## stringr 1.5.1 2023-11-14 [1] RSPM (R 4.3.0) ## urlchecker 1.0.1 2021-11-30 [1] RSPM (R 4.3.0) ## usethis 2.2.3 2024-02-19 [1] RSPM (R 4.3.0) ## vctrs 0.6.5 2023-12-01 [1] RSPM (R 4.3.0) ## xfun 0.48 2024-10-03 [1] CRAN (R 4.3.2) ## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.3.0) ## yaml 2.3.8 2023-12-11 [1] RSPM (R 4.3.0) ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library ## ## ────────────────────────────────────────────────────────────────────────────── "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] +[["index.html", "Getting Started on AnVIL About this Book", " Getting Started on AnVIL October 28, 2024 About this Book This book is part of a series of books for the Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) of the National Human Genome Research Institute (NHGRI). Here, we present opinionated step-by-step guides for setting up accounts focused on three personas: PIs, Analysts, and Consortia. Skills Level Please choose the closest matching persona from the lefthand menu. Genetics Novice: no genetics knowledge needed Programming skills Novice: no programming experience needed AnVIL Collection Additional guides are provided to help you with Workspaces, launch interactive tools, and start working with data. Learn more about AnVIL by visiting https://anvilproject.org or reading the article in Cell Genomics. Please check out our full collection of AnVIL and related resources: https://hutchdatascience.org/AnVIL_Collection/ "],["introduction.html", "Chapter 1 Introduction", " Chapter 1 Introduction Welcome to the AnVIL Getting Started guide! In it you will find step-by-step instructions for setting up your accounts, as well as guides on how to use some of AnVIL’s key features. 1.0.1 What Is AnVIL? AnVIL is NHGRI’s Genomic Data Science Analysis, Visualization, and Informatics Lab-space. It provides a platform for performing genomic data analysis on the cloud. 1.0.2 Does AnVIL Cost Money? Through AnVIL, you pay for computing resources as you use them. If you’d like to try it out, new users can claim a $300 Google Cloud credit to test out the platform and perform some small analyses. We also provide a cost estimator. 1.0.3 Where Can I Get Help? Please visit our community support forum at help.anvilproject.org with any questions (or suggestions!) you may have. 1.0.4 How to Use This Book This book is not intended to be read through sequentially, rather, it is a collection of guides that you can reference based on your needs. It is divided into two major sections: Account Setup Step-by-step instructions for new AnVIL users to set up their accounts and start using the AnVIL platform. We have included recommendations for configuring your accounts based on several common use cases: PIs and Lab Managers: managing a team of researchers working on AnVIL Data Analysts: joining a team working on AnVIL Consortia: using AnVIL as part of a research consortium Working on AnVIL Examples and walkthroughs of common tasks on the AnVIL platform: Workspaces: how to create and clone research spaces on AnVIL Tools: how to run common tools including Jupyter Notebooks, Galaxy, and RStudio Data: how to find and access AnVIL datasets, as well as upload and manage your own data Workflows: how to find and run existing automated data processing pipelines, and how to customize and share your own 1.0.5 Activate scroll_highlight Feature Note that some sections of this book cover steps in a lot of detail. When navigating the table of contents, you can click subsection (e.g., 2.1, 4.3) headers a second time to expand the table of contents and enable the scroll_highlight feature. This can help you follow the separate steps within more clearly. "],["pis-and-lab-managers.html", "Chapter 2 PIs and Lab Managers 2.1 Account Setup Overview 2.2 Step 1: Create a Google Account 2.3 Step 2: Set Up Google Billing 2.4 Step 3: Add Terra to Google Billing Account 2.5 Step 4: Create Terra Billing Projects 2.6 Step 5: Set Budgets and Alerts 2.7 Step 6: Add Users and Workspaces 2.8 Wrap-Up", " Chapter 2 PIs and Lab Managers This chapter is targeted towards people who are responsible for bringing a team to AnVIL. Broadly targeted towards principal investigators (PIs), but also relevant to team leads or lab managers, you will find here: Account Setup Overview – Design philosophy and goals for this guide - is this a good fit for your team? What should you know before you start? Account Setup Steps – Step-by-step instructions to create your first accounts on AnVIL and connect your team members The Appendices of this book contain additional information that may be of interest, including: Templates for including AnVIL in grant applications (Budget Templates, IRB Templates) Information regarding AnVIL’s security features for protecting sensitive research (Authorization Domains) Please click on the subsection headers in the left hand navigation bar (e.g., 2.1, 4.3) a second time to expand the table of contents and enable the scroll_highlight feature (see more). 2.1 Account Setup Overview 2.1.1 Goals for This Guide 2.1.2 Design Philosophy This guide provides an opinionated walkthrough on how to set up AnVIL for your lab, based on experiences from many labs actively using AnVIL. These step-by-step instructions take team leads that are completely new to the AnVIL through account setup to the point where team members can start working on AnVIL. Following the recommendations in this guide will help you more clearly see where charges are coming from and have greater control over which users can spend your money and access your data. In support of these goals we have made the following design decisions: COST CONTROL Prevent charges to your funding account until you explicitly give authorization by starting with Google’s free $300 credit program Control who can charge to your account by limiting who can “share” permission to compute - yourself and any designated “Lab Managers” COST TRANSPARENCY Allow fine-grain accounting of who spent what by creating individual “Billing Projects” for each user Monitor costs by setting up email alerts to warn you when you reach spending thresholds Enable detailed analysis of costs by exporting cost data using BigQuery DATA ACCESS CONTROLS Reduce unwanted access by limiting who can “share” your data and analyses - yourself and any designated “Lab Managers” Stricter data access management can be enforced through “Authorization Domains”; however this can make future sharing and publication difficult. This guide recommends avoiding Authorization Domains for most uses, especially as you are starting out. If you are working with highly sensitive data, see this documentation for more information. These design decisions are made to help you get up and running as quickly as possible without overwhelming new users. As your experience and comfort with AnVIL grows, you will likely change your design to better match your unique needs e.g. enabling Authorization Domains when working with protected data. 2.1.3 Before You Start You will need a credit card or bank account to activate your free trial and get started. Don’t worry! You won’t be billed until you explicitly turn on automatic billing, but payment information is needed for verification purposes. Before setting up billing yourself, you may want to check with your institutional procurement office and see if they have a preferred account set-up method with Google (such as a third party reseller or an existing account). To add lab members, you will need to know the Google account they will use to access Terra. You can send lab members to the Data Analysts chapter for instructions on how they can sign up and start working on AnVIL. You can complete most setup steps without this information and then add them once you know the correct accounts. 2.1.4 Starting Setup AnVIL uses Terra to run analyses. Terra operates on Google Cloud Platform (GCP), so you’ll pay for all storage and analysis costs through a Google account linked to Terra. The costs are the standard Google Cloud Platform fees for storing and moving data as well as executing an analysis. These costs are passed along through Terra without any markup. Create a Google account Set up Google Billing (and claim your free credits!). Add an administrator or viewer (optional) Link Terra to the Google Billing Account Create Terra Billing Projects Set budgets and alerts (optional, but highly recommended) Add users and Workspaces 2.1.5 Lab Management Roles While there are many ways to configure your lab, this guide defines the following roles and responsibilities: PI - The PI sets up the lab’s Google Cloud Account, creates its Google Billing Account(s), and Google Payment Method(s), links Terra with GCP, and invites Lab Managers to be Google Cloud “Billing Account Users.” Lab Manager (Optional) - A Lab Manager creates or clones Terra Workspaces and manages who can use those Workspaces. The Lab Manager is also responsible for creating one or more Terra Billing Projects configuring GCP budgets and alerts. Importantly, lab managers control who can spend lab money and should have an understanding of Google Cloud Billing and Terra Billing Projects. Depending on your lab, the PI may choose to be the only Lab Manager, or may appoint trusted lab members to assist. Data Analyst - A lab member who is granted write + can-compute access on one or more Terra Workspaces by a Lab Manager and who will run analyses in Terra. Data Analysts cannot share Terra Workspaces (this prevents them from enabling others to spend lab money). 2.2 Step 1: Create a Google Account Terra operates on Google Cloud Platform, so you will need a (free) Google account which will allow you to Access the Terra platform to manage team members, data, and analyses Access Google Cloud Platform to manage billing Receive alerts when spending reaches specified thresholds If you do not already have a Google account that you would like to use for accessing Terra, create one now. If you would like to create a Google account that is associated with your non-Gmail, institutional email address, follow these instructions. 2.3 Step 2: Set Up Google Billing Terra operates on Google Cloud Platform, and does not charge any markup. Rather than paying Terra or AnVIL, users set up billing directly with Google Cloud Platform. Make sure to use the same Google account ID you use to log into Terra for Google Cloud Billing. To set up billing, you must first create a Google “Billing Account”. You can create multiple Billing Accounts associated with your Google ID. We recommend creating separate Billing Accounts for different funding sources. 2.3.1 Create a Google Billing Account Log in to the Google Cloud Platform console using your Google ID. Make sure to use the same Google account ID you use to log into Terra. If you are a first time user, don’t forget to claim your free credits! If you haven’t been to the console before, once you accept the Terms of Service you will be greeted with an invitation to “Try for Free.” Follow the instructions to sign up for a Billing Account and get your credits. Choose “Individual Account”. This “billing account” is just for managing billing, so you don’t need to be able to add your team members. You will need to give either a credit card or bank account for security. Don’t worry! You won’t be billed until you explicitly turn on automatic billing. You can view and edit your new Billing Account, by selecting “Billing” from the left-hand menu, or going directly to the billing console console.cloud.google.com/billing Clicking on the Billing Account name will allow you to manage the account, including accessing reports, setting alerts, and managing payments and billing. At any point, you can create additional Billing Accounts using the Create Account button. We generally recommend creating a new Billing Account for each funding source. 2.3.2 Add Users or Viewers (optional) If you have a project manager or finance administrator who needs access to a Billing Account, you can add them with a few different levels of permissions. Generally the most useful are: Users have a great deal of power over spending - they can create new “Billing Projects” and control who can spend money on those projects. If you have a lab or accounts manager responsible for expenses, it may make sense to add them as a Billing Account User. If you wish to retain full control over who can spend money on GCP, you should not add any Users. Viewers can see the activity in the Billing Account but can’t make any changes. This can be useful for finance staff who need access to the reports, or for lab members to be able to see what their analyses are costing. Anyone you wish to add to the Billing Account will need their own Google ID. To add a member to a Billing Project: Log in to the Google Cloud Platform console using your Google ID. Navigate to Billing You may be automatically directed to view a specific Billing Account. If you see information about a single account rather than a list of your Billing Accounts, you can get back to the list by clicking “Manage Billing Accounts” from the drop-down menu. Check the box next to the Billing Account you wish to add a member to, click “ADD MEMBER”. Enter their Google ID in the text box. In the drop-down menu, mouse over Billing, then choose the appropriate role. Click “SAVE”. 2.4 Step 3: Add Terra to Google Billing Account This gives Terra permission to create projects and send charges to the Google Billing Account, and must be done by an administrator of the Google Billing Account. Terra needs to be added as a “Billing Account User”: Log in to the Google Cloud Platform console using your Google ID. Navigate to Billing You may be automatically directed to view a specific Billing Account. If you see information about a single account rather than a list of your Billing Accounts, you can get back to the list by clicking “Manage Billing Accounts” from the drop-down menu. Check the box next to the Billing Account you wish to add Terra to, click “ADD MEMBER”. Enter terra-billing@terra.bio in the text box. In the drop-down menu, mouse over Billing, then choose “Billing Account User”. Click “SAVE”. 2.5 Step 4: Create Terra Billing Projects This is how you enable Terra users to charge to the Google Billing Account. Note that Google will report charges at the level of Billing Projects. If you create only one Billing Project for your lab, you will not be able to see a breakdown of where charges are coming from. It is highly recommended that you create separate Billing Projects for each category of spending you would like to track. For example: A Billing Project for each lab member, if you would like to track individual spending A Billing Project for each analysis type, if you would like to track spending on e.g. RNA-seq vs. variant calling. A Billing Project for each cohort, if you would like to track spending per data set If you are uncertain, we recommend starting by setting up a Billing Project per lab member. This makes it easy to track lab member spending, and also makes it easier to cleanly shut down projects when a member leaves the lab. 2.5.1 Create a Billing Project Launch Terra and sign in with your Google account. If this is your first time logging in to Terra, you will need to accept the Terms of Service. In the drop-down menu on the left, navigate to “Billing”. Click the triple bar in the top left corner to access the menu. Click the arrow next to your name to expand the menu, then click “Billing”. You can also navigate there directly with this link: https://anvil.terra.bio/#billing On the Billing page, click the “+ CREATE” button to create a new Billing Project. Select GCP Billing Project (Google’s Platform). If prompted, select the Google account to use and give Terra permission to manage Google Cloud Platform billing accounts. Enter a unique name for your Terra Billing Project and select the appropriate Google Billing Account. The name of the Terra Billing Project must: Only contain lowercase letters, numbers and hyphens Start with a lowercase letter Not end with a hyphen Be between 6 and 30 characters Select the Google Billing Account to use. All activities conducted under your new Terra Billing Project will charge to this Google Billing Account. If prompted, give Terra permission to manage Google Cloud Platform billing accounts. Click “Create”. Your new Billing Project should now show up in the list of Billing Projects Owned by You. You can add additional members or can modify or deactivate the Billing Project at any time by clicking on its name in this list. The page doesn’t always update as soon as the Billing Project is created. If it’s been a couple of minutes and you don’t see a change, try refreshing the page. As mentioned above, we recommend creating separate Terra Billing Projects for each of your team members so you can track their spending. These Billing Projects can all be associated with the same Google Billing Account if they are all funded by the same source. Having trouble? Check out the Troubleshooting appendix Visit our community support forum at help.anvilproject.org with any questions. 2.6 Step 5: Set Budgets and Alerts Cloud computing can save a great deal of money, time and effort by providing compute on an as-needed basis. However, care must be taken that users do not accidentally request excessive resources, or leave resources running when not needed. Unfortunately, there are two issues that make direct cost control difficult: The Google Cloud billing interface does not provide a way to automatically cancel computations when a spending threshold is reached Compute costs are reported with a delay (~1 day) As a PI or lab manager, there are some steps you can take to help monitor and limit spending: Be careful with members and permissions in your Billing Projects and Workspaces on Terra (see Adding Users and Workspaces for recommended setup) Most importantly, monitor your spending so you can shut down unnecessary expensive activities before they have time to accumulate. Terra provides extensive documentation and examples regarding cost management while working in the cloud We highly recommended you set budgets and alerts to notify you if spending starts to exceed expectations. This will make it easier to notice and shut down any accidental overspending. A good starting point is to set a monthly budget, and then set alerts at 50 percent and 90 percent of expected spend. You can add additional alerts if you desire. You can set a single Budget for your entire lab, set up individual budgets for each Billing Project, or even set budgets for certain subsets of your Billing Projects. This will depend on the size of your lab and how closely you want to monitor spending. More granular budgets make it quicker to notice and track down overspending from a particular project but mean you will get more emails every month. When setting budgets with broader scope, you can always find out which particular Billing Project is spending the money by checking in the GCP Billing interface. NOTE: that there may be some restrictions on the budgets and alerts you can set while you’re using GCP’s free credits. At the time of writing (Feb 2021) you are not able to set budgets for individual projects while you are using the GCP free credits, but can still set an overall budget. Any restrictions should be lifted when you upgrade to a paid account. 2.6.1 Set Alerts Log in to the Google Cloud Platform console using the Google ID associated with your Google Cloud projects. Open the dropdown menu on the top left and click on Billing. You may be automatically directed to view a specific Billing Account. If you see information about a single account (and it’s not the one you’re interested in), you can get back to the list of all your Billing Accounts by clicking “Manage Billing Accounts” from the drop-down menu. Click on the name of the Billing Account you want to set alerts for. In the left-hand menu, click “Budgets & alerts”. Click the “Create Budget” tab. Enter a name for your budget, and then choose which projects you want to monitor. Then click “Next”. For Budget Type, select “Specified amount”. Enter the total budget amount for the month (you will set alerts at different thresholds in the next step). Click “Next” (do not click “Finish”). Enter the threshold amounts where you want to receive an alert. We recommend starting with 50% and 90%. You can set other alerts if you prefer. Check the box for “Email alerts to billing admins and users”, then click “Finish”. Now you (as the owner and admin), along with anyone you added with admin or user privileges (e.g. lab managers) will receive alerts when your lab members reach the specified spending thresholds. These emails will be sent to the Gmail accounts associated with the Billing Account. You can edit your budgets at any time by going to Billing > Budgets & alerts, and clicking on the name of the budget you want to edit. 2.6.2 View spend You can always check your current spend through the Google Billing console, but remember There is a reporting delay (~1 day), so you cannot immediately see what an analysis cost Costs are reported at the level of Workspaces, so if there are multiple people using a Workspace, you will not be able to determine which of them was responsible for the charges. The Google Billing console displays information by Billing Account. To view spending: Log in to the Google Cloud Platform console using the Google ID associated with your Google Cloud projects. Open the dropdown menu on the top left and click on Billing. You may be automatically directed to view a specific Billing Account. If you see information about a single account (and it’s not the one you’re interested in), you can get back to the list of all your Billing Accounts by clicking “Manage Billing Accounts” from the drop-down menu. Click on the name of the Billing Account for the project you want to view. Look at the top of the Overview tab to see your month-to-date spending. Scroll further down the Overview tab to show your top projects. Click on the Reports tab to see more detailed information about each of your projects. This is probably the most useful tab for exploring costs of individual projects over time. Click on the Cost table tab to obtain a convenient table of spending per project. 2.6.3 Export Cost Data to BigQuery Coming soon – instructions on how to export your cost data so you can better analyze and control your expenses. 2.7 Step 6: Add Users and Workspaces Finally, back on Terra, you can add lab members and give them permission to run analyses funded through your Billing Projects. There are two primary ways to permit users to charge to your Billing Projects: Add them directly to the Billing Project. This gives them flexibility to create and manage their own Workspaces, but reduces your control over spending. Anyone they add to their Workspaces with sufficient permissions (i.e. permission to compute) can charge to your Billing Project. Create a Workspace yourself, and add them to the Workspace (or have a designated Lab Manager responsible for managing Workspaces). This gives you much more control over who can charge to your Billing Project. Billing permissions on Terra can be confusing. For this reason, We recommend starting by having a single person responsible for managing all Workspaces (either yourself or a trusted “lab manager”). This person should create all Workspaces and add lab members as Writers (not Owners) to the Workspaces. This provides the greatest control over spending. Once you are familiar with the permissions system and are certain your lab members understand the implication of different permission settings, you may decide to give them greater control over Workspace access. 2.7.1 Create a New Workspace Launch Terra In the drop-down menu on the left, navigate to “Workspaces”. Click the triple bar in the top left corner to access the menu. Click “Workspaces”. Click on the plus icon near the top of left of the page. Name your Workspace and select the appropriate Billing Project. All activity in the Workspace will be charged to this Billing Project (regardless of who conducted it). If you are working with protected data, you can set the Authorization Domain to limit who can be added to your Workspace. Note that the Authorization Domain cannot be changed after the Workspace is created (i.e. there is no way to make this Workspace shareable with a larger audience in the future). Workspaces by default are only visible to people you specifically share them with. Authorization domains add an extra layer of enforcement over privacy, but by nature make sharing more complicated. We recommend using Authorization Domains in cases where it is extremely important and/or legally required that the data be kept private (e.g. protected patient data, industry data). For data you would merely prefer not be shared with the world, we recommend relying on standard Workspace sharing permissions rather than Authorization Domains, as Authorization Domains can make future collaborations, publications, or other sharing complicated. Click “CREATE WORKSPACE”. The new Workspace should now show up under your Workspaces. To start, we recommend creating one Workspace for each lab member (associated with that lab member’s Billing Project, with separate Billing Projects for your lab members). This will enable you and your lab members to familiarize yourself with Workspaces and decide how best to organize your work. You can then create additional Workspaces as needed. 2.7.2 Add Members to Workspaces Lab members must have logged in to Terra at least once before they can be added to your Billing Projects and Workspaces (they do not need to log in to Google Cloud Console). You can send lab members to the Data Analysts guide for instructions on how they can sign up and start working on AnVIL. Lab members can be added to a Workspace with a few different permission levels: Readers can view the Workspace but not make edits or run analyses (i.e. they cannot spend your money) Writers can make edits and run analyses (i.e. they can spend your money) Owners can make edits and run analyses and can also manage the permissions of other users (i.e. they can enable others to spend your money) More details about the permissions associated with each Access Level can be found in the Terra documentation. Managing permissions for a Workspace has important implications: Billing: Terra charges are associated with Workspaces rather than users. Any billable activity that takes place in a given Workspace will be charged to the associated Billing Project, regardless of who conducted the activity. If there are multiple users with permission to compute, it is impossible to tell who conducted the activity. Data access: Especially when working with protected data, it’s important to ensure that users have proper authorization to view the data before giving them access to a Workspace containing the data. Terra provides Authorization Domains to assist with this. In general we recommend: Writers: Lab members who need permission to compute (and charge to your Billing Project). This gives them permission to freely use the Workspace, (adding and removing data, conducting analyses, etc.) but prevents them from adding additional members who could charge to your Billing Project. This ensures you have control over who is doing the spending. Readers: All other users (i.e. users who need to see the Workspace but should not charge to your Billing Project). Readers can always “clone” the Workspace (creating a copy of it associated with their own Billing Project) if they want to run computations themselves. If working with protected data, take advantage of Authorization Domains to increase security. To add a member to a Workspace: Launch Terra In the drop-down menu on the left, navigate to “Workspaces”. Click the triple bar in the top left corner to access the menu. Click “Workspaces”. Click on the name of the Workspace to open the Workspace. Opening a Workspace does not cost anything. Certain activities in the Workspace (such as running an analysis) will charge to the Workspace’s Billing Project. Workspace management (e.g. adding and removing members, editing the description) does not cost money. Click the teardrop button () on the right hand side to open the Workspace management menu. Click “Share” Enter the email address of the user or Group you’d like to share the Workspace with. If adding an individual, make sure to enter the account that they use to access AnVIL. If adding a Terra Group, use the Group email address, which can be found on the Terra Group management page. Choose their permission level. Remember that all activity in the Workspace will be charged to the Workspace’s Billing Project, regardless of who conducts it, so only add members as “Writers” or “Owners” if they should be charging to the Workspace’s Billing Project. “Readers” can view all parts of the Workspace but cannot make edits or run analyses (i.e. they cannot spend money). They can also clone their own copy of the Workspace where they can conduct activity on their own Billing Project. Click “Save”. The user should now be able to see the Workspace when logged in to Terra. 2.7.3 Request Quota Increase To prevent abuse, new users of GCP are only permitted to create a few Google Cloud “Projects”. When working on Terra, each Terra Workspace is associated with its own Google Cloud Project, so if your team has multiple members you can bump up against this limit fairly quickly and won’t be able to create more Workspaces. Since this limit is imposed by Google, you will need to contact them directly to request a quota increase, using this form. At the time of writing (April 2022) Terra is working to expedite this process for Terra users; we recommend checking the relevant Terra documentation for the latest information as well as recommendations about how to fill out the form. 2.8 Wrap-Up Congratulations! You have successfully set up AnVIL for your lab! Your lab members should be free to carry out analyses in the Workspaces you created. You should not need to do any further configuration through Terra until you decide to add or change user permissions for your Billing Projects and Workspaces. You can view costs at any time through Google Cloud Billing. Note that costs are reported with a delay (~1 day). To learn more about billing and setup, we recommend checking out this Leanpub course. "],["data-analysts.html", "Chapter 3 Data Analysts 3.1 Account Setup Overview 3.2 Step 1: Create a Google Account 3.3 Step 2: Set Up Terra 3.4 Step 3: Link External Accounts (optional) 3.5 Wrap-Up", " Chapter 3 Data Analysts This chapter is targeted towards people who are joining an existing team on AnVIL. You will find here: Account Setup Overview – Introductory information and goals for this guide Account Setup Steps – Step-by-step instructions to create your first accounts on AnVIL 3.1 Account Setup Overview 3.1.1 Goals for This Guide 3.1.2 Starting Setup Terra is the compute engine of AnVIL; i.e. where you will run your analyses. Terra currently offers access to Jupyter Notebooks and RStudio for interactive analysis, as well as the Workflow Description Language (WDL) for batch processing of many samples. Behind the scenes, Terra runs on Google Cloud Platform, so you will need a (free) Google Account. In this guide, you will go through the following steps: Create a Google account Launch Terra and sign in with your Google account Link external accounts (e.g., dbGaP) to Terra (optional - enables you to import AnVIL open access datasets and to access protected data if you have appropriate authorization) 3.2 Step 1: Create a Google Account Terra operates on Google Cloud Platform, so you will need a (free) Google account to create a Terra account and run analyses on AnVIL. If you do not already have a Google account that you would like to use for accessing Terra, create one now. If you would like to create a Google account that is associated with your non-Gmail, institutional email address, follow these instructions. 3.3 Step 2: Set Up Terra Launch Terra, and you should be prompted to sign in with your Google account. Once you have signed in, your Terra account is set up and your PI or manager should be able to add you to projects and/or Workspaces. If this is the first time you or your team has used Terra, the PI or manager will also need to set up billing. You can always access Terra by going to anvil.terra.bio, or by clicking the link on the AnVIL home page. 3.4 Step 3: Link External Accounts (optional) AnVIL provides access to a wide selection of datasets, including controlled-access data. Linking your accounts will enable you to import these data into Terra. The following links will take you to the Terra documentation for setting up and linking external accounts. Link Terra and eRA Commons ID - To use controlled-access data on Terra, you will need to link your Terra user ID to your authorization account (such as a dbGaP account). Linking to external servers will allow Terra to automatically determine if you can access controlled datasets hosted in Terra (ex. TCGA, TOPMed, etc.) based on your approved dbGaP applications. 3.5 Wrap-Up Congratulations! You have successfully set up your AnVIL account! Your PI or lab manager should be now be able to add you to Workspaces so that you can perform analyses. Please contact your PI or manager to coordinate your user permissions for Terra Projects and Workspaces. To learn more about how to perform analyses on AnVIL, see the Working on AnVIL section of this book. The Workspaces chapter introduces AnVIL Workspaces, the fundamental unit of research organization on AnVIL. All analyses on AnVIL are performed in a Workspace. The [Tools], Data, and Workflows chapters explain how to perform a variety of common research tasks on AnVIL. "],["consortia.html", "Chapter 4 Consortia 4.1 Account Setup Overview 4.2 Consortium Data Managers 4.3 Consortium PIs and Lab Managers 4.4 Consortium Data Analysts and Researchers 4.5 Consortium Data Submitters 4.6 Wrap-Up", " Chapter 4 Consortia This chapter is targeted towards people who are part of a consortium and plan to use AnVIL as part of their activities. You will find here: Account Setup Overview – Goals for this guide and help choosing a role in the consortium Account Setup Steps – Step-by-step instructions to create your first accounts as part of the consortium on AnVIL Please click on the subsection headers in the left hand navigation bar (e.g., 2.1, 4.3) a second time to expand the table of contents and enable the scroll_highlight feature (see more). 4.1 Account Setup Overview 4.1.1 Goals for This Guide 4.1.2 Choosing a Role Consortia are usually made up of scientists taking on different research roles. This guide will be most helpful to you if you target the section most closely related to your role. The sections are summarized by role below: Data Manager The data manager is responsible for controlling access of consortium members. To protect data security, data managers ensure that consortium members only have access to appropriate datasets. If there are multiple datasets, the data manager will manage consortium member access as appropriate to a subset of those datasets. These limited access permissions are organized via distinct Terra Authorization Domains. Principal Investigator / Lab Manager Principal Investigators are responsible for managing billing expenses accrued by members of the consortium. Within the consortium, PIs can add members to Workspaces managed under specific Google Billing Accounts and Billing Projects, providing centralized control over computing and data storage expenses. Principal Investigators will need to work with data managers to ensure their team members have correct data access via Terra Authorization Domains, where necessary. Data Analyst / Researcher A consortium data analyst might describe someone who is performing analyses or writing code, but isn’t responsible for managing dataset access or billing expenses. The Analyst should be able to identify the PI or data manager point of contact in the consortium. Analysts will not be able to run analyses or access data without explicit permission from consortium management. This role also describes research leaders who will participate in activities but will not be managing personnel or expenses as part of the consortium. Data Submitter Data submitters are responsible for working with the AnVIL Data Ingestion Team as well as the National Center for Biotechnology Information (NCBI) to make data available on AnVIL. Data submitters will work with the data manager to ensure the data is protected via the appropriate Terra Authorization Domains. 4.1.3 Consortia Responsibilities It is important for consortium members of all roles to have a common understanding of the consortium’s security and data access policies. Consortia leadership are responsible for the following: Drafting Memorandums of Understanding (MOUs) and Data Use Agreements (DUAs) Arranging IRB oversight Drafting policy and protocol for data security and management incidents Ensuring all consortium members are aware of all data use limitations and the terms of the consortium agreements Defining what it means to be a consortium member Developing clear timelines and milestones for sharing data with the community rapidly, completely, and in NIH-designated data repositories Creating a timeline for closing out the consortium and thus the consortium-managed, streamlined access for consortium members Communicating the timeline with relevant NHGRI program staff and provide AnVIL with 6 months’ notice of when the consortium’s data access is expected to end 4.2 Consortium Data Managers 4.2.1 Managing Security and Privacy Policies As a data manager, you are responsible for managing access to sensitive data. You should be involved in managing the following: Drafting Memorandums of Understanding (MOUs) and Data Use Agreements (DUAs) Arranging and managing IRB oversight Drafting policy and protocol for reporting data security and management incidents Managing data use limitations and conditions of consortium membership 4.2.2 Account Setup To set up your account on AnVIL, please see the chapter for PIs and Lab Managers. Once the setup is complete, return to this page to continue. Go to PIs and Lab Managers chapter 4.2.3 Two Factor Authentication Note that you must establish two factor authentication on your Google Account for added security. 4.2.4 Set Up Terra Authorization Domains Terra Authorization Domains help keep sensitive genomic data secure while still allowing easy sharing with collaborators. See the Security section on Terra Authorization Domains for more information on setting up these secure access groups. 4.2.5 Granting Access Data managers should confirm that all other consortium members have two factor authentication set up on their Google accounts prior to granting access to datasets. Granting member access to datasets should be controlled via Terra Authorization Domains. 4.2.6 Revoking Access Revoking access to datasets, either when a member leaves the consortium or the consortium concludes, should be controlled via Terra Authorization Domains. 4.3 Consortium PIs and Lab Managers 4.3.1 Be Aware of Security Policies As part of a consortium on AnVIL, you should be aware of the following: Memorandums of Understanding (MOUs) and Data Use Agreements (DUAs) IRB oversight of the consortium Policy and protocol for reporting data security and management incidents 4.3.2 Account Setup To set up your account on AnVIL, please see the chapter for PIs and Lab Managers. Once the setup is complete, return to this page to continue. Go to PIs and Lab Managers chapter 4.3.3 Two Factor Authentication Note that you must establish two factor authentication on your Google Account for added security. 4.4 Consortium Data Analysts and Researchers 4.4.1 Be Aware of Security Policies As part of a consortium on AnVIL, you should be aware of the following: Memorandums of Understanding (MOUs) and Data Use Agreements (DUAs) IRB oversight of the consortium Policy and protocol for reporting data security and management incidents 4.4.2 Account Setup To set up your account on AnVIL, please see the chapter for Data Analysts. Once the setup is complete, return to this page to continue. Go to Data Analysts chapter 4.4.3 Two Factor Authentication Note that you must establish two factor authentication on your Google Account for added security. 4.5 Consortium Data Submitters 4.5.1 Be Aware of Security Policies As part of a consortium on AnVIL, you should be aware of the following: Memorandums of Understanding (MOUs) and Data Use Agreements (DUAs) IRB oversight of the consortium Policy and protocol for reporting data security and management incidents 4.5.2 Account Setup To set up your account on AnVIL, please see the documentation for Data Submitters. Once the setup is complete, return to this page to continue. Go to Data Submitters documentation 4.5.3 Two Factor Authentication Note that you must establish two factor authentication on your Google Account for added security. 4.6 Wrap-Up Congratulations! You have successfully set up your AnVIL account! You should now be able to move forward with your consortium activities, such as adding personnel to appropriate datasets or working in a Workspace. Remember that you will need to coordinate with other members of the consortium to establish appropriate permissions for datasets, Terra Billing Projects, and Workspaces. "],["workspaces.html", "Chapter 5 Workspaces 5.1 Access Your Workspaces 5.2 Clone an Existing Workspace 5.3 Create a New Workspace 5.4 Add Members to Workspace", " Chapter 5 Workspaces Workspaces are the building blocks of projects in Terra. Inside a Workspace, you can run analyses, launch interactive tools like RStudio and Galaxy, store data, and share results. To get a Workspace of your own, you can Clone a Workspace: Cloning an existing Workspace allows you to copy existing documentation, code, and/or data into your own experimental space. Create a Workspace: Creating a new Workspace from scratch allows you to fully customize the contents. The video below gives a brief introduction to the parts of a Workspace. 5.1 Access Your Workspaces If you are part of a research team, you may have been added to some existing Workspaces. To find and access your Workspaces, follow the steps below. Launch Terra In the drop-down menu on the left, navigate to “Workspaces”. Click the triple bar in the top left corner to access the menu. Click “Workspaces”. You are automatically directed to the “My Workspaces” tab. Here you can see any Workspaces that have been shared with you, along with your permission level. Reader means you can open the Workspace and see everything, but can’t do any computations or make any edits. Writer means you can run computations, which will charge costs to the Workspace’s Billing Project. Writers can also make edits to the Workspace. Owner is similar to Writer, but also allows you to control who can access the Workspace. Click on the name of a Workspace to open it. Opening and viewing a Workspace does not cost anything. When you open a Workspace, you are directed to the Workspace Dashboard. This generally has a description of the Workspace contents, as well as some useful details about the Workspace itself. From here you can navigate through the different tabs of the Workspace, and if you have sufficient permission, you can start running analyses. If you are only a Reader, you may need to “clone” (make your own copy) of the Workspace before you can start working. 5.2 Clone an Existing Workspace Cloning an existing Workspace allows you to copy existing documentation, code, and/or data into your own experimental space. Cloning creates a new copy of the Workspace that will charge costs to the Billing Project of your choice. Note that you can only clone a Workspace if you are at least a “User” on the Terra Billing Project. This helps prevent unwanted charges. Workspaces charge money to their associated Billing Project, regardless of who conducts the activity, so it’s important to be careful about who has permission to use the Workspace (see Add Members to Workspace for details). If you need to clone a Workspace but don’t have permission to create your own Workspaces, contact your PI or lab manager so that they can either grant you permission or clone the Workspace for you. The following steps show you how to clone a Workspace that has already been developed by other AnVIL users. When cloning, AnVIL makes a copy of notebooks and code for you to modify. Data however, is linked back to the original Workspace through Data Tables, which saves space! Launch Terra Locate the Workspace you want to clone. If a Workspace has been shared with you ahead of time, it will appear in “MY WORKSPACES”. You can clone a Workspace that was shared with you to perform your own analyses. In the screenshot below, no Workspaces have been shared. If a Workspace hasn’t been shared with you, navigate to the “FEATURED” or “PUBLIC” Workspace tabs. Use the search box to find the Workspace you want to clone. Click the teardrop button on the far right next to the Workspace you want to clone. Click “Clone”. You can also clone the Workspace from the Workspace Dashboard instead of the search results. You will see a popup box appear. Name your Workspace and select the appropriate Terra Billing Project. All activity in the Workspace will be charged to this Billing Project (regardless of who conducted it). Remember that each Workspace should have its own Billing Project. If you are working with protected data, you can set the Authorization Domain to limit who can be added to your Workspace. Note that the Authorization Domain cannot be changed after the Workspace is created (i.e. there is no way to make this Workspace shareable with a larger audience in the future). Workspaces by default are only visible to people you specifically share them with. Authorization domains add an extra layer of enforcement over privacy, but by nature make sharing more complicated. We recommend using Authorization Domains in cases where it is extremely important and/or legally required that the data be kept private (e.g. protected patient data, industry data). For data you would merely prefer not be shared with the world, we recommend relying on standard Workspace sharing permissions rather than Authorization Domains, as Authorization Domains can make future collaborations, publications, or other sharing complicated. Click “CLONE WORKSPACE”. The new Workspace should now show up under your Workspaces. 5.3 Create a New Workspace Creating a new Workspace from scratch allows you to fully customize the contents. Note that you can only create a new Workspace if you are at least a “User” on a Terra Billing Project. This helps prevent unwanted charges. Workspaces charge money to their associated Billing Project, regardless of who conducts the activity, so it’s important to be careful about who has permission to use the Workspace (see Add Members to Workspace for details). If you need to create a Workspace but don’t have permission, contact your PI or lab manager so that they can either grant you permission or create the Workspace for you. The following steps show you how to create a Workspace so you can get started. Launch Terra In the drop-down menu on the left, navigate to “Workspaces”. Click the triple bar in the top left corner to access the menu. Click “Workspaces”. Click on the plus icon near the top of left of the page. Name your Workspace and select the appropriate Billing Project. All activity in the Workspace will be charged to this Billing Project (regardless of who conducted it). If you are working with protected data, you can set the Authorization Domain to limit who can be added to your Workspace. Note that the Authorization Domain cannot be changed after the Workspace is created (i.e. there is no way to make this Workspace shareable with a larger audience in the future). Workspaces by default are only visible to people you specifically share them with. Authorization domains add an extra layer of enforcement over privacy, but by nature make sharing more complicated. We recommend using Authorization Domains in cases where it is extremely important and/or legally required that the data be kept private (e.g. protected patient data, industry data). For data you would merely prefer not be shared with the world, we recommend relying on standard Workspace sharing permissions rather than Authorization Domains, as Authorization Domains can make future collaborations, publications, or other sharing complicated. Click “CREATE WORKSPACE”. The new Workspace should now show up under your Workspaces. 5.4 Add Members to Workspace Members can be added to a Workspace with a few different permission levels: More details about the permissions associated with each Access Level can be found in the Terra documentation. Managing permissions for a Workspace has important implications: Billing: Terra charges are associated with Workspaces rather than users. Any billable activity that takes place in a given Workspace will be charged to the associated Billing Project, regardless of who conducted the activity. If there are multiple users with permission to compute, it is impossible to tell who conducted the activity. Data access: Especially when working with protected data, it’s important to ensure that users have proper authorization to view the data before giving them access to a Workspace containing the data. Terra provides Authorization Domains to assist with this. Make sure you understand what permissions you are granting before adding someone to your Workspace. To add a member to a Workspace: Launch Terra In the drop-down menu on the left, navigate to “Workspaces”. Click the triple bar in the top left corner to access the menu. Click “Workspaces”. Click on the name of the Workspace to open the Workspace. Opening a Workspace does not cost anything. Certain activities in the Workspace (such as running an analysis) will charge to the Workspace’s Billing Project. Workspace management (e.g. adding and removing members, editing the description) does not cost money. Click the teardrop button () on the right hand side to open the Workspace management menu. Click “Share” Enter the email address of the user or Group you’d like to share the Workspace with. If adding an individual, make sure to enter the account that they use to access AnVIL. If adding a Terra Group, use the Group email address, which can be found on the Terra Group management page. Choose their permission level. Remember that all activity in the Workspace will be charged to the Workspace’s Billing Project, regardless of who conducts it, so only add members as “Writers” or “Owners” if they should be charging to the Workspace’s Billing Project. “Readers” can view all parts of the Workspace but cannot make edits or run analyses (i.e. they cannot spend money). They can also clone their own copy of the Workspace where they can conduct activity on their own Billing Project. Click “Save”. The user should now be able to see the Workspace when logged in to Terra. "],["jupyter-notebook.html", "Chapter 6 Jupyter Notebook 6.1 Jupyter Notebook: Video tutorial", " Chapter 6 Jupyter Notebook One of the analysis platforms available on AnVIL is Jupyter Notebook. This platform offers accessible analysis reports incorporating multiple languages in the case of Jupyter. This chapter focuses on launching and highlighting a few features for Jupyter Notebook. 6.1 Jupyter Notebook: Video tutorial Here is a video tutorial that describes the basics of using Jupyter Notebook on AnVIL. 6.1.1 Objectives Start compute for your Jupyter environment Create notebook to perform analysis Stop compute to minimize expenses 6.1.2 Slides The slides for this tutorial are are located here. "],["galaxy.html", "Chapter 7 Galaxy 7.1 Galaxy: Video tutorial 7.2 Galaxy: Step-by-step guide", " Chapter 7 Galaxy One of the analysis platforms available on AnVIL is Galaxy. This platform offers a graphical interface for thousands of tools(https://anvilproject.org/learn/interactive-analysis/getting-started-with-galaxy). This chapter focuses on launching and highlighting a few features for Galaxy. 7.1 Galaxy: Video tutorial Here is a video tutorial that describes the basics of using Galaxy on AnVIL. 7.1.1 Objectives Start compute for your Galaxy on AnVIL Run tool to quality control sequencing reads Stop compute to minimize expenses 7.1.2 Slides The slides for this tutorial are are located here. 7.2 Galaxy: Step-by-step guide This step-by-step guide provides written instructions and screenshots for getting started with Galaxy on AnVIL. 7.2.1 Starting Galaxy Note that, in order to use Galaxy, you must have access to a Terra Workspace with permission to compute (i.e. you must be a “Writer” or “Owner” of the Workspace). Open your Workspace, and click on the “Environment configuration” button, a cloud icon on the righthand side of the screen. Under Galaxy, click on “Create new Environment”. Click on “Next” and “Create” to keep all settings as-is. This will take 8-10 minutes. Click on “Open Galaxy” when the environment is ready. 7.2.2 Navigating Galaxy Notice the three main sections. Tools - These are all of the bioinformatics tool packages available for you to use. The Main Dashboard - This contains flash messages and posts when you first open Galaxy, but when we are using data this is the main interface area. History - When you start a project you will be able to see all of the documents in the project in the history. Now be aware, this can become very busy. Also the naming that Galaxy uses is not very intuitive, so you must make sure that you label your files with something that makes sense to you. On the welcome page, there are links to tutorials. You may try these out on your own. If you want to try a new analysis this is a good place to start. 7.2.3 Deleting Galaxy environment Once you are done with your activity, you’ll need to shut down your Galaxy cloud environment. This frees up the cloud resources for others and minimizes computing cost. The following steps will delete your work, so make sure you are completely finished at this point. Otherwise, you will have to repeat your work from the previous steps. Return to AnVIL, and find the Galaxy logo that shows your cloud environment is running. Click on this logo. Next, click on “Settings”. Click on “Delete Environment”. Finally, select “Delete everything, including persistent disk”. Make sure you are done with the activity and then click “Delete”. "],["rstudio.html", "Chapter 8 RStudio 8.1 RStudio: Video tutorial 8.2 RStudio: Step-by-step guide", " Chapter 8 RStudio One of the analysis platforms available on AnVIL is RStudio. This platform offers rich genomics support through Bioconductor. This chapter focuses on launching and highlighting a few features for RStudio. 8.1 RStudio: Video tutorial Here is a video tutorial that describes the basics of using RStudio on AnVIL. 8.1.1 Objectives Start compute for your RStudio environment Tour RStudio on AnVIL Stop compute to minimize expenses 8.1.2 Slides The slides for this tutorial are are located here. 8.2 RStudio: Step-by-step guide This step-by-step guide provides written instructions and screenshots for getting started with RStudio on AnVIL. 8.2.1 Launch RStudio Cloud Environment AnVIL is very versatile and can scale up to use very powerful cloud computers. It’s very important that you select a cloud computing environment appropriate to your needs to avoid runaway costs. If you are uncertain, start with the default settings; it is fairly easy to increase your compute resources later, if needed, but harder to scale down. Note that, in order to use RStudio, you must have access to a Terra Workspace with permission to compute (i.e. you must be a “Writer” or “Owner” of the Workspace). Open Terra - use a web browser to go to anvil.terra.bio In the drop-down menu on the left, navigate to “Workspaces”. Click the triple bar in the top left corner to access the menu. Click “Workspaces”. Click on the name of your Workspace. You should be routed to a link that looks like: https://anvil.terra.bio/#workspaces/<billing-project>/<workspace-name>. Click on the cloud icon on the far right to access your Cloud Environment options. If you don’t see this icon, you may need to scroll to the right. In the dialogue box, click the “Settings” button under RStudio. You will see some configuration options for the RStudio cloud environment, and a list of costs because it costs a small amount of money to use cloud computing. Configure any settings you need for your cloud environment. If you are uncertain about what you need, the default configuration is a reasonable, cost-conservative choice. It is fairly easy to increase your compute resources later, if needed, but harder to scale down. Scroll down and click the “CREATE” button when you are satisfied with your setup. The dialogue box will close and you will be returned to your Workspace. You can see the status of your cloud environment by hovering over the RStudio icon. It will take a few minutes for Terra to request computers and install software. When your environment is ready, its status will change to “Running”. Click on the RStudio logo to open a new dialogue box that will let you launch RStudio. Click the launch icon to open RStudio. This is also where you can pause, modify, or delete your environment when needed. You should now see the RStudio interface with information about the version printed to the console. 8.2.2 Tour RStudio Next, we will be using RStudio and the package Glimma to create interactive plots. See this vignette for more information. The Bioconductor team has created a very useful package to programmatically interact with Terra and Google Cloud. Install the AnVIL package. It will make some steps easier as we go along. You can now quickly install precompiled binaries using the AnVIL package’s install() function. We will use it to install the Glimma package and the airway package. The airway package contains a SummarizedExperiment data class. This data describes an RNA-Seq experiment on four human airway smooth muscle cell lines treated with dexamethasone. {Note: for some of the packages, you will have to install packaged from the CRAN repository, using the install.packages() function. The examples will show you which install method to use.} <img src="06-tools-rstudio_files/figure-html//1BLTCaogA04bbeSD1tR1Wt-mVceQA6FHXa8FmFzIARrg_g11f12bc99af_0_56.png" alt="Screenshot of the RStudio environment interface. Code has been typed in the console and is highlighted." width="480" /> Load the example data. The multidimensional scaling (MDS) plot is frequently used to explore differences in samples. When this data is MDS transformed, the first two dimensions explain the greatest variance between samples, and the amount of variance decreases monotonically with increasing dimension. The following code will launch a new window where you can interact with the MDS plot. Change the colour_by setting to “groups” so you can easily distinguish between groups. In this data, the “group” is the treatment. You can download the interactive html file by clicking on “Save As”. You can also download plots and other files created directly in RStudio. To download the following plot, click on “Export” and save in your preferred format to the default directory. This saves the file in your cloud environment. You should see the plot in the “Files” pane. Select this file and click “More” > “Export” Select “Download” to save the file to your local machine. 8.2.3 Pause RStudio You can view costs and make changes to your cloud environments from the panel on the far right of the page. If you don’t see this panel, you may need to scroll to the right. Running environments will have a green dot, and paused environments will have an orange dot. Hovering over the RStudio icon will show you the costs associated with your RStudio environment. Click on the RStudio icon to open the cloud environment settings. Click the Pause button to pause RStudio. This will take a few minutes. When the environment is paused, an orange dot will be displayed next to the RStudio icon. If you hover over the icon, you will see that it is paused, and has a small ongoing cost as long as it is paused. When you’re ready to resume working, you can do so by clicking the RStudio icon and clicking Resume. The right-hand side icon reminds you that you are accruing cloud computing costs. If you don’t see this icon, you may need to scroll to the right. You should minimize charges when you are not performing an analysis. You can do this by clicking on the RStudio icon and selecting “Pause”. This will release the CPU and memory resources for other people to use. Note that your work will be saved in the environment and continue to accrue a very small cost. This work will be lost if the cloud environment gets deleted. If there is anything you would like to save permanently, it’s a good idea to copy it from your compute environment to another location, such as the Workspace bucket, GitHub, or your local machine, depending on your needs. You can also pause your cloud environment(s) at https://anvil.terra.bio/#clusters. 8.2.4 Delete RStudio Cloud Environment Pausing your cloud environment only temporarily stops your work. When you are ready to delete the cloud environment, click on the RStudio icon on the right-hand side and select “Settings”. If you don’t see this icon, you may need to scroll to the right. Click on “Delete Environment”. If you are certain that you do not need the data and configuration on your disk, you should select “Delete everything, including persistent disk”. If there is anything you would like to save, open the compute environment and copy the file(s) from your compute environment to another location, such as the Workspace bucket, GitHub, or your local machine, depending on your needs. Select “DELETE”. You can also delete your cloud environment(s) and disk storage at https://anvil.terra.bio/#clusters. sessionInfo() ## R version 4.3.2 (2023-10-31) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 22.04.4 LTS ## ## Matrix products: default ## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 ## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: Etc/UTC ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## loaded via a namespace (and not attached): ## [1] sass_0.4.8 utf8_1.2.4 generics_0.1.3 xml2_1.3.6 ## [5] stringi_1.8.3 lattice_0.21-9 hms_1.1.3 digest_0.6.34 ## [9] magrittr_2.0.3 evaluate_0.23 grid_4.3.2 timechange_0.3.0 ## [13] bookdown_0.41 fastmap_1.1.1 rprojroot_2.0.4 jsonlite_1.8.8 ## [17] Matrix_1.6-1.1 processx_3.8.3 chromote_0.3.1 ps_1.7.6 ## [21] promises_1.2.1 httr_1.4.7 fansi_1.0.6 ottrpal_1.3.0 ## [25] udpipe_0.8.11 cow_0.0.0.9000 jquerylib_0.1.4 cli_3.6.2 ## [29] rlang_1.1.4 gitcreds_0.1.2 cachem_1.0.8 yaml_2.3.8 ## [33] tools_4.3.2 tzdb_0.4.0 dplyr_1.1.4 curl_5.2.0 ## [37] png_0.1-8 vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4 ## [41] lubridate_1.9.3 snakecase_0.11.1 stringr_1.5.1 janitor_2.2.0 ## [45] pkgconfig_2.0.3 later_1.3.2 pillar_1.9.0 bslib_0.6.1 ## [49] data.table_1.15.0 glue_1.7.0 Rcpp_1.0.12 highr_0.11 ## [53] xfun_0.48 tibble_3.2.1 tidyselect_1.2.0 knitr_1.48 ## [57] textrank_0.3.1 websocket_1.4.2 htmltools_0.5.7 igraph_2.0.2 ## [61] rmarkdown_2.25 webshot2_0.1.1 readr_2.1.5 compiler_4.3.2 ## [65] askpass_1.2.0 openssl_2.1.1 "],["data.html", "Chapter 9 Data 9.1 Bring Your Own Data 9.2 Analyze Existing Data", " Chapter 9 Data Data is stored by Workspaces in two different locations: the Workspace Bucket and the Persistent Disk. The Workspace Bucket is a special Google Cloud Storage bucket that is governed by the built-in AnVIL security policies. This durable and scalable storage location is suitable for both raw data as well as analysis outputs that need to be preserved and/or shared. In contrast, Persistent Disks provide a working directory for Cloud Environments that run Jupyter, RStudio, and Galaxy. Input data can be localized to Persistent Disks for analysis while output data can be transferred to the Workspace Bucket for more reliable long term storage. Data Tables provide a way to organize data and metadata, including URI links to storage buckets. These tables are a convenient way to organize input for analyses as well as tracking workflow outputs. More details can be found in the Terra documentation. 9.1 Bring Your Own Data 9.1.1 Overview The starting point for bringing your own data to AnVIL is the Workspace Dashboard. At the bottom right, you’ll find the full path to the Google Bucket information corresponding to your Workspace. You can click the clipboard icon on the right to copy the name of your Workspace Bucket. You will be able to see any uploaded files by clicking the “Open in browser” link. You can also see any uploaded files by clicking the “Files” directory at the bottom left in the Data Tab. 9.1.2 Browser: Upload Single Files Click the “Files” directory at the bottom left of the Data Tab. Then click the “+” button in the bottom right corner of the screen. This will prompt a file browser on your local machine. 9.1.3 Browser: Upload Folders Click the “Open in browser” link on the bottom right of the Workspace Dashboard Tab. This will open a new browser window or tab directed to your Workspace’s Google Bucket on the Google Cloud Platform. Here, you can upload files and manage your data and folders. You can also upload an entire folder by clicking on “UPLOAD FOLDER”. 9.1.4 gsutil: Local to Cloud gsutil is a Python application that lets you access Cloud Storage from the command line in a terminal. The terminal you use can be run on your local machine (local instance) or built into the Workspace Cloud Environment. 9.1.4.1 Install gsutil on Your Local Computer or Local Server Cloud SDK is a set of tools that you can use to manage resources and applications hosted on Google Cloud. These tools include the gsutil command-line tool. Ensure you have a terminal available. MacOS and Linux users have a terminal application available by default. Terminal applications are also available through third party software, such as RStudio. Windows users should download a terminal application, such as Putty. Install Cloud SDK following the appropriate link below: Windows MacOS Linux Test that Cloud SDK has been successfully installed by typing gsutil in the terminal application prompt: gsutil If the installation was successful, you should see information about using gsutil that looks like the following: Usage: gsutil [-D] [-DD] [-h header]... [-i service_account] [-m] [-o] [-q] [-u user_project] [command [opts...] args...] If the installation was not successful, you should see a warning that gsutil was not found. Please return to the installation steps to ensure they have been completed correctly. command not found: gsutil 9.1.4.2 Copy Files From Your Local Computer to a Workspace Bucket The gsutil cp command allows you to copy data from one machine to another. On your local machine’s terminal, you should use the command in the following format: gsutil cp where_to_copy_data_from/filename where_to_copy_data_to Example: To copy the file test.bam located on your local computer at users/name/data/ into the Workspace Bucket gs://ab5-27x on the cloud: gsutil cp users/name/data/test.bam gs://ab5-27x Remember that you can easily copy the Workspace Bucket ID using the clipboard button on the Workspace Dashboard. Please see the gsutil cp documentation for more details, such as how to do parallel multi-threaded/multi-processing copying or copying an entire directory tree. The gsutil cp command can also be used to copy files from one Workspace Bucket to another (cloud-to-cloud copying). 9.2 Analyze Existing Data In addition to bringing your own data, you can use existing data on AnVIL. Using the following resources can help you discover data to use in your analyses. 9.2.1 AnVIL Data Library The Datasets Library is a good place to get started and familiarize yourself with existing data. Here, you can find curated datasets from thousands of participants. Some of these are open access (such as the 1000 Genomes dataset) while others will require you to request access. Taking a look at Featured Workspaces can get you started quickly. Remember that when you clone a Workspace, AnVIL automatically cross-links to the original data contained within the Data Tables. 9.2.2 AnVIL Dataset Catalog The AnVIL Dataset Catalog displays key NHGRI datasets accessible in AnVIL, such as the CCDG (Centers for Common Disease Genomics), CMG (Centers for Mendelian Genomics), eMERGE (Electronic Medical Records and Genomics), as well as other relevant datasets. You will need to coordinate access to controlled data. 9.2.3 AnVIL Data Explorer The AnVIL Data Explorer enables faceted searches of open and managed access datasets hosted in AnVIL, making it easier for researchers to find and custom-build cohorts. "],["workflows.html", "Chapter 10 Workflows", " Chapter 10 Workflows Workflows allow you to run whole genomic pipelines in Terra. Workflows are written in WDL (Workflow Description Language), which is a human-readable and writable language originally developed for genomic analysis pipelines. Terra on AnVIL is specifically designed to integrate a WDL workflow’s input and output information directly into the platform. This integration allows you to easily configure a workflow from your Workspace. Established workflows help make analyses more reproducible. They also make it easier to configure, launch, and monitor analyses across many samples. You can access workflows from a Workspace by clicking on the Workflows Tab. To run a workflow you can either use a pre-configured workflow, or create your own workflow. 10.0.1 Pre-configured Workflow You will need to clone a Workspace that contains the workflow you’d like to run. Refer to Terra’s quickstart guide for running a pre-configured workflow. You can browse code and workflows in the AnVIL library here. 10.0.2 Create a Custom Workflow Custom workflows are written in WDL. Refer to Terra’s “Hello, learn-wdl!” tutorial to learn the basics before fully customizing your analysis pipeline. "],["faqs.html", "A FAQs", " A FAQs A.0.1 What is AnVIL? Watch our 4 minute video Read more at https://anvilproject.org/overview See more Overview Videos A.0.2 How Do I Get Started? Identify Billing Account(s) e.g. Google $300 Free Trial, AC2 Make an account (see PIs and Lab Managers example) A.0.3 What Data Is Available? Browse the Data Dashboard Request access if necessary for Controlled Access Data or Consortia Access Data Bring Your Own Data A.0.4 How Do I Perform an Analysis? Launch interactive tools like Jupyter, RStudio, Galaxy “Terminal” available through Jupyter and RStudio Migrate your HPC analysis to AnVIL using Docker/WDLs Introducing the Learn WDL Course (10 min video) Basic WDL covered in Webinar 2 from BDC Spring 21 Become a WDL super hero with https://github.com/openwdl/learn-wdl A.0.5 Where Can I Get Help? General questions in https://help.anvilproject.org Additional options at https://anvilproject.org/help "],["troubleshooting.html", "B Troubleshooting", " B Troubleshooting B.0.1 Create Terra Billing Project The error messages generated during billing project creation can be hard to understand (see this example. If you get an error message, Double check that your project name follows the rules. If your chosen name follows the rules but you are getting an error then the name may be already taken - try adding something (e.g. your lab or team name, date, etc.) to make it unique. "],["authorization-domains.html", "C Authorization Domains", " C Authorization Domains Terra Authorization Domains help keep sensitive genomic data secure while still allowing easy sharing with collaborators. Authorization Domains work like a badge. This badge can be associated with a Workspace that allows access only to people with the same badge. When you clone a Workspace that has an Authorization Domain, the new copy will also have the badge: anyone who wants to access the cloned Workspace has to have the badge. You don’t have to worry about accidentally sharing sensitive data because if you try to share the cloned workspace with a user who doesn’t have the right badge, that researcher won’t be able to enter. An Authorization Domain is a Managed Group with strictly defined and enforced Workspace permissions where they: Restrict Workspace access to only the individuals in the group Are assigned to Workspaces when they are created Follow all workspace copies Learn how to create an Authorization Domain here. "],["budget-templates.html", "D Budget Templates", " D Budget Templates If you want to apply for a grant and you plan to use the AnVIL platform for data storage, data movement, and data analysis, you can include the anticipated costs in your proposal. We have created a template for the budget justification paragraph of your grant proposal. The documents described in the following provide you with insightful knowledge. D.0.1 Types of Costs There are three types of costs that are typically occur when performing operations on the Google Cloud Platform. 1. Cost for Computing is driven by your particular CPU and memory requirements. Importantly, you can save money if your work can tolerate being interrupted (also known as a preemptible compute resource). In this case, you pay less per hour with the understanding that your work may be interrupted by a customer willing to pay more. Details and current pricing can be found here. 2. Cost for Storage is driven by the amount of data and the length of time you wish to store the data. Here, you can save money if you have data that you do not plan to access frequently. This would be the case for raw data that has already been processed, backups, and archives. Details and current pricing can be found here. 3. Cost for Network Usage (egress) applies to data being transferred out of a Cloud resource. In this context, a Cloud resource refers to a set of computers in a particular region. This would apply, for example, if you transferred data from Google’s East Coast computers to Amazon’s West Coast computers. In general, while it’s free to upload data to the Cloud, you will incur costs when downloading data to your local computer or between Cloud regions. Details and current pricing can be found here. D.0.2 Usage of Budget Templates In a first step, you can use the template Google Sheet AnVIL_Cost_Estimator to calculate costs for computing, storage, and network usage (egress) for your proposal. In a second step, you can use the template Google Doc AnVIL_Budget_Justification to create a budget justification paragraph for your proposal by including the information highlighted in pink (mostly copying entries from your Google Sheet AnVIL_Cost_Estimator). Please download and adapt both documents to your project. Please check that the prices are up to date by using the links listed below or in the AnVIL_Cost_Estimator. For further guidance, you can have a look at a completed document AnVIL_Budget_Justification_Example. "],["irb-templates.html", "E IRB Templates", " E IRB Templates If you plan to use the AnVIL platform for data storage, data movement, and data analysis, you will likely need to provide information about AnVIL in your IRB application. While all IRBs will have different questions and requirements, there are general areas of concern that are common across IRB applications. The following sections provide general information about AnVIL as well as examples of language addressing some of these concerns. E.0.1 General Information When writing an IRB research plan you will be required to fill out a few fields about data collection and data storage. The following sections provide the information that you will need for these IRB sections. There is also language regarding consent, with some suggested text that you may include in the IRB research plan and your consent form. If you are using AnVIL, you should know that the storage is on Google Cloud Platform, so it is encrypted, it is not stored on a secure server, and it is backed up through multi-site replication. Important things to know about AnVIL security are that there are four key design concepts: Authenticate: All components require authentication at every step, not just the perimeter Authorize: All data requires explicit authorization to access Audit: All data access is logged (to a different system), with alerts for anomalous events Encrypt: All data-in-transit and all data-at-rest is encrypted There is language below about data storage, which you will need for your research plan. There is also a section about privacy and confidentiality, which you will also need. E.0.2 Description of AnVIL The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-Space, or AnVIL provides a collaborative environment for creating and sharing data and analysis workflows for both users with limited computational expertise and sophisticated data scientist users. The platform is built on a set of established components that have been used in a number of flagship scientific projects: The Terra platform: provides a compute environment with secure data and analysis sharing capabilities Dockstore: provides standards based sharing of containerized tools and workflows The AnVIL Data Explorer: enables faceted searches of open and managed access datasets hosted in AnVIL Bioconductor and Galaxy: provide environments for users at different skill levels to construct and execute analyses E.0.3 Data Storage We will keep a copy of the participant’s genomics data in various data formats (formats - CRAMs, BAMs, vcfs, BCLs, fastqs). This will be stored in Broad’s Google Cloud services, which is the underlying storage platform for AnVIL. The Google Cloud Platform has summarized its services with respect to genomics data processing in a white paper here: https://cloud.google.com/files/genomics-data-wp.pdf E.0.4 Privacy and Confidentiality There is a small risk of loss of privacy and confidentiality for the participant and their personal health information. While it is unlikely, there is a possibility that genetic information could be seen by unauthorized individuals. Researchers who will have access to genetic information about participants will take measures to maintain the privacy of the participant and confidentiality of their information. E.0.5 Security Terra (powered by Broad Genomics’ Workbench) is operated by the Broad Institute at the FISMA (Federal Information Systems Management Act) “moderate” level and received Authority to Operate from NCI and NIH in May of 2016. FISMA is a practice of documentation, audit, and organizational risk acceptance. It is centered on the controls outlined in NIST (National Institute of Standards and Technology) Special Publications 800-30 and 800-53. Covered topics include: Network penetration testing and assessment by a Federally authorized outside firm Maintaining system logs separate from the primary system for forensic analysis Regular review of logs and changes by an in-house auditor Security training and background screening for staff with elevated access to the system Documented procedures to respond to security incidents The Terra portal and its underlying platform, Broad Genomics’ Workbench, are hosted on Google’s Cloud Services. See below for details. Since Terra requires that users utilize Google logins, the application operates on top of Google’s world-class security that protects from nation-state level attacks. Terra supports the use of Google’s 2 Factor authentication as well. As a FISMA Moderate system, all logs are audited continually and various levels of security layering are required. These include Web Application Firewalls, weekly scanning, code scanning (dynamic and static), dependency scanning and manual penetration testing. Data analysis is constrained to computing nodes that are sandboxed using Docker within Google’s Pipelines API. Google undergoes several independent third party audits on a regular basis to provide verification of security, privacy and compliance controls including annual audits for SSAE 16/ISAE 3402 Type II. Google’s infrastructure provides reliable information security that can meet or exceed the requirements of HIPAA and protected health information. Broad Information Security & Compliance Team provides risk review and consulting security services for legacy, current, and future computing infrastructure, applications, and information services across the enterprise. Risk management and mitigation services are also provided to assist business units with meeting regulatory requirements. The team also takes a lead role in incident response processes, including network, system, and data forensic efforts and is responsible for developing and ensuring proper review of all security policies, standards, and procedures. The Senior Director of Information Security & Compliance (DISC) facilitates the development and maintenance of the information security program, leads the development of information security policies, standards and procedures, and collaborates with business and system owners in understanding their security responsibilities. The DISC also promotes consistency in the development and implementation of strategic initiatives and policies based on industry and regulatory standards (FISMA, etc). The Broad Institute has a strong commitment to information security, appropriate use of data, compliance with applicable regulations, and respect for data privacy. E.0.6 Example Language for Informed Consent Below we provide some possible language that could be used for consent forms. E.0.6.1 Types of Research We may use your sample to do other biological and genetic research. Genetic research may include looking at some or all of your genes and DNA to see if there are links to different types of health conditions. We may use your sample and your DNA sequence information to help create new analytical and machine learning methods to look at genomic and clinical data. We may share your samples, your DNA sequence information, your health information, and results from our research results with other data banks, such as those sponsored by the National Institutes of Health, so that researchers from around the world can use them to study many conditions. A code will be assigned to your samples and health information. Your name, medical record number, or other information that easily identifies you will not be stored with your samples or health information. The key to the code that connects your name to your samples and information will be stored securely in a controlled access database. This method is discussed more below. E.0.6.2 Data Repository Submission Your individual genomic data and health information may be put in a controlled-access data commons. This means that only researchers who apply for and get permission to use the information for a specific research project will be able to access the information. If this happens, your genomic data and health information will not be labeled with your name or other information that could be used to identify you. Your data and information will be entered in the database with the random study identifier created by the research team. Researchers approved to access information in the database will agree not to attempt to identify you. E.0.6.3 Potential Risks Every research study involves the risk of loss of privacy. The personal information that will be associated with your WGS cannot typically be used to identify you. However, it is possible that in the future there may be a way to link stored genomic information back to you. We believe that the possibility of this is low. Your information will be stored on a cloud-computing storage system administered by The Broad Institute. E.0.7 Additional Resources Workspace access controls https://support.terra.bio/hc/en-us/articles/360025851892 Workspace owners (e.g. PI, Lab Manager, Data Coordination Center) can grant read / write / compute access, and can revoke access Authorization Domains See our introduction here https://support.terra.bio/hc/en-us/articles/360026775691 Must be set up when a Workspace is created - prevents accidental sharing "],["overview-videos.html", "F Overview Videos", " F Overview Videos AnVIL – Streamlining Accessibility and Computability of Large-scale Genomic Datasets – 4m58s 9/30/20 Workspaces – Introduction to using Workspaces in Terra – 9m52s 4/5/20 Jupyter – Interactive Analysis using a Jupyter Notebook in Terra – 9m40s 5/7/20 RStudio – A sneak preview of RStudio in Terra – 6m57s 1/22/21 Galaxy – Galaxy on AnVIL Walkthrough – 5m50s 12/9/20 Data Tables – Introduction to Data Tables in Terra – 5m25s 4/5/20 Dockstore – Importing a GATK workflow from Dockstore into Terra – 8m29s 10/15/20 "],["give-us-feedback.html", "G Give us Feedback", " G Give us Feedback Thank you for your interest in this book! There are a few ways you can suggest improvements: Fill out this Google form. If you have a GitHub account, you can raise an issue in our repository. Submit a pull request! Click the pencil icon on any page (top left) to view the source .Rmd for the page and suggest changes. "],["about.html", "About the Authors", " About the Authors These credits are based on our course contributors table guidelines.     In memory of James Taylor, who was instrumental in initiating this project.   Credits Names Pedagogy Lead Content Instructors Katherine Cox, Ava Hoffman Content Contributors Kai Kammers, Frederick Tan, Sarah Wheelan Content Editors/Reviewers Natalie Kucher, Jeff Leek, Valerie Reeves, Frederick Tan Content Directors Jeff Leek, Frederick Tan Content Consultants Allie Cliffe, Candace Patterson Production Content Publisher Ira Gooding Technical Template Publishing Engineers Candace Savonen, Carrie Wright Publishing Maintenance Engineer Candace Savonen Technical Publishing Stylists Carrie Wright, Candace Savonen Package Developers (ottrpal) John Muschelli, Candace Savonen, Carrie Wright Funding Funder National Human Genome Research Institute (NHGRI) #5U24HG010263 Funding Staff Fallon Bachman, Jennifer Vessio, Emily Voeglein   ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.3.2 (2023-10-31) ## os Ubuntu 22.04.4 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC ## date 2024-10-28 ## pandoc 3.1.1 @ /usr/local/bin/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date (UTC) lib source ## bookdown 0.41 2024-10-16 [1] CRAN (R 4.3.2) ## bslib 0.6.1 2023-11-28 [1] RSPM (R 4.3.0) ## cachem 1.0.8 2023-05-01 [1] RSPM (R 4.3.0) ## cli 3.6.2 2023-12-11 [1] RSPM (R 4.3.0) ## devtools 2.4.5 2022-10-11 [1] RSPM (R 4.3.0) ## digest 0.6.34 2024-01-11 [1] RSPM (R 4.3.0) ## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.3.0) ## evaluate 0.23 2023-11-01 [1] RSPM (R 4.3.0) ## fastmap 1.1.1 2023-02-24 [1] RSPM (R 4.3.0) ## fs 1.6.3 2023-07-20 [1] RSPM (R 4.3.0) ## glue 1.7.0 2024-01-09 [1] RSPM (R 4.3.0) ## htmltools 0.5.7 2023-11-03 [1] RSPM (R 4.3.0) ## htmlwidgets 1.6.4 2023-12-06 [1] RSPM (R 4.3.0) ## httpuv 1.6.14 2024-01-26 [1] RSPM (R 4.3.0) ## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.3.0) ## jsonlite 1.8.8 2023-12-04 [1] RSPM (R 4.3.0) ## knitr 1.48 2024-07-07 [1] CRAN (R 4.3.2) ## later 1.3.2 2023-12-06 [1] RSPM (R 4.3.0) ## lifecycle 1.0.4 2023-11-07 [1] RSPM (R 4.3.0) ## magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.3.0) ## memoise 2.0.1 2021-11-26 [1] RSPM (R 4.3.0) ## mime 0.12 2021-09-28 [1] RSPM (R 4.3.0) ## miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.3.0) ## pkgbuild 1.4.3 2023-12-10 [1] RSPM (R 4.3.0) ## pkgload 1.3.4 2024-01-16 [1] RSPM (R 4.3.0) ## profvis 0.3.8 2023-05-02 [1] RSPM (R 4.3.0) ## promises 1.2.1 2023-08-10 [1] RSPM (R 4.3.0) ## purrr 1.0.2 2023-08-10 [1] RSPM (R 4.3.0) ## R6 2.5.1 2021-08-19 [1] RSPM (R 4.3.0) ## Rcpp 1.0.12 2024-01-09 [1] RSPM (R 4.3.0) ## remotes 2.4.2.1 2023-07-18 [1] RSPM (R 4.3.0) ## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.2) ## rmarkdown 2.25 2023-09-18 [1] RSPM (R 4.3.0) ## sass 0.4.8 2023-12-06 [1] RSPM (R 4.3.0) ## sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.3.0) ## shiny 1.8.0 2023-11-17 [1] RSPM (R 4.3.0) ## stringi 1.8.3 2023-12-11 [1] RSPM (R 4.3.0) ## stringr 1.5.1 2023-11-14 [1] RSPM (R 4.3.0) ## urlchecker 1.0.1 2021-11-30 [1] RSPM (R 4.3.0) ## usethis 2.2.3 2024-02-19 [1] RSPM (R 4.3.0) ## vctrs 0.6.5 2023-12-01 [1] RSPM (R 4.3.0) ## xfun 0.48 2024-10-03 [1] CRAN (R 4.3.2) ## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.3.0) ## yaml 2.3.8 2023-12-11 [1] RSPM (R 4.3.0) ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library ## ## ────────────────────────────────────────────────────────────────────────────── "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] diff --git a/docs/troubleshooting.html b/docs/troubleshooting.html index e3711aca..df5aa6ac 100644 --- a/docs/troubleshooting.html +++ b/docs/troubleshooting.html @@ -6,7 +6,7 @@ B Troubleshooting | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@

  • 10 Workflows diff --git a/docs/workflows.html b/docs/workflows.html index 2f582b6b..cc55b1f8 100644 --- a/docs/workflows.html +++ b/docs/workflows.html @@ -6,7 +6,7 @@ Chapter 10 Workflows | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@
  • 10 Workflows diff --git a/docs/workspaces.html b/docs/workspaces.html index 73981b48..70f0c55f 100644 --- a/docs/workspaces.html +++ b/docs/workspaces.html @@ -6,7 +6,7 @@ Chapter 5 Workspaces | Getting Started on AnVIL - + @@ -22,7 +22,7 @@ - + @@ -295,7 +295,7 @@
  • 10 Workflows