From e2dfe7712eacd4a06c580add6b67a4c1ddcec7ff Mon Sep 17 00:00:00 2001
From: Ekaterina Sakharova <49755902+KateSakharova@users.noreply.github.com>
Date: Thu, 11 Apr 2024 12:17:28 +0100
Subject: [PATCH] Fix/update atlanteco (#47)

* wip

* added IPS check

* add BGC

* fixes after review

* Clear outputs
---
 dependencies/py-environment.yml               |   1 +
 .../Atlanteco Interactive Sample Map.ipynb    | 308 +++++++++++++++---
 2 files changed, 269 insertions(+), 40 deletions(-)

diff --git a/dependencies/py-environment.yml b/dependencies/py-environment.yml
index 26a91fe..3f57f3a 100644
--- a/dependencies/py-environment.yml
+++ b/dependencies/py-environment.yml
@@ -19,3 +19,4 @@ dependencies:
   - pip
   - pip:
     - sourmash==4.1.2
+    - gffutils==0.12
diff --git a/src/notebooks/Python Examples/Atlanteco Interactive Sample Map.ipynb b/src/notebooks/Python Examples/Atlanteco Interactive Sample Map.ipynb
index e17a219..5304359 100644
--- a/src/notebooks/Python Examples/Atlanteco Interactive Sample Map.ipynb	
+++ b/src/notebooks/Python Examples/Atlanteco Interactive Sample Map.ipynb	
@@ -4,9 +4,6 @@
    "cell_type": "raw",
    "id": "003ddb34-a276-4b23-8888-9f514272b7b8",
    "metadata": {
-    "jupyter": {
-     "source_hidden": true
-    },
     "tags": []
    },
    "source": [
@@ -15,8 +12,7 @@
     "author: \"Kate S [Ekaterina Sakharova] (MGnify team)\"\n",
     "categories: [Python]\n",
     "execute: \n",
-    "  enabled: false\n",
-    "  eval: false\n",
+    "  eval: true\n",
     "---"
    ]
   },
@@ -40,13 +36,22 @@
     "\n",
     "The [MGnify API](https://www.ebi.ac.uk/metagenomics/api/v1) returns JSON data. The `jsonapi_client` package can help you load this data into Python, e.g. into a Pandas dataframe.\n",
     "\n",
-    "**This example shows you how to load a MGnify Super Study's data from the MGnify API and display it on an interactive world map**\n",
+    "**This example shows you how to load a MGnify AtlantECO Super Study's data from the MGnify API and display it on an interactive world map**\n",
     "\n",
     "You can find all of the other \"API endpoints\" using the [Browsable API interface in your web browser](https://www.ebi.ac.uk/metagenomics/api/v1). The URL you see in the browsable API is exactly the same as the one you can use in this code.\n",
     "\n",
     "This is an interactive code notebook (a Jupyter Notebook).\n",
     "To run this code, click into each cell and press the ▶ button in the top toolbar, or press `shift+enter`.\n",
     "\n",
+    "**Content:**\n",
+    "- Fetch AtlantECO studies data\n",
+    "- Show study' samples on the interactive world map\n",
+    "- Check functional annotation terms presence/absense\n",
+    "    - GO-term\n",
+    "    - InterPro \n",
+    "    - Biosynthetic Gene Clusters (BGC)\n",
+    "    \n",
+    "\n",
     "---"
    ]
   },
@@ -66,6 +71,7 @@
    "execution_count": null,
    "id": "78657238-8961-4fe9-ac64-02dda587c8b8",
    "metadata": {
+    "is_executing": true,
     "tags": []
    },
    "outputs": [],
@@ -90,7 +96,7 @@
     "We can fetch the Samples for each Study, and concatenate them all into one Dataframe.\n",
     "Each sample has geolocation data in its `attributes` - this is what we need to build a map.\n",
     "\n",
-    "It takes time to fetch data for all samples, so **let's show samples from the first 6 studies only.** "
+    "It takes time to fetch data for all samples, so **let's show samples from chosen PRJEB46727 study only.** This study contains assembly data https://www.ebi.ac.uk/metagenomics/studies/MGYS00005810#overview."
    ]
   },
   {
@@ -102,10 +108,11 @@
    },
    "outputs": [],
    "source": [
+    "substudy = studies[studies['attributes.bioproject'] == 'PRJEB46727']\n",
     "studies_samples = []\n",
     "\n",
     "with Session(\"https://www.ebi.ac.uk/metagenomics/api/v1\") as mgnify:\n",
-    "    for idx, study in studies[:6].iterrows():\n",
+    "    for idx, study in substudy.iterrows():\n",
     "        print(f\"fetching {study.id} samples\")\n",
     "        samples = map(lambda r: r.json, mgnify.iterate(f'studies/{study.id}/samples?page_size=1000'))\n",
     "        samples = pd.json_normalize(samples)\n",
@@ -165,8 +172,8 @@
     "tags": []
    },
    "source": [
-    "## Check GO term presence\n",
-    "Let's check whether a specific identifier is present in each sample. This example is written for GO-term 'GO:0015878', but other identifier types are available on the MGnify API.\n",
+    "## Check functional annotation terms presence\n",
+    "Let's check whether a specific identifier is present in each sample. \n",
     "\n",
     "We will work with MGnify analyses (`MGYA`s) corresponding to chosen samples. We filter analyses by \n",
     "- pipeline version: 5.0\n",
@@ -202,13 +209,66 @@
    "id": "0b4ad3aa-f19f-4007-b64f-59edf10b5734",
    "metadata": {},
    "source": [
-    "Next, check each analysis for GO term presence/absence. We add a column to the dataframe with a colour: blue if GO term was found and red if not."
+    "Define functions:\n",
+    "- `identify_existance` to check each analysis for identifier presence/absence. We add a column to the dataframe with a colour: blue if identifier was found and red if not.\n",
+    "- `show_on_map` to plot results on the world map. Join the analyses and sample tables to have geolocation data and identifier presence data together (we'll create a new sub-DataFrame with a subset of the fields and add them to the map)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7454b97b-2f9d-45b9-bc96-f1877cf3b58a",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "def identify_existance(input_analyses, identifier, term):\n",
+    "    data = []\n",
+    "    with Session(\"https://www.ebi.ac.uk/metagenomics/api/v1\") as mgnify:\n",
+    "        for idx, mgya in input_analyses.iterrows():\n",
+    "            print(f\"processing {mgya.id}\")\n",
+    "            analysis_identifier = map(lambda r: r.json, mgnify.iterate(f'analyses/{mgya.id}/{identifier}'))\n",
+    "            analysis_identifier = pd.json_normalize(analysis_identifier)\n",
+    "            data.append(\"#0000FF\" if term in list(analysis_identifier.id) else \"#FF0000\")\n",
+    "        presented = sum([1 for i in data if i == \"#0000FF\"])\n",
+    "        print(f\"Presented {presented} of {identifier} {term}\")\n",
+    "        input_analyses.insert(2, identifier, data, True)\n",
+    "    return input_analyses\n",
+    "\n",
+    "def show_on_map(input_analyses, studies_samples, identifier):\n",
+    "    df = input_analyses.join(studies_samples.set_index('sample_id'), on='relationships.sample.data.id')\n",
+    "    df2 = df[[identifier, 'lon', 'lat', 'study', 'attributes.accession', 'relationships.study.data.id', 'relationships.sample.data.id', 'relationships.assembly.data.id']].copy()\n",
+    "    df2 = df2.set_index(\"study\")\n",
+    "    df2 = df2.rename(columns={\"attributes.accession\": \"analysis_ID\", \n",
+    "                              'relationships.study.data.id': \"study_ID\",\n",
+    "                              'relationships.sample.data.id': \"sample_ID\", \n",
+    "                              'relationships.assembly.data.id': \"assembly_ID\"\n",
+    "                             })\n",
+    "    m = leafmap.Map(center=(0, 0), zoom=2)\n",
+    "    m.add_points_from_xy(df2, \n",
+    "                         x='lon', \n",
+    "                         y='lat', \n",
+    "                         popup=[\"study_ID\", \"sample_ID\", \"assembly_ID\", \"analysis_ID\", identifier],\n",
+    "                        color_column=identifier, add_legend=False)\n",
+    "    return m"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "720a6d00-7976-4881-aed1-cd1fc756fb82",
+   "metadata": {},
+   "source": [
+    "## GO term\n",
+    "This example is written for GO-term for **biotin transport** [GO:0015878](http://www.candidagenome.org/cgi-bin/GO/go.pl?goid=15878)\n",
+    "\n",
+    "Other GO identifiers are available on the MGnify API."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "c6980125-7fd7-4fdb-9f15-c1b9c93a21c8",
+   "id": "20035a8f-7957-4d84-9099-0f930287fd69",
    "metadata": {
     "tags": []
    },
@@ -216,56 +276,224 @@
    "source": [
     "identifier = \"go-terms\"\n",
     "go_term = 'GO:0015878'\n",
-    "go_data = []\n",
-    "with Session(\"https://www.ebi.ac.uk/metagenomics/api/v1\") as mgnify:\n",
-    "    for idx, mgya in analyses.iterrows():\n",
-    "        print(f\"processing {mgya.id}\")\n",
-    "        analysis_identifier = map(lambda r: r.json, mgnify.iterate(f'analyses/{mgya.id}/{identifier}'))\n",
-    "        analysis_identifier = pd.json_normalize(analysis_identifier)\n",
-    "        go_data.append(\"#0000FF\" if go_term in list(analysis_identifier.id) else \"#FF0000\")\n",
-    "analyses.insert(2, identifier, go_data, True)"
+    "go_analyses = analyses\n",
+    "go_data = identify_existance(go_analyses, identifier, go_term)\n",
+    "map_vis = show_on_map(go_data, studies_samples, identifier)\n",
+    "map_vis"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "8f4242af-617a-4ad9-b265-f7b629da9b52",
+   "id": "923ad300-d764-499c-bd76-c3bd89e78d36",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## InterPro entry\n",
+    "This example is written for InterPro entry [IPR001650](https://www.ebi.ac.uk/interpro/entry/InterPro/IPR001650): **Helicase, C-terminal domain-like**\n",
+    "\n",
+    "Other IPS identifiers are available on the MGnify API."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0ec7bbb7-0538-4d3d-b58e-efc10db32549",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "identifier = \"interpro-identifiers\"\n",
+    "ips_term = 'IPR001650'\n",
+    "ips_analyses = analyses\n",
+    "ips_data = identify_existance(ips_analyses, identifier, ips_term)\n",
+    "map_vis = show_on_map(ips_data, studies_samples, identifier)\n",
+    "map_vis"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "75614961-8aab-4aa3-870e-92c745e4332d",
    "metadata": {},
    "source": [
-    "Join the analyses and sample tables to have geolocation data and identifier presence data together.\n",
+    "## BGC (Biosynthetic Gene Clusters)\n",
+    "\n",
+    "MGnify has additional analysis of [BGCs](https://mibig.secondarymetabolites.org/) provided by [Sanntis](https://github.com/Finn-Lab/SanntiS). These annotations are saved as [RO-Crates](https://www.researchobject.org/ro-crate/) objects and linked to assembly records.\n",
     "\n",
-    "We'll create a new sub-DataFrame with a subset of the fields and add them to the map."
+    "The following example counts the number of **truncated from beggining** proteins of nearest MiBIG class **Polyketide** with dice distance more than **0.65**. We will use [gffutils](https://daler.github.io/gffutils/index.html) for parsing GFF file.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "03c226f3-df3e-4cc8-8834-80d1ca602763",
+   "metadata": {},
+   "source": [
+    "Define a function to find GFF file in zipped archive by url:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "0decacbe-04ed-4fdc-93fe-9f3acd78624b",
+   "id": "c228858d-4b5d-47a2-895a-32a55bd10173",
    "metadata": {
     "tags": []
    },
    "outputs": [],
    "source": [
-    "df = analyses.join(studies_samples.set_index('sample_id'), on='relationships.sample.data.id')\n",
-    "df2 = df[[identifier, 'lon', 'lat', 'study', 'attributes.accession', 'relationships.study.data.id', 'relationships.sample.data.id', 'relationships.assembly.data.id']].copy()\n",
-    "df2 = df2.set_index(\"study\")\n",
-    "df2 = df2.rename(columns={\"attributes.accession\": \"analysis_ID\", \n",
-    "                          'relationships.study.data.id': \"study_ID\",\n",
-    "                          'relationships.sample.data.id': \"sample_ID\", \n",
-    "                          'relationships.assembly.data.id': \"assembly_ID\"\n",
-    "                         })\n",
-    "m = leafmap.Map(center=(0, 0), zoom=2)\n",
-    "m.add_points_from_xy(df2, \n",
-    "                     x='lon', \n",
-    "                     y='lat', \n",
-    "                     popup=[\"study_ID\", \"sample_ID\", \"assembly_ID\", \"analysis_ID\"],\n",
-    "                    color_column=identifier, add_legend=False)\n",
-    "m"
+    "import requests\n",
+    "from zipfile import ZipFile\n",
+    "from io import BytesIO\n",
+    "\n",
+    "def find_gff_file(link):\n",
+    "    # Read archive and find GFF file\n",
+    "    \n",
+    "    response = requests.get(link)\n",
+    "\n",
+    "    # Check if the request was successful (status code 200)\n",
+    "    if response.status_code == 200:\n",
+    "        # Open the zip file from the content of the response\n",
+    "        with ZipFile(BytesIO(response.content)) as zip_file:\n",
+    "            # List all files in the zip archive\n",
+    "            file_list = zip_file.namelist()\n",
+    "\n",
+    "            # Filter files with .gff extension\n",
+    "            gff_files = [file_name for file_name in file_list if file_name.endswith(\".gff\")]\n",
+    "            print(f\"Found {gff_files}\")\n",
+    "            # Read the first .gff file (you can modify this to read a specific file)\n",
+    "            if gff_files:\n",
+    "                first_gff_file = gff_files[0]\n",
+    "                gff_content = zip_file.open(first_gff_file).read()\n",
+    "                return gff_content\n",
+    "            else:\n",
+    "                print(\"No .gff files found in the zip archive.\")\n",
+    "                return None\n",
+    "    else:\n",
+    "        print(\"Failed to fetch the zip file.\")\n",
+    "        return None"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "352dbabf-92da-470c-9b3a-cf46ee5e168f",
+   "metadata": {},
+   "source": [
+    "Define a function to get counts from GFF."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6bb16116-bbcf-4e74-a3ff-e7f328963ffc",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import gffutils\n",
+    "\n",
+    "def get_count(nearest_MiBIG_class, nearest_MiBIG_diceDistance, partial_value, gff_content):\n",
+    "    if gff_content:\n",
+    "        try:\n",
+    "            with tempfile.NamedTemporaryFile(delete=False) as temp_gff_file:\n",
+    "                temp_gff_file.write(gff_content)\n",
+    "                temp_gff_file_path = temp_gff_file.name\n",
+    "\n",
+    "                # Create a GFF database using gffutils\n",
+    "                db = gffutils.create_db(\n",
+    "                    temp_gff_file_path,\n",
+    "                    dbfn=':memory:',  # Use an in-memory database\n",
+    "                    force=True,       # Overwrite if the database already exists\n",
+    "                    keep_order=True,  # Preserve feature order \n",
+    "                )\n",
+    "\n",
+    "                count = 0\n",
+    "                for feature in db.all_features():\n",
+    "                    if feature[\"nearest_MiBIG_class\"][0] == nearest_MiBIG_class and \\\n",
+    "                       float(feature[\"nearest_MiBIG_diceDistance\"][0]) >= nearest_MiBIG_diceDistance and \\\n",
+    "                       feature[\"partial\"][0] == partial_value:\n",
+    "                        count += 1\n",
+    "                print(f\"Count is {count}\")\n",
+    "                return count\n",
+    "        except:\n",
+    "            print('Error in GFF DB')\n",
+    "            return 0\n",
+    "    else:\n",
+    "        return 0"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "47210f0e-c9b1-463f-a4e2-be97eeb41352",
+   "metadata": {},
+   "source": [
+    "Process data:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7caa665a-036e-42a0-9d25-61d2e0e41239",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "nearest_MiBIG_class = \"Polyketide\" \n",
+    "nearest_MiBIG_diceDistance = 0.65\n",
+    "partial_value = \"10\"\n",
+    "\n",
+    "counts = []\n",
+    "\n",
+    "with Session(\"https://www.ebi.ac.uk/metagenomics/api/v1\") as mgnify:\n",
+    "    for idx, mgya in analyses.iterrows():\n",
+    "        # get ERZ assembly accession\n",
+    "        assembly = mgya['relationships.assembly.data.id']\n",
+    "        annotations_for_assembly = mgnify.iterate(f'assemblies/{assembly}/extra-annotations')\n",
+    "        sanntis_annotation = None\n",
+    "        for item in annotations_for_assembly:\n",
+    "            if 'sanntis' in item.id:\n",
+    "                sanntis_annotation = item.links.self\n",
+    "                break\n",
+    "        if not sanntis_annotation:\n",
+    "            print('Sanntis annotation was not found')\n",
+    "            continue\n",
+    "        \n",
+    "        print(f\"processing {mgya.id} {assembly}\")\n",
+    "        \n",
+    "        gff_content = find_gff_file(sanntis_annotation)\n",
+    "        counts.append(get_count(nearest_MiBIG_class, nearest_MiBIG_diceDistance, partial_value, gff_content))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2806a194-b1b9-42af-8124-c00ce66fa66b",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "Display on the interactive map"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0ba6ed0d-8169-4ddd-a9f9-3255350b2654",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "identifier = \"bgc\"\n",
+    "analyses.insert(2, identifier, counts, True)\n",
+    "map_vis = show_on_map(analyses, studies_samples, identifier)\n",
+    "map_vis"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "3985abb6-73a6-418f-8657-5fb383022d90",
+   "id": "7ffb9081-73ba-4adf-9279-1d32f9e6c6fe",
    "metadata": {},
    "outputs": [],
    "source": []