Skip to content

Commit

Permalink
Fixes and cleanup
Browse files Browse the repository at this point in the history
  • Loading branch information
GullyBurns committed Oct 20, 2023
1 parent cb8fef0 commit 6b2ac53
Show file tree
Hide file tree
Showing 7 changed files with 57 additions and 64 deletions.
20 changes: 9 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ with 32+GB of memory - no support for Windows or Linux yet).

``` bash
git clone https://github.com/chanzuckerberg/alzhazen
conda create -n alhazen python=3.8
conda create -n alhazen python=3.11
conda activate alhazen
cd alhazen
pip install -e .
Expand Down Expand Up @@ -62,14 +62,14 @@ For example to run the chatbot for the single paper QA task, execute the
following command:

``` bash
python -m fire alhazen/tools/<tool_name>.py chatbot
python -m fire alhazen.apps <tool_name> <tool_args>
```

for example, run the following command to chat with the single paper QA
chatbot:

``` bash
python -m fire alhazen/tools/single_paper_qa.py chatbot
python -m fire alhazen.apps single_paper_chatbot '/path/to/pdf/or/nxml/files/' 'mistral-7b-instruct'`
```

## Code Status and Capabilities
Expand All @@ -80,11 +80,6 @@ them. We will provide some access to each capability through
[Gradio](https://gradio.app/) as we develop them, and will eventually
synthesise them into a single agent-driven interface.

- **Single Paper QA (SPQA)** - retrieval augmented generation (RAG)
question answering about a single paper (implemented through
[llama-index](https://www.llamaindex.ai/))
- … more to come

## Where does the Name ‘Alhazen’ come from?

One thousand years ago, Ḥasan Ibn al-Haytham (965-1039 AD) studied
Expand All @@ -96,9 +91,9 @@ following the same paradigm ([Website](https://www.ibnalhaytham.com/),
[Wikipedia](https://en.wikipedia.org/wiki/Ibn_al-Haytham), [Tbakhi &
Amir 2007](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6074172/)).

We use the latinized form of his name (‘Alhazen’) as the name of this
project to honor his contribution (which goes largely unrecognized from
within non-Islamic communities).
We use the latinized form of his name (‘Alhazen’) to honor his
contribution (which goes largely unrecognized from within non-Islamic
communities).

Famously, he was quoted as saying:

Expand All @@ -113,3 +108,6 @@ Here, we seek to develop an AI capable of applying scientific knowledge
engineering to support CZI’s mission. We seek to honor Ibn al-Haytham’s
critical view of published knowledge by creating a AI-powered system for
scientific discovery.

Note - when describing our agent, we will use non-gendered pronouns
(they/them/it) to refer to the agent.
11 changes: 11 additions & 0 deletions alhazen/apps.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,17 @@ def single_paper_chatbot(doc_dir,
chunk_overlap=100,
section_regex=None):

"""Run a chatbot on a single paper - useful for testing. Execute this as follows:
Set environment variable LLMS_TEMP_DIR to a directory where you want to store the temporary files
`os.environ['LLMS_TEMP_DIR'] = '/tmp/alhazen'`
Run the chatbot
`python -m fire alhazen.apps single_paper_chatbot '/path/to/pdf/or/nxml/files/' 'mistral-7b-instruct'`
"""

chatbot = LocalFileLangChainChatBot(
doc_dir,
llm_name,
Expand Down
35 changes: 9 additions & 26 deletions alhazen/tools/single_document_qa.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
# AUTOGENERATED! DO NOT EDIT! File to edit: ../../nbs/11_single_document_qa.ipynb.

# %% auto 0
__all__ = ['PROMPT_SINGLE_ARTICLE_QUESTION_2', 'PROMPT_SINGLE_ARTICLE_QUESTION', 'get_ft_url_from_doi',
'LocalFileLangChainChatBot']
__all__ = ['PROMPT_SINGLE_ARTICLE_QUESTION', 'get_ft_url_from_doi', 'LocalFileLangChainChatBot']

# %% ../../nbs/11_single_document_qa.ipynb 2
import os
Expand Down Expand Up @@ -80,8 +79,8 @@ def get_ft_url_from_doi(doi, file_path):
f.write(xml)

# %% ../../nbs/11_single_document_qa.ipynb 4
PROMPT_SINGLE_ARTICLE_QUESTION_2 = {
'name': 'SingleArticleQuestion_2',
PROMPT_SINGLE_ARTICLE_QUESTION = {
'name': 'SingleArticleQuestion',
'description': 'Ask a question about a single research paper',
'system':
'''You are an helpful assistant, well-versed in scientific language, but cautious about making sure what you say is accurate.
Expand All @@ -90,33 +89,17 @@ def get_ft_url_from_doi(doi, file_path):
'''
Use the following pieces of context to answer the question at the end.
If you don''t know the answer, just say that you don''t know, don''t try to make up an answer
The question will be enclosed by angle brackets.
The text of the progress report will be enclosed by square brackets.
The question will be enclosed by double quotes.
The text of the progress report will be enclosed by double square brackets.
CONTEXT: [[{context}]]
Then, take a deep breath and think carefully before responding as precisely as possible to this question (enclosed in double angle brackets):
Then, take a deep breath and think carefully before responding as precisely as possible to this question (enclosed in double quotes):
QUESTION: <<{question}>>''',
QUESTION: "{question}"''',

'input_variables': ["context", "question"]}

PROMPT_SINGLE_ARTICLE_QUESTION = {
'name': 'SingleArticleQuestion',
'description': 'Ask a question about a single research paper',
'system':
'''You are an helpful assistant, well-versed in scientific language, but cautious about making sure what you say is accurate.
Your task is to answer a question about a progress report concisely and accurately.
Answer the question briefly in a single paragraph or bullet-point list (if needed).
The question will be enclosed by double quotes.
The text of the progress report will be enclosed by double square brackets.
Remember - answer the with the minimum of commentary possible in one or two sentences only.''',
'instruction':
'''First read the text of this progress report (enclosed in double square brackets):
ARTICLE: [[{article}]]
Then, take a deep breath and think carefully before responding as precisely as possible to this question (enclosed in double quotes):
QUESTION: "{question}"''',
'input_variables': ["article", "question"]}

class LocalFileLangChainChatBot:
'''Uses a simple prompt to answer questions about a collection of short PDF files.'''
Expand All @@ -134,7 +117,7 @@ def __init__(self,
doc_dir += '/'
self.change_directory(self.doc_dir)
self.prompt_registry = TaskInstructionRegistry()
self.prompt_registry.register_new_instruction_template(PROMPT_SINGLE_ARTICLE_QUESTION_2)
self.prompt_registry.register_new_instruction_template(PROMPT_SINGLE_ARTICLE_QUESTION)
model_type = None

self.section_regex = section_regex
Expand All @@ -144,7 +127,7 @@ def __init__(self,
elif GGUF_LOOKUP_URL.get(llm_name) is not None:
model_type = MODEL_TYPE.LlamaCpp

t = self.prompt_registry.get_instruction_template(PROMPT_SINGLE_ARTICLE_QUESTION_2.get('name'))
t = self.prompt_registry.get_instruction_template(PROMPT_SINGLE_ARTICLE_QUESTION.get('name'))
if model_type == MODEL_TYPE.OpenAI:
self.prompt_template = t.generate_prompt_template()
else:
Expand Down
14 changes: 9 additions & 5 deletions docs/apps.html
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">

<meta name="description" content="Command line applications that can be called.">
<meta name="description" content="Command line applications that can be called with python -m fire alhazen.apps <app>.">

<title>alhazen - Apps</title>
<style>
Expand Down Expand Up @@ -64,10 +64,10 @@

<link rel="stylesheet" href="styles.css">
<meta property="og:title" content="alhazen - Apps">
<meta property="og:description" content="Command line applications that can be called.">
<meta property="og:description" content="Command line applications that can be called with python -m fire alhazen.apps <app>.">
<meta property="og:site-name" content="alhazen">
<meta name="twitter:title" content="alhazen - Apps">
<meta name="twitter:description" content="Command line applications that can be called.">
<meta name="twitter:description" content="Command line applications that can be called with python -m fire alhazen.apps <app>.">
<meta name="twitter:card" content="summary">
</head>

Expand Down Expand Up @@ -164,7 +164,7 @@ <h1 class="title">Apps</h1>

<div>
<div class="description">
Command line applications that can be called.
Command line applications that can be called with <code>python -m fire alhazen.apps &lt;app&gt;</code>.
</div>
</div>

Expand All @@ -190,7 +190,11 @@ <h3 class="anchored" data-anchor-id="single_paper_chatbot">single_paper_chatbot<
chunk_size=1000, chunk_overlap=100,
section_regex=None)</code></pre>
</blockquote>
<p>Run a chatbot on a single paper - useful for testing. Execute this as follows: &gt; # Set environment variable LLMS_TEMP_DIR to a directory where you want to store the temporary files &gt; os.environ[‘LLMS_TEMP_DIR’] = ‘/tmp/alhazen’ &gt; # Run the chatbot &gt; python -m fire alhazen.apps single_paper_chatbot ‘/path/to/pdf/or/nxml/files/’ ‘mistral-7b-instruct’</p>
<p>Run a chatbot on a single paper - useful for testing. Execute this as follows:</p>
<p>Set environment variable LLMS_TEMP_DIR to a directory where you want to store the temporary files</p>
<p><code>os.environ['LLMS_TEMP_DIR'] = '/tmp/alhazen'</code></p>
<p>Run the chatbot</p>
<p><code>python -m fire alhazen.apps single_paper_chatbot '/path/to/pdf/or/nxml/files/' 'mistral-7b-instruct'</code></p>


</section>
Expand Down
13 changes: 5 additions & 8 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@ <h2 class="anchored" data-anchor-id="installation">Installation</h2>
<section id="install-from-source" class="level3">
<h3 class="anchored" data-anchor-id="install-from-source">Install from source</h3>
<div class="sourceCode" id="cb1"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://github.com/chanzuckerberg/alzhazen</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="ex">conda</span> create <span class="at">-n</span> alhazen python=3.8</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="ex">conda</span> create <span class="at">-n</span> alhazen python=3.11</span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="ex">conda</span> activate alhazen</span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> alhazen</span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install <span class="at">-e</span> .</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
Expand Down Expand Up @@ -259,27 +259,24 @@ <h3 class="anchored" data-anchor-id="gguf-files-from-huggingface-thebloke">GGUF
<h2 class="anchored" data-anchor-id="how-to-use">How to use</h2>
<p>We use the fire library to create a modular command line interface (CLI) for Alhazen.</p>
<p>For example to run the chatbot for the single paper QA task, execute the following command:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="ex">python</span> <span class="at">-m</span> fire alhazen/tools/<span class="op">&lt;</span>tool_name<span class="op">&gt;</span>.py chatbot</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="sourceCode" id="cb3"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="ex">python</span> <span class="at">-m</span> fire alhazen.apps <span class="op">&lt;</span>tool_name<span class="op">&gt;</span> <span class="op">&lt;</span>tool_args<span class="op">&gt;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<p>for example, run the following command to chat with the single paper QA chatbot:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="ex">python</span> <span class="at">-m</span> fire alhazen/tools/single_paper_qa.py chatbot</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="sourceCode" id="cb4"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="ex">python</span> <span class="at">-m</span> fire alhazen.apps single_paper_chatbot <span class="st">'/path/to/pdf/or/nxml/files/'</span> <span class="st">'mistral-7b-instruct'</span><span class="kw">`</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</section>
<section id="code-status-and-capabilities" class="level2">
<h2 class="anchored" data-anchor-id="code-status-and-capabilities">Code Status and Capabilities</h2>
<p>This project is still very early, but we are attempting to provide access to the full range of capabilities of the project as we develop them. We will provide some access to each capability through <a href="https://gradio.app/">Gradio</a> as we develop them, and will eventually synthesise them into a single agent-driven interface.</p>
<ul>
<li><strong>Single Paper QA (SPQA)</strong> - retrieval augmented generation (RAG) question answering about a single paper (implemented through <a href="https://www.llamaindex.ai/">llama-index</a>)</li>
<li>… more to come</li>
</ul>
</section>
<section id="where-does-the-name-alhazen-come-from" class="level2">
<h2 class="anchored" data-anchor-id="where-does-the-name-alhazen-come-from">Where does the Name ‘Alhazen’ come from?</h2>
<p>One thousand years ago, Ḥasan Ibn al-Haytham (965-1039 AD) studied optics through experimentation and observation. He advocated that a hypothesis must be supported by experiments based on confirmable procedures or mathematical reasoning — an early pioneer in the scientific method <em>five centuries</em> before Renaissance scientists started following the same paradigm (<a href="https://www.ibnalhaytham.com/">Website</a>, <a href="https://en.wikipedia.org/wiki/Ibn_al-Haytham">Wikipedia</a>, <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6074172/">Tbakhi &amp; Amir 2007</a>).</p>
<p>We use the latinized form of his name (‘Alhazen’) as the name of this project to honor his contribution (which goes largely unrecognized from within non-Islamic communities).</p>
<p>We use the latinized form of his name (‘Alhazen’) to honor his contribution (which goes largely unrecognized from within non-Islamic communities).</p>
<p>Famously, he was quoted as saying:</p>
<blockquote class="blockquote">
<p>The duty of the man who investigates the writings of scientists, if learning the truth is his goal, is to make himself an enemy of all that he reads, and, applying his mind to the core and margins of its content, attack it from every side. He should also suspect himself as he performs his critical examination of it, so that he may avoid falling into either prejudice or leniency.</p>
</blockquote>
<p>Here, we seek to develop an AI capable of applying scientific knowledge engineering to support CZI’s mission. We seek to honor Ibn al-Haytham’s critical view of published knowledge by creating a AI-powered system for scientific discovery.</p>
<p>Note - when describing our agent, we will use non-gendered pronouns (they/them/it) to refer to the agent.</p>


</section>
Expand Down
2 changes: 1 addition & 1 deletion docs/single_document_qa.html
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,7 @@ <h3 class="anchored" data-anchor-id="get_ft_url_from_doi">get_ft_url_from_doi</h
<pre><code> get_ft_url_from_doi (doi, file_path)</code></pre>
</blockquote>
<hr>
<p><a href="https://github.com/chanzuckerberg/alhazen/blob/main/alhazen/tools/single_document_qa.py#L121" target="_blank" style="float:right; font-size:smaller">source</a></p>
<p><a href="https://github.com/chanzuckerberg/alhazen/blob/main/alhazen/tools/single_document_qa.py#L104" target="_blank" style="float:right; font-size:smaller">source</a></p>
</section>
<section id="localfilelangchainchatbot" class="level3">
<h3 class="anchored" data-anchor-id="localfilelangchainchatbot">LocalFileLangChainChatBot</h3>
Expand Down
26 changes: 13 additions & 13 deletions docs/sitemap.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2,54 +2,54 @@
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://chanzuckerberg.github.io/alhazen/index.html</loc>
<lastmod>2023-10-20T06:20:41.641Z</lastmod>
<lastmod>2023-10-20T06:52:36.894Z</lastmod>
</url>
<url>
<loc>https://chanzuckerberg.github.io/alhazen/airtable_utils.html</loc>
<lastmod>2023-10-20T06:20:41.127Z</lastmod>
<lastmod>2023-10-20T06:52:36.368Z</lastmod>
</url>
<url>
<loc>https://chanzuckerberg.github.io/alhazen/tiab_corpus_qa.html</loc>
<lastmod>2023-10-20T06:20:40.683Z</lastmod>
<lastmod>2023-10-20T06:52:35.924Z</lastmod>
</url>
<url>
<loc>https://chanzuckerberg.github.io/alhazen/curated_data_utils.html</loc>
<lastmod>2023-10-20T06:20:40.253Z</lastmod>
<lastmod>2023-10-20T06:52:35.491Z</lastmod>
</url>
<url>
<loc>https://chanzuckerberg.github.io/alhazen/single_document_qa.html</loc>
<lastmod>2023-10-20T06:20:39.807Z</lastmod>
<lastmod>2023-10-20T06:52:35.034Z</lastmod>
</url>
<url>
<loc>https://chanzuckerberg.github.io/alhazen/core.html</loc>
<lastmod>2023-10-20T06:20:39.274Z</lastmod>
<lastmod>2023-10-20T06:52:34.508Z</lastmod>
</url>
<url>
<loc>https://chanzuckerberg.github.io/alhazen/query_translator.html</loc>
<lastmod>2023-10-20T06:20:38.786Z</lastmod>
<lastmod>2023-10-20T06:52:33.956Z</lastmod>
</url>
<url>
<loc>https://chanzuckerberg.github.io/alhazen/20_ms_nlp_utils.html</loc>
<lastmod>2023-10-20T06:20:39.004Z</lastmod>
<lastmod>2023-10-20T06:52:34.208Z</lastmod>
</url>
<url>
<loc>https://chanzuckerberg.github.io/alhazen/search_engine_eutils.html</loc>
<lastmod>2023-10-20T06:20:39.518Z</lastmod>
<lastmod>2023-10-20T06:52:34.768Z</lastmod>
</url>
<url>
<loc>https://chanzuckerberg.github.io/alhazen/pdf_text_extractor.html</loc>
<lastmod>2023-10-20T06:20:40.028Z</lastmod>
<lastmod>2023-10-20T06:52:35.269Z</lastmod>
</url>
<url>
<loc>https://chanzuckerberg.github.io/alhazen/vector_dbs.html</loc>
<lastmod>2023-10-20T06:20:40.463Z</lastmod>
<lastmod>2023-10-20T06:52:35.705Z</lastmod>
</url>
<url>
<loc>https://chanzuckerberg.github.io/alhazen/jats_text_extractor.html</loc>
<lastmod>2023-10-20T06:20:40.896Z</lastmod>
<lastmod>2023-10-20T06:52:36.142Z</lastmod>
</url>
<url>
<loc>https://chanzuckerberg.github.io/alhazen/apps.html</loc>
<lastmod>2023-10-20T06:20:41.346Z</lastmod>
<lastmod>2023-10-20T06:52:36.589Z</lastmod>
</url>
</urlset>

0 comments on commit 6b2ac53

Please sign in to comment.