Scholet

Basic Idea

The project is a visualization of the scholar/publications. The visualization will be interactive and will allow users to explore the community's research and collaborations in a variety of ways. The visualization was built with Svelte and D3.js.

Project Structure

The project is structured as follows:

src contains the source code for the project
- routes contains the different pages of the project
- types contains the typescript types
- static contains the data
python-script contains the python scripts for data processing, vectorization, 2d projection, and clustering

Sever Side File Structure:

We used python, fastapi, and uvicorn to serve the data to the frontend. The server is running on http://localhost:8000 and the frontend is running on http://localhost:5173. The server is responsible for serving the data to the frontend and processing the data. make sure you include the .env file with: OPENAI_API_KEY=YOUR_API_KEY

server/
├── main.py
├── requirements.txt
|── .env
|── data/
└── utils/
    ├── __init__.py
    ├── rag.py
    ├── data.py
    └── client_setup.py

Data

combined_df.csv

This csv file contains the top cited publications from each professor from the AI Institute.

combined_df_pubdate.csv

This csv file contains the latest publications from each professor from the AI Institute. The data is a combination of the author and publication data.

Column Name	Description
email	The email of the author
first_name	The first name of the author
last_name	The last name of the author
faculty	The faculty author is affiliated with
department	The department author is affiliated with
area_of_focus	The self-identified area of focus of the author
gs_link	The link to the author's google scholar profile
author_id	The author's unique id
title	The title of the publication
abstract	The abstract of the publication
doi	The doi of the publication
embeddings	The embeddings of the publication
umap_x	The x coordinate of the 2d projection of the embeddings
umap_y	The y coordinate of the 2d projection of the embeddings
cluster	The cluster the publication belongs to (KMeans)
kde	The kernel density estimation of the publication
focus_tag	The tag of the area of focus

Data Preperation - Python Scripts

Load data (publications record from google scholar + faculty data from the AI Institute website)
Clean data
Vectorize data with all-MiniLM-L6-v2
2D projection of the vectorized data (UMAP)
Clustering of the 2D projected data (KMeans)
Kernel Density Estimation of the 2D projected data
Extract top keywords from each cluster (TF-IDF, GPT-4)

Developing

Once you've created a project and installed dependencies with npm install (or pnpm install or yarn), start a development server:

npm run dev

# or start the server and open the app in a new browser tab
npm run dev -- --open

Server Side

Now let's create a virtual environment and install the dependencies:

Create a new conda environment

Open your terminal and run the following command to create a new conda environment:
```
conda create --name myenv python=3.9
```
Replace myenv with the name you want to give your environment.
Activate the conda environment with
```
conda activate myenv
conda install pip
```

Install Dependencies & Run

cd server
pip install -r requirements.txt
uvicorn main:app --port=8000 --reload

Building

To create a production version of your app:

npm run build

You can preview the production build with npm run preview.

To deploy your app, you may need to install an adapter for your target environment.

Q&A

Why using visual embeddings?

Visual embeddings are a way to represent high-dimensional data in a 2D space. This allows us to visualize the data and explore the relationships between the data points. In this project, we use UMAP to project the vectorized data into a 2D space. This allows us to visualize the relationships between the publications and the authors.

Why using clustering?

Clustering is a way to group data points based on their similarity. In this project, we use KMeans to cluster the 2D projected data. This allows us to group the publications and the authors based on their similarity. This can help us identify patterns and relationships in the data.

Why using kernel density estimation?

Kernel density estimation is a way to estimate the probability density function of a set of data points. In this project, we use kernel density estimation to estimate the probability density function of the 2D projected data. This allows us to visualize the distribution of the data points and identify areas of high density.

Why using TF-IDF?

TF-IDF is a way to extract keywords from a set of documents. In this project, we use TF-IDF to extract keywords from the abstracts of the publications. This allows us to identify the key topics and themes in the publications.

Why project the data into 2D space? What is UMAP

UMAP is a dimensionality reduction technique that is used to project high-dimensional data into a lower-dimensional space. In this project, we use UMAP to project the vectorized data into a 2D space. This allows us to visualize the relationships between the publications and the authors.

Why use MiniLM?

MiniLM is a small version of the LM model that is designed for efficient vectorization of text data. In this project, we use MiniLM to vectorize the abstracts of the publications. This allows us to represent the abstracts as high-dimensional vectors, which can be projected into a 2D space and clustered.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.vscode		.vscode
server		server
src		src
static		static
.gitignore		.gitignore
.npmrc		.npmrc
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
svelte.config.js		svelte.config.js
tsconfig.json		tsconfig.json
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scholet

Basic Idea

Project Structure

Sever Side File Structure:

Data

Data Preperation - Python Scripts

Developing

Server Side

Building

Q&A

References

About

Releases

Packages

Contributors 2

Languages

ryanyen2/Scholet

Folders and files

Latest commit

History

Repository files navigation

Scholet

Basic Idea

Project Structure

Sever Side File Structure:

Data

Data Preperation - Python Scripts

Developing

Server Side

Building

Q&A

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages