Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change repo name #17

Merged
merged 4 commits into from
Jun 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ on:

# Replace package-name with your package name
env:
PACKAGE_NAME: interest
PACKAGE_NAME: dataQuest

jobs:
build:
Expand Down
32 changes: 16 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# INTEREST
# dataQuest

The code in this repository implements a pipeline to extract specific articles from a large corpus.

Expand All @@ -10,7 +10,7 @@ Articles can be filtered based on individual or multiple features such as title,
## Getting Started
Clone this repository to your working station to obtain examples and python scripts:
```
git clone https://github.com/UtrechtUniversity/historical-news-sentiment.git
git clone https://github.com/UtrechtUniversity/dataQuest.git
```

### Prerequisites
Expand All @@ -20,10 +20,10 @@ To install and run this project you need to have the following prerequisites ins
```

### Installation
#### Option 1 - Install interest package
To run the project, ensure to install the interest package that is part of this project.
#### Option 1 - Install dataQuest package
To run the project, ensure to install the dataQuest package that is part of this project.
```
pip install interest
pip install dataQuest
```
#### Option 2 - Run from source code
If you want to run the scripts without installation you need to:
Expand All @@ -42,7 +42,7 @@ pip install .
On Linux and Mac OS, you might have to set the PYTHONPATH environment variable to point to this directory.

```commandline
export PYTHONPATH="current working directory/historical-news-sentiment:${PYTHONPATH}"
export PYTHONPATH="current working directory/dataQuest:${PYTHONPATH}"
```
### Built with
These packages are automatically installed in the step above:
Expand Down Expand Up @@ -85,7 +85,7 @@ Below is a snapshot of the JSON file format:

In our use case, the harvested KB data is in XML format. We have provided the following script to transform the original data into the expected format.
```
from interest.preprocessor.parser import XMLExtractor
from dataQuest.preprocessor.parser import XMLExtractor

extractor = XMLExtractor(Path(input_dir), Path(output_dir))
extractor.extract_xml_string()
Expand All @@ -99,9 +99,9 @@ python3 convert_input_files.py --input_dir path/to/raw/xml/data --output_dir pat

In order to define a corpus with a new data format you should:

- add a new input_file_type to [INPUT_FILE_TYPES](https://github.com/UtrechtUniversity/historical-news-sentiment/blob/main/interest/filter/__init__.py)
- implement a class that inherits from [input_file.py](https://github.com/UtrechtUniversity/historical-news-sentiment/blob/main/interest/filter/input_file.py).
This class is customized to read a new data format. In our case-study we defined [delpher_kranten.py](https://github.com/UtrechtUniversity/historical-news-sentiment/blob/main/interest/filter/delpher_kranten.py).
- add a new input_file_type to [INPUT_FILE_TYPES](https://github.com/UtrechtUniversity/dataQuest/blob/main/dataQuest/filter/__init__.py)
- implement a class that inherits from [input_file.py](https://github.com/UtrechtUniversity/dataQuest/blob/main/dataQuest/filter/input_file.py).
This class is customized to read a new data format. In our case-study we defined [delpher_kranten.py](https://github.com/UtrechtUniversity/dataQuest/blob/main/dataQuest/filter/delpher_kranten.py).


### 2. Filtering
Expand Down Expand Up @@ -144,7 +144,7 @@ The output of this script is a JSON file for each selected article in the follow
}
```
### 3. Categorization by timestamp
The output files generated in the previous step are categorized based on a specified [period-type](https://github.com/UtrechtUniversity/historical-news-sentiment/blob/main/interest/temporal_categorization/__init__.py),
The output files generated in the previous step are categorized based on a specified [period-type](https://github.com/UtrechtUniversity/dataQuest/blob/main/dataQuest/temporal_categorization/__init__.py),
such as ```year``` or ```decade```. This categorization is essential for subsequent steps, especially if you intend to apply tf-idf or other models to specific periods. In our case, we applied tf-idf per decade.

```commandline
Expand All @@ -159,7 +159,7 @@ By utilizing tf-idf, the most relevant articles related to the specified topic (

Before applying tf-idf, articles containing any of the specified keywords in their title are selected.

From the rest of articles, to choose the most relevant ones, you can specify one of the following criteria in [config.py](https://github.com/UtrechtUniversity/historical-news-sentiment/blob/main/config.json):
From the rest of articles, to choose the most relevant ones, you can specify one of the following criteria in [config.py](https://github.com/UtrechtUniversity/dataQuest/blob/main/config.json):

- Percentage of selected articles with the top scores
- Maximum number of selected articles with the top scores
Expand Down Expand Up @@ -192,12 +192,12 @@ From the rest of articles, to choose the most relevant ones, you can specify one

The following script, add a new column, ```selected``` to the .csv files from the previous step.
```commandline
python3 scripts/3_select_final_articles.py --input_dir "output/output_timestamped/"
python3 scripts/step3_select_final_articles.py --input-dir "output/output_timestamped/"
```

### 5. Generate output
As the final step of the pipeline, the text of the selected articles is saved in a .csv file, which can be used for manual labeling. The user has the option to choose whether the text should be divided into paragraphs or a segmentation of the text.
This feature can be set in [config.py](https://github.com/UtrechtUniversity/historical-news-sentiment/blob/main/config.json).
This feature can be set in [config.py](https://github.com/UtrechtUniversity/dataQuest/blob/main/config.json).
```commandline
"output_unit": "paragraph"

Expand All @@ -211,7 +211,7 @@ OR
```

```commandline
python3 scripts/step4_generate_output.py --input_dir "output/output_timestamped/” --output-dir “output/output_results/“ --glob “*.csv”
python3 scripts/step4_generate_output.py --input-dir "output/output_timestamped/” --output-dir “output/output_results/“ --glob “*.csv”
```
## About the Project
**Date**: February 2024
Expand Down Expand Up @@ -248,5 +248,5 @@ To contribute:

Pim Huijnen - p.huijnen@uu.nl

Project Link: [https://github.com/UtrechtUniversity/historical-news-sentiment](https://github.com/UtrechtUniversity/historical-news-sentiment)
Project Link: [https://github.com/UtrechtUniversity/dataQuest](https://github.com/UtrechtUniversity/dataQuest)

File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import json
import logging
from typing import List, Union, Tuple
from interest.preprocessor.text_cleaner import TextCleaner
from dataQuest.preprocessor.text_cleaner import TextCleaner

text_cleaner = TextCleaner()

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@
from typing import List, Tuple, Dict, Union
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from interest.models.tfidf import TfidfEmbedder
from interest.article_final_selection.process_article import ArticleProcessor
from interest.article_final_selection.process_article import clean
from interest.article_final_selection.article_selector import ArticleSelector
from dataQuest.models.tfidf import TfidfEmbedder
from dataQuest.article_final_selection.process_article import ArticleProcessor
from dataQuest.article_final_selection.process_article import clean
from dataQuest.article_final_selection.article_selector import ArticleSelector


def process_articles(articles_filepath: str, clean_keywords: List[str]) -> (
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""define input-file type"""
from interest.filter.delpher_kranten import KrantenFile
from dataQuest.filter.delpher_kranten import KrantenFile

INPUT_FILE_TYPES = {
"delpher_kranten": KrantenFile
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@
import logging
import os
from typing import Optional
from interest.filter.document import Document, Article
from interest.filter.input_file import InputFile
from dataQuest.filter.document import Document, Article
from dataQuest.filter.input_file import InputFile


class KrantenFile(InputFile):
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"""
from abc import ABC, abstractmethod
from typing import List
from interest.filter.document import Document, Article
from dataQuest.filter.document import Document, Article


class DocumentFilter(ABC):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@
import logging
from pathlib import Path
from typing import Iterable, TextIO, cast, Optional
from interest.filter.document import Document, Article
from interest.filter.document_filter import DocumentFilter
from dataQuest.filter.document import Document, Article
from dataQuest.filter.document_filter import DocumentFilter


class InputFile(abc.ABC):
Expand Down
File renamed without changes.
6 changes: 3 additions & 3 deletions interest/models/tfidf.py → dataQuest/models/tfidf.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@
import scipy
from sklearn.feature_extraction.text import TfidfVectorizer

from interest.models.base import BaseEmbedder
from interest.utils import load_spacy_model
from interest.settings import SPACY_MODEL
from dataQuest.models.base import BaseEmbedder
from dataQuest.utils import load_spacy_model
from dataQuest.settings import SPACY_MODEL


class TfidfEmbedder(BaseEmbedder):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
specified output units. """
from typing import List, Union
import logging
from interest.settings import SPACY_MODEL
from interest.utils import load_spacy_model
from dataQuest.settings import SPACY_MODEL
from dataQuest.utils import load_spacy_model

PARAGRAPH_FORMATTER = 'paragraph'
FULLTEXT_FORMATTER = 'full_text'
Expand Down
1 change: 1 addition & 0 deletions dataQuest/preprocessor/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# from dataQuest.preprocessor.parser import XMLExtractor
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
"""
import re
from typing import Union, List
from interest.settings import SPACY_MODEL
from interest.utils import load_spacy_model
from dataQuest.settings import SPACY_MODEL
from dataQuest.utils import load_spacy_model


def merge_texts_list(text: Union[str, List[str]]) -> str:
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""Mapping from string format descriptions to corresponding classes."""
from interest.temporal_categorization.timestamped_data \
from dataQuest.temporal_categorization.timestamped_data \
import (YearPeriodData, DecadePeriodData)

PERIOD_TYPES = {
Expand Down
14 changes: 7 additions & 7 deletions interest/utils.py → dataQuest/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@
import json
import spacy
import spacy.cli
from interest.filter.document_filter import (YearFilter,
TitleFilter,
DocumentFilter)
from interest.filter.document_filter import (CompoundFilter,
DecadeFilter,
KeywordsFilter)
from interest.settings import ENCODING
from dataQuest.filter.document_filter import (YearFilter,
TitleFilter,
DocumentFilter)
from dataQuest.filter.document_filter import (CompoundFilter,
DecadeFilter,
KeywordsFilter)
from dataQuest.settings import ENCODING


@cache
Expand Down
1 change: 0 additions & 1 deletion interest/preprocessor/__init__.py

This file was deleted.

4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ requires = ["setuptools", "setuptools-scm"]
build-backend = "setuptools.build_meta"

[project]
name = "interest"
name = "dataQuest"
description = "A package to extract hystorical news sentiments"
authors = [
{name = "Shiva Nadi", email = "s.nadi@uu.nl"},
Expand Down Expand Up @@ -31,7 +31,7 @@ lint = ["flake8"]
test = ["pytest", "mypy"]

[tool.setuptools]
packages = ["interest"]
packages = ["dataQuest"]

[tool.flake8]
max-line-length = 99
Expand Down
2 changes: 1 addition & 1 deletion scripts/convert_input_files.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from interest.preprocessor.parser import XMLExtractor
from dataQuest.preprocessor.parser import XMLExtractor
from argparse import ArgumentParser
from pathlib import Path
import logging
Expand Down
8 changes: 4 additions & 4 deletions scripts/step1_filter_articles.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@

from tqdm import tqdm

from interest.filter import INPUT_FILE_TYPES
from interest.filter.input_file import InputFile
from interest.utils import load_filters_from_config
from interest.utils import save_filtered_articles
from dataQuest.filter import INPUT_FILE_TYPES
from dataQuest.filter.input_file import InputFile
from dataQuest.utils import load_filters_from_config
from dataQuest.utils import save_filtered_articles

if __name__ == "__main__":
parser = argparse.ArgumentParser("Filter articles from input files.")
Expand Down
4 changes: 2 additions & 2 deletions scripts/step2_categorize_by_timestamp.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
from pathlib import Path
import pandas as pd
from tqdm import tqdm # type: ignore
from interest.temporal_categorization import PERIOD_TYPES
from interest.temporal_categorization.timestamped_data import TimestampedData
from dataQuest.temporal_categorization import PERIOD_TYPES
from dataQuest.temporal_categorization.timestamped_data import TimestampedData

OUTPUT_FILE_NAME = 'articles'
FILENAME_COLUMN = 'file_path'
Expand Down
6 changes: 3 additions & 3 deletions scripts/step3_select_final_articles.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
from typing import List
from pathlib import Path
import pandas as pd
from interest.utils import get_keywords_from_config
from interest.utils import read_config
from interest.article_final_selection.process_articles import select_articles
from dataQuest.utils import get_keywords_from_config
from dataQuest.utils import read_config
from dataQuest.article_final_selection.process_articles import select_articles

ARTICLE_SELECTOR_FIELD = "article_selector"

Expand Down
10 changes: 5 additions & 5 deletions scripts/step4_generate_output.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@
from typing import Union
import pandas as pd
from pandas import DataFrame
from interest.settings import SPACY_MODEL
from interest.article_final_selection.process_article import ArticleProcessor
from interest.utils import read_config, get_file_name_without_extension
from interest.output_generator.text_formater import (TextFormatter,
SEGMENTED_TEXT_FORMATTER)
from dataQuest.settings import SPACY_MODEL
from dataQuest.article_final_selection.process_article import ArticleProcessor
from dataQuest.utils import read_config, get_file_name_without_extension
from dataQuest.output_generator.text_formater import (TextFormatter,
SEGMENTED_TEXT_FORMATTER)


FILE_PATH_FIELD = "file_path"
Expand Down
Loading