The preprocessing
submodule implements some basic functionality for preprocessing clinical text. It currently implements the following things:
- Normalizing
- Sentence splitting
- Tokenizing
The abstract classes Normalizer
, SentenceSplitter
and Tokenizer
define the interface (see table below), with ready-to-use implementations provided in the implementation classes. It is always possible to use/define your own custom normalizer, sentence splitter or tokenizer, by creating a new class with your own custom logic.
Abstract Class | Functions | Implementations |
---|---|---|
Normalizer |
normalize |
BasicNormalizer |
SentenceSplitter |
process_text split , process_sentence |
BasicSentenceSplitter |
Tokenizer |
tokenize |
SpacyTokenizer |
A complete API can be found at the bottom of this readme.
The BasicNormalizer implements some basic normalization steps, such as removing special characters, and removing double whitespaces/newlines.
from psynlp.preprocessing import BasicNormalizer
bn = BasicNormalizer()
bn.normalize("Patiënt werd vannacht opgenomen.")
>> "Patient werd vannacht opgenomen."
Splitting text into sentences early is a step that becomes very useful in later stages, as processing text sentence-by-sentence strikes a nice balance between processing texts and processing words. When using the context
submodule, only sentences are accepted as input.
A ready-for-use implementation can be found in BasicSentenceSplitter
. This implementation makes use of the nltk
sentence tokenizer, which based on some experimentation works well (and relatively fast). The nltk
sentence tokenizer is used in combination with some custom rules and scripts, such as always splitting on \n
, and not splitting on the period in 2.5 milligram
.
from psynlp.preprocessing import BasicSentenceSplitter
bss = BasicSentenceSplitter()
bss.split("Dit is zin één. Dit is zin twee.")
>> ['Dit is zin één.', 'Dit is zin twee.']
For tokenizing, the SpacyTokenizer is implemented, which is based on the out-of-the-box Dutch tokenizer that Spacy provides, with some additional logic for correctly tokenizing DEDUCE-tags and some other symbols such as /
and -
. NB: the spacy model is created in the nlpresourcetraining
pipeline, and automatically loaded from the shared drive when creating the SpacyTokenizer
.
from psynlp.preprocessing import SpacyTokenizer
st = SpacyTokenizer(spacy_model="spacy_model_name")
st.tokenize("Deze tekst gaat over <PERSOON-1>.")
>>> ['Deze', 'tekst', 'gaat', 'over', '<PERSOON-1>', '.']
bn = BasicNormalizer()
Function | Description | Returns |
---|---|---|
bn.normalize(text) |
Normalize text | text |
class CustomNormalizer(Normalizer):
def normalize(self, text):
return text
bss = BasicSentenceSplitter(verbose=False,
normalizer=None)
Field | Description |
---|---|
verbose |
Verbosity |
normalizer |
An instance of the Normalizer class, such as a BasicNormalizer or None if texts should not be normalized |
Function | Description | Returns |
---|---|---|
bss.split(text) |
Split a text into sentences. | [sentence] |
bss.process_dataframe(dataframe, hash_var, text_columns) |
Processes a dataframe by splitting the texts in text_columns into separate rows, requires a unique hash_var |
exploded dataframe |
class CustomSentenceSplitter(SentenceSplitter):
def _process_text(self, text): # before splitting
return text
def split(self, text): # defines how to split
return text
def _process_sentence(self, sentence): # post process sentence after splitting
return sentence
st = SpacyTokenizer(spacy_model)
Field | Description |
---|---|
spacy_model |
The spacy model to use from global resources |
Function | Description | Returns |
---|---|---|
st.tokenize(text) |
Tokenize the text | [tokens] |
class CustomTokenizer(Tokenizer):
def tokenize(self, text):
return(text)