Specification of the value-to-form processing in Lexibank datasets:
The value-to-form processing is divided into two steps, implemented as methods:
FormSpec.split
: Splits a string into individual form chunks.FormSpec.clean
: Normalizes a form chunk.
These methods use the attributes of a FormSpec
instance to configure their behaviour.
brackets
:{'’': '’', '(': ')'}
Pairs of strings that should be recognized as brackets, specified asdict
mapping opening string to closing stringseparators
:/
Iterable of single character tokens that should be recognized as word separatormissing_data
:('?', '-')
Iterable of strings that are used to mark missing datastrip_inside_brackets
:True
Flag signaling whether to strip content in brackets (and strip leading and trailing whitespace)replacements
:[('*', ''), (' ', '_')]
List of pairs (source
,target
) used to replace occurrences ofsource
in formswithtarget
(before stripping content in brackets)first_form_only
:False
Flag signaling whether at most one form should be returned fromsplit
- effectively ignoring any spelling variants, etc.normalize_whitespace
:True
Flag signaling whether to normalize whitespace - stripping leading and trailing whitespace and collapsing multi-character whitespace to single spacesnormalize_unicode
:None
UNICODE normalization form to use for input ofsplit
(None
, 'NFD' or 'NFC')
Source lexemes may be impossible to interpret correctly. 108 such lexemes are listed
in etc/lexemes.csv
and replaced as specified in this file.