Skip to content

Latest commit

 

History

History
32 lines (25 loc) · 1.67 KB

FORMS.md

File metadata and controls

32 lines (25 loc) · 1.67 KB

Specification of form manipulation

Specification of the value-to-form processing in Lexibank datasets:

The value-to-form processing is divided into two steps, implemented as methods:

  • FormSpec.split: Splits a string into individual form chunks.
  • FormSpec.clean: Normalizes a form chunk.

These methods use the attributes of a FormSpec instance to configure their behaviour.

  • brackets: {'’': '’', '(': ')'} Pairs of strings that should be recognized as brackets, specified as dict mapping opening string to closing string
  • separators: / Iterable of single character tokens that should be recognized as word separator
  • missing_data: ('?', '-') Iterable of strings that are used to mark missing data
  • strip_inside_brackets: True Flag signaling whether to strip content in brackets (and strip leading and trailing whitespace)
  • replacements: [('*', ''), (' ', '_')] List of pairs (source, target) used to replace occurrences of source in formswith target (before stripping content in brackets)
  • first_form_only: False Flag signaling whether at most one form should be returned from split - effectively ignoring any spelling variants, etc.
  • normalize_whitespace: True Flag signaling whether to normalize whitespace - stripping leading and trailing whitespace and collapsing multi-character whitespace to single spaces
  • normalize_unicode: None UNICODE normalization form to use for input of split (None, 'NFD' or 'NFC')

Replacement of invalid lexemes

Source lexemes may be impossible to interpret correctly. 108 such lexemes are listed in etc/lexemes.csv and replaced as specified in this file.