LimpiaR is an R library of functions for cleaning & pre-processing text data. The name comes from ‘limpiar’ the Spanish verb’to clean’. Generally when calling a LimpiaR function, you can think of it as ‘clean…’.
LimpiaR is primarily used for cleaning unstructured text data, such as that which comes from social media or reviews. In its initial release, it is focused around the Spanish language, however, some of its functions are language-ambivalent.
You can install the development version of LimpiaR from GitHub with:
# install.packages("devtools")
devtools::install_github("jpcompartir/LimpiaR")
LimpiaR provides a comprehensive suite of text cleaning and processing functions, primarily focused on preparing text data for machine learning and analytics tasks. Below you’ll find the functions organised by their primary purpose.
Functions for editing the text variable in place.
Function | Description | Language Support | Primary Use Case | Notes |
---|---|---|---|---|
limpiar_accents | Removes accented characters | Language-agnostic | Text normalisation | Useful for reducing token complexity |
limpiar_spaces | Removes redundant spaces | Language-agnostic | Text cleaning | Also standardises punctuation spacing |
limpiar_url | Removes URLs from text | Language-agnostic | Text cleaning | Handles various URL formats |
limpiar_repeat_chars | Normalises repeated characters | Spanish-focused | Text normalisation | Handles laugh patterns (jajaja) |
limpiar_shorthands | Expands common abbreviations | Spanish-focused | Text normalisation | e.g., “porq” → “porque” |
limpiar_tags | Normalises social media tags | Language-agnostic | Social media prep | Handles @mentions and #hashtags |
limpiar_stopwords | Removes common stopwords | Spanish-focused | Text analysis | Offers “sentiment” and “topics” modes |
limpiar_slang | Normalises dialectal variations | Spanish-focused | Text normalisation | Handles multiple Spanish dialects |
limpiar_emojis_es | Converts emojis to Spanish text | Spanish | Text normalisation | Spanish-specific emoji descriptions |
limpiar_recode_emojis | Recodes emojis to text | Language-agnostic | Text normalisation | General emoji handling |
limpiar_remove_emojis | Removes emojis completely | Language-agnostic | Text cleaning | Complete emoji removal |
limpiar_pp_products | Replaces product mentions | English/Spanish | Entity normalisation | For product analysis |
limpiar_pp_companies | Replaces company mentions | English/Spanish | Entity normalisation | For company analysis |
limpiar_non_ascii | Removes non-ASCII characters | Language-agnostic | Text cleaning | Less aggressive than alphanumeric |
limpiar_alphanumeric | Keeps only letters/numbers | Language-agnostic | Text cleaning | Most aggressive cleaning |
Functions for removing unwanted posts entirely (rather than cleaning).
Function | Description | Language Support | Primary Use Case | Notes |
---|---|---|---|---|
limpiar_duplicates | Removes duplicate content | Language-agnostic | Data cleaning | Also removes protected content |
limpiar_retweets | Removes retweet content | Language-agnostic | Social media cleaning | Identifies RT patterns |
limpiar_spam_grams | Removes spam-like patterns | Language-agnostic | Content filtering | Uses n-gram analysis |
Miscellaneous functions designed to speed up aspects of cleaning text.
Function | Description | Language Support | Primary Use Case | Notes |
---|---|---|---|---|
limpiar_inspect | Viewable pane for pattern matches | Language-agnostic | Data exploration | Interactive viewing |
limpiar_na_cols | Removes NA-heavy columns | Language-agnostic | Data cleaning | Configurable threshold |
limpiar_link_click | Makes URLs clickable and short | Language-agnostic | UI enhancement | For Shiny/DataTable |
limpiar_ex_subreddits | Extracts subreddit names | Language-agnostic | Reddit analysis | URL parsing |
A collection of functions that collectively make up a Parts of Speech (POS) analysis workflow.
Function | Description | Language Support | Primary Use Case | Notes |
---|---|---|---|---|
limpiar_pos_import_model | Imports Parts of Speech models and caches | 65+ languages | POS analysis prep | Uses UDPipe models |
limpiar_pos_annotate | Performs POS analysis | 65+ languages | Text analysis | Includes dependency parsing |