From f893df5061be8f53fd586b142274b7ed669112c9 Mon Sep 17 00:00:00 2001 From: Jon Harmon Date: Thu, 3 Mar 2022 07:59:35 -0600 Subject: [PATCH] Prepare for CRAN. (#9) * Prepare for CRAN. * Update win server details for rhub. --- DESCRIPTION | 2 +- NEWS.md | 10 +++++++--- README.Rmd | 2 +- README.md | 7 +++---- cran-comments.md | 25 +++++++++++++++++++++++-- 5 files changed, 35 insertions(+), 11 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index 236552a..2456242 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: wordpiece.data Title: Data for Wordpiece-Style Tokenization -Version: 1.0.2.9000 +Version: 2.0.0 Authors@R: c( person(given = "Jonathan", family = "Bratt", diff --git a/NEWS.md b/NEWS.md index e0992a3..e021069 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,11 +1,15 @@ +# wordpiece.data 2.0.0 + +- Breaking change: The wordpiece vocabularies are now character vectors, rather than named integer vectors. Update wordpiece to version 2.1.2 or later for compatibility. + # wordpiece.data 1.0.2 -* Corrected type of loaded vocabularies from double to integer. +- Corrected type of loaded vocabularies from double to integer. # wordpiece.data 1.0.1 -* Initial CRAN release. +- Initial CRAN release. # wordpiece.data 1.0.0 -* Added a `NEWS.md` file to track changes to the package. +- Added a `NEWS.md` file to track changes to the package. diff --git a/README.Rmd b/README.Rmd index 3d44517..e304d93 100644 --- a/README.Rmd +++ b/README.Rmd @@ -39,7 +39,7 @@ remotes::install_github("macmillancontentscience/wordpiece.data") The datasets included in this package were retrieved from huggingface (specifically, [cased](https://huggingface.co/bert-base-cased/resolve/main/vocab.txt) and [uncased](https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt)). They were then processed using the {[wordpiece](https://github.com/macmillancontentscience/wordpiece)} package. -This is a bit circular, because this package will be used as a dependency for the wordpiece package. +This is a bit circular, because this package is a dependency for the wordpiece package. ```{r process-datasets, eval = FALSE} vocab_txt <- tempfile(fileext = ".txt") diff --git a/README.md b/README.md index 4c06cf8..4d1975b 100644 --- a/README.md +++ b/README.md @@ -36,8 +36,8 @@ and [uncased](https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt)). They were then processed using the {[wordpiece](https://github.com/macmillancontentscience/wordpiece)} -package. This is a bit circular, because this package will be used as a -dependency for the wordpiece package. +package. This is a bit circular, because this package is a dependency +for the wordpiece package. ``` r vocab_txt <- tempfile(fileext = ".txt") @@ -87,8 +87,7 @@ function to load data used by library(wordpiece.data) head(wordpiece_vocab()) -#> [PAD] [unused0] [unused1] [unused2] [unused3] [unused4] -#> 0 1 2 3 4 5 +#> [1] "[PAD]" "[unused0]" "[unused1]" "[unused2]" "[unused3]" "[unused4]" ``` ## Code of Conduct diff --git a/cran-comments.md b/cran-comments.md index e56a9ee..ff10ae5 100644 --- a/cran-comments.md +++ b/cran-comments.md @@ -1,12 +1,33 @@ +# Resubmission + +## Changes + +* Breaking change: The wordpiece vocabularies are now character vectors, rather than named integer vectors. + ## Test environments -* local R installation, R 4.1.1 (Windows 10) +* local R installation, R 4.1.2 (Windows 10) * win-builder (devel) -* Windows Server 2008 R2 SP1, R-devel, 32/64 bit (rhub) +* Windows Server 2022, R-devel, 64 bit (rhub) * Ubuntu Linux 20.04.1 LTS, R-release, GCC (rhub) * Fedora Linux, R-devel, clang, gfortran (rhub) +There is a NOTE when testing for Windows Server: + +``` +* checking for detritus in the temp directory ... NOTE +Found the following files/directories: + 'lastMiKTeXException' +``` + +I cannot reproduce this error on my Windows machine, and a web search indicated that it is likely nothing. This package is very simple and I can't find anything that could possibly trigger that error. + ## R CMD check results 0 errors | 0 warnings | 0 notes * These words in DESCRIPTION are NOT misspelled: Tokenization, tokenize, wordpiece, Wordpiece. + + +## Reverse dependencies + +wordpiece 2.1.2 handles the difference between this version of wordpiece.data and the previous version.