From f893df5061be8f53fd586b142274b7ed669112c9 Mon Sep 17 00:00:00 2001
From: Jon Harmon <jonthegeek@gmail.com>
Date: Thu, 3 Mar 2022 07:59:35 -0600
Subject: [PATCH] Prepare for CRAN. (#9)

* Prepare for CRAN.

* Update win server details for rhub.
---
 DESCRIPTION      |  2 +-
 NEWS.md          | 10 +++++++---
 README.Rmd       |  2 +-
 README.md        |  7 +++----
 cran-comments.md | 25 +++++++++++++++++++++++--
 5 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/DESCRIPTION b/DESCRIPTION
index 236552a..2456242 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: wordpiece.data
 Title: Data for Wordpiece-Style Tokenization
-Version: 1.0.2.9000
+Version: 2.0.0
 Authors@R: c(
     person(given = "Jonathan",
            family = "Bratt",
diff --git a/NEWS.md b/NEWS.md
index e0992a3..e021069 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -1,11 +1,15 @@
+# wordpiece.data 2.0.0
+
+- Breaking change: The wordpiece vocabularies are now character vectors, rather than named integer vectors. Update wordpiece to version 2.1.2 or later for compatibility.
+
 # wordpiece.data 1.0.2
 
-* Corrected type of loaded vocabularies from double to integer.
+- Corrected type of loaded vocabularies from double to integer.
 
 # wordpiece.data 1.0.1
 
-* Initial CRAN release.
+- Initial CRAN release.
 
 # wordpiece.data 1.0.0
 
-* Added a `NEWS.md` file to track changes to the package.
+- Added a `NEWS.md` file to track changes to the package.
diff --git a/README.Rmd b/README.Rmd
index 3d44517..e304d93 100644
--- a/README.Rmd
+++ b/README.Rmd
@@ -39,7 +39,7 @@ remotes::install_github("macmillancontentscience/wordpiece.data")
 
 The datasets included in this package were retrieved from huggingface (specifically, [cased](https://huggingface.co/bert-base-cased/resolve/main/vocab.txt) and [uncased](https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt)).
 They were then processed using the {[wordpiece](https://github.com/macmillancontentscience/wordpiece)} package.
-This is a bit circular, because this package will be used as a dependency for the wordpiece package.
+This is a bit circular, because this package is a dependency for the wordpiece package.
 
 ```{r process-datasets, eval = FALSE}
 vocab_txt <- tempfile(fileext = ".txt")
diff --git a/README.md b/README.md
index 4c06cf8..4d1975b 100644
--- a/README.md
+++ b/README.md
@@ -36,8 +36,8 @@ and
 [uncased](https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt)).
 They were then processed using the
 {[wordpiece](https://github.com/macmillancontentscience/wordpiece)}
-package. This is a bit circular, because this package will be used as a
-dependency for the wordpiece package.
+package. This is a bit circular, because this package is a dependency
+for the wordpiece package.
 
 ``` r
 vocab_txt <- tempfile(fileext = ".txt")
@@ -87,8 +87,7 @@ function to load data used by
 library(wordpiece.data)
 
 head(wordpiece_vocab())
-#>     [PAD] [unused0] [unused1] [unused2] [unused3] [unused4] 
-#>         0         1         2         3         4         5
+#> [1] "[PAD]"     "[unused0]" "[unused1]" "[unused2]" "[unused3]" "[unused4]"
 ```
 
 ## Code of Conduct
diff --git a/cran-comments.md b/cran-comments.md
index e56a9ee..ff10ae5 100644
--- a/cran-comments.md
+++ b/cran-comments.md
@@ -1,12 +1,33 @@
+# Resubmission
+
+## Changes
+
+* Breaking change: The wordpiece vocabularies are now character vectors, rather than named integer vectors.
+
 ## Test environments
-* local R installation, R 4.1.1 (Windows 10)
+* local R installation, R 4.1.2 (Windows 10)
 * win-builder (devel)
-* Windows Server 2008 R2 SP1, R-devel, 32/64 bit (rhub)
+* Windows Server 2022, R-devel, 64 bit (rhub)
 * Ubuntu Linux 20.04.1 LTS, R-release, GCC (rhub)
 * Fedora Linux, R-devel, clang, gfortran (rhub)
 
+There is a NOTE when testing for Windows Server:
+
+```
+* checking for detritus in the temp directory ... NOTE
+Found the following files/directories:
+  'lastMiKTeXException'
+```
+
+I cannot reproduce this error on my Windows machine, and a web search indicated that it is likely nothing. This package is very simple and I can't find anything that could possibly trigger that error.
+
 ## R CMD check results
 
 0 errors | 0 warnings | 0 notes
 
 * These words in DESCRIPTION are NOT misspelled: Tokenization, tokenize, wordpiece, Wordpiece.
+
+
+## Reverse dependencies
+
+wordpiece 2.1.2 handles the difference between this version of wordpiece.data and the previous version.