em
is a command-line program for managing the NYU Press open access EPUBs made available online
on Open Square.
Current functions:
- Intake of EPUB files: creation of the normalized, exploded EPUBs that are stored
in the
nyu-press-readium-epub-content
private repo - Solr indexing of EPUB metadata from the source files generated by the
metadata
command - Management of the legacy EPUB handles in the handle server
- Creation of the normalized metadata files that are used for publication that are stored in dlts-epub-metadata
- Creation and editing of epub_library.json files, which are the files used by
ReadiumJS viewer to describe a library
(note that
epub_library.json
is a legacy format file that has been superseded byepub_library.opds
-- OPDS XML) - Writing out metadata dump files for analysis
em
can operate in either immediate execution or interactive shell mode.
- Node.js
- Java for running the bundled Solr v6.6.5 used for
solr
tests - yarn for installing the dependencies (
npm install
sometimes fails)
To use em
for processing NYU Press collections:
Step 1) Clone the repo and install NPM packages using yarn
(npm
sometimes fails):
git clone https://github.com/NYULibraries/dlts-epub-manager.git epub-manager
cd epub-manager
yarn
Step 2) Clone the metadata and exploded EPUB repos:
git clone https://github.com/NYULibraries/dlts-epub-metadata.git ~/epub-metadata
# This is a private repo and can only be accessed by DLTS and technical partners.
git clone https://github.com/nyudlts/nyu-press-readium-epub-content ~/nyu-press-readium-epub-content
Step 3) Make private configuration files for dev, stage, and prod. Private
configuration files contain sensitive information that cannot be committed into
the repo in the dev.json,
stage.json,
and prod.json
files in config/
, such as the usernames and passwords for our restful handle
servers and for the Supafolio API.
somebody@host:~/epub-manager$ cat config-private/dev.json
{
"restfulHandleServerUsername" : "[USERNAME FOR DEV RESTFUL HANDLE SERVER]",
"restfulHandleServerPassword" : "[PASSWORD FOR DEV RESTFUL HANDLE SERVER]",
"supafolioApiKey" : "[SUPAFOLIO API KEY]"
}
somebody@host:~/epub-manager$ cat config-private/stage.json
{
"restfulHandleServerUsername" : "[USERNAME FOR STAGE RESTFUL HANDLE SERVER]",
"restfulHandleServerPassword" : "[PASSWORD FOR STAGE RESTFUL HANDLE SERVER]",
"supafolioApiKey" : "[SUPAFOLIO API KEY]"
}
somebody@host:~/epub-manager$ cat config-private/prod.json
{
"restfulHandleServerUsername" : "[USERNAME FOR PROD RESTFUL HANDLE SERVER]",
"restfulHandleServerPassword" : "[PASSWORD FOR PROD RESTFUL HANDLE SERVER]",
"supafolioApiKey" : "[SUPAFOLIO API KEY]"
}
Step 4) Make a local configuration if needed. The intake
and metadata
commands currently require a local configuration file (see
Special note about configuration of intake
and
Special note about configuration of metadata
):
somebody@host:~/epub-manager$ ls config/
dev.json prod.json stage.json
somebody@host:~/epub-manager$ cat > config/local.json
{
"cacheMetadataInMemory" : true,
"intakeEpubDir" : "/home/somebody/epubs/publish/nyupress/wip",
"intakeEpubList" : null,
"intakeOutputDir" : "/home/somebody/nyu-press-readium-epub-content/",
"metadataDir" : "/home/somebody/epub-metadata/nyupress",
"metadataEpubList" : null,
"readiumJsonFile" : "/home/somebody/nyu-press-readium-epub-content/epub_library.json",
"restfulHandleServerHost" : "localhost:9002",
"restfulHandleServerPath" : "/id/handle",
"solrHost" : "localhost",
"solrPort" : 8080,
"solrPath" : "/solr"
}
Don't forget the private configuration file:
somebody@host:~/epub-manager$ cat config-private/local.json
{
"restfulHandleServerUsername" : "[USERNAME FOR CHOSEN RESTFUL HANDLE SERVER]",
"restfulHandleServerPassword" : "[PASSWORD FOR CHOSEN RESTFUL HANDLE SERVER]",
"supafolioApiKey" : "[SUPAFOLIO API KEY]"
}
See Configuration file format for more details. Also
see Special note about configuration of intake
.
Intake new EPUBs - local configuration (see
Special note about configuration of intake
):
# Intake EPUB files and output normalized exploded EPUB directories.
./em intake add local
Create metadata files - local configuration (see
Special note about configuration of metadata
):
# Create metadata files.
./em metadata add local
Handles processing - prod configuration:
# Add all prod handles to handle server.
./em handles add prod
# Delete prod handles from handle server.
./em handles delete prod
Solr indexing - dev configuration:
# Add all dev EPUB metadata to Solr index.
./em solr add dev
# Delete dev EPUB metadata from Solr index.
./em solr delete dev
# Delete everything from Solr index.
./em solr delete all dev
# Same as `delete all` followed by `add`.
./em solr full-replace dev
epub_library.json
file editing - local configuration:
# Add all local EPUB metadata to file.
./em readium-json add local
# Delete local EPUB metadata from file.
./em readium-json delete local
# Delete everything from file.
./em readium-json delete all local
# Same as `delete all` followed by `add`.
./em readium-json full-replace local
Load prod configuration metadata and write to file: start interactive shell,
run load prod
followed by load write
.
somebody@host:~/epub-manager$ ./em
em$ load prod
Cloning into '/Users/david/Documents/programming/src/dlts/epub-manager/cache/metadataRepo'...
Already on 'master'
em$ load write
Metadata dumped to /home/someboady/epub-manager/cache/metadata.json.
em$ quit
somebody@host:~/epub-manager$ # Metadata for prod was written to JSON file in cache directory.
somebody@host:~/epub-manager$ ls cache/metadata.json
cache/metadata.json
Get help message (note publish
and verify
have not been implemented yet):
somebody@host:~/epub-manager$ ./em help
Commands:
help [command...] Provides help for a given command.
exit Exits application.
handles add [configuration] Bind EPUB handles.
handles delete [configuration] Unbind EPUB handles.
intake add [configuration] Intake EPUBs and generate Readium versions.
load <configuration> Read in configuration file and load resources.
load write [file] Write metadata out to file.
load clear Clear all loaded metadata.
metadata add [configuration] Generate metadata files from Supafolio API.
publish [options] Publish EPUBs.
publish add [options] Add EPUBs.
publish delete [options] Delete EPUBs.
publish delete all [options] Delete all EPUBs.
publish full-replace [options] Replace all EPUBs.
readium-json add [configuration] Add EPUBs to `epub_library.json` file.
readium-json delete [configuration] Delete EPUBs from `epub_library.json` file.
readium-json delete all [configuration] Delete all EPUBs from `epub_library.json` file.
readium-json full-replace [configuration] Replace entire `epub_library.json` file.
solr add [configuration] Add EPUBs to Solr index.
solr delete [configuration] Delete EPUBs from Solr index.
solr delete all [configuration] Delete all EPUBs from Solr index.
solr full-replace [configuration] Replace entire Solr index.
verify Verify integrity of published collection, handles, and metadata indexes.
Get help for specific commands in interactive mode:
somebody@host:~/epub-manager$ ./em
em$ help load
Usage: load [options] <configuration>
Read in configuration file and load resources.
Options:
--help output usage information
em$ help solr
Commands:
solr add [configuration] Add EPUBs to Solr index.
solr delete [configuration] Delete EPUBs from Solr index.
solr delete all [configuration] Delete all EPUBs from Solr index.
solr full-replace [configuration] Replace entire Solr index.
em$
em
is built using Vorpal, a Node.js framework
for building interactive CLI applications. The various EPUB management functions
are executed using specific commands: handles
, intake
, load
, metadata
, readium-json
,
and solr
.
Most em
commands and subcommands can be run immediately from the command line
by passing them as arguments to the em
script. There are a relatively small subset
of commands that can only be run in the interactive shell because they must be run
as part of a sequence of commands.
The help
command lists all these function commands along with information about
their subcommands and options. For help on individual commands, use help COMMAND
.
Note that the following commands are listed in help
but are not yet implemented:
publish
, verify
.
These have been set up as placeholders only (and for testing).
While in interactive shell mode, the following features are available:
- Autocompletion via the
tab
key. Commands can be autocompleted, as can their subcommands. In addition, for commands that take the[configuration]
option, there is autocompletion for the names of the configuration files inconfig/
(minus their *.json suffixes). - Command history using the up and down arrows.
Most of the commands share a similar set of subcommands which run specific
operations whose semantics are generally the same for all commands. In each
case, EPUB-related data are first loaded by a load [configuration]
operation
(which is performed transparently if [configuration]
is used with the current
command). The subcommand then performs operations on the destination, which is
usually a datastore of some kind or a filesystem.
add
: add EPUB data to the destination, updating in place any EPUBs that already exist. Do not delete any existing EPUBs.delete
: delete the EPUB data specified by[configuration]
from the destination. Do not delete any other data for EPUBs that are already there.delete all
: delete all EPUB data from the destination, regardless of whether the EPUBs are specified in[configuration]
.full-replace
: this is adelete all
followed by anadd
.
See Quickstart for some basic usage examples. Below are some more
detailed use cases. No detailed use case is provided for intake
because this
command has only one subcommand add
and is usually run as a one-shot
(see
Special note about configuration of intake
).
Most of the examples given will employ the interactive shell. With few exceptions, the command invocations shown can also be performed in immediate execution mode. For example, the following command invocations do the same thing:
In em
shell, using the tab
key to get suggestions for [configuration]
:
somebody@host:~/epub-manager$ ./em
em$ readium-json add
dev local prod stage
em$ readium-json add local
Added to Readium JSON file /home/somebody/nyu-press-readium-epub-content/epub_library.json for conf "local": 67 EPUBs.
Immediately executed on the command line:
somebody@host:~/epub-manager$ ./em readium-json add local
Added to Readium JSON file /home/somebody/nyu-press-readium-epub-content/epub_library.json for conf "local": 67 EPUBs.
EXAMPLE: Update Solr index and epub_library.json
for local
(from Installation and setup), then add to Solr index for
dev.
Note that local
configuration specifies metadataDir
while
dev
specifies metadataRepo
, metadataRepoBranch
, and metadataRepoSubdirectory
.
somebody@host:~/epub-manager$ ./em
em$ load local
em$ solr add
Added 67 EPUBs to Solr index:
9780814707821
9780814707517
9780814725078
...
[SNIPPED]
em$ readium-json add
Added to Readium JSON file /home/somebody/nyu-press-readium-epub-content/epub_library.json for conf "local": 67 EPUBs.
em$ load dev
Cloning into '/home/somebody/epub-manager/cache/metadataRepo'...
Switched to a new branch 'develop'
em$ solr add
Added 67 EPUBs to Solr index:
9780814707821
9780814707517
9780814725078
...
[SNIPPED]
em$ quit
...or...
somebody@host:~/epub-manager$ ./em
em$ solr add local
Added 67 EPUBs to Solr index:
9780814707821
9780814707517
9780814725078
...
[SNIPPED]
em$ readium-json add local
Added to Readium JSON file /home/somebody/nyu-press-readium-epub-content/epub_library.json for conf "local": 67 EPUBs.
em$ load dev
Cloning into '/home/somebody/epub-manager/cache/metadataRepo'...
Switched to a new branch 'develop'
em$ solr add dev
Cloning into '/Users/david/Documents/programming/src/dlts/epub-manager/cache/metadataRepo'...
Switched to a new branch 'develop'
Added 67 EPUBs to Solr index:
9780814707821
9780814707517
9780814725078
...
[SNIPPED]
em$ quit
Note that it is not possible to rewrite the remote epub_library.json
file sitting on
the dev server. The epub_library.json
file rewrite is always local. Thus, in this use case, the user
presumably switched the local repo /home/somebody/nyu-press-readium-epub-content/
to develop
branch before running the readium-json
command.
Rewriting the epub_library.json
file for a local instance of ReadiumJS viewer
would have involved changing the readiumJsonFile
option in local.conf
from the
path to the repo copy /home/somebody/nyu-press-readium-epub-content/epub_library.json
to the path of the library content directory of a locally installed ReadiumJS viewer:
e.g. /var/www/html/readium-js-viewer/cloud-reader/epub_content/epub_library.json
.
EXAMPLE: Dump metadata for 3 EPUBs into file cache/3-epubs.json
, then delete them
from stage
Solr index, then dump the metadata again to /tmp/3-epubs.json
.
Note that load write [file]
cannot be run in immediate execution mode, because
it must first be preceded by load [configuration]
.
Copy config/stage.json
to config/ad-hoc.json
(for example) and change:
"metadataEpubList" : null,
...to:
"metadataEpubList" : [ "9780814707821", "9780814707517", "9780814725078" ],
...then:
somebody@host:~/epub-manager$ ./em
em$ load ad-hoc
Cloning into '/home/somebody/epub-manager/cache/metadataRepo'...
Switched to a new branch 'stage'
em$ load write cache/3-epubs.json
Metadata dumped to cache/3-epubs.json.
em$ quit
somebody@host:~/epub-manager$ ls cache/3-epubs.json
cache/3-epubs.json
somebody@host:~/epub-manager$ ./em
em$ solr delete ad-hoc
Cloning into '/home/somebody/epub-manager/cache/metadataRepo'...
Switched to a new branch 'stage'
Deleted 9780814707821 from Solr index.
Deleted 9780814707517 from Solr index.
Deleted 9780814725078 from Solr index.
Deleted 3 EPUBs.
em$ quit
somebody@host:~/epub-manager$ cat cache/3-epubs.json
cat: cache/3-epubs.json: No such file or directory
somebody@host:~/epub-manager$ # Whoops, cache/ was cleared when `em` was restarted for `solr delete ad-hoc`.
somebody@host:~/epub-manager$ # Write the file again, this time to /tmp/:
somebody@host:~/epub-manager$ ./em
em$ load ad-hoc
em$ load write /tmp/3-epubs.json
Metadata dumped to /tmp/3-epubs.json.
em$ quit
somebody@host:~/epub-manager$ ls /tmp/3-epubs.json
/tmp/3-epubs.json
EXAMPLE: Add handles for prod
, then delete handles specified in ad-hoc
.
somebody@host:~/epub-manager$ ./em
em$ handles add prod
Cloning into '/Users/david/Documents/programming/src/dlts/epub-manager/cache/metadataRepo'...
Already on 'master'
Added 67 handles to handles server:
9780814707821: 2333.1/37pvmfhh
9780814707517: 2333.1/4tmpg641
9780814725078: 2333.1/zgmsbf5k
9780814723418: 2333.1/9s4mw88v
9780814786086: 2333.1/tqjq2dn7
9780814786123: 2333.1/ffbg7c4r
...
[SNIPPED]
em$ handles delete ad-hoc
Cloning into '/Users/david/Documents/programming/src/dlts/epub-manager/cache/metadataRepo'...
Switched to a new branch 'develop'
Added 3 handles to handles server:
9780814784891: 2333.1/b8gthvz5
9781479863570: 2333.1/73n5tfjs
9781479829712: 2333.1/brv15j8p
em$ quit
EXAMPLE: Delete all EPUBs in epub_library.json
file for local
, then add local
EPUBs
twice, then do a full replace.
somebody@host:~/epub-manager$ ./em
em$ readium-json delete all local
Deleted all EPUBs from /home/somebody/nyu-press-readium-epub-content/epub_library.json.
em$ quit
somebody@host:~/epub-manager$ cat /home/somebody/nyu-press-readium-epub-content/epub_library.json
[]
somebody@host:~/epub-manager$ # Accidentally add `local` EPUBs twice. The second
somebody@host:~/epub-manager$ # `add` will simply update with the same content.
somebody@host:~/epub-manager$ ./em
em$ readium-json add local
Added to Readium JSON file /home/somebody/nyu-press-readium-epub-content/epub_library.json for conf "local": 67 EPUBs.
em$ readium-json add local
Added to Readium JSON file /home/somebody/nyu-press-readium-epub-content/epub_library.json for conf "local": 67 EPUBs.
em$ quit
somebody@host:~/epub-manager$ # Verify that the file only has 67 EPUBs in it, despite
somebody@host:~/epub-manager$ # having run `readium-json add local` twice.
somebody@host:~/epub-manager$ grep '"identifier":' //home/somebody/nyu-press-readium-epub-content/epub_library.json | wc -l
67
somebody@host:~/epub-manager$ # But do a full replace anyway...
somebody@host:~/epub-manager$ ./em
em$ readium-json full-replace local
Deleted all EPUBs from /home/somebody/nyu-press-readium-epub-content/epub_library.json.
Added to Readium JSON file /home/somebody/nyu-press-readium-epub-content/epub_library.json for conf "local": 67 EPUBs.
Fully replaced all EPUBs in Readium JSON for conf local.
em$ quit
somebody@host:~/epub-manager$ grep '"identifier":' /home/somebody/nyu-press-readium-epub-content/epub_library.json | wc -l
67
# Run all acceptance and unit tests
yarn test
# Run unit tests
yarn test:lib
# Run individual unit tests or group of tests
node_modules/.bin/jest [PATH TO *.test.js* FILE OR FILES]
# Run acceptance tests
yarn test:acceptance
# Run acceptance tests for individual commands
# Note that certain acceptance test suites cannot be run simultaneously --
# see https://jira.nyu.edu/jira/browse/NYUP-742. For this reason, `yarn test:acceptance`
# uses the --runInBand Jest option.
node_modules/.bin/jest test/acceptance/handles
node_modules/.bin/jest test/acceptance/intake
node_modules/.bin/jest test/acceptance/load
node_modules/.bin/jest test/acceptance/metadata
node_modules/.bin/jest test/acceptance/readium-json
node_modules/.bin/jest test/acceptance/solr
Note that the solr
tests require that the test/solr/
Solr instance be running.
If it is not running, the test will produce an error message with instructions on how to
start the test Solr:
somebody@host:~/epub-manager$ mocha test/acceptance/solr
solr command
1) "before all" hook
0 passing (184ms)
1 failing
1) solr command "before all" hook:
AssertionError:
Solr is not responding. Try running Solr setup and start script:
test/solr/start-solr-test-server.sh
Error: connect ECONNREFUSED 127.0.0.1:9001
at Context.before (test/acceptance/solr.js:41:20)
The test/solr/start-solr-test-server.sh
script is a modified version of
django-haystack's.
Running it will do the following:
- Download the appropriate Solr archive to
test/solr/download-cache/
. This step is skipped if the archive exists already. - Unpack the archive, install the Solr server and configure it using the files in
test/solr/config-files/
. - Start Solr on port 9001 in the foreground. To start it in the background,
set
BACKGROUND_SOLR
to a non-empty value:
BACKGROUND_SOLR=true test/solr/start-solr-test-server.sh
To stop the server, simply kill
the process.
Configuration files are stored in config/
and config-private/
. Each file in
config/
must have a corresponding, identically named file in config-private/
for storing sensitive information related to that configuration.
The basenames of the files in config/
are the configuration names that can be
specified as options for various em
commands, and are used as autocomplete
possibilities for commands that take a configuration option.
The
dev,
stage,
and prod
configurations for NYU Press collections are already included in the repo in
config/
. Individual clones of this repo must have local config-private/
files
corresponding to these three configurations. See
Installation and setup, Step 3.
New configuration files can be created in config/
and will be ignored by git.
The contents of config-private/
is ignored by git
entirely.
config/
file properties:
- cacheMetadataInMemory:
true
to load all metadata at once into memory for faster processing, otherwisefalse
. Currently onlytrue
is supported. - intakeEpubDir: directory containing the *.epub files be processed by the intake system.
The subdirectory names are also used as the ISBN list for the
metadata
command if intakeEpubList is not specified. - intakeEpubList: array of EPUB ids specifying the EPUBs to be processed by the intake system. All other EPUBs will be ignored. If this option is not specified then the names of the subdirectories in intakeEpubDir will be used for the EPUB list. Example: [ "9780814707821", "9780814707517", "9780814725078" ]
- intakeOutputDir: directory to output the normalized, exploded EPUBs to
- metadataDir: full path to the directory containing the metadata files. For
NYU Press collections, this would be the
nyupress
directory in the local clone of the dlts-epub-metadata repo. If this option is specified, metadata repo options will be ignored. Example: "/home/somebody/epub-metadata/nyupress" - metadataEpubList: array of EPUB ids specifying the EPUBs to be processed by the metadata system. All other EPUBs will be ignored. If this option is not specified then the names of the subdirectories in metadataDir will be used for the EPUB list. Example: [ "9780814707821", "9780814707517", "9780814725078" ]
- Metadata repo options -- these will be ignored if metadataDir has been specified.
- metadataRepo: URL for the git repo containing the metadata. The repo will
be cloned locally using
git clone [metadataRepo]
. Example: "https://github.com/NYULibraries/dlts-epub-manager.git" - metadataRepoBranch: branch or commit to use. Will be checked out using
git checkout [metadataRepoBranch]
. Examples:- "master"
- "0c18465a5c80c056088e98d45b6dd621e6001a7b"
- metadataRepoSubdirectory: relative path to subdirectory containing the metadata to be processed. Example: "nyupress"
- metadataRepo: URL for the git repo containing the metadata. The repo will
be cloned locally using
- readiumJsonFile: full path to the
epub_library.json
file. Example: "/home/somebody/nyu-press-readium-epub-content/epub_library.json" - restfulHandleServerHost: hostname of the restful handle server. Example: "devhandle.dlib.nyu.edu"
- restfulHandleServerPath: path to use for handle requests. Example: "/id/handle"
- solrHost: hostname of Solr server. Example: "localhost"
- solrPort: port that Solr is running on. Example: 8080
- solrPath: path to use for Solr requests. Example: "/solr/nyupress"
config-private/
file properties:
- restfulHandleServerUsername: user authorized to add, update, and delete on the restful handle server.
- restfulHandleServerPassword: password for the user authorized to add, update, and delete on the restful handle server.
- supafolioApiKey: API key for the Supafolio Open Square catalog.
For example configuration files that illustrate the correct usage of all the above options, look in config/ and test/acceptance/fixtures/config/.
intake
configuration is not included in the
dev,
stage,
and prod
configurations because intake
is a one-shot process that is done when a collection
is first received by the publisher, and is never run again as part of a publication job
to dev, stage, prod. intake
creates the data and metadata that are consumed
by the other commands like handles
or solr
that are run on a more regular basis
against dev, stage, and prod servers.
The metadata
command currently determines which ISBNs to fetch metadata for by reading the
subdirectory names in intakeEpubDir
or the ISBN list in intakeEpubList
.
Note that these options are not present in the
dev,
stage,
and prod
configurations - see Special note about configuration of intake
.
This means that there exists no dev, stage, or prod configurations for the metadata
command as well.
This is intentional, as the metadata
command targets a local directory and does not run against
dev, stage, prod servers or in dev, stage, and prod environments. This local directory
will usually be a local clone of dlts-epub-metadata,
with dev, stage, or prod branches checked out as needed.
##Future enhancements
- Solr indexing of EPUB full-text content
- Creation and editing of epub_library.opds files
- One-step publications of EPUBs: full processing of new EPUBs -- all functions performed in one step
- Verification of collection: check that decompressed EPUB files, Solr index, and
epub_library.json
file are in sync