Stylometry_with_R_instructions.Rmd

---
title: "Stylometry with `stylo`"
author: Maciej Eder, Joanna Byszuk, Jan Rybicki
date: 26.05.2018
output:
  rmdshower::shower_presentation:
    self_contained: false
    theme: ribbon
    ratio: 4x3
    katex: true
    fig_width: 9
    fig_height: 6
#output: 
#  revealjs::revealjs_presentation:
#    theme: white
#    highlight: pygments
#    center: false
#    transition: fade
#    self_contained: false
#    fig_width: 9
#    fig_height: 6
#    fig_caption: false
---


## Disclaimer

This crash tutorial introduces basic functions of the package `stylo`, in order to help the users start their experiments in no time. Essentially this means that no theoretical background will be provided. Also, the discussion on the functionalities of the package `stylo` will be reduced as much as possible. For more details, please refer to the following resources:

* for beginners: a concise [HOWTO](https://sites.google.com/site/computationalstylistics/stylo/stylo_howto.pdf)
* for advanced users: a paper in [R Journal](https://journal.r-project.org/archive/2016/RJ-2016-007/RJ-2016-007.pdf)
* full documentation [at CRAN](https://cran.r-project.org/web/packages/stylo/stylo.pdf)


## Installing `stylo`

* run R
* type `install.packages("stylo")`
* pick your R server
* click `OK`
* done!


## Some basic R functions, just in case

* to activate a package: `library(stylo)`
* to set working directory: `setwd("path/to/my/stuff")`
* to find your current location: `getwd()`
* to list files in your current location: `list.files()`
* to get help: `help(<function>)`, e.g. `help(stylo)`
* to quit R: `q()`


## Main functions: `stylo()`

* It computes distances (differences) between texts, ...
* ... represented as rows of frequencies of most frequent words.
* Then it plots graphs of those distances:
    * **Cluster Analysis** plots (dendrograms)
    * **Multidimensional Scaling** scatterplots
    * **Principal Components Analysis** scatterplots
    * **Bootstrap Consensus Trees** plots (for multiple parameter settings)
    * **Bootstrap Consensus Networks** (other software will be needed to take over)
* The plots can be both displayed on screen and saved to a file (e.g. PNG).


## Functions: `stylo.network()`


* It is an extended version of the function `stylo()`.
* It performs Bootstrap Consensus Networks, or a network-like generalization of the Bootstrap Consensus Trees method.
* It produces interactive visualizations in a web browser: to make it happen, you have to install an additional R package first. Type: `install.packages("networkD3")`


## Main functions: `classify()`

* It trains a model for pre-defined groups of texts, e.g. authors.
* Then it computes distances (differences) between texts, ...
* ... represented as rows of frequencies of most frequent words.
* Finally, it compares the trained models with test texts, using:
    * **Delta** classifier (lazy learner introduced by Burrows)
    * **k-NN** classifier (lazy learner relying on >1 neighbors)
    * **Suppor Vector Machines**, a high-performance non-probabilistic classifier
    * **Naive Bayes**, a classical yet slightly outdated classifier
    * **Nearest Shrunken Centroids**, a classifier for high-dimensional datasets
* A final report of the classifier’s performance is outputted.


## Main functions: `oppose()`

* Designed to compare two (groups of) texts
* It cuts input texts into equal-sized samples
* Finds words characteristic for two (groups) of texts
    * These can be reused with `stylo()` or `classify()`
* Produces a diagram of the use of each group’s words


## Functions: `rolling.classify()`

* Looks for traces of authors in a co-authored text...
* ... by sliding through this text sequentially in order to detect peculiarities.
* Produces a graph of the respective strengths of these traces.


## Functions: `imposters()`

* Performs a computation-heavy technique of authorship verification, referred to as the General Imposters method.
* It compares a disputed text against (a) some texts by a potential author of that text, and (b) several texts written by people who could not have written it (aka the imposters).
* In many random iterations it estimates if the text in question was more likely written by the candidate, or by any of the imposters. 


## Preparing a corpus

* before you launch R, ...
* in your favourite folder, create a subfolder named `corpus`
* put your raw text files there, e.g.:
    * `Shakespeare_Hamlet.txt`
    * `Kyd_Tragedy.txt`
    * etc.
* to be on the safe side - make sure your files are in UTF-8, especially if the language of your corpus contains diacritics!


## Running `stylo()`

1. activate the package 
    * type `library(stylo)`, hit ENTER
2. navigate to your favourite folder:
    * geeks: `setwd("the/path/to/my/favourite/folder")`
    * RStudio users: find your directory in the **Files** panel, then 
      use **Menu > More > Set as Working Directory**
    * Windows users: use **Menu > File > Change directory**
    * MaxOS users: use **Menu > Misc > Change working directory**
3. launch the main function: 
    * type `stylo()`, hit ENTER


## `stylo()` parameters

* INPUT: state your texts’ format
* LANGUAGE: self-evident
* Don’t press OK yet!

![title](img/stylo_1.png){width=500px}


## `stylo()` parameters


* FEATURES: things to count: words or characters
    * ngram size: say 1 for single features, 2 for 2-grams, etc. In a vast majority of cases, you’ll choose word 1-grams.
* MFW SETTINGS: how many most frequent words to use
    * in most cases, `Minimum = Maximum`

![title](img/stylo_2.png){width=500px}


## `stylo()` parameters

* CULLING: optionally, filter out some words you **don’t** want to analyze. Examples:
    * 0 – all the words survive culling
    * 20 – a given word has to appear in at least 20% texts
    * 100 – an extreme filter: all words that don’t appear in all the texts are removed
* DELETE PRONOUNS: optionally, filter out personal pronouns.
    * Attention! The set of pronouns is chosen according to the selected language


## `stylo()` parameters

* STATISTICS: Cluster Analysis, Multidimensional Scaling, etc.
* DISTANCES: choose how the similarities between texts should be measured
    * Classic Delta: perhaps a best choice to start
    * Cosine Delta (aka Wurzburg Delta): perhaps an even better choice
    * Eder’s Delta: a good choice for highly inflected languages

![title](img/stylo_3.png){width=500px}


## `stylo()` parameters

* SAMPLING: do I want to split the texts
    * no sampling: the texts will be analyzed as they are
    * normal sampling: dividing the texts into equal-sized blocks
    * random sampling: randomly harvesting _N_ words from each text
    * number of samples: random harvesting can be repeated _n_ times

![title](img/stylo_4.png){width=500px}


## `stylo()` parameters

* OUTPUT: most of the options are self-evident. Some might be useful:
  * on screen: make sure that this is switched on to see your results
  * PCA flavour: choose “loadings” to see the discrimination power of particular features (but first, choose PCA on STATISTICS tab)
  * horizontal CA tree: use it to position your dendrograms horizontally 

![title](img/stylo_5.png){width=500px}


## Bootstrap Consensus Networks

* There are two ways of computing a stylometric network:
* interactive: 
    * run the function `stylo.network()`
    * set the parameters as for Bootstrap Consensus Networks
    * a web browser will start automatically: your network is there!
* static, yet highly customizable:
    * run the function `stylo(network = TRUE)`
    * find a file named `..._C_0.5_EDGES.csv` in your working directory
    * load it into a network manipulation tool, e.g. [Gephi](https://gephi.org/)


## Running Gephi: Import

* select **GEPHI > New Project**
* **Data laboratory > Import Spreadsheet**
* Import settings:
    * Separation: Comma
    * As table: Edges table
    * Charset: don’t bother if your filenames contain no fancy characters, e.g. `Brontë_Wuthering.txt`; otherwise choose wisely
    * Next
    * Select ALL available options
    * Finish!


## Running Gephi: Labels

* We need to get authors’ names...
* Select **Create column with list of regex matching groups > ID**. 
	* Title: of your choice, e.g. Author, 
	* Regular Expression: limit what content to extract from the ID, 
		* e.g. to extract just the author from `Brontë_Wuthering.txt`: `^[A-Za-z]+`
* OK!
* **Copy data to other column > Author** to **Label**


## Running Gephi: Overview

* Go to **Overview** panel
* In **Appearance >  Nodes > Partition**
* Select **Author**
* Apply
* Show labels (T icon on the bottom panel)


## Running Gephi: Overview Layout

* Select **ForceAtlas 2**
* If you have a lot of data:
	* Dissuade Hubs
	* Prevent Overlap
* Edge Weight Influence 0.5
* Scaling: 500
* Run!


## Running Gephi: Overview Layout (cont.)

* To make labels align:
	* select **Label Adjust**
* If your nodes stick together:
	* select **Expansion**
* Run! (for Expansion - a couple of times)


## Running Gephi: Preview

* Go to **Preview** panel
* Node labels
    * Show labels
* Edges
    * Show edges
* Thickness: play with 0.1, 0.01, etc.
* Refresh!
* Reset zoom

## Running Gephi: saving a network

* File > Save
    * with `.gephi` extension
* Picture
    * File > Export > SVG/PDF*/PNG
* For PDFs **Options > Landscape** & increase margins!
* OK!


## Running `oppose()`

* Different subfolder structure:
    * `primary_set`
    * `secondary_set`
    * `test_set` (optional)
* Running the function:
    * `library(stylo)` <ENTER>
    * `oppose()` <ENTER>
* What we get:
    * `words_preferred.txt` characteristic for the `primary_set` texts
    * `words_avoided.txt` characteristic for the `secondary_set` texts
* word frequency graph


## `oppose()` parameters

* Slice length: size (in words) of the samples (5000)
* Slice overlap: (0)
* Method: (Craig’s Zeta)
* Visualization: type of graph (Markers)


## `oppose()` parameters

Most of the parameters for this somewhat underdeveloped function are not on GUI. You can switch them on as command line parameters

* when your corpus contains non-Latin diacritics:
    * `oppose(encoding = "UTF-8", corpus.lang = "Spanish")`


## Running `rolling.classify()`

* Different subfolder structure:
    * `reference_set` (individual writings)
    * `test_set` (collaborative text)
* Running the function:
    * `library(stylo)` <ENTER>
    * Example: `rolling.classify(write.png.file = TRUE, classification.method = "svm", mfw = 100, training.set.sampling = "normal.sampling", slice.size = 5000, slice.overlap = 4500)`


## Running `rolling.classify()`
How to read the output? Look at this example for "Roman de la Rose":
![title](img/rolling-svm_100-features_5000-per-slice.png){width=800px}
Sections attributed to Guillaume de Lorris are marked red, and to Jean de Meun - green. The thickness of the bottom stripe indicates certainty of classification and a vertical dashed line the commonly-accepted division.