index.html

<!DOCTYPE html>
<html>
<head>
  <title>Stylometry with stylo</title>
  <meta charset="utf-8">
  <meta http-equiv="x-ua-compatible" content="ie=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <link rel="stylesheet" href="Stylometry_with_R_instructions_files/shower-ribbon/package/styles/screen-4x3.css">
  <link rel="stylesheet" href="Stylometry_with_R_instructions_files/rmdshower/style-common.css">
  <link rel="stylesheet" href="Stylometry_with_R_instructions_files/rmdshower/style-ribbon.css">
  <link rel="stylesheet" href="Stylometry_with_R_instructions_files/shower-ribbon/style-override.css">
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.css">
  <script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.js"></script>
  <script src="Stylometry_with_R_instructions_files/rmdshower/auto-render.min.js" type="text/javascript"></script>
  
  
  
  
  
</head>

<body class="shower list">

  <header class="caption">
    <h1>Stylometry with stylo</h1>
    <p>Maciej Eder, Joanna Byszuk, Jan Rybicki</p>
  </header>

  
  
<section id="disclaimer" class="slide level2">
<h2>Disclaimer</h2>
<p>This crash tutorial introduces basic functions of the package <code>stylo</code>, in oder to help the users start their experiments in no time. Essentially this means that no theoretical background will be provided. Also, the discussion on the functionalities of the package <code>stylo</code> will be reduced as much as possible. For more details, please refer to the following resources:</p>
<ul>
<li>for beginners: a concise <a href="https://sites.google.com/site/computationalstylistics/stylo/stylo_howto.pdf">HOWTO</a></li>
<li>for advanced users: a paper in <a href="https://journal.r-project.org/archive/2016/RJ-2016-007/RJ-2016-007.pdf">R Journal</a></li>
<li>full documentation <a href="https://cran.r-project.org/web/packages/stylo/stylo.pdf">at CRAN</a></li>
</ul>
</section>
<section id="installing-stylo" class="slide level2">
<h2>Installing <code>stylo</code></h2>
<ul>
<li>run R</li>
<li>type <code>install.packages(&quot;stylo&quot;)</code></li>
<li>pick your R server</li>
<li>click <code>OK</code></li>
<li>done!</li>
</ul>
</section>
<section id="some-basic-r-functions-just-in-case" class="slide level2">
<h2>Some basic R functions, just in case</h2>
<ul>
<li>to activate a package: <code>library(stylo)</code></li>
<li>to set working directory: <code>setwd(&quot;path/to/my/stuff&quot;)</code></li>
<li>to find your current location: <code>getwd()</code></li>
<li>to list files in your current location: <code>list.files()</code></li>
<li>to get help: <code>help(&lt;function&gt;)</code>, e.g. <code>help(stylo)</code></li>
<li>to quit R: <code>q()</code></li>
</ul>
</section>
<section id="main-functions-stylo" class="slide level2">
<h2>Main functions: <code>stylo()</code></h2>
<ul>
<li>It computes distances (differences) between texts, …</li>
<li>… represented as rows of frequencies of most frequent words.</li>
<li>Then it plots graphs of those distances:
<ul>
<li><strong>Cluster Analysis</strong> plots (dendrograms)</li>
<li><strong>Multidimensional Scaling</strong> scatterplots</li>
<li><strong>Principal Components Analysis</strong> scatterplots</li>
<li><strong>Bootstrap Consensus Trees</strong> plots (for multiple parameter settings)</li>
<li><strong>Bootstrap Consensus Networks</strong> (other software will be needed to take over)</li>
</ul></li>
<li>The plots can be both displayed on screen and saved to a file (e.g. PNG).</li>
</ul>
</section>
<section id="functions-stylo.network" class="slide level2">
<h2>Functions: <code>stylo.network()</code></h2>
<ul>
<li>It is an extended version of the function <code>stylo()</code>.</li>
<li>It performs Bootstrap Consensus Networks, or a network-like generalization of the Bootstrap Consensus Trees method.</li>
<li>It produces interactive visualizations in a web browser: to make it happen, you have to install an additional R package first. Type: <code>install.packages(&quot;networkD3&quot;)</code></li>
</ul>
</section>
<section id="main-functions-classify" class="slide level2">
<h2>Main functions: <code>classify()</code></h2>
<ul>
<li>It trains a model for pre-defined groups of texts, e.g. authors.</li>
<li>Then it computes distances (differences) between texts, …</li>
<li>… represented as rows of frequencies of most frequent words.</li>
<li>Finally, it compares the trained models with test texts, using:
<ul>
<li><strong>Delta</strong> classifier (lazy learner introduced by Burrows)</li>
<li><strong>k-NN</strong> classifier (lazy learner relying on &gt;1 neighbors)</li>
<li><strong>Suppor Vector Machines</strong>, a high-performance non-probabilistic classifier</li>
<li><strong>Naive Bayes</strong>, a classical yet slightly outdated classifier</li>
<li><strong>Nearest Shrunken Centroids</strong>, a classifier for high-dimensional datasets</li>
</ul></li>
<li>A final report of the classifier’s performance is outputted.</li>
</ul>
</section>
<section id="main-functions-oppose" class="slide level2">
<h2>Main functions: <code>oppose()</code></h2>
<ul>
<li>Designed to compare two (groups of) texts</li>
<li>It cuts input texts into equal-sized samples</li>
<li>Finds words characteristic for two (groups) of texts
<ul>
<li>These can be reused with <code>stylo()</code> or <code>classify()</code></li>
</ul></li>
<li>Produces a diagram of the use of each group’s words</li>
</ul>
</section>
<section id="functions-rolling.classify" class="slide level2">
<h2>Functions: <code>rolling.classify()</code></h2>
<ul>
<li>Looks for traces of authors in a co-authored text…</li>
<li>… by sliding through this text sequentially in order to detect peculiarities.</li>
<li>Produces a graph of the respective strengths of these traces.</li>
</ul>
</section>
<section id="functions-imposters" class="slide level2">
<h2>Functions: <code>imposters()</code></h2>
<ul>
<li>Performs a computation-heavy technique of authorship verification, referred to as the General Imposters method.</li>
<li>It compares a disputed text against (a) some texts by a potential author of that text, and (b) several texts written by people who could not have written it (aka the imposters).</li>
<li>In many random iterations it extimates if the text in question was more likely written by the candidate, or by any of the imposters.</li>
</ul>
</section>
<section id="preparing-a-corpus" class="slide level2">
<h2>Preparing a corpus</h2>
<ul>
<li>before you launch R, …</li>
<li>in your favourite folder, create a subfolder named <code>corpus</code></li>
<li>put your raw text files there, e.g.:
<ul>
<li><code>Shakespeare_Hamlet.txt</code></li>
<li><code>Kyd_Tragedy.txt</code></li>
<li>etc.</li>
</ul></li>
</ul>
</section>
<section id="running-stylo" class="slide level2">
<h2>Running <code>stylo()</code></h2>
<ol type="1">
<li>activate the package
<ul>
<li>type <code>library(stylo)</code>, hit ENTER</li>
</ul></li>
<li>navigate to your favourite folder:
<ul>
<li>geeks: <code>setwd(&quot;the/path/to/my/favourite/folder&quot;)</code></li>
<li>RStudio users: find your directory in the <strong>Files</strong> panel, then use <strong>Menu &gt; More &gt; Set as Working Directory</strong></li>
<li>Windows users: use <strong>Menu &gt; File &gt; Change directory</strong></li>
<li>MaxOS users: use <strong>Menu &gt; Misc &gt; Change working directory</strong></li>
</ul></li>
<li>launch the main function:
<ul>
<li>type <code>stylo()</code>, hit ENTER</li>
</ul></li>
</ol>
</section>
<section id="stylo-parameters" class="slide level2">
<h2><code>stylo()</code> parameters</h2>
<ul>
<li>INPUT: state your txts’ format</li>
<li>LANGUAGE: self-evident</li>
<li>Don’t press OK yet!</li>
</ul>
<p><img src="img/stylo_1.png" alt="title" />{width=500px}</p>
</section>
<section id="stylo-parameters-1" class="slide level2">
<h2><code>stylo()</code> parameters</h2>
<ul>
<li>FEATURES: things to count: words or characters
<ul>
<li>ngram size: say 1 for single features, 2 for 2-grams, etc. In a vast majority of cases, you’ll choose word 1-grams.</li>
</ul></li>
<li>MFW SETTINGS: how many most frequent words to use
<ul>
<li>in most cases, <code>Minimum = Maximum</code></li>
</ul></li>
</ul>
<p><img src="img/stylo_2.png" alt="title" />{width=500px}</p>
</section>
<section id="stylo-parameters-2" class="slide level2">
<h2><code>stylo()</code> parameters</h2>
<ul>
<li>CULLING: optionally, filter out some words you <strong>don’t</strong> want to analyze. Examples:
<ul>
<li>0 – all the words survive culling</li>
<li>20 – a given word has to appear in at least 20% texts</li>
<li>100 – an extreme filter: all words are removed that don’t occur in all the texts</li>
</ul></li>
<li>DELETE PRONOUNS: optionally, filter out personal pronouns.
<ul>
<li>Attention! The set of pronouns is decided according to the chosen language</li>
</ul></li>
</ul>
</section>
<section id="stylo-parameters-3" class="slide level2">
<h2><code>stylo()</code> parameters</h2>
<ul>
<li>STATISTICS: Cluster Analysis, Multidimensional Scaling, etc.</li>
<li>DISTANCES: choose how the similarities between texts should be measured
<ul>
<li>Classic Delta: perhaps a best choice to start</li>
<li>Cosine Delta (aka Wurzburg Delta): perhaps an even better choice</li>
<li>Eder’s Delta: a good choice for highly inflected languages</li>
</ul></li>
</ul>
<p><img src="img/stylo_3.png" alt="title" />{width=500px}</p>
</section>
<section id="stylo-parameters-4" class="slide level2">
<h2><code>stylo()</code> parameters</h2>
<ul>
<li>SAMPLING: do I want to split the texts
<ul>
<li>no sampling: the texts will be analyzed as they are</li>
<li>normal sampling: dividing the texts into equal-sized blocks</li>
<li>random sampling: randomly harvesting <em>N</em> words from each text</li>
<li>number of samples: random harvesting can be repeated <em>n</em> times</li>
</ul></li>
</ul>
<p><img src="img/stylo_4.png" alt="title" />{width=500px}</p>
</section>
<section id="stylo-parameters-5" class="slide level2">
<h2><code>stylo()</code> parameters</h2>
<ul>
<li>OUTPUT: most of the options are self-evident. Some might be useful:</li>
<li>on screen: make sure that this is switched on to see your results</li>
<li>PCA flavour: choose “loadings” to see the discrimination power of particular features (but first, choose PCA on STATISTICS tab)</li>
<li>horizontal CA tree: use it to position your dendrograms horizontally</li>
</ul>
<p><img src="img/stylo_5.png" alt="title" />{width=500px}</p>
</section>
<section id="bootstrap-consensus-networks" class="slide level2">
<h2>Bootstrap Consensus Networks</h2>
<ul>
<li>There are two ways of computing a stylometric network:</li>
<li>interactive:
<ul>
<li>run the function <code>stylo.network()</code></li>
<li>set the parameters as for Bootstrap Consensus Networks</li>
<li>a web browser will start automatically: your network is there!</li>
</ul></li>
<li>static, yet highly customizable:
<ul>
<li>run the function <code>stylo(network = TRUE)</code></li>
<li>find a file named <code>..._C_0.5_EDGES.csv</code> in your working directory</li>
<li>load it into a network manipulation tool, e.g. <a href="https://gephi.org/">Gephi</a></li>
</ul></li>
</ul>
</section>
<section id="running-gephi" class="slide level2">
<h2>Running Gephi</h2>
<ul>
<li>select GEPHI &gt; New Project</li>
<li>Data laboratory &gt; Import Spreadsheet</li>
<li>Import settings:
<ul>
<li>Separation: Comma</li>
<li>As table: Edges table</li>
<li>Charset: don’t bother if your filenames contain no fancy characters, e.g. <code>Brontë_Wuthering.txt</code>; otherwise choose wisely</li>
<li>Next</li>
<li>Select ALL available options</li>
<li>Finish!</li>
</ul></li>
</ul>
</section>
<section id="running-gephi-1" class="slide level2">
<h2>Running Gephi</h2>
<ul>
<li>We need to get authors’ names…</li>
<li>Click</li>
<li>And set:</li>
<li>OK!</li>
<li>Copy ID column to LABEL</li>
</ul>
</section>
<section id="running-gephi-2" class="slide level2">
<h2>Running Gephi</h2>
<ul>
<li>Preview</li>
<li>Partition</li>
<li>Click</li>
<li>Select, e.g. Author</li>
<li>Apply</li>
<li>Show labels</li>
</ul>
</section>
<section id="gephi-layout" class="slide level2">
<h2>Gephi Layout</h2>
<ul>
<li>ForceAtlas 2</li>
<li>Dissuade Hubs (hmmm… really?)</li>
<li>Prevent Overlap (hmmm… really?)</li>
<li>Edge Weight Influence 0.5</li>
<li>Scaling: 500</li>
<li>Run!</li>
</ul>
</section>
<section id="gephi-layout-cont." class="slide level2">
<h2>Gephi Layout (cont.)</h2>
<ul>
<li>Expansion</li>
<li>Run!</li>
</ul>
</section>
<section id="running-gephi-preview" class="slide level2">
<h2>Running Gephi: Preview</h2>
<ul>
<li>Node labels
<ul>
<li>Show labels</li>
</ul></li>
<li>Edges
<ul>
<li>Show edges</li>
</ul></li>
<li>Thickness: play with 0.1, 0.01, etc.</li>
<li>Refresh!</li>
<li>Reset zoom</li>
</ul>
</section>
<section id="running-gephi-saving-a-network" class="slide level2">
<h2>Running Gephi: saving a network</h2>
<ul>
<li>File &gt; Save
<ul>
<li>with <code>.gephi</code> extension</li>
</ul></li>
<li>Picture
<ul>
<li>File &gt; Export &gt; SVG/PDF/PNG</li>
</ul></li>
<li>Options &gt; Landscape</li>
<li>OK!</li>
</ul>
</section>
<section id="running-oppose" class="slide level2">
<h2>Running <code>oppose()</code></h2>
<ul>
<li>Different subfolder structure:
<ul>
<li><code>primary_set</code></li>
<li><code>secondary_set</code></li>
<li><code>test_set</code> (optional)</li>
</ul></li>
<li>Running the function:
<ul>
<li><code>library(stylo)</code> <ENTER></li>
<li><code>oppose()</code> <ENTER></li>
</ul></li>
<li>What we get:
<ul>
<li><code>words_preferred.txt</code> characteristic for the <code>primary_set</code> texts</li>
<li><code>words_avoided.txt</code> characteristic for the <code>secondary_set</code> texts</li>
</ul></li>
<li>word frequency graph</li>
</ul>
</section>
<section id="oppose-parameters" class="slide level2">
<h2><code>oppose()</code> parameters</h2>
<ul>
<li>Slice length: size (in words) of the samples (5000)</li>
<li>Slice overlap: (0)</li>
<li>Method: (Craig’s Zeta)</li>
<li>Visualization: type of graph (Markers)</li>
</ul>
</section>
<section id="oppose-parameters-1" class="slide level2">
<h2><code>oppose()</code> parameters</h2>
<p>Most of the parameters for this somewhat underdeveloped function are not on GUI. You can switch them on as command-line parameters</p>
<ul>
<li>when your corpus contains non-Latin diacritics:
<ul>
<li><code>oppose(encoding = &quot;UTF-8&quot;, corpus.lang = &quot;Spanish&quot;)</code></li>
</ul></li>
</ul>
</section>
<section id="running-rolling.classify" class="slide level2">
<h2>Running <code>rolling.classify()</code></h2>
<ul>
<li>Different subfolder structure:
<ul>
<li><code>reference_set</code> (individual writings)</li>
<li><code>test_set</code> (collaborative text)</li>
</ul></li>
<li>Running the function:
<ul>
<li><code>library(stylo)</code> <ENTER></li>
<li><code>rolling.classify(write.png.file = TRUE, classification.method = &quot;delta&quot;, mfw = 100, training.set.sampling = &quot;normal.sampling&quot;, slice.size = 5000, slice.overlap = 4500)</code></li>
</ul></li>
<li>What we get:</li>
</ul>
</section>

  <!--
  To hide progress bar from entire presentation
  just remove “progress” element.
  -->
  <!-- <div class="progress"></div> -->
  <script src="Stylometry_with_R_instructions_files/rmdshower/node_modules/shower/node_modules/shower-core/shower.min.js"></script>
  <!-- Copyright © 2015 Yours Truly, Famous Inc. -->
  <!-- Photos by John Carey, fiftyfootshadows.net -->

    <script>renderMathInElement(document.body);</script>
  
  
</body>
</html>