Skip to content

zihan0822/linguistic_hawkes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Linguistic Hawkes Process Model Codebase

This Python codebase implements discrete hawkes process model for linguistic pattern analysis. Powered by dask backend, this repo is:

  1. Scalable: Can be deployed on either single machine or cluster to model either short or long term process.
  2. Fast: Highly parallelized. Extra speed-oriented fitting options are provided.
  3. Extendable: Can be used in other context besides liguistic analysis; Only proper input data formatting is required.

Run the code

  • Create your own corpus Corpus file should be in .json format, where key is the token string (of, the, ...) and value is a list of integer that records the time stamps at which that token occurs in a book. A utility that automates corpus creation from text file is provided in utils/scanner.py, where spacy-en_core_web_sm is used as default tokenizer. And the default token position counting pipeline is
[beginning of file] To  be  or  not  to  be , that  is  the  question .
[lower case]        to  be  or  not  to  be , that  is  the  question .
[time stamp]        0   1   2   3    4   5  \ 6     7   8    9        \
[example result]    {'to': [0, 4], 'be': [1, 5], 'or': [2], ......}     
from utils.scanner import Scanner
ignore_list = [' ', '.', ',']   # character to exclude when counting token position
scanner = Scanner(ignore_list)
corpus, token_cnt = scanner.count_pos('Hamlet.txt')
# corpus: Dict[str, List[int]]: mapping from toke string to occurrence position
# token_cnt: int, total number of token (ignore_list excluded)

Feel free to create your own customed corpus with other tokenizer.

  • Entry Point
$ python main.py -h       # display options help menu
$ python main.py --word --corpus-path --total-word-num \
                 --ckp-save-path \
                 --epoch \
                 --X-bandwidth --lag-bandwidth
  • Args Spec
    • word: target word in the book to consider
    • corpus-path: path to corpus json file
    • total-word-num: total word/token number of the book
    • ckp-save-path: save path of fitted checkpoint
    • X-bandwidth: bandwidth used when estimating distribution of occurrence
    • lag-bandwidth: bandwidth used when estimating distribution of occurrence lag, i.e difference between any pair of occurrence
    • epoch: the number of fitting epoch

Workflow

Conditional intensity function $\lambda(t)$ is defined as, which is the probability of occurrence of event at time stamp $t$ conditioned previous history

$$\lambda(t) = P\{X_t = 1 |\mathcal{H}_t\}$$

In our example, we estimate $\lambda(t)$ for uni-variant case with the following equation

$$\lambda(t) = \mu_0 \mu(t) + A \sum_{i: t_i < t} g(t-t_i)$$

The objective likelihood function is $$\mathcal{L} = \sum_{i=1}^{N} log[\lambda(t)](X_i) + log[1-\lambda(t)](1-X_i)$$ where $X_i$ is an indicator variable for occurrence, $X_i=1$ denotes occurrence, $X_i=0$ denotes no occurrence.

Denote $\phi(t) = \frac{\mu(t)}{\lambda(t)}$ and $\rho_{ij}=\frac{g(t_j-t_i)}{\lambda(t_j)}$

E step: Update $\mu(t)$ and $g(t)$

$$\mu(t) \leftarrow \sum_{t_i \in N} \phi(t_i) \mathbb{1}[t \in [t-\Delta t, t+\Delta t]] \approx \sum_{t_i \in N} \phi(t_i) Z(t-t_i, \omega_{bg})$$ $$g(t) \leftarrow \sum_{t_i, t_j \in N; t_i < t_j} \rho_{ij} \mathbb{1}[t_j-t_i \in [t-\Delta t, t+\Delta t]] \approx \sum_{t_i, t_j \in N; t_i < t_j} \rho_{ij} Z(t_j-t_i, \omega_{trigger})$$

M step: Update $\mu_0$ and $A$ $$\frac{\partial \mathcal{L}}{\partial \mu_0} = \sum_{t_i \in N} \frac{\phi(t_i)}{\mu_0} - \sum_{t \notin N} \frac{\mu(t)}{1-\lambda(t)}$$ $$\frac{\partial \mathcal{L}}{\partial A} = \sum_{t_i \in N} \frac{1-\phi(t_i)}{A} - \sum_{t \notin N} \frac{\sum_{j: t_j &lt; t} g(t-t_j)}{1-\lambda(t)} $$ Iterative Optimization Approach $$\mu_0^{k+1} \leftarrow \frac{\sum_{t_i \in N}\phi^k(t_i)} {\sum_{t \notin N} \frac{\mu^k(t)}{1-\lambda^k(t)}}$$ $$A^{k+1} \leftarrow \frac{\sum_{t_i \in N} 1-\phi^{k+1}(t_i)}{\sum_{t \notin N} \frac{\sum_{t_j:t_j &lt; t} g^{k+1}(t-t_j)}{1 - \lambda^{k+1}(t)} }$$

Advanced Options

Parameter Init

  • init-mu0: initial value of $\mu_0$
  • init-A: initial value of $A$
  • init-exp-lambda: initial value of $\lambda$ parameterizing exponential distribution $\lambda e^{-\lambda x}$, which is used to initialize $g(t)$.

Fast Computation

  • g-window-size: when computing self excitation term, events occurred beyond this window this be truncated. Without truncating, the complexity for accumulative trigger calculation is $O(N^2)$.

Memory Utilization When running on single machine, one of the major problems when fitting long process is OOM (out of memory) on RAM, the following arguments should be carefully tuned to prevent it.

  • X-chunk-size: chunk size to split across entire episode
  • kernel-chunk-size: chunk size to split across occurrence
  • lag-chunk-size: chunk size to split across occurrence lag

General Guide:

  • Bottlenecks:

    • (#kernel_chunk_size, #occur_lag/#occur): estimation of the distribution of occurrence and lag of occurrence
    • (#lag_chunk_size, #g-window-size): calculation of accmulative self excitation term; events beyond g window will be truncated In single-machine scenario, the memory used by multiple workers collectively can be approximated by #num-worker $\times$ bottleneck, which should not exceed RAM capacity
  • Monitering memory consumption with Dask Daskboard is highly recommended.

Here are some of the results of $g(t)$ fitted on novels of Mark Twain and Dickens. $g(t)$ models how triggering effect between words varies with time/length of the interval. We believe that this metrics can be used to characterize writer's style.

dickens

mark_twain

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages