GitHub

Linguistic Hawkes Process Model Codebase

This Python codebase implements discrete hawkes process model for linguistic pattern analysis. Powered by dask backend, this repo is:

Scalable: Can be deployed on either single machine or cluster to model either short or long term process.
Fast: Highly parallelized. Extra speed-oriented fitting options are provided.
Extendable: Can be used in other context besides liguistic analysis; Only proper input data formatting is required.

Run the code

Create your own corpus Corpus file should be in .json format, where key is the token string (of, the, ...) and value is a list of integer that records the time stamps at which that token occurs in a book. A utility that automates corpus creation from text file is provided in utils/scanner.py, where spacy-en_core_web_sm is used as default tokenizer. And the default token position counting pipeline is

[beginning of file] To  be  or  not  to  be , that  is  the  question .
[lower case]        to  be  or  not  to  be , that  is  the  question .
[time stamp]        0   1   2   3    4   5  \ 6     7   8    9        \
[example result]    {'to': [0, 4], 'be': [1, 5], 'or': [2], ......}

from utils.scanner import Scanner
ignore_list = [' ', '.', ',']   # character to exclude when counting token position
scanner = Scanner(ignore_list)
corpus, token_cnt = scanner.count_pos('Hamlet.txt')
# corpus: Dict[str, List[int]]: mapping from toke string to occurrence position
# token_cnt: int, total number of token (ignore_list excluded)

Feel free to create your own customed corpus with other tokenizer.

Entry Point

$ python main.py -h       # display options help menu
$ python main.py --word --corpus-path --total-word-num \
                 --ckp-save-path \
                 --epoch \
                 --X-bandwidth --lag-bandwidth

Args Spec
- word: target word in the book to consider
- corpus-path: path to corpus json file
- total-word-num: total word/token number of the book
- ckp-save-path: save path of fitted checkpoint
- X-bandwidth: bandwidth used when estimating distribution of occurrence
- lag-bandwidth: bandwidth used when estimating distribution of occurrence lag, i.e difference between any pair of occurrence
- epoch: the number of fitting epoch

Workflow

Conditional intensity function $\lambda(t)$ is defined as, which is the probability of occurrence of event at time stamp $t$ conditioned previous history

$$\lambda(t) = P\{X_t = 1 |\mathcal{H}_t\}$$

In our example, we estimate $\lambda(t)$ for uni-variant case with the following equation

$$\lambda(t) = \mu_0 \mu(t) + A \sum_{i: t_i < t} g(t-t_i)$$

The objective likelihood function is $$\mathcal{L} = \sum_{i=1}^{N} log[\lambda(t)](X_i) + log[1-\lambda(t)](1-X_i)$$ where $X_i$ is an indicator variable for occurrence, $X_i=1$ denotes occurrence, $X_i=0$ denotes no occurrence.

Denote $\phi(t) = \frac{\mu(t)}{\lambda(t)}$ and $\rho_{ij}=\frac{g(t_j-t_i)}{\lambda(t_j)}$

E step: Update $\mu(t)$ and $g(t)$

$$\mu(t) \leftarrow \sum_{t_i \in N} \phi(t_i) \mathbb{1}[t \in [t-\Delta t, t+\Delta t]] \approx \sum_{t_i \in N} \phi(t_i) Z(t-t_i, \omega_{bg})$$

$$g(t) \leftarrow \sum_{t_i, t_j \in N; t_i < t_j} \rho_{ij} \mathbb{1}[t_j-t_i \in [t-\Delta t, t+\Delta t]] \approx \sum_{t_i, t_j \in N; t_i < t_j} \rho_{ij} Z(t_j-t_i, \omega_{trigger})$$

M step: Update $\mu_0$ and $A$ $$\frac{\partial \mathcal{L}}{\partial \mu_0} = \sum_{t_i \in N} \frac{\phi(t_i)}{\mu_0} - \sum_{t \notin N} \frac{\mu(t)}{1-\lambda(t)}$$ $$\frac{\partial \mathcal{L}}{\partial A} = \sum_{t_i \in N} \frac{1-\phi(t_i)}{A} - \sum_{t \notin N} \frac{\sum_{j: t_j < t} g(t-t_j)}{1-\lambda(t)} $$ Iterative Optimization Approach $$\mu_0^{k+1} \leftarrow \frac{\sum_{t_i \in N}\phi^k(t_i)} {\sum_{t \notin N} \frac{\mu^k(t)}{1-\lambda^k(t)}}$$ $$A^{k+1} \leftarrow \frac{\sum_{t_i \in N} 1-\phi^{k+1}(t_i)}{\sum_{t \notin N} \frac{\sum_{t_j:t_j < t} g^{k+1}(t-t_j)}{1 - \lambda^{k+1}(t)} }$$

Advanced Options

Parameter Init

init-mu0: initial value of $\mu_0$
init-A: initial value of $A$
init-exp-lambda: initial value of $\lambda$ parameterizing exponential distribution $\lambda e^{-\lambda x}$, which is used to initialize $g(t)$.

Fast Computation

g-window-size: when computing self excitation term, events occurred beyond this window this be truncated. Without truncating, the complexity for accumulative trigger calculation is $O(N^2)$.

Memory Utilization When running on single machine, one of the major problems when fitting long process is OOM (out of memory) on RAM, the following arguments should be carefully tuned to prevent it.

X-chunk-size: chunk size to split across entire episode
kernel-chunk-size: chunk size to split across occurrence
lag-chunk-size: chunk size to split across occurrence lag

General Guide:

Bottlenecks:
- (#kernel_chunk_size, #occur_lag/#occur): estimation of the distribution of occurrence and lag of occurrence
- (#lag_chunk_size, #g-window-size): calculation of accmulative self excitation term; events beyond g window will be truncated In single-machine scenario, the memory used by multiple workers collectively can be approximated by #num-worker $\times$ bottleneck, which should not exceed RAM capacity
Monitering memory consumption with Dask Daskboard is highly recommended.

Here are some of the results of $g(t)$ fitted on novels of Mark Twain and Dickens. $g(t)$ models how triggering effect between words varies with time/length of the interval. We believe that this metrics can be used to characterize writer's style.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
Shakespeare		Shakespeare
asset		asset
hawkes		hawkes
novel		novel
utils		utils
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Linguistic Hawkes Process Model Codebase

Run the code

Workflow

Advanced Options

About

Releases

Packages

Languages

zihan0822/linguistic_hawkes

Folders and files

Latest commit

History

Repository files navigation

Linguistic Hawkes Process Model Codebase

Run the code

Workflow

Advanced Options

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages