This Python codebase implements discrete hawkes process model for linguistic pattern analysis. Powered by dask
backend, this repo is:
- Scalable: Can be deployed on either single machine or cluster to model either short or long term process.
- Fast: Highly parallelized. Extra speed-oriented fitting options are provided.
- Extendable: Can be used in other context besides liguistic analysis; Only proper input data formatting is required.
- Create your own corpus
Corpus file should be in
.json
format, where key is the token string (of, the, ...) and value is a list of integer that records the time stamps at which that token occurs in a book. A utility that automates corpus creation from text file is provided inutils/scanner.py
, wherespacy-en_core_web_sm
is used as default tokenizer. And the default token position counting pipeline is
[beginning of file] To be or not to be , that is the question . [lower case] to be or not to be , that is the question . [time stamp] 0 1 2 3 4 5 \ 6 7 8 9 \ [example result] {'to': [0, 4], 'be': [1, 5], 'or': [2], ......}
from utils.scanner import Scanner
ignore_list = [' ', '.', ','] # character to exclude when counting token position
scanner = Scanner(ignore_list)
corpus, token_cnt = scanner.count_pos('Hamlet.txt')
# corpus: Dict[str, List[int]]: mapping from toke string to occurrence position
# token_cnt: int, total number of token (ignore_list excluded)
Feel free to create your own customed corpus with other tokenizer.
- Entry Point
$ python main.py -h # display options help menu
$ python main.py --word --corpus-path --total-word-num \
--ckp-save-path \
--epoch \
--X-bandwidth --lag-bandwidth
- Args Spec
- word: target word in the book to consider
- corpus-path: path to corpus json file
- total-word-num: total word/token number of the book
- ckp-save-path: save path of fitted checkpoint
- X-bandwidth: bandwidth used when estimating distribution of occurrence
- lag-bandwidth: bandwidth used when estimating distribution of occurrence lag, i.e difference between any pair of occurrence
- epoch: the number of fitting epoch
Conditional intensity function
In our example, we estimate
The objective likelihood function is
Denote
E step: Update
M step: Update
Parameter Init
-
init-mu0: initial value of
$\mu_0$ -
init-A: initial value of
$A$ -
init-exp-lambda: initial value of
$\lambda$ parameterizing exponential distribution$\lambda e^{-\lambda x}$ , which is used to initialize$g(t)$ .
Fast Computation
-
g-window-size: when computing self excitation term, events occurred beyond this window this be truncated. Without truncating, the complexity for accumulative trigger calculation is
$O(N^2)$ .
Memory Utilization When running on single machine, one of the major problems when fitting long process is OOM (out of memory) on RAM, the following arguments should be carefully tuned to prevent it.
- X-chunk-size: chunk size to split across entire episode
- kernel-chunk-size: chunk size to split across occurrence
- lag-chunk-size: chunk size to split across occurrence lag
General Guide:
-
Bottlenecks:
- (#kernel_chunk_size, #occur_lag/#occur): estimation of the distribution of occurrence and lag of occurrence
- (#lag_chunk_size, #g-window-size): calculation of accmulative self excitation term; events beyond g window will be truncated
In single-machine scenario, the memory used by multiple workers collectively can be approximated by #num-worker
$\times$ bottleneck, which should not exceed RAM capacity
-
Monitering memory consumption with Dask Daskboard is highly recommended.
Here are some of the results of