[WIP] Support TorchScript and graph rewrite #54

feihugis · 2020-11-11T21:49:49Z

This PR enables SelftAttention of transformers-bart model to be compatible with TorchScript and add the graph rewriter/optimization for einsum.

JiushengChen · 2020-11-11T22:35:10Z

Good to see this PR!

I suppose PR Encoder-decoder Multihead attention cpu optimization #43 can be deprecated now. Please work with @NickNickGo to merge everything into this PR.
Update benchmarks in scripts and readmes.

feihugis · 2020-11-11T23:02:28Z

Good to see this PR!

I suppose PR Encoder-decoder Multihead attention cpu optimization #43 can be deprecated now. Please work with @NickNickGo to merge everything into this PR.

The PR #43 (@NickNickGo) is for fairseq, and this PR currently only works for transformers-bart. There will be no conflicts between these two PRs.

[Jiusheng]: is it possible to cover both fairseq and transformers?

Update benchmarks in scripts and readmes.

There is a performance issue after the graph is rewritten. Once the issue is resolved, I will update the benchmark numbers and add docs.

fastseq/optimizer/jit/einsum_rewriter.py

fastseq/optimizer/transformers/modeling_bart_optimizer.py

JiushengChen · 2020-11-13T05:18:31Z

fastseq/optimizer/transformers/modeling_bart_optimizer.py

                                        SelfAttention, _reorder_buffer)

 from fastseq.logging import get_logger
 from fastseq.utils.api_decorator import replace
+from fastseq.optimizer.jit.graph_rewriter import optimize_graph


i see this is only for bart, with git, we should be able to optimize multiple model from backend?

Yes, the optimization can be applied to other models. The current limitation is that we need to check if other models are compatible with torch.jit.script.

This makes sense. Please include these in the PR. Look forward to review them.

JiushengChen · 2020-11-13T19:26:34Z

fastseq/optimizer/jit/einsum_rewriter.py

+@graph_pattern
+def einsum_rewrite_pattern_0(eqn: str, operands: List[Tensor]):
+    # for cases like "bmhtd,bnhsd->bmhts"
+    if (len(eqn) == 18 and eqn[0:3] == eqn[13:16] and eqn[0] == eqn[6] and


eqn[0:4] == eqn[13:17]?

space is allowed in equation, replace them first

eqn = eqn.replace(' ', '')

Yes, space is allowed here. One issue I'm working on is that adding replace triggers some weird issue in IRParser.

JiushengChen · 2020-11-13T19:34:00Z

fastseq/optimizer/jit/einsum_rewriter.py

+        return r
+
+    # for cases like "bmhts,bnhsd->bmhtd"
+    if (len(eqn) == 18 and eqn[0:3] == eqn[13:16] and eqn[0] == eqn[6] and


JiushengChen · 2020-11-13T19:35:54Z

fastseq/optimizer/jit/einsum_rewriter.py

+def einsum_rewrite_pattern_0(eqn: str, operands: List[Tensor]):
+    # for cases like "bmhtd,bnhsd->bmhts"
+    if (len(eqn) == 18 and eqn[0:3] == eqn[13:16] and eqn[0] == eqn[6] and
+        eqn[2] == eqn[8] and eqn[4] == eqn[10] and eqn[3] == eqn[16] and


eqn[3] == eqn[16] is unnecessary if eqn[0:4] == eqn[13:17] used

JiushengChen · 2020-11-13T19:41:43Z

tests/optimizer/jit/test_einsum_rewriter.py

+    def test_einsum_rewriter(self):
+
+        def run_einsum(t0: Tensor, t1: Tensor):
+            r = torch.einsum("bmhtd,bnhsd->bmhts", t0, t1)


add some extra spaces in equation

use a different char set like i, j, k etc.

Item-2 is done.

NickNickGo

Thanks @feihugis for this PR! Looks good in general.

Looking forward to speedup / Profile comparison for einsum op before/after .
Can more cases/shapes be covered under "einsum_rewrite_pattern_0" function?
Could you briefly describe changes to make Self Attention of transformers-bart model compatible with JIT? Maybe add few comments in code.

NickNickGo · 2020-11-17T00:38:46Z

fastseq/optimizer/jit/einsum_rewriter.py

+def einsum_rewrite_pattern_0(eqn: str, operands: List[Tensor]):
+    # eqn = eqn.replace(' ', '')  # TODO: fix the issue: ValueError: stoll
+    # for cases like "bmhtd,bnhsd->bmhts"
+    if (len(eqn) == 18 and eqn[0:4] == eqn[13:17] and eqn[0] == eqn[6] and


Can we make this more general ? Same pattern can be used for equations without batch dimension.

I prefer to leaving it as what it is. If we meet the cases in the future, it can be added easily with similar code block. To make it more general, it will be more like the implementation of einsum kernel.

From the micro benchmarking result, the runtime for large tensors will be very similar with/without the optimization.

NickNickGo · 2020-11-17T00:39:47Z

fastseq/optimizer/jit/einsum_rewriter.py

+
+    # for cases like "bmhts,bnhsd->bmhtd"
+    if (len(eqn) == 18 and eqn[0:4] == eqn[13:17] and eqn[0] == eqn[6] and
+        eqn[2] == eqn[8] and eqn[4] == eqn[9] and eqn[10] == eqn[17]):


NickNickGo · 2020-11-17T00:41:12Z

fastseq/optimizer/jit/einsum_rewriter.py

+        t0 = t0.reshape(b*h, m*t, s)
+        t1 = t1.view(b*h, s, d)
+        r = torch.bmm(t0, t1).view(b, h, m, t, d).permute(0, 2, 1, 3, 4)
+        return r


Is returned tensor contiguous ? When comparing speedup with einsum, please take this into account as well.

No, the returned tensor is not contiguous, and the output of einsum is not contiguous either, so I think it is an apples to apples comparison.

NickNickGo · 2020-11-17T00:49:45Z

tests/optimizer/jit/test_einsum_rewriter.py

+        time1 = timeit.Timer(
+            functools.partial(script_run_einsum, eqn, t0, t1))
+        s1 = time1.timeit(repeat_times)


Is cuda synchronization taken care ?

Good point. torch.cuda.synchronize() is added now.

feihugis · 2020-11-17T06:54:58Z

Could you briefly describe changes to make Self Attention of transformers-bart model compatible with JIT? Maybe add few comments in code.

The major change is to make the code work with the limited data types that torchscript supports and handle the different behaviors between python and torchscript. For example, python can update the values of the dictionary in place, but torchscript could not. In order to handle these differences, the code logic is changed accordingly.

feihugis · 2020-11-17T18:39:12Z

Based on the below perf benchmarking, the performance with/without optimization are very similar.

Micro benchmark for optimized operation:

eqn='bmhtd,bnhsd->bmhts', shape0=[128, 4, 16, 5, 64], shape1=[128, 2, 16, 1024, 64])
- einsum took: 3.4279239177703857;
- optimized einsum torchscript took: 3.422758102416992;
- optimized einsum python took: 3.422323703765869;
eqn='kmijd,knisd->kmijs', shape0=[128, 4, 16, 1, 64], shape1=[128, 2, 16, 1024, 64])
- einsum took: 3.2339890003204346;
- optimized einsum torchscript took: 3.231293201446533;
- optimized einsum python took: 3.2313060760498047;
eqn='bmhts,bnhsd->bmhtd', shape0=[128, 4, 16, 5, 64], shape1=[128, 2, 16, 64, 1024])
- einsum took: 5.048973798751831;
- optimized einsum torchscript took: 5.0475754737854;
- optimized einsum python took: 5.050021171569824;
eqn='impts,inpsw->imptw', shape0=[128, 4, 16, 3, 64], shape1=[128, 2, 16, 64, 7])
- einsum took: 0.10066008567810059;
- optimized einsum torchscript took: 0.08646607398986816;
- optimized einsum python took: 0.08228182792663574;

E2E benchmark results:

with optimization

Util	Model	Task	Split	BatchSize	Samples	Tokens	Bleu	Rouge	Loss	Perplexity	Runtime(seconds)	Throughput(samples/s)	Throughput(tokens/s)
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.98\|14.97\|25.28	NA	NA	156	6.6	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.94\|14.95\|25.26	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.96\|14.97\|25.27	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.95\|25.27	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.92\|25.26	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.98\|14.98\|25.25	NA	NA	91	11.3	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.96\|14.98\|25.28	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.94\|25.29	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.98\|15.01\|25.28	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.98\|14.98\|25.26	NA	NA	92	11.1	NA

without optimization

Util	Model	Task	Split	BatchSize	Samples	Tokens	Bleu	Rouge	Loss	Perplexity	Runtime(seconds)	Throughput(samples/s)	Throughput(tokens/s)
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.96\|25.27	NA	NA	132	7.8	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.95\|25.30	NA	NA	91	11.3	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.98\|14.95\|25.25	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.95\|25.30	NA	NA	91	11.3	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.96\|14.93\|25.27	NA	NA	93	11.0	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.99\|25.23	NA	NA	91	11.3	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.96\|25.25	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.94\|25.26	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.96\|25.26	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.98\|14.95\|25.26	NA	NA	91	11.3	NA

* Fix prophenet dict loading. * Use logger. * Fix import.

* Generate the XML log file for each unit tests * run all fastseq unit tests * Add Nikhil's changes on pipeline to publish XML * Just use a small unit test to test pipeline * Change the xml folder path * Add more tests * Add env var for xml log dir and test the failures * Enable all fastseq unit tests * Enable all tests * Generate xml files for fairseq and transformers unit tests * Fix an issue in pytest command * Trigger the CI pipeline

… (#59) * Update install_requires and enable fairseq to work with torch 1.6&1.7 * Better error message and address some warnings in torch1.7 * Raise the error if fairseq/transformers are installed but the optmizations can not be applied * Move transformers/fairseq to extra_require * Remove the out-of-dated build files for ngram cuda op * Run fastseq units before transformers and fairseq

JiushengChen closed this Nov 12, 2020

feihugis reopened this Nov 12, 2020

JiushengChen reviewed Nov 12, 2020

View reviewed changes

fastseq/optimizer/jit/einsum_rewriter.py Outdated Show resolved Hide resolved

fastseq/optimizer/jit/einsum_rewriter.py Outdated Show resolved Hide resolved

JiushengChen reviewed Nov 13, 2020

View reviewed changes

feihugis force-pushed the dev_torchscript branch from cf38541 to 0575353 Compare November 14, 2020 19:51

NickNickGo reviewed Nov 17, 2020

View reviewed changes

feihugis force-pushed the dev_torchscript branch 2 times, most recently from c1955b7 to 5acea73 Compare November 17, 2020 06:40

JiushengChen and others added 12 commits November 17, 2020 21:54

Fix prophenet dict loading. (#58)

3ac6c2c

* Fix prophenet dict loading. * Use logger. * Fix import.

made ngram op device agnostic, unit test cleaned (#61)

62b6657

Add missing init files (#62)

7558a5c

Update the instructions for installation (#64)

a55e9de

Init code to support TorchScript and graph rewrite

f9d5a82

Improve einsum_rewrite_pattern

510f7f1

Make einsum_rewrite_pattern more general

83154f1

Enhance rewrite pattern and tests

69e5410

Add synchronize and check for contiguous

699fba7

Test commit for public fastseq

19fcdb1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Support TorchScript and graph rewrite #54

[WIP] Support TorchScript and graph rewrite #54

feihugis commented Nov 11, 2020

JiushengChen commented Nov 11, 2020

feihugis commented Nov 11, 2020 •

edited by JiushengChen

Loading

JiushengChen Nov 13, 2020

feihugis Nov 13, 2020

JiushengChen Nov 13, 2020

JiushengChen Nov 13, 2020

JiushengChen Nov 13, 2020

feihugis Nov 14, 2020

JiushengChen Nov 13, 2020

feihugis Nov 14, 2020

JiushengChen Nov 13, 2020

feihugis Nov 14, 2020

JiushengChen Nov 13, 2020

feihugis Nov 14, 2020

NickNickGo left a comment

NickNickGo Nov 17, 2020

feihugis Nov 17, 2020

NickNickGo Nov 17, 2020

NickNickGo Nov 17, 2020

feihugis Nov 17, 2020

NickNickGo Nov 17, 2020

feihugis Nov 17, 2020

feihugis commented Nov 17, 2020

feihugis commented Nov 17, 2020

[WIP] Support TorchScript and graph rewrite #54

Are you sure you want to change the base?

[WIP] Support TorchScript and graph rewrite #54

Conversation

feihugis commented Nov 11, 2020

JiushengChen commented Nov 11, 2020

feihugis commented Nov 11, 2020 • edited by JiushengChen Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickNickGo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feihugis commented Nov 17, 2020

feihugis commented Nov 17, 2020

feihugis commented Nov 11, 2020 •

edited by JiushengChen

Loading