BASE／BC_H13／IQL_S3 v20220301

Model

Value Network `V`(`s`)

Encoder

Type: Transformer encoder layers (the same network structure as the one used for BERT_BASE)
- Dimension: 768
- # of heads: 12
- Dimension of feedforward networks: 3072
- # of layers: 12
- Activation function: GELU
- Dropout rate in training: 0.1
- Initialization: Transferred from the trained encoder of BASE／BC_H13 v20220210

Decoder

Type: Single-layer position-wise feedforward network
- Dimension: 3072
- Activation function: GELU
- Dropout rate in training: 0.1
- Initialization: Random

Q Network `Q`(`s`, `a`)

Encoder

Type: Transformer encoder layers (the same network structure as the one used for BERT_BASE)
- Dimension: 768
- # of heads: 12
- Dimension of feedforward networks: 3072
- # of layers: 12
- Activation function: GELU
- Dropout rate in training: 0.1
- Initialization: Transferred from the trained encoder of BASE／BC_H13 v20220210

Decoder

Type: Dueling network with two single-layer position-wise feedforward networks
- Dimension: 3072
- Activation function: GELU
- Dropout rate in training: 0.1
- Initialization: Random

Objective

Type: Implicit Q-learning (IQL)
- Reward: Game delta of grading points as a Saint 3 player in the Jade room

Data

Crawled Game Records

Crawled Game Records v202007_202107

Training Examples

100044800 samples randomly sampled from the crawled game records and shuffled.

Optimization

Implicit Q-learning (IQL)

Discount factor (γ): 0.99
Expectile (τ): 0.9
Soft update (Polyak averaging) rate of target networks (α): 0.1
Optimizer: LAMB
Learning rate: 0.001
ε: 1.0e-6
Batch size: 4096
# of training epochs: 1

Advantage Weighted Regression (AWR)

Inverse temperature (β): 1.0
Optimizer: LAMB
Learning rate: 0.001
ε: 1.0e-6
Batch size: 4096
# of training epochs: 1 (More than 1 epoch seem to result in overfitting)

Quantitative Comparison with BASE／BC_H13 v20220210 as the Baseline

Please refer to Methods and Metrics in Performance Comparison and Evaluation for the evaluation method and the meaning of each metric.

2500 sets of duplicate mahjong for the 1vs3 and 3vs1 styles, respectively, and 1667 sets for the 2vs2 style. All games are half-length.

		average	variance (unbiased)	99% CI LL	95% CI LL	95% CI UL	99% CI UL
1vs3 # of games	10000
1vs3 ranking		2.457×10⁰	1.306×10⁰	2.428×10⁰	2.435×10⁰	2.479×10⁰	2.486×10⁰
1vs3 grading point		-1.631×10¹	2.526×10⁴	-2.040×10¹	-1.943×10¹	-1.319×10¹	-1.222×10¹
1vs3 soul point		1.488×10^-2	1.508×10^-1	4.875×10^-3	7.268×10^-3	2.249×10^-2	2.488×10^-2
1vs3 top rate		2.760×10^-1	1.998×10^-5	2.645×10^-1	2.672×10^-1	2.848×10^-1	2.875×10^-1
1vs3 quinella rate		5.198×10^-1	2.496×10^-5	5.069×10^-1	5.100×10^-1	5.296×10^-1	5.327×10^-1
1vs3 ranking diff		-5.733×10^-2	2.322×10⁰	N/A (one-sided)	N/A (one-sided)	-3.226×10^-2	-2.188×10^-2

2vs2 # of games	10002
2vs2 ranking		2.428×10⁰	1.279×10⁰	2.399×10⁰	2.406×10⁰	2.450×10⁰	2.457×10⁰
2vs2 grading point		-1.125×10¹	2.442×10⁴	-1.528×10¹	-1.431×10¹	-8.187×10⁰	-7.224×10⁰
2vs2 soul point		2.461×10^-2	1.480×10^-1	1.470×10^-2	1.707×10^-3	3.215×10^-2	3.452×10^-2
2vs2 top rate		2.794×10^-1	1.007×10^-5	2.712×10^-1	2.732×10^-1	2.856×10^-1	2.876×10^-1
2vs2 quinella rate		5.302×10^-1	1.245×10^-5	5.211×10^-1	5.233×10^-1	5.371×10^-1	5.393×10^-1
2vs2 ranking diff		-1.439×10^-1	1.658×10⁰	N/A (one-sided)	N/A (one-sided)	-1.227×10^-1	-1.139×10^-1

3vs1 # of games	10000
3vs1 ranking		2.482×10⁰	1.260×10⁰	2.453×10⁰	2.460×10⁰	2.504×10⁰	2.511×10⁰
3vs1 grading point		-1.749×10¹	2.458×10⁴	-2.153×10¹	-2.056×10¹	-1.442×10¹	-1.345×10¹
3vs1 soul point		6.410×10^-3	1.461×10^-1	-3.707×10^-3	-1.352×10^-3	1.363×10^-2	1.599×10^-2
3vs1 top rate		2.572×10^-1	6.369×10^-6	2.507×10^-1	2.523×10^-1	2.621×10^-1	2.637×10^-1
3vs1 quinella rate		5.084×10^-1	8.331×10^-6	5.010×10^-1	5.027×10^-1	5.141×10^-1	5.158×10^-1
3vs1 ranking diff		-7.067×10^-2	2.162×10⁰	N/A (one-sided)	N/A (one-sided)	-4.648×10^-2	-3.646×10^-2

Supplemental: AWR epoch = 2 (probably overfitting)

		average	variance (unbiased)	99% CI LL	95% CI LL	95% CI UL	99% CI UL
1vs3 # of games	10000
1vs3 ranking		2.618×10⁰	1.254×10⁰	2.589×10⁰	2.596×10⁰	2.640×10⁰	2.647×10⁰
1vs3 grading point		-3.647×10¹	2.588×10⁴	-4.061×10¹	-3.962×10¹	-3.332×10¹	-3.233×10¹
1vs3 soul point		-3.940×10^-2	1.453×10^-1	-4.922×10^-2	-4.687×10^-2	-3.193×10^-2	-2.958×10^-2
1vs3 top rate		2.162×10^-1	1.695×10^-5	2.056×10^-1	2.081×10^-1	2.243×10^-1	2.268×10^-1
1vs3 quinella rate		4.588×10^-1	2.483×10^-5	4.460×10^-1	4.490×10^-1	4.686×10^-1	4.716×10^-1
1vs3 ranking diff		1.568×10^-1	2.229×10⁰	N/A (one-sided)	N/A (one-sided)	1.814×10^-1	1.915×10^-1

2vs2 # of games	10002
2vs2 ranking		2.442×10⁰	1.284×10⁰	2.413×10⁰	2.420×10⁰	2.464×10⁰	2.471×10⁰
2vs2 grading point		-1.317×10¹	2.455×10⁴	-1.721×10¹	-1.624×10¹	-1.010×10¹	-9.134×10⁰
2vs2 soul point		1.969×10^-2	1.485×10^-1	9.763×10^-3	1.214×10^-2	2.724×10^-2	2.962×10^-2
2vs2 top rate		2.763×10^-1	9.996×10^-6	2.682×10^-1	2.701×10^-1	2.825×10^-1	2.844×10^-1
2vs2 quinella rate		5.236×10^-1	1.247×10^-5	5.145×10^-1	5.167×10^-1	5.305×10^-1	5.327×10^-1
2vs2 ranking diff		-1.155×10^-1	1.667×10⁰	N/A (one-sided)	N/A (one-sided)	-9.426×10^-2	-8.546×10^-2

3vs1 # of games	10000
3vs1 ranking		2.469×10⁰	1.266×10⁰	2.440×10⁰	2.447×10⁰	2.491×10⁰	2.498×10⁰
3vs1 grading point		-1.600×10¹	2.451×10⁴	-2.003×10¹	-1.907×10¹	-1.293×10¹	-1.197×10¹
3vs1 soul point		1.058×10^-2	1.467×10^-1	-7.123×10^-4	-3.072×10^-3	1.809×10^-2	2.045×10^-2
3vs1 top rate		2.626×10^-1	6.455×10^-6	2.561×10^-1	2.576×10^-1	2.676×10^-1	2.691×10^-1
3vs1 quinella rate		5.139×10^-1	8.327×10^-6	5.065×10^-1	5.082×10^-1	5.196×10^-1	5.213×10^-1
3vs1 ranking diff		-1.225×10^-1	2.117×10⁰	N/A (one-sided)	N/A (one-sided)	-9.857×10^-2	-8.865×10^-2

Supplemental: AWR `β` = 2.0 (worse than `β` = 1.0)

(FIXME: Buggy results as of 2022/03/26)

		average	variance (unbiased)	99% CI LL	95% CI LL	95% CI UL	99% CI UL
1vs3 # of games	10000
1vs3 ranking		2.508×10⁰	1.235×10⁰	2.479×10⁰	2.486×10⁰	2.530×10⁰	2.537×10⁰
1vs3 grading point		-2.234×10¹	2.543×10⁴	-2.645×10¹	-2.547×10¹	-1.921×10¹	-1.823×10¹
1vs3 soul point		-2.870×10^-3	1.498×10^-1	-1.284×10^-2	-1.046×10^-2	4.717×10^-3	7.101×10^-3
1vs3 top rate		2.591×10^-1	1.920×10^-5	2.478×10^-1	2.505×10^-1	2.677×10^-1	2.704×10^-1
1vs3 quinella rate		4.962×10^-1	2.500×10^-5	4.833×10^-1	4.864×10^-1	5.060×10^-1	5.091×10^-1
1vs3 ranking diff		1.107×10^-2	2.303×10⁰	N/A (one-sided)	N/A (one-sided)	3.603×10^-2	4.638×10^-2

2vs2 # of games	10002
2vs2 ranking		2.483×10⁰	1.271×10⁰	2.454×10⁰	2.461×10⁰	2.505×10⁰	2.512×10⁰
2vs2 grading point		-1.810×10¹	2.483×10⁴	-2.216×10¹	-2.119×10¹	-1.501×10¹	-1.404×10¹
2vs2 soul point		5.884×10^-3	1.472×10^-1	-3.999×10^-3	-1.636×10^-3	1.340×10^-2	1.577×10^-2
2vs2 top rate		2.595×10^-1	9.607×10^-6	2.515×10^-1	2.534×10^-1	2.656×10^-1	2.675×10^-1
2vs2 quinella rate		5.083×10^-1	1.249×10^-5	4.992×10^-1	5.014×10^-1	5.152×10^-1	5.174×10^-1
2vs2 ranking diff		-3.369×10^-2	1.624×10⁰	N/A (one-sided)	N/A (one-sided)	-1.273×10^-2	-4.042×10^-3

3vs1 # of games	10000
3vs1 ranking		2.499×10⁰	1.261×10⁰	2.470×10⁰	2.477×10⁰	2.521×10⁰	2.528×10⁰
3vs1 grading point		-1.958×10¹	2.469×10⁴	-2.363×10¹	-2.266×10¹	-1.650×10¹	-1.553×10¹
3vs1 soul point		2.733×10^-4	1.462×10^-1	-9.578×10^-3	-7.222×10^-3	7.768×10^-3	1.012×10^-2
3vs1 top rate		2.533×10^-1	6.304×10^-6	2.468×10^-1	2.484×10^-1	2.582×10^-1	2.598×10^-1
3vs1 quinella rate		5.000×10^-1	8.334×10^-6	4.926×10^-1	4.943×10^-1	5.057×10^-1	5.074×10^-1
3vs1 ranking diff		-3.600×10^-3	2.162×10⁰	N/A (one-sided)	N/A (one-sided)	2.059×10^-2	3.061×10^-2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BASE／BC_H13／IQL_S3 v20220301

Model

Value Network `V`(`s`)

Encoder

Decoder

Q Network `Q`(`s`, `a`)

Encoder

Decoder

Objective

Data

Crawled Game Records

Training Examples

Optimization

Implicit Q-learning (IQL)

Advantage Weighted Regression (AWR)

Quantitative Comparison with BASE／BC_H13 v20220210 as the Baseline

Supplemental: AWR epoch = 2 (probably overfitting)

Supplemental: AWR `β` = 2.0 (worse than `β` = 1.0)

Clone this wiki locally

BASE／BC_H13／IQL_S3 v20220301

Model

Value Network V(s)

Encoder

Decoder

Q Network Q(s, a)

Encoder

Decoder

Objective

Data

Crawled Game Records

Training Examples

Optimization

Implicit Q-learning (IQL)

Advantage Weighted Regression (AWR)

Quantitative Comparison with BASE／BC_H13 v20220210 as the Baseline

Supplemental: AWR epoch = 2 (probably overfitting)

Supplemental: AWR β = 2.0 (worse than β = 1.0)

Clone this wiki locally

Value Network `V`(`s`)

Q Network `Q`(`s`, `a`)

Supplemental: AWR `β` = 2.0 (worse than `β` = 1.0)