To play a card as HUMAN (double) mouse click the card.

Windows EXE Versions


Commited Versions:

  • commit witches_0.1 Beta shifting included in learning process
  • commit witches_0.2 Beta shifting included also in (fixed core dumped when playing with human)
  • commit witches_0.3 MultiPlayer Version see in online folder
  • commit witches_0.4 MultiPlayer Version see in online folder (includes: marked card pressed, sleeping, simple click)
  • commit witches_0.5_beta see in online folder (includes play locally clean exit on windows)
# for single Player (one PC)
# for multi Player (in online folder:)


This are example options for the Linux Version (not online one) see also gui_options.json

  "names": ["Laura", "Alfons", "Frank", "Lea"],
  "type": ["NN", "RANDOM", "HUMAN", "RL0"],  "[HUMAN, RANDOM, NN, MCTS, RL0, RL1, ... RL5]"
  "expo": [500, 500, 500, 500],              "-> adjustements only for MCTS"
  "depths": [300, 300, 300, 300],            "-> adjustements only for MCTS"
  "itera": [5000, 5000, 5000, 5000],         " -> adjustements only for MCTS"
  "faceDown": [false, false, false, false],  " [true, false] If Cards are visible or not"
  "sleepTime": 0.001,                        " Time to wait between 2 moves"
  "model_path_for_NN": "data/test.pth",      " Input path for Neuronal Network"
  "nu_games": 100,                           " Number of Games"
  "shifting_phase": 20,                      " TODO"
  "mcts_save_actions": false,                " -> adjustements only for MCTS"
  "mcts_actions_path": "data/actions_strong44__mcts.txt",  " -> adjustements only for MCTS"
  "automatic_mode": false,                   " [true, false]  true: play a pickle game_play"
  "save_game_play": false,                   " [true, false]  true: save a pickle game_play"
  "game_play_path": "data/game_play.pkl",    " *.pkl          path for pickle game_play"
  "onnx_path": "data/model_long_training.pth.onnx"  "[model_long_training.pth.onnx, model.pth.onnx, actions_all.pth.onnx]"
  "onnx_rl_path": ["rl_path3", "rl_path4", "rl_path5", "rl_path6"]   " in data/*.onnx [rl_path3, rl_path4, rl_path5, rl_path6, ... rl_path12_further] path PPO trained"

Instructions online Version (Windows)

Go to the online folder and see the option files.

  1. Download and extract
  2. Go here note your open IP.
  3. Allow Portforwarding (Port 8000, TCP, IPV4) in your router (e.g. fritzbox).
  4. Open the start_server.exe click options and insert the noted open_ip.
  5. Open the start_client.exe click options and insert the noted open_ip.

Player Types

  • RANDOM: plays a random possible card
  • HUMAN : you can choose to play (use double click)
  • NN : Is player that was trained by classifying data generated by MCTS
  • MCTS : Monte Carlo Tree Search Player
    • Chooses and action based on predictions into the future (similar to minimax)
    • depth: At what depth should the tree be spanned
    • expo : Trade-off between exploration (the higher this value ) and exploitation
    • itera: Max number of iterations (the lower the faster)
  • RL : Reinforcement Learning
    • Select RL(number) number in range 0, len(onnx_rl_path), the higher the stronger the RL
    • An actor-critic Proximal Policy Optimization (PPO) Reinforcement Learning algorithm.
    • Generate trained model using the modules/ppo_witches and the modules/gym-witches
    • onnx_rl_path: Path to the trained model
      • rl_path4_82.onnx (is the stronges one), wins 80% of the games (against only RANDOM players)
      • trained for 2h on single cpu, i5, no vectorized environment
      • I still can beat ai (ai is to greedy) Stats 3 Games: [Rand=-32, RL=-9, RL=-35, ME=+1]

Rules of witches:

  • Aim: Have a minimum of minus Points!
  • 60 Cards(4xJoker, 1-14 in Yellow, Green, Red, Blue)
  • Red Cards give -1 Point (except Red 11)
  • Blue Cards do nothing (except Blue 11 if you have it in your offhand deletes green 11 and green 12 if you have it in your offhand as well)
  • Green Cards do nothing (except Green 11 -5 and Green 12 -10 Points)
  • Yellow Cards do nothing (except Yellow 11 +5)
  • A joker can be placed anytime (you do not have to give the same color as the first player)
  • If you have no joker and you are not the first player, you have to play the same color as the first player.
  • The winner of one round (highest card value) has to start with the next round
  • If only Jokers are played the first one wins this round.
  • Note: Number 15 is a Joker (in the code)
  • Buy the real-card game e.g. here: buy-witches
  • German rules

Training an AI

Start with MCTS (similar to minimax)

To train a strong human-competitive player I first tested Monte Carlo Tree Search (MCTS). Usually, MCTS is used for games like chess to select a set of promising options (instead of testing all possible options as in minimax algorithm). I also used mcts to predict possible future rewards if a specific card is played and recorded the state and the action that mcts predicted. In here I assumed, that mcts knows all cards(If not it is very hard to predict possible future actions). The recorded data could then be used to train a Neuronal Network (Classifying Problem) Inputs and Outputs are known. The trained NN at the end does not know the cards of the other players. However the NN only performs a little bit better than RANDOM players: mcts_witches


I tested the pytorch REINFORCE (see e.g. this example) algorithm in modules/

  • see commit is_learning
  • run with modules$ python
  • player 0 is the trained player
  • 50 games played
  • half of them won by trained player! is_learning
  • See also pytorchForum question:
    most of the time for most gym environments three Linear layers is enough, maybe 64 ~ 500 neurons per level and I would suggest you use a pyramid like structure. Conv2d is only necessary for visual inputs.

    Must use discount < 1, otherwise there is no garantee on convergence, because the convergence of magic like RL algorithms relies on a simple math principle, you must have learned it in your freshman math analysis class: converging series or a little bit more advanced compaction

    Because the naive REINFORCE algorithm is bad, try use DQN, RAINBOW, DDPG,TD3, A2C, A3C, PPO, TRPO, ACKTR or whatever you like. Follow the train result reference openai gym train reference, normally you need to let the agent interact with the environment for 100K or even 1M steps, for extremely complex real life scenes, you need stacks of servers and massively distributed algorithms like IMPALA. There are many many many many methods to learn faster, but I would recommend you to start with PPO. But PPO is not a solution once and for all, it cannot solve some scenes like the hardcore bipedalwalker from openai gym.

    You will know that it is learning, by looking at its behavior changing from a total noise, to a fool, to an expert, and a fool again. Reward and loss might be good indicators, but be careful of your reward design, as your networks will exploit it and take lazy counter measures! Thats’ called reward shaping.

    Please use batches, because it can stablelize training, because pytorch will average the gradients computed from each sample from the batch, You see more, you can judge better, right? normally 100~256 samples per batch, but some studies says monsterous batchsize could be helpful like 20000+


Learning Procedure (gym interface)

  • python state = env.reset()
    • with state: 240x1
    • state = on_table, on_hand, played, play_options (each 60x1 one-hot encoded)
    • get the state right before ai_player has to play!
  • python action = ppo_test.policy_old.act(state, memory)
  • python state, reward, done, nu_games_won = env.step(action)

PPO with Monte Carlo Reward Estimation

No GAE used. See the file

  • See file modules/

  • First I learned without discounted rewards:

  • is_learning-img

  • Problem: The learning stopped to early (it was a short sighted learning) see also here

  • Next I included the mc rewards and played around with the hyper-parameters: is_learning

  • Results

  • Use beta=0.01 (at the beginning!)

  • Using 512 for latent layers does not improve the results (64 are already enough)

  • Using update_timestep of 2000 is advised!

  • eps clipping = 0.1 is advised!

  • 81% is a maximum

    • Can still be better!
    • Plays to greedy (always captures blue 11)
    • Plays Joker to early!
  • PPO Hyperparameter Tuning

    • if gets bader again at some point lower the lr (to be adjusted first)
  • Training against trained players

    • is computationally expensive (use path instead of .onnx)
    • Has not a big effect (correct hyperparams not found yet?)
    • Tested gamma=0.7 -> almost no effect
    • Tested update timestep = 5 -> almost no effect
    • Example 35% : Game ,0018000, rew ,-5.351, inv_mo ,0.0025, won ,[48. 79. 51. 58.], Time ,0:07:40.390401

05.04.2020 - adjustements for learning correct moves at commit best_learning_mc inv moves is at 0.01 after 270000 episodes. Rewards with -100 in case of wrong move and with trick reward+21 in case of correct move.

  • change input to 180 and see if improves NO does not learn faster, use also options at input state
  • change value layer and see if it improves NO does not learn faster, use seperate
  • include shifting? and see what is changing
    • rewarding with total current rewards seems also to work (changed gameClasses with newest one)
    • So far it seems to work however, correct moves and invalid moves has different meaning inv_moves = 0.0455 correct moves = 16.23
    • Including some additional states has helped!
    • Found a new best player rl_path11_op win rate of 95% against random player! With shifting: 95_percent_winner_mc_rewarding-img new player is rl0 95_percent_winner_mc_rewarding-img
    • See commit 95_precent_mc_rewarding_winner or nicer version shift_trained_further
      • Play 10 rounds against this player as a human!
      • Shifted cards are not always best options!
      • Play against pretrained copys (see path 12 was has not better stats....)
      • Train pretrained copys further.....
      • -1101 -890 in 200 games
      • -969 -935 in again 200 games
      • Test monte carlo options!!! as add input to Network output?!
    • Note that current rewarding aims to find yellow 11 as fast as possible!

17.04.2020 Learning multi Train against each other

  • learn against trained players endures for 15h, shift works better see rl_path14_multi

  • still worse than me (human) and worse than: "rl_path12_further"

  • see commit included_multi

  • Added iig folder

  • adjustable card number

  • see commit included_multi

  • Not to see also the collab version (is not faster however, and trains only when active at pc max. 12h)

  • how to reward?

  • works also with and without step?

  • max. should be 0.001 invalid moves in 2000 games or episodes?

  • reset to best_learning_mc and see how fast it learns correct moves ....

  • anderer Zustand:
    • For each opponent is in shifing phase
    • For each opponent has in offhand [11 and 12, 13, 14, J] -> 15 states
    • Not has... color x
    • Control before



  • Shuffling was used
  • Final rewarding was used
  • All logic bugs were deleted
  • It was learned with increase batch size (breaks after some time - error unknown)
  • Learn further1 with smaller eps, higher batch size and a little bit lower lr
  • Learn further2 again with smaller eps, higher batch size and a little bit lower lr
  • Final Result:
    • Against 3 random players achieves 16.95 correct moves of 17
    • Best rewarding 0.5 per game whereas the mean of random is -8.1 per game

PPO with gae

See the file

PPO with LSTM and gae

See the file

Test Baselines

  • Tested baselines see
  • Need of constructing own model! (Damit auch zuege lernt)
  • Not so easy to use (export as onnx etc.)


  • Tune Hyperparameters in current

  • Wie sagen, dass shifting phase ist?

  • Test if after shift phase player has new cards (in his options!)

  • Monitor value, loss, entropy

  • Use vectorenv test baselines custom environment

  • Include shifting

    • First move: shift 1 card (possible cards are all cards on hand)
    • Second move: shift 1 card (possible cards are all on hand except already shifted one)
    • Third move: play a card.
    • seems not to learn that now is a shift phase!
    • using 1 in on table cards during shift phase
    • Test if after shift phase player has new cards (in his options!)
    • Test minimal ppo implementation with GAE

Further Notes


  1. create VanilaMCTS object
  2. run SOLVE
    • for n_iterations = 50 do:
      • Selection : select leaf node which have maximum uct(exploration vs. exploitation) value, out:node to expand, depth (root=0)
      • Expansion : create all possible outcomes from leaf node out: expanded tree
      • Simulation: simulate game from child node's state until it reaches the resulting state of the game. out: rewards
      • Backprob : assign rewards back to the top node!

Tree consists of state([self.players, self.rewards, self.on_table_cards, self.played_cards]), current_player, n, w, q

  • n = number of iterations
  • w = summed reward at that depth
  • q = w/n

The tree looks as follows: tree = {root_id: {'state': state, 'player': player, 'cards_away': [], # this is used for shifting! 'child': [], 'parent': None, 'n': 0, 'w': 0, 'q': None}}


  • Clipping probs = torch.clamp(probs, 0, 1) # did not work for me
  • torch.nn.utils.clip_grad_norm_(self.parameters(), 5) in updatePolicy
  • working adam see commit working_adam
  • Problem for SGD and adam (what is the reason?!)
    ERROR!!!! invalid multinomial distribution (encountering probability entry < 0)
    • So um die 22 gewinnrate rum lokales minimum erreicht?!
    • use different lr !
    • use clipping
    • use clamping?!
  • TEST PPO Pytorch from here expl
    • Return = discounted sum of rewards (gamma = Interest in financial get Money NOW! greedy or not)
    • Value Function tries to estimate final reward in this episode
    • advantage estimate = discounted reward - baseline estimate (by value function)
      • 0 Gradient is positive Increase action probabilities

    • Running Gradient Descent on a single batch destroys your policy (cause of NOISE) -> Trust Region is required!
  • Incooperate Loss from here
  • Test nn.Tanh(), as activation function! see here
  • Test Discounted rewards (set gamma to not greedy)

Older Stuff:

Creating an exe

* convert torch model and params to onnx
* use pyinstaller (also on ubuntu possible!)
* in this case exe should be smaller!



Further Notes:

Problem: mcts geht nicht fuer imperfect information games oder sprich sobald mehr als 2 Personen dabei sind explodiert alles?! Bspw. wenn jeder nur 3 Karten bekommt.

Für 4 Spieler und 3 Karten Müsste man 3!=6 MCTS lernen lassen mit:

  1. Zug max 81 mögliche Zustände Erster Spieler hat 3 Möglichkeiten Zweiter Spieler 9 mögliche Zustände Dritter Spieler 27 mögliche Zustände Vierter Spieler 81 mögliche Zustände

  2. Zug für jeden der 81 möglichen Zustände

  • wenn spieler 1 gewinnt: dann hat erster Spieler 2 möglichkeiten 812 zustände dann hat zweiter Spieler 2 möglichkeiten 814 zustände dann hat dritter Spieler 2 möglichkeiten 818 dann hat vierter Spieler 2 möglichkeiten 8116 das ganze nochmal mal 4 wenn ein anderer Spieler gewinnt also 5184 Zustände.
  1. Zug:

Other Card Games:

Example tree

Started commit "beginning_reinforce"

  • why not learning anything??
  • tested with self.rewards, and discountedRewards
  • change lr, momentum, gamma (for rewards)
  • tested with -1*loss (no effect)
  • how does the learning work in general?, when initializing the network params lost?
  • see other reinforce examples:
  • Using a different discount function does not work as well!
  • use only positive rewards! klappt auch nicht!!!
  • Try using another network!
  • Does not work as well!!!

  • Question on forum

    • How do I know that my algorithm learns something?
    • How to setup the network?
    • What am I missing?
    • What shape should losses have (15x15 matrix?)
    • see here
    • geht hiermit noch aktuell am besten: self.optimizer = optim.SGD(self.parameters(), lr=0.1)
    • bestes ergebnis: game finished with::: [-295. -663. -729. -716.]
    • Should I collect batches????!!!
    • Problem ist dass
    • invalid multinomial distribution (encountering probability entry < 0)
    • clipping to prevent nans:

      torch.nn.utils.clip_grad_norm_(self.parameters(), 5)
  • Check Game Logic:

    • ai player plays valid cards!