Monte carlo casino blackjack

Monte carlo casino blackjack

{H1}

Search code, repositories, users, issues, pull requests...

Homework 1 of MATH-6450I Reinforcement Learning

ZHANG Yun

Use reinforcement learning methods to study the Blackjack game.

[TOC]

  • One of contains cards. All face cards are counted as .

  • Objective of the Blackjack game is to obtain cards the sum of whose numerical values is as great as possible without exceeding 21.

  • The card with value () is called if the sum of cards at hand (total ) plus is smaller or equal to 21, otherwise .

  • Suppose there are decks (infinite decks when ) of cards and , including and , on the .

  • Game starts by dealing two cards to each player.

  • If someone gets 21 points immediately (check for and ):

    • Only dealer gets 21 points, every gambler loses
    • Dealer and some gambler(s) get 21 points at the same time (Draw), which is a miracle, nobody wins and nobody loses.
    • Some gambler(s) get 21 points (Natural), he/they win(s) and other players lose.
  • For each gambler:

    • If total points is smaller than , keep drawing card until exceeds.
    • Based on his observation and his policy, make decision from the two options:
      • : draw one card
      • : stop drawing card and go to next player
    • If his total points exceed during hitting, we say he goes bust, and he loses the game .
  • Then it's the dealer's turn, his action strictly follows a fixed policy: sticks on any sum of 17 or greater, and hits otherwise.

  • Finally it goes to the settlement part.

  • Dealer is excluded from the settlement.
  • In practice, for easier comparison, we first set the points exceeding 21 (busted) to 0.
  • If dealer goes bust
    • If there are some gamblers not go bust, they win and others lose.
    • If everyone goes bust, which is also a miracle, everyone lose!
  • If dealer does not goes bust
    • If there are some gamblers not goes bust
      • Subtract the dealer's point from each gambler's point, those who still have a nonnegative values win, others lose.
    • If gamblers all go bust, dealer wins and gamblers lose.
  • Rewards are given only after all players finishes their turns.

  • Wining gamblers get reward , losers get reward .

  • If nobody wins or loses, everyone gets reward .

  • Reward for Natural is .

The blackjack game with settings described above is implemented in a object-oriented way, the cards, decks, players are all designed objects, which with their various "methods " together consist the game. Brief introductions are provided below.

  • Enumeration classes
    • : Available actions for players. (, )
    • : Possible statuses for an episode of Blackjack game. (, , , )
    • : Possible statuses for a player in an episode of Blackjack game. (, , , , , )
    • : whether an card is usable. (, )
  • Normal Classes
    • : the poker cards used in Blackjack games. A special function is implemented considering that all face cards are counted as 10 and possible usable ace.
    • : decks of cards. Card dealing method is implemented here.
    • : a class for player's states, containing the cards at hand, whether there is a usable ace, and current points.
    • : a general class for players, containing player's state, and a drawing card method which updates player's state automatically. (, )
: decks and players on the casino table
  • (Decks) and (List[Player]), the last player is dealer.
  • a method, dealing card to all players at the beginning of the blackjack game.
  • some quick summaries of players' states and status.
: the blackjack game
  • players on the table take actions in sequence, a class attribute stores the id (index) of current player.
MethodDescriptionMethodDescription
take one step action and return new observation and status.to check if ongoing episode is over.
let current player execute given actiongame final settlement, assign rewards to gamblers.
observation of current player: his sum of points, dealer's showing card, whether he has a usable ace.play one episode of blackjack game, by first dealing card, and then each player taking actions in turn. Return the rewards and players' trajectory.
update and return present game status.

Reinforcement learning models for studying Blackjack game.

  • : an template abstract class for other models, with some interface methods.
    • : returns the values of the action value function for a given state.
    • : predict action based on state.
    • : train the model.
    • and : load/save the model from/to file.
  • : a fixed policy model for testing environment as well as performing Monte Carlo prediction.
    • sticks on any sum of 20 or greater, and hits otherwise.
    • Its value functions after 10,000 and 50,000 episodes of Monte Carlo learning are shown here.

Monte Carlo Control, SARSA, Q-Learning

  • Since the blackjack game is episodic and reward is given only at the end, the original Sarsa and Q-Learning methods are not appropriate. Here we adopt episodic version of them, the only change is that in one episode, the agent does not take action according to most recently updated policy. The results show that they also work.
  • Hyper parameters
    • The parameter of the -greedy method is set to be at the beginning of -th episode.
    • The discount factor , and the learning rate are constants.
  • Convergence criteria: we record the last 100 updates of Q function value in a queue with fixed length. If the average of their absolute value is lower than pre specified threshold (), the training achieves convergence and stops.
  • The experiments are performed under the unit test framework (refer to ), with same hyperparameters.
  • The visualization package and are used to produce all illustrated figures. The codes for visualization are in .

0. Apply Monte Carlo Prediction on a fixed policy

Apply Monte Carlo policy evaluation on the extreme policy which sticks on any sum of 20 or greater and hits otherwise. The approximated state value function after 10,000 and 50,000 episodes evaluations are drawn respectively.

It's basically same with Figure 5.1 of Sutton and Barto. The estimates for states with a usable ace are relatively less regular because these states are rare. In any event, after 500,000 games the value function is well approximated.

1. Find the optimal policy for the Blackjack when and .

Источник: https://github.com/claude9493/Blackjack_RL