"The only real mistake is the one from which we learn nothing.” — John Powell
Before diving deep into the subject at hand, it's important to recognize that every Machine Learning algorithm is, at its core, an attempt to formalize something that humans, animals, or other living organisms naturally do in pursuit of their well-being.
So don’t get overwhelmed by the math or jargon too early. At its heart, RL is just a formal way of describing how we — as living beings — learn from experience. If you can relate it back to how you’ve personally learned things through trial and error, you’re already thinking like a reinforcement learning researcher.
Stay curious, connect concepts to your own experiences, and let intuition guide your technical understanding.
1.What is Reinforcement Learning?
Reinforcement Learning is simply a type of machine learning in which and agent learns by interacting with an environment and establish
policies based on the response received from the environment
A Policy is a deliberate system of guidelines to guide decisions and achieve rational outcomes - wikipedia
An Agent is simply a representative of someone
you are sent to represent someone but you really do not know
what is wrong or right
so you will only do try and error
what ever brings good results do them more
avoid what brings bad results
but you will take some time exploring before you will
identify good and bad actions
Reinforcement Learning has evolved from three major research threads:
Trial-and-error Learning from psychology
Optimal Control Theory
Temporal-difference learning
You can learn more about these threads in this blog post:🏃➡️👉here
2.What are the elements of a Reinforcement Learning?
-A Policy : a is like a rule that defines the type of action an agent is supposed to take when in a given state . psychology summarizes it as set of stimulus-response rules .it defines the way a learning agent will behave in time.
-A Reward signal: this defines the goal of reinforcement learning. A reward is a signal number sent by the environment to the agent at each time step. The objective is to maximize the total reward earned in a long run. Rewards defines good and bad events for the learning agent
-A value function : A Value function specifies what is good in a long run. The Value a state is the total amount of reward an agent can expect to get in the future starting from that state. This means in a RL system value is not same as reward. Rewards are the immediate desirable signals from the environments while Value indicate the long-term Return from states after taking into account the states that are likely to follow, and the rewards available in those states .
-A model of the environment : This is something that mimics the behavior of the environment. For example, given a state and action, the model might predict the resultant next state and next reward.
Other important terms
-State (S) - The current situation of the environment
-Action (A) – What the agent can do
3.How does an Agent Learn in an environment
The agent tries actions and observes the results:
Takes an action
Gets a reward
Updates its behavior (policy)
This loop continues until it finds the best strategy, called an optimal policy.
let do a small primitive hands-on
checkout the working code here
problem: Think of a smart thermostat that learns the best time to turn the heater on or off to keep the room comfortable and save energy. It gets positive rewards for comfort and negative rewards for wasting energy.
Environmental states
- too hot
- normal
- too cold
Agents actions
- Turn ON heater
- Turn OFF heater
Reward response
- + 1 if room is comfortable
- -1 if room is too hot or cold
OBJECTIVE : maximize reward
i.import numpy and random
import numpy as np
import random
ii.Define states and keep tract of howmany they are
#define state
states = ["too hot", "normal", "too cold"]
n_states = len(states) #the number of states
the python list above is called a state_space and can be accessed using env.state_space.sample()
when you here of space think of it as a set
iii.Define actions and keep tract of howmany they are
#define state
actions = ["ON", "OFF"]
n_actions = len(actions) #the number of states
iv. Initialize a Q-table Q(s,a):
This table act as a guide that shows what you will get depending on what state(s) you are and what action(a) you take
This Q-table tracks the expected reward (value) of taking each action in each temperature state.
State | Action: ON | Action: OFF |
---|---|---|
Cold | 0.00 | 0.00 |
Okay | 0.00 | 0.00 |
Hot | 0.00 | 0.00 |
⚠️ Note: All values start at
0.00
. As the agent interacts with the environment, these values will be updated to reflect which actions are more rewarding in each state.
# Initialize Q-table: rows = states, cols = actions
r = n_states
c = n_actions
q_table = np.zeros(( r, c))
v.Reward(response signal) function
question: should the agent be aware of this function ? why?
we will represent this function in the form of a reward matrix using the python dictionary
reward_matrix = {
"Cold": {"ON": 0, "OFF": -1},
"Okay": {"ON": -1, "OFF": 1},
"Hot": {"ON": -1, "OFF": -2}
}
# let this not confuse you
# These are simply rules you can define your own as you wish
Reward Matrix Explanation
The thermostat has 3 possible temperature states and can take 2 actions: turning the heater ON
or OFF
.
State | Action | Reward | Interpretation |
---|---|---|---|
Cold | ON | 0 | Neutral: turning on heater is expected in cold. |
Cold | OFF | -1 | Bad: it's cold and you're not heating. |
Okay | ON | -1 | Bad: wasting energy when it's already comfortable. |
Okay | OFF | +1 | Good: maintaining comfort and saving energy. |
Hot | ON | -1 | Bad: worsening the situation, it's already hot. |
Hot | OFF | -2 | Very bad: it's hot and you're not cooling down. |
vi.Parameters :
this include the learning rate, discount factor, exploration rate
-learning rate(α) : size of the steps you take to reach a goal . if your steps are too small you may take infinite time to reach and if they are too large you may keep passing the destination ( think of a swinging simple pendulum
that is what will happen if your learning rate is too high
so choosing a learning rate is something you should not do it carelessly
- discount factor(γ) : this is a number between 0 and 1 it guide the agent on how to compare futur rewards to immediate present reward
did you know that if i buy something from you that cost 10000 CFA and ask you to choose between receiving 10000 CFA NOW OR 20000 CFA after 12 months which want will you prefere
As someone who wishes to maximize profit you will find the present worth of 20000 CFA which lives 12month into the future
this will be done using the discounted factor
-the Reward at time (t) is represented as
the Total Expected Reward is represented as
if : future reward are as important as immediate present rewards
you must consuder the futur outcomes of your actionsif : future reward are not at all important as immediate present rewards thus focus only on the present reward
exploration rate : this has to do with the question
should i just stick to what i know brings good results (exploitation) OR i should try and experiment with other alternative methods before choosing the best one (exploration)
?
this is important because there may be a methods better that what you already know
this could also waste your time if what you knew was already the bestSO there is this wrestling between Exploitation and Exploration
😊😊❤️❤️❤️I LOVE EXPLORATION
# Parameters
alpha = 0.1 # learning rate
gamma = 0.9 # discount factor
epsilon = 0.2 # exploration rate
👨⚕️ these parameters must be chosen wisely
vii. Simulate a training session (episode)
for episode in range(1000):
# Start at a random state
state_idx = random.randint(0, n_states - 1)
state = states[state_idx]
# Choose action: explore or exploit
if random.uniform(0, 1) < epsilon:
action_idx = random.randint(0, n_actions - 1)
else:
action_idx = np.argmax(q_table[state_idx])
action = actions[action_idx]
# Get reward
reward = reward_matrix[state][action]
# Simulate next state: very basic model
if action == "ON":
next_state_idx = min(state_idx + 1, n_states - 1)
else:
next_state_idx = max(state_idx - 1, 0)
# Update Q-table using Q-learning formula
old_value = q_table[state_idx, action_idx]
next_max = np.max(q_table[next_state_idx])
q_table[state_idx, action_idx] = old_value + alpha * (reward + gamma * next_max - old_value)
# Display learned Q-values
print("Learned Q-table:")
for i, state in enumerate(states):
print(f"{state}: ON = {q_table[i, 0]:.2f}, OFF = {q_table[i, 1]:.2f}")
I like this quote : The only real mistake is the one from which we learn nothing.