Reinforcement Learning: How Machines Learn from Trial and Error

Reinforcement Learning: How Machines Learn from Trial and Error

Publish Date: Jun 2
6 1

"The only real mistake is the one from which we learn nothing.” — John Powell


Before diving deep into the subject at hand, it's important to recognize that every Machine Learning algorithm is, at its core, an attempt to formalize something that humans, animals, or other living organisms naturally do in pursuit of their well-being.
So don’t get overwhelmed by the math or jargon too early. At its heart, RL is just a formal way of describing how we — as living beings — learn from experience. If you can relate it back to how you’ve personally learned things through trial and error, you’re already thinking like a reinforcement learning researcher.

Stay curious, connect concepts to your own experiences, and let intuition guide your technical understanding.

1.What is Reinforcement Learning?

Reinforcement Learning is simply a type of machine learning in which and agent learns by interacting with an environment and establish
policies based on the response received from the environment

A Policy is a deliberate system of guidelines to guide decisions and achieve rational outcomes - wikipedia

An Agent is simply a representative of someone

you are sent to represent someone but you really do not know
what is wrong or right 
so you will only do try and error
what ever brings good results do them more
avoid what brings bad results

but you will take some time exploring before you will 
identify good and bad actions
Enter fullscreen mode Exit fullscreen mode

Reinforcement Learning has evolved from three major research threads:

  • Trial-and-error Learning from psychology

  • Optimal Control Theory

  • Temporal-difference learning

You can learn more about these threads in this blog post:🏃‍➡️👉here

2.What are the elements of a Reinforcement Learning?

-A Policy : a is like a rule that defines the type of action an agent is supposed to take when in a given state . psychology summarizes it as set of stimulus-response rules .it defines the way a learning agent will behave in time.

-A Reward signal: this defines the goal of reinforcement learning. A reward is a signal number sent by the environment to the agent at each time step. The objective is to maximize the total reward earned in a long run. Rewards defines good and bad events for the learning agent

-A value function : A Value function specifies what is good in a long run. The Value a state is the total amount of reward an agent can expect to get in the future starting from that state. This means in a RL system value is not same as reward. Rewards are the immediate desirable signals from the environments while Value indicate the long-term Return from states after taking into account the states that are likely to follow, and the rewards available in those states .

-A model of the environment : This is something that mimics the behavior of the environment. For example, given a state and action, the model might predict the resultant next state and next reward.

Other important terms

-State (S) - The current situation of the environment
-Action (A) – What the agent can do


3.How does an Agent Learn in an environment

The agent tries actions and observes the results:

Takes an action

Gets a reward

Updates its behavior (policy)

This loop continues until it finds the best strategy, called an optimal policy.


let do a small primitive hands-on

checkout the working code here

problem: Think of a smart thermostat that learns the best time to turn the heater on or off to keep the room comfortable and save energy. It gets positive rewards for comfort and negative rewards for wasting energy.

Environmental states

  • too hot
  • normal
  • too cold

Agents actions

  • Turn ON heater
  • Turn OFF heater

Reward response

  • + 1 if room is comfortable
  • -1 if room is too hot or cold

OBJECTIVE : maximize reward


i.import numpy and random

import numpy as np
import random
Enter fullscreen mode Exit fullscreen mode

ii.Define states and keep tract of howmany they are

#define state
states = ["too hot", "normal", "too cold"]
n_states = len(states) #the number of states
Enter fullscreen mode Exit fullscreen mode

the python list above is called a state_space and can be accessed using env.state_space.sample() when you here of space think of it as a set

iii.Define actions and keep tract of howmany they are

#define state
actions = ["ON", "OFF"]
n_actions = len(actions) #the number of states
Enter fullscreen mode Exit fullscreen mode

iv. Initialize a Q-table Q(s,a):

This table act as a guide that shows what you will get depending on what state(s) you are and what action(a) you take

This Q-table tracks the expected reward (value) of taking each action in each temperature state.

State Action: ON Action: OFF
Cold 0.00 0.00
Okay 0.00 0.00
Hot 0.00 0.00

⚠️ Note: All values start at 0.00. As the agent interacts with the environment, these values will be updated to reflect which actions are more rewarding in each state.

# Initialize Q-table: rows = states, cols = actions
r = n_states
c = n_actions
q_table = np.zeros(( r, c))
Enter fullscreen mode Exit fullscreen mode

v.Reward(response signal) function

question: should the agent be aware of this function ? why?

we will represent this function in the form of a reward matrix using the python dictionary

reward_matrix = {
    "Cold":    {"ON": 0,   "OFF": -1},
    "Okay":    {"ON": -1,  "OFF": 1},
    "Hot":     {"ON": -1,  "OFF": -2}
}
# let this not confuse you 
# These are simply rules you can define your own as you wish
Enter fullscreen mode Exit fullscreen mode

Reward Matrix Explanation

The thermostat has 3 possible temperature states and can take 2 actions: turning the heater ON or OFF.

State Action Reward Interpretation
Cold ON 0 Neutral: turning on heater is expected in cold.
Cold OFF -1 Bad: it's cold and you're not heating.
Okay ON -1 Bad: wasting energy when it's already comfortable.
Okay OFF +1 Good: maintaining comfort and saving energy.
Hot ON -1 Bad: worsening the situation, it's already hot.
Hot OFF -2 Very bad: it's hot and you're not cooling down.

vi.Parameters :

this include the learning rate, discount factor, exploration rate

-learning rate(α) : size of the steps you take to reach a goal . if your steps are too small you may take infinite time to reach and if they are too large you may keep passing the destination ( think of a swinging simple pendulum
that is what will happen if your learning rate is too high
so choosing a learning rate is something you should not do it carelessly

  • discount factor(γ) : this is a number between 0 and 1 it guide the agent on how to compare futur rewards to immediate present reward

did you know that if i buy something from you that cost 10000 CFA and ask you to choose between receiving 10000 CFA NOW OR 20000 CFA after 12 months which want will you prefere

As someone who wishes to maximize profit you will find the present worth of 20000 CFA which lives 12month into the future

this will be done using the discounted factor

-the Reward at time (t) is represented as (Rt)( R_t )
the Total Expected Reward is represented as (Gt)( G_t )

G=Rt+γRt+1+γ2Rt+2+γ3Rt+3+ G = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \gamma^3 R_{t+3} + \dots

if γ=1\gamma = 1 : future reward are as important as immediate present rewards
you must consuder the futur outcomes of your actions

if γ=0\gamma = 0 : future reward are not at all important as immediate present rewards thus focus only on the present reward

exploration rate : this has to do with the question

should i just stick to what i know brings good results (exploitation) OR i should try and experiment with other alternative methods before choosing the best one (exploration)?
this is important because there may be a methods better that what you already know
this could also waste your time if what you knew was already the best

SO there is this wrestling between Exploitation and Exploration

😊😊❤️❤️❤️I LOVE EXPLORATION

# Parameters
alpha = 0.1    # learning rate
gamma = 0.9    # discount factor
epsilon = 0.2  # exploration rate
Enter fullscreen mode Exit fullscreen mode

👨‍⚕️ these parameters must be chosen wisely

vii. Simulate a training session (episode)


for episode in range(1000):
    # Start at a random state
    state_idx = random.randint(0, n_states - 1)
    state = states[state_idx]

    # Choose action: explore or exploit
    if random.uniform(0, 1) < epsilon:
        action_idx = random.randint(0, n_actions - 1)
    else:
        action_idx = np.argmax(q_table[state_idx])

    action = actions[action_idx]

    # Get reward
    reward = reward_matrix[state][action]

    # Simulate next state: very basic model
    if action == "ON":
        next_state_idx = min(state_idx + 1, n_states - 1)
    else:
        next_state_idx = max(state_idx - 1, 0)

    # Update Q-table using Q-learning formula
    old_value = q_table[state_idx, action_idx]
    next_max = np.max(q_table[next_state_idx])
    q_table[state_idx, action_idx] = old_value + alpha * (reward + gamma * next_max - old_value)

# Display learned Q-values
print("Learned Q-table:")
for i, state in enumerate(states):
    print(f"{state}: ON = {q_table[i, 0]:.2f}, OFF = {q_table[i, 1]:.2f}")

Enter fullscreen mode Exit fullscreen mode

Comments 1 total

  • Fonyuy Gita
    Fonyuy GitaJun 6, 2025

    I like this quote : The only real mistake is the one from which we learn nothing.

Add comment