How Exploration Agents like Q-Learning, UCB, and MCTS Collaboratively Learn Intelligent Problem-Solving Strategies in Dynamic Grid Environments

In this tutorial, we explore how exploration strategies shape intelligent decision-making through agent-based problem solving. We build and train three agents, Q-Learning with epsilon-greedy exploration, Upper Confidence Bound (UCB), and Monte Carlo Tree Search (MCTS), to navigate a grid world and reach a goal efficiently while avoiding obstacles. Also, we experiment with different ways of balancing exploration and exploitation, visualize learning curves, and compare how each agent adapts and performs under uncertainty. Check out the FULL CODES here.

import numpy as np import random from collections import defaultdict, deque import math import matplotlib.pyplot as plt from typing import List, Tuple, Dict   class GridWorld:    def __init__(self, size=10, n_obstacles=15):        self.size = size        self.grid = np.zeros((size, size))        self.start = (0, 0)        self.goal = (size-1, size-1)        obstacles = set()        while len(obstacles) < n_obstacles:            obs = (random.randint(0, size-1), random.randint(0, size-1))            if obs not in [self.start, self.goal]:                obstacles.add(obs)                self.grid[obs] = 1        self.reset()    def reset(self):        self.agent_pos = self.start        return self.agent_pos    def step(self, action):        if self.agent_pos == self.goal:            reward, done = 100, True        else:            reward, done = -1, False        return self.agent_pos, reward, done    def get_valid_actions(self, state):        valid = []        for i, move in enumerate(moves):            new_pos = (state[0] + move[0], state[1] + move[1])            if (0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size                and self.grid[new_pos] == 0):                valid.append(i)        return valid

We begin by creating a grid world environment that challenges our agent to reach a goal while avoiding obstacles. We design its structure, define movement rules, and ensure realistic navigation boundaries to simulate an interactive problem-solving space. This forms the foundation where our exploration agents will operate and learn. Check out the FULL CODES here.

class QLearningAgent:    def __init__(self, n_actions=4, alpha=0.1, gamma=0.95, epsilon=1.0):        self.n_actions = n_actions        self.alpha = alpha        self.gamma = gamma        self.epsilon = epsilon        self.q_table = defaultdict(lambda: np.zeros(n_actions))    def get_action(self, state, valid_actions):        if random.random() < self.epsilon:            return random.choice(valid_actions)        else:            q_values = self.q_table[state]            valid_q = [(a, q_values[a]) for a in valid_actions]            return max(valid_q, key=lambda x: x[1])[0]    def update(self, state, action, reward, next_state, valid_next_actions):        current_q = self.q_table[state][action]        if valid_next_actions:            max_next_q = max([self.q_table[next_state][a] for a in valid_next_actions])        else:            max_next_q = 0        new_q = current_q + self.alpha * (reward + self.gamma * max_next_q - current_q)        self.q_table[state][action] = new_q    def decay_epsilon(self, decay_rate=0.995):        self.epsilon = max(0.01, self.epsilon * decay_rate)

We implement the Q-Learning agent that learns through experience, guided by an epsilon-greedy policy. We observe how it explores random actions early on and gradually focuses on the most rewarding paths. Through iterative updates, it learns to balance exploration and exploitation effectively.

class UCBAgent:    def __init__(self, n_actions=4, c=2.0, gamma=0.95):        self.n_actions = n_actions        self.c = c        self.gamma = gamma        self.q_values = defaultdict(lambda: np.zeros(n_actions))        self.action_counts = defaultdict(lambda: np.zeros(n_actions))        self.total_counts = defaultdict(int)    def get_action(self, state, valid_actions):        self.total_counts[state] += 1        ucb_values = []        for action in valid_actions:            q = self.q_values[state][action]            count = self.action_counts[state][action]            if count == 0:                return action            exploration_bonus = self.c * math.sqrt(math.log(self.total_counts[state]) / count)            ucb_values.append((action, q + exploration_bonus))        return max(ucb_values, key=lambda x: x[1])[0]    def update(self, state, action, reward, next_state, valid_next_actions):        self.action_counts[state][action] += 1        count = self.action_counts[state][action]        current_q = self.q_values[state][action]        if valid_next_actions:            max_next_q = max([self.q_values[next_state][a] for a in valid_next_actions])        else:            max_next_q = 0        target = reward + self.gamma * max_next_q        self.q_values[state][action] += (target - current_q) / count

We develop the UCB agent that uses confidence bounds to guide its exploration decisions. We watch how it strategically tries less-visited actions while prioritizing those that yield higher rewards. This approach helps us understand a more mathematically grounded exploration strategy. Check out the FULL CODES here.

class MCTSNode:    def __init__(self, state, parent=None):        self.state = state        self.parent = parent        self.children = {}        self.visits = 0        self.value = 0.0    def is_fully_expanded(self, valid_actions):        return len(self.children) == len(valid_actions)    def best_child(self, c=1.4):        choices = [(action, child.value / child.visits +                    c * math.sqrt(2 * math.log(self.visits) / child.visits))                   for action, child in self.children.items()]        return max(choices, key=lambda x: x[1])   class MCTSAgent:    def __init__(self, env, n_simulations=50):        self.env = env        self.n_simulations = n_simulations    def search(self, state):        root = MCTSNode(state)        for _ in range(self.n_simulations):            node = root            sim_env = GridWorld(size=self.env.size)            sim_env.grid = self.env.grid.copy()            sim_env.agent_pos = state            while node.is_fully_expanded(sim_env.get_valid_actions(node.state)) and node.children:                action, _ = node.best_child()                node = node.children[action]                sim_env.agent_pos = node.state            valid_actions = sim_env.get_valid_actions(node.state)            if valid_actions and not node.is_fully_expanded(valid_actions):                untried = [a for a in valid_actions if a not in node.children]                action = random.choice(untried)                next_state, _, _ = sim_env.step(action)                child = MCTSNode(next_state, parent=node)                node.children[action] = child                node = child            total_reward = 0            depth = 0            while depth < 20:                valid = sim_env.get_valid_actions(sim_env.agent_pos)                if not valid:                    break                action = random.choice(valid)                _, reward, done = sim_env.step(action)                total_reward += reward                depth += 1                if done:                    break            while node:                node.visits += 1                node.value += total_reward                node = node.parent        if root.children:            return max(root.children.items(), key=lambda x: x[1].visits)[0]        return random.choice(self.env.get_valid_actions(state))

We construct the Monte Carlo Tree Search (MCTS) agent to simulate and plan multiple potential future outcomes. We see how it builds a search tree, expands promising branches, and backpropagates results to refine decisions. This allows the agent to plan intelligently before acting. Check out the FULL CODES here.

def train_agent(agent, env, episodes=500, max_steps=100, agent_type="standard"):    rewards_history = []    for episode in range(episodes):        state = env.reset()        total_reward = 0        for step in range(max_steps):            valid_actions = env.get_valid_actions(state)            if agent_type == "mcts":                action = agent.search(state)            else:                action = agent.get_action(state, valid_actions)            next_state, reward, done = env.step(action)            total_reward += reward            if agent_type != "mcts":                valid_next = env.get_valid_actions(next_state)                agent.update(state, action, reward, next_state, valid_next)            state = next_state            if done:                break        rewards_history.append(total_reward)        if hasattr(agent, 'decay_epsilon'):            agent.decay_epsilon()        if (episode + 1) % 100 == 0:            avg_reward = np.mean(rewards_history[-100:])            print(f"Episode {episode+1}/{episodes}, Avg Reward: {avg_reward:.2f}")    return rewards_history   if __name__ == "__main__":    print("=" * 70)    print("Problem Solving via Exploration Agents Tutorial")    print("=" * 70)    env = GridWorld(size=8, n_obstacles=10)    agents_config = {        'Q-Learning (ε-greedy)': (QLearningAgent(), 'standard'),        'UCB Agent': (UCBAgent(), 'standard'),        'MCTS Agent': (MCTSAgent(env, n_simulations=30), 'mcts')    }    results = {}    for name, (agent, agent_type) in agents_config.items():        print(f"nTraining {name}...")        rewards = train_agent(agent, GridWorld(size=8, n_obstacles=10),                              episodes=300, agent_type=agent_type)        results[name] = rewards    plt.figure(figsize=(12, 5))    plt.subplot(1, 2, 1)    for name, rewards in results.items():        smoothed = np.convolve(rewards, np.ones(20)/20, mode='valid')        plt.plot(smoothed, label=name, linewidth=2)    plt.xlabel('Episode')    plt.ylabel('Reward (smoothed)')    plt.title('Agent Performance Comparison')    plt.legend()    plt.grid(alpha=0.3)    plt.subplot(1, 2, 2)    for name, rewards in results.items():        avg_last_100 = np.mean(rewards[-100:])        plt.bar(name, avg_last_100, alpha=0.7)    plt.ylabel('Average Reward (Last 100 Episodes)')    plt.title('Final Performance')    plt.xticks(rotation=15, ha='right')    plt.grid(axis='y', alpha=0.3)    plt.tight_layout()    plt.show()    print("=" * 70)    print("Tutorial Complete!")    print("Key Concepts Demonstrated:")    print("1. Epsilon-Greedy exploration")    print("2. UCB strategy")    print("3. MCTS-based planning")    print("=" * 70)

We train all three agents in our grid world and visualize their learning progress and performance. We analyze how each strategy, Q-Learning, UCB, and MCTS, adapts to the environment over time. Finally, we compare results and gain insights into which exploration approach leads to faster, more reliable problem-solving.

In conclusion, we successfully implemented and compared three exploration-driven agents, each demonstrating a unique strategy for solving the same navigation challenge. We observe how epsilon-greedy enables gradual learning through randomness, UCB balances confidence with curiosity, and MCTS leverages simulated rollouts for foresight and planning. This exercise helps us appreciate how different exploration mechanisms influence convergence, adaptability, and efficiency in reinforcement learning.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Asif Razzaq

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

Asif Razzaq

Leave a Reply Cancel reply

Related Posts

A Step-by-Step Coding Guide to Integrate Dappier AI’s Real-Time Search and Recommendation Tools with OpenAI’s Chat API

What is AI Agent Observability? Top 7 Best Practices for Reliable AI

Google Search Just Got a Major AI Upgrade: Gemini 2.5 Pro, Deep Search, and Agentic Intelligence