D. Reinforcement Learning Controller (Demo)
Objective
The objectives of this laboratory experiment are as follows:
Train a controller that balances the pendulum using Reinforcement Learning.
Implement the balance controller on the Quanser Rotary Pendulum system and evaluate its performance.
Equipment
QUBE-Servo 3 Rotary Inverted Pendulum Model
A picture of the QUBE-Servo 3 with rotary pendulum module is shown in Fig 1. The numbered components in Fig. 1 are listed in Table 1 along with the numerical values of the system parameters in Table 2.

Table 1: QUBE-Servo 3 components (Figure 1)
1
Rotary servo
2
Rotary arm housing
3
Pendulum encoder
4
Rotary arm
5
Pendulum link
Table 2: QUBE-Servo 3 main parameters
Mass of pendulum
0.024
kg
Total length of pendulum
0.129
m
Pendulum moment of inertia about center of mass
3.3282 · 10-5
kg⋅m2
Pendulum viscous damping coefficient as seen at the pivot axis
5 · 10-5
N⋅m⋅s/rad
Mass of rotary arm
0.095
kg
Rotary arm length from pivot to tip
0.085
m
Rotary arm moment of inertia about its center of mass
5.7198 · 10-5
kg⋅m2
Rotary arm viscous damping coefficient as seen at the pivot axis
0.001
N⋅m⋅s/rad
Rm
Motor armature resistance
8.4
Ω
kt
Current-torque constant
0.0422
N⋅m/A
km
Back-emf constant
0.0422
V⋅s/rad
Reinforcement Learning
Fundamentals
Although we have found a sufficient model for the Inverted Pendulum system in Part A, there are often systems where the plant dynamics are difficult or impossible to model. In this case, we look to model-free control.
Reinforcement learning algorithms, which include model-free algorithms, seek to train an agent. The agent receives observations of the plant system in addition to feedback from a reward function, and outputs a corresponding action according to its policy. The agent may be trained on the physical system in a series of discrete training episodes. Following each episode, the agent updates its policy according to the specific reinforcement learning algorithm in use. A summary of the reinforcement learning training loop is shown in Fig. 2.

Following training, the agent may be deployed on the system to ideally achieve the desired behavior. The particular reinforcement learning algorithm used in this lab is the deep deterministic policy gradient (DDPG) algorithm, a model-free actor-critic algorithm. For more information about the DDPG algorithm in particular, please visit the https://www.mathworks.com/help/reinforcement-learning/ug/ddpg-agents.html.
Reinforcement Learning for the Inverted Pendulum System
To implement the DDPG algorithm, we need a reward function, which evaluates the observed performance of the system. A good reward function promotes behavior we desire in the system, and penalizes behavior we want to eliminate. In this case, we seek to command the system to θ=0,α=0,θ˙=0,α˙=0. Refer to Part A Modeling for the definition of variables. Additionally, we especially want to avoid the servo angle θ exceeding the physical limit of the system at θmax, actual=90°, with a margin of error such that θmax=60° , the pendulum angle α exceeding the balance control limit of αmax=ϵ=12°, and the input voltage limit of Vmax=5V . Thus, we use the following quadratic reward function
where qii.rii,B are weights. This function covers all of the desired behavior we seek from the system. By default, the weights are q33=0<r11=0.1<q44=1<q11=10<q22=20, and B=−100, reflecting the relative importance of each state.
For a model-free application, an engineer would typically train the reinforcement learning model on the physical plant system or a digital twin of the plant system. However, doing this during a 2.5 hour lab session is impractical, as this would require repeatedly raising the pendulum for 1000 training episodes. Consequently, we train our reinforcement learning agent in simulation on the state space model we developed in part A.
A. Training the Reinforcement Learning Controller
Download the Lab4_Group.zip file to the lab PC and extract the folder contents.
Open the rotpen_group.mlx live script and run the Initialization section. This section generates all necessary parameters for the reinforcement learning training. Ensure that the drop down menu is configured to 'train new agent.'
Run the Observation and Action Signals section. This section creates the learning environment for the reinforcement learning agent, configures the observation signal read by the agent, and configures the action signal generated by the agent.
Run the Create DDPG Agent section. This section creates and configures the critic and actor neural networks, and compiles these networks into a single agent.
Run the Train/Load Agent section. This section begins the training session. By default, the training session is configured to contain 1000 episodes with a random starting pendulum angle between -20 and 20 degrees.
The s_rotpen_train_rl.slx Simulink diagram should automatically open, in addition to a Reinforcement Learning Training Monitor window. Open the 'alpha' and 'theta' scopes and configure your monitor so you can clearly see all three windows, as in Fig. 3. You should now be able to watch as the reinforcement learning agent trains on your state space model.
Wait for the training to complete. This should take 20-30 minutes. Ensure that the main MATLAB window is open and visible to avoid slowdown due to CPU allocation. On training completion, your trained agent will be saved to the current directory as agent_student.mat.

B. Implementing the Reinforcement Learning Controller
Proceed to the Implement Reinforcement Learning on Hardware section of the rotpen_group.mlx script.
Ensure the 'doPolicy' drop down box is set to 'true'.
Run the section. The q_rotpen_rl_student.slx Simulink diagram should open automatically.
To build the model, click the down arrow on Monitor & Tune under the Hardware tab and then click Build for monitoring
. This generates the controller code.Press Connect
button under Monitor & Tune and Press Start
. Once it is running, manually bring up the pendulum to its upright vertical position. You may need to try this a few times.
Observe the behavior of your controller. It is possible that your reinforcement learning controller is unable to balance the pendulum, or, if it is, the control is not very robust. This is OK! These are the natural consequences of Reinforcement Learning based control.
Questions for Report
Briefly summarize the RL balance controller experiment and your observations. How did the RL controller behave compared to the model-based pole placement controller?
Last updated
Was this helpful?