Reinforcement Learning and Its Implications for Enterprise Artificial Intelligence


Reinforcement learning (RL) is a subset of machine learning algorithms that learns by exploring its environment. Unlike supervised learning, which trains on labeled datasets, RL achieves its stated objective by receiving positive or negative rewards for the actions that it takes. The environments in which RL works can be both simulated and real-world environments, although real-world environments are seldom used for training RL algorithms.

Most of today’s machine learning is done using labeled datasets, with the algorithm told what the correct answer is, or the known patterns for which it should be looking. This data-driven approach is what is truly driving deep learning today, where deep neural networks are given vast amounts of labeled data to train, and based on the labeled data, they can make probabilistic predictions on new data. When using RL, one does not really know what the correct answer is, or does not have the labeled data at hand, with the algorithm exploring its state space and achieving the given objective.

(Source: Reinforcement Learning: An Introduction 2nd Edition, Richard S. Sutton and Andrew G. Barto)

Deep RL is where deep learning is used in conjunction with RL to simplify the reward function in cases where the search space is very large, or the environment is very complicated with multi-dimensional states, actions, and rewards. The use of deep learning with RL is also known as Q-learning in which a deep learning network is used as a function approximator (called the Q function), predicting the reward for an input, rather than trying to explore and store rewards and actions for every state. Also, in simulation environments, by simply feeding pixels of an environment through a neural network, it allows the reinforcement algorithm to better understand its environment.

Teaching Artificial Intelligence Systems and Avoiding Real-World Consequences

For the most part, RL is being used to teach AI systems how to play games, as games provide a safe and bounded environment for learning. For example, AlphaGo uses RL (in combination with other techniques) and similar techniques to have AI learn Atari games, or become champions at Poker. RL is also being used in robotics and autonomous cars, mostly in simulation environments that teach how robots and cars should navigate. Simulation environments are great for RL, as one can afford to spend time doing trial and error without any real-world consequences. On the other hand, having an autonomous car perform RL in the real world is not the best way to teach an AI.

Today, within the enterprise, we see a lot of use for supervised learning techniques, mostly using classical machine learning or deep learning. The types of problems most classical machine learning techniques are trying to solve fall under the predictive analytics tasks where an AI system is trying to learn from a bunch of data, understand the patterns in the data, and make predictions. Deep learning-driven techniques can be categorized as human perception tasks like learning vision or language where the AI is now able to understand text, speech, pictures, and video, and then perform additional analytics.

Real-Time Use Cases for Reinforcement Learning

RL allows enterprises to tackle the next level of problems, especially around optimizing, controlling, and monitoring systems or processes in real time. RL is very handy in cases where there is a lack of clean data, or labeled data. Any industrial or machine process that is being handled by traditional programmable logic controller (PLC) control systems can, in theory, be run using a RL algorithm. PLC systems are one of the most common control systems used in manufacturing, robotics, and energy management (e.g., heating, ventilation, and air condition (HVAC)) systems with simulators like Matlab and Simulink being commonly used to simulate mechanical and electrical systems, or even robotics simulators like a robot operating system (ROS). However, many other verticals using RL have their own simulators that can be used and integrated as RL environments. Below is preliminary list of industries where RL could be applied:

  • Manufacturing: Controlling, managing, and predicting failure of machinery
  • Building Automation: Control and management of building and HVAC systems
  • Energy: Control and management of power plant systems
  • Aerospace: Controlling, managing, and predicting failure of aircraft, drone systems, engines, etc.
  • Logistics: System-level real-time adjustment and prediction of issues in logistics network planning
  • Consumer: Consumer and toy robot management and control allowing for real-time learning
  • Transport: Management and control of traffic systems, public transport scheduling, etc.

RL platforms like Bonsai or OpenAI’s Gym and Universe can be integrated into some of the simulators to have RL algorithms learn in the simulated environment. We are already starting to see early concepts and experimentation of RL systems in verticals like manufacturing. As more platforms emerge, and it becomes easier to port environments into platforms and perform real-time learning, the use cases will increase.

Technical and Safety Issues Are Significant Challenges

One of the promises of OpenAI’s Universe was the integration of websites that will allow RL algorithms to be applied to websites and the internet, which will open a new class of AI systems that will learn narrow tasks like filling out forms on websites, user behavior on e-commerce websites, chatbot behavior, and many other tasks that humans take for granted while interacting with the internet. Tractica covered the implications of OpenAI Universe in a blog back in 2016, but unfortunately, OpenAI has not really delivered on all its promises yet. There could be a technical reason and an AI safety reason for why OpenAI seems to have abandoned the Universe project so far. The way that OpenAI’s Universe integrates environments is through a virtual network computing (VNC) connection, which is known to slow down feedback to the RL algorithm and prevent real-time state updates. Possibly, OpenAI has been reevaluating not just the technical challenges, but the ethical, privacy, and safety challenges. This essentially means that agents will be running on websites, smartphones, and the internet, constantly learning user and system-level behavior. Just as we take cookies for granted on websites that cache website content, speed up website loading, and capture discrete user behavior, the RL agent is the smarter version of the website cookie that is constantly watching pixel movement on screens and learning from it.

Human Oversight Is Still a Requirement for Reinforcement Learning Platforms

If OpenAI Universe or someone else creates an RL platform that enables the “Internet as an Environment,” we are essentially entering a world of RL agents being massively deployed into industrial, robotic, and internet systems, as they learn and adapt to scenarios. One could argue that this is where AI starts to move into the next phase, where we start to get another order of magnitude efficiency gain of systems and processes. However, we will also need human oversight of agents, making sure they do not make wrong decisions, or end up destroying modern day electrical, industrial, and internet infrastructure. This is one of biggest hurdles of RL, as there are ethical, safety, and survival-level issues for humanity, where the AI goes beyond playing Atari games, recognizing voices, cats, and faces, and enters the system-level control and management realm. A very simple example of a multi-agent RL system causing global panic would be the AI shutting down the power grid and power plants to save energy costs, as that would be the simplest way to save energy costs! Until then, Tractica expects to see sandbox approaches to RL, especially in industrial and building management types of systems where one can build in human oversight and install kill switches.

Comments are closed.