Skip to main content

Understanding Q-learning: How a Reward Is All You Need

ยท 10 min read
Corentin Risselin
Software Engineer at Spice AI

There are two general ways to train an AI to match a given expectation: we can either give it the expected outputs (commonly named labels) for differents inputs; we call this supervised learning. Or we can provide a reward for each output as a score: this is reinforcement learning (RL).

Supervised learning works by tweaking all the parameters (weights in neural networks) to fit the desired outputs, expecting that given enough input/label pairs the AI will find common rules that generalize for any input.

Reinforcement learning's reward is often provided from a simple function that can score any output: we don't know what specific output would be best, but we can recognize how good the result is. In this latter statement there are two underlying concepts we will address in this post:

  • Can we only tell if the output is good in a binary way, or do we have to quantify the output to train our AI?
  • Do we have to give a reward for every AI's output? Can we give a reward only at specific times?

Those questions are already mostly answered, and many algorithms deal with those topics. Our journey here will be to understand how we tackle those questions and end up with a beautiful formula that is at the core of modern approaches of RL:

Equation 1. Q estimation at the heart of many RL algorithm, also known as the Bellman equation.


The vast majority, if not all, of modern RL algorithms are based on the principles of Q-learning: the idea is to evaluate a 'reward expectation' for each possible action. If we can have a good evaluation, we could maximize the reward by choosing actions with the maximum evaluated rewards. The function giving this expected reward is named Q. For now, we will assume we can have a reward for any action.

Equation 2. Definition of the Q function.

The t indices show that the state and action aren't constant and will vary, usually with time/action taken. On the other hand, the Q function and the reward function r are unique functions that ideally return the 'expected reward' for any (state, action) pairs.

For now, we will assume we can have a reward that gives an objective and perfect evaluation of each state/action.

Figure 1. Example of reward given for different actions at a specific state. Here a simple 2D map with a goal.


We know that actions' outcomes (rewards) will vary depending on the current state we are in, otherwise the problem would be trivial to solve. If the states that are relevant to our actions can be numbered, a simple way would be to build a table with all the possible states/action pairs. There are different ways to build such a table depending on how we can interact with our environment. Eventually, we would have a good 'map' to guide us to do the best actions.

Figure 2. Example of Q-table: we can build an exhaustive table for all the possible (state, action) pairs

Deep Q-Learningโ€‹

When the number of variables of the environment relevant to our actions/rewards becomes too large, the number of possible states grows quickly. It doesn't take a lot of possible parameters to make the Q-table approach unfeasible. Neural networks are known to work very nicely and efficiently in high dimensionality (with many input variables). They also generalize well, so the idea in Deep Q-Learning is to use a neural network to predict the different Q values for each action given a state.

Figure 3. A neural network can predict Q values from state information

In this case, we do not need to give the state/action pairs but only the state, as the neural network would exhaustively return all the Q values associated with each action. Outputting all actions' Q value is a common method as the general cases have a complex environment but a smaller number of possible actions.

This method works very well. It is similar to supervised learning with states as inputs and rewards as labels. We assumed so far that we had a reward for each action, and we chose the next action with the best reward (called a greedy policy). In many cases this is not enough: even if an action would yield the best reward at a given state, this may affect the next state so that we wouldn't optimize the reward in the long term. Also, if we can't have a reward for each action, we usually give 0 as a reward. We will not be able to choose the right action if they affect later states despite not yielding different rewards at the current state.

The sparsity of rewards or the long-term calculation of total reward (non-greedy policies) leads us to diverge from supervised learning and learn potential future rewards.

Temporal difference: TD-Learningโ€‹

TD-learning is a clever way to account for potential future value without knowing them yet. TD is a model-free class of algorithms: it does not simulate future states. The main idea is to consider all the rewards of a sequence of actions to give a better value than just the reward of the next action.

We can, for instance, sum all the future rewards:

Figure 4. Cumulating future rewards to assign values to each state.

Mathematically this can be written as:

Equation 3.

This is named TD(0): the simplest form of TD method, accumulating all the rewards.

Introducing policiesโ€‹

We could try different trajectories (sequence of actions) and retrospectively get the final reward for each action, but this has 2 drawbacks: the environment is usually too vast, and the sequence of actions might not even have a definite end. Also, such exhaustive methods might not be very efficient. Instead, we can evaluate the 'value' of the next state overall, like the maximum of all its possible rewards (direct reward), and add this value to the reward of a given action.

If a state can have different branches, we can select the best one, and this would be our policy, the way we choose actions. This simple form of taking the maximum is called the 'greedy' policy.

Figure 5. With a greedy policy the associated values to state come from the maximum value of the next state. Here despite the lower branch giving only half the top reward directly the overall value is greater.

This can be written down as:

Equation 4.

The expected value notation is defined as:

Equation 5.

For a greedy policy the probabilities p would all be set to 0 but the one associated with the highest return to 1 (in case of equality between n actions, we would attribute '1/n' as probabilities to get the same expected value).

Equation 6.

Relation with Q functionโ€‹

The expected reward can be replaced by the Q function we used earlier, which now can be denominated to be specific to our chosen policy (named ฯ€):

Equation 7.


We previously discussed the problem of not being able to go through all the states exhaustively and that the evaluation of the Q value from a neural network could help. We want to use the TD method to have a better value estimation that will consider potential future rewards.

The TD(0) method is elegant as we can, in fact, only use the next state's expected value instead of all future ones. The idea is that with successive evaluations, we build a chain of dependencies as each states' value depends on the next one.

Equation 8.

Figure 6. Iterative propagation of state values following TD(0) method.

We can see that the greedy policy would work even with null rewards in the trajectory. We can explicit our greedy policy, going back to use Q value instead of the state value V:

Equation 9.


We need to fix a problem: if a trajectory grows too long or never ends, a state value can potentially grow indefinitely. To counter that, we can add a discount factor (originally named lambda, usually refer as gamma in Q-learning) for the next state's value:

Equation 10.

Notice that we simplify the reward notation for clarity.

To avoid exploding values, this discount has to be between 0 and 1 (strictly below 1). We can think about it as giving more importance to the direct reward than the future ones. As the contribution to the latter reward decrease, the chain of action can grow without the calculated value growing. If the reward has an upper limit, the value will also be bounded.

The sparsity of rewards is also solved: giving only a positive reward after many non-rewarding steps will create smooth values for the intermediate states. Any reward, positive or negative, will diffuse its value to the neighbor states.

Figure 7. The TD(0) value propagation can allow for a smooth value distribution over the state that will help building efficient behaviour.

Q-Learning algorithmโ€‹

Finally, as we train a neural network to estimate the Q function, we need to update its target with successive iteration. We cannot fully trust the estimator (a neural network here) to give the correct value, so we introduce a learning rate to update the target smoothly.

Equation 11. Fully explained Bellman equation.

That is it! We now understand all the parts of this formula. Over multiple training steps with different sates, the training should find a good average Q function. While training, the estimator uses its own output to train itself (commonly referred to as bootstrapping): it is like it is chasing itself. Bootstrapping can lead to instability in the training process. There are many additional methods to help against such instability.

From giving rewards, sparse or not, binary or fine-grained, we have a smooth space of values for all our states/actions so the AI can follow a greedy policy to the best outcome.

This way of training is not a silver bullet and there is no guarantee that the AI will find a correlation from the information given as state to the returned reward.


We can see how our rewards are used to train AI's policies using Q-learning. By understanding the many iterations required and the bootstrapping issues, we can help our AI by carefully giving relevant state information and reward:

  • There needs to be a correlation between the state information and the reward: the simpler the relationship, the easier/faster the AI will find it.
  • Sparse and binary rewards make the training problem long and arduous. Giving more information through the reward can tremendously increase the speed/accuracy of the learned Q-estimator.
  • The longer the chain of actions, the more complex the Q-value will be to estimate.

We didn't see how the AI's algorithm can explore different actions given an environment here.'s technology focuses exclusively on off-policy training where we only have past data and cannot interact with the environment. RL is a vast topic and currently quickly growing. Robotics is a fantastic field of application; many other areas are yet to be explored with such a technology. We hope to push forward the technology and its field of application with our platform.

If you'd like to partner with us on the mission of making new applications by leveraging RL, we invite you to discuss with us on Discord, reach out on Twitter or email us.

I hope you enjoy this post and learn new things.

Corentin v0.5-alpha

ยท 3 min read
Phillip LeBlanc
Co-Founder and CTO of Spice AI

We are excited to announce the release of v0.5-alpha! ๐Ÿฅ‡

Highlights include a new learning algorithm called "Soft Actor-Critic" (SAC), fixes to the behavior of spice upgrade, and a more consistent authoring experience for reward functions.

If you are new to, check out the getting started guide and star spiceai/spiceai on GitHub.

Highlights in v0.5-alphaโ€‹

Soft Actor-Critic (Discrete) (SAC) Learning Algorithmโ€‹

The addition of the Soft Actor-Critic (Discrete) (SAC) learning algorithm is a significant improvement to the power of the AI engine. It is not set as the default algorithm yet, so to start using it pass the --learning-algorithm sacd parameter to spice train. We'd love to get your feedback on how its working!

Consistent reward authoring experienceโ€‹

With the addition of the reward function files that allow you to edit your reward function in a Python file, the behavior of starting a new training session by editing the reward function code was lost. With this release, that behavior is restored.

In addition, there is a breaking change to the variables used to access the observation state and interpretations. This change was made to better reflect the purpose of the variables and make them easier to work with in Python

Previous (Type)New (Type)
prev_state (SimpleNamespace)current_state (dict)
prev_state.interpretations (list)current_state_interpretations (list)
new_state (SimpleNamespace)next_state (dict)
new_state.interpretations (list)next_state_interpretations (list)

Improved spice upgrade behaviorโ€‹

The CLI will no longer recommend "upgrading" to an older version. An issue was also fixed where trying to upgrade the CLI using spice upgrade on Linux would return an error.

New in this releaseโ€‹

  • Adds a new learning algorithm called "Soft-Actor Critic" (SAC).
  • Updates the reward function parameters for the YAML code blocks from prev_state and new_state to current_state and next_state to be consistent with the reward function files.
  • Fixes an issue where editing a reward functions file would not automatically trigger training.
  • Fixes the normalization of values for the Deep-Q Learning algorithm to handle larger values.
  • Fixes an issue where the CLI would not upgrade on Linux with the spice upgrade command.
  • Fixes an issue where the CLI would recommend an "upgrade" to an older version.


Communityโ€‹ started with the vision to make AI easy for developers. We are building in the open and with the community. Reach out on Discord or by email to get involved. We will also be starting a community call series soon!

AI needs AI-ready data

ยท 5 min read
Phillip LeBlanc
Co-Founder and CTO of Spice AI

A significant challenge when developing an app powered by AI is providing the machine learning (ML) engine with data in a format that it can use to learn. To do that, you need to normalize the numerical data, one-hot encode categorical data, and decide what to do with incomplete data - among other things.

This data handling is often challenging! For example, to learn from Bitcoin price data, the prices are better if normalized to a range between -1 and 1. Being close to 0 is also a problem because of the lack of precision in floating-point representations (usually under 1e-5).

As a developer, if you are new to AI and machine learning, a great talk that explains the basics is Machine Learning Zero to Hero. makes the process of getting the data into an AI-ready format easy by doing it for you!

What is AI-ready data?โ€‹

You write code with if statements and functions, but your machine only understands 1s and 0s. When you write code, you leverage tools, like a compiler, to translate that human-readable code into a machine-readable format.

Similarly, data for AI needs to be translated or "compiled" to be understood by the ML engine. You may have heard of tensors before; they are simply another word for a multi-dimensional array and they are the language of ML engines. All inputs to and all outputs from the engine are in tensors. You could use the following techniques when converting (or "compiling") source data to a tensor.

  1. Normalization/standardization of the numerical input data. Many of the inputs and outputs in machine learning are interpreted as probability distributions. Much of the math that powers machine learning, such as softmax, tanh, sigmoid, etc., is meant to work in the [-1, 1] range.

Normalizing raw data Figure 1. Normalizing Bitcoin price data.

  1. Conversion of categorical data into numerical data. For categorical data (i.e., colors such as "red," "blue," or "green"), you can achieve this through a technique called "One Hot Encoding." In one hot encoding, each possible value in the category appears as a column. The values in the column are assigned a binary value of 1 or 0 depending on whether the value exists or not.

Figure 2. A visualization of one-hot encoding Figure 2. A visualization of one-hot encoding.

  1. Several advanced techniques exist for "compiling" this source data - this process is known in the AI world as "feature engineering." This article goes into more detail on feature engineering techniques if you are interested in learning more.

There are excellent tools like Pandas, Numpy, scipy, and others that make the process of data transformation easier. However, most of these tools are Python libraries and frameworks - which means having to learn Python if you don't know it already. Plus, when building intelligent apps (instead of just doing pure data analysis), this all needs to work on real-time data in production.

Building intelligent appsโ€‹

The tools mentioned above are not designed for building real-time apps. They are often designed for analytics/data science.

In your app, you will need to do this data compilation in real-time - and you can't rely on a local script to help process your data. It becomes trickier if the team responsible for the initial training of the machine learning model is not the team responsible for deploying it out into production.

How data is loaded and processed in a static dataset is likely very different from how the data is loaded and processed in real-time as your app is live. The result often is two separate codebases that are maintained by different teams that are both responsible for doing the same thing! Ensuring that those codebases stay consistent and evolve together is another challenge to tackle. helps developers build apps with real-time MLโ€‹ handles the "compilation" of data for you.

You specify the data that your ML should learn from in a Spicepod. The runtime handles the logistics of gathering the data and compiling it into an AI-ready format.

It does this by using many techniques described earlier, such as normalization and one-hot encoding. And because we're continuing to evolve, our data compilation will only get better over time.

In addition, the design of the runtime naturally ensures that the data used for both the training and real-time cases are consistent. uses the same data-components and runtime logic to produce the data. And not only that, you can take this a step further and share your Spicepod with someone else, and they would be able to use the same AI-ready data for their applications.

Summaryโ€‹ handles the process of compiling your data into an AI-ready format in a way that is consistent both during the training and real-time stages of the ML engine. A Spicepod defines which data to get and where to get it. Sharing this Spicepod allows someone else to use the same AI-ready data format in their application.

Learn more and contributeโ€‹

Building intelligent apps that leverage AI is still way too hard, even for advanced developers. Our mission is to make this as easy as creating a modern web page. If the vision resonates with you, join us!

Our Roadmap is public, and now that we have launched, the project and work are open for collaboration.

If you are interested in partnering, we'd love to talk. Try out, email us "hey," get in touch on Discord, or reach out on Twitter.

We are just getting started! ๐Ÿš€


Spicepods: From Zero To Hero

ยท 8 min read
Luke Kim
Founder and CEO of Spice AI

In my previous post, Teaching Apps how to Learn with Spicepods, I introduced Spicepods as packages of configuration that describe an application's data-driven goals and how it should learn from data. To leverage in your application, you can author a Spicepod from scratch or build upon one fetched from the registry. In this post, we'll walk through the creation and authoring of a Spicepod step-by-step from scratch.

As a refresher, a Spicepod consists of:

  • A required YAML manifest that describes how the pod should learn from data
  • Optional seed data
  • Learned model/state
  • Performance telemetry and metrics

We'll create the Spicepod for the ServerOps Quickstart, an application that learns when to optimally run server maintenance operations based upon the CPU-usage patterns of a server machine.

We'll also use the Spice CLI, which you can install by following the Getting Started guide or Getting Started YouTube video.

Fast iterationsโ€‹

Modern web development workflows often include a file watcher to hot-reload so you can iteratively see the effect of your change with a live preview. takes inspiration and enables a similar Spicepod manifest authoring experience. If you first start the runtime in your application root before creating your Spicepod, it will watch for changes and apply them continuously so that you can develop in a fast, iterative workflow.

You would normally do this by opening two terminal windows side-by-side, one that runs the runtime using the command spice run and one where you enter CLI commands. In addition, developers would open the dashboard located at http://localhost:8000 to preview changes they make.

Figure 1.'s modern development workflow

Creating a Spicepodโ€‹

The easiest way to create a Spicepod is to use the CLI command: spice init <Spicepod name>. We'll make one in the ServerOps Quickstart application called serverops.

Figure 2. Creating a Spicepod.

The CLI saves the Spicepod manifest file in the spicepods directory of your application. You can see it created a new serverops.yaml file, which should be included in your application and be committed to your source repository. Let's take a look at it.

Figure 3. Spicepod manifest.

The initialized manifest file is very simple. It contains a name and three main sections being:

  • dataspaces
  • actions
  • training

We'll walk through each of these in detail, and as a Spicepod author, you can always reference the documentation for the Spicepod manifest syntax.

Authoring a Spicepod manifestโ€‹

You author and edit Spicepod manifest files in your favorite text editor with a combination of CLI helper commands. We eventually plan to have a VS Code extension and dashboard/portal editing abilities to make this even easier.

Adding a dataspaceโ€‹

To build an intelligent, data-driven application, we must first start with data.

A dataspace is a logical grouping of data with definitions of how that data should be loaded and processed, usually from a single source. A combination of its data source and its name identifies it, for example, nasdaq/msft or twitter/tweets. Read more about Dataspaces in the Core Concepts documentation.

Let's add a dataspace to the Spicepod manifest to load CPU metric data from a CSV file. This file is a snapshot of data from InfluxDB, a time-series database we like.

Figure 4. Adding a dataspace.

We can see this dataspace is identified by its source hostmetrics and name cpu. It includes a data section with a file data connector, the path to the file, and a data processor to know how to process it. In addition, it defines a single measurement usage_idle under the measurements section, which is a measurement of CPU load. In, measurements are the core primitive the AI engine uses to learn and is always numerical data. includes a growing library of community contributable data connectors and data processors you can consist of in your Spicepod to access data. You can also contribute your own.

Finally, because the data is a snapshot of live data loaded from a file, we must set a Spicepod epoch_time that defines the data's start Unix time.

Now we have a dataspace, called hostmetrics/cpu, that loads CSV data from a file and processes the data into a usage_idle measurement. The file connector might be swapped out with the InfluxDB connector in a production application to stream real-time CPU metrics into And in addition, applications can always send real-time data to the runtime through its API with a simple HTTP POST (and in the future, using Web Sockets and gRPC).

Adding actionsโ€‹

Now that the Spicepod has data, let's define some data-driven actions so the ServerOps application can learn when is the best time to take them. We'll add three actions using the CLI helper command, spice action add.

Figure 5. Adding actions.

And in the manifest:

Figure 6. Actions added to the manifest

Adding rewardsโ€‹

The Spicepod now has data and possible actions, so we can now define how it should learn when to take them. Similar to how humans learn, we can set rewards or punishments for actions taken based on their effect and the data. Let's add scaffold rewards for all actions using the spice rewards add command.

Figure 7. Adding rewards

We now have rewards set for each action. The rewards are uniform (all the same), meaning the Spicepod is rewarded the same for each action. Higher rewards are better, so if we change perform_maintenance to 2, the Spicepod will learn to perform maintenance more often than the other actions. Of course, instead of setting these arbitrarily, we want to learn from data, and we can do that by referencing the state of data at each time-step in the time-series data as the AI engine trains.

Figure 8. Rewards added to the manifest

The rewards themselves are just code. Currently, we currently support Python code, either inline or in a .py external code file and we plan to support several other languages. The reward code can access the time-step state through the prev_state and new_state variables and the dataspace name. For the full documentation, see Rewards.

Let's add this reward code to perform_maintenance, which will reward performing maintenance when there is low CPU usage.

cpu_usage_prev = 100 - prev_state.hostmetrics_cpu_usage_idle
cpu_usage_new = 100 - new_state.hostmetrics_cpu_usage_idle
cpu_usage_delta = cpu_usage_prev - cpu_usage_new
reward = cpu_usage_delta / 100

This code takes the CPU usage (100 minus the idle time) deltas between the previous time state and the current time state, and sets the reward to be a normalized delta value between 0 and 1. When the CPU usage is moving from higher cpu_usage_prev to lower cpu_usage_low, its a better time to run server maintenance and so we reward the inverse of the delta. E.g. 80% - 50% = 30% = 0.3. However, if the CPU moves lower to higher, 50% - 80% = -30% = -0.3, it's a bad time to run maintenance, so we provide a negative reward or "punish" the action.

Figure 9. Reward code

Through these rewards and punishments and the CPU metric data, the Spicepod will when it is a good time to perform maintence and be the decision engine for the ServerOps application. You might be thinking you could write code without AI to do this, which is true, but handling the variety of cases, like CPU spikes, or patterns in the data, like cyclical server load, would take a lot of code and a development time. Applying AI helps you build faster.

Putting it all togetherโ€‹

The manifest now has defined data, actions, and rewards. The Spicepod can get data to learn which actions to take and when based on the rewards provided.

If the runtime is running, the Spicepod automatically trains each time the manifest file is saved. As this happens reward performance can be monitored in the dashboard.

Once a training run completes, the application can query the Spicepod for a decision recommendation by calling the recommendations API http://localhost:8000/api/v0.1/pods/serverops/recommendation. The API returns a JSON document that provides the recommended action, the confidence of taking that action, and when that recommendation is valid.

In the ServerOps Quickstart, this API is called from the server maintenance PowerShell script to make an intelligent decision on when to run maintenance. The ServerOps Sample, which uses live data, can be continuously trained to learn and adapt even as the live data changes due to load patterns changing.

The full Spicepod manifest from this walkthrough can be added from using the spice add quickstarts/serverops command.


Leveraging to be the decision engine for your server maintenance application helps you build smarter applications, faster that will continue to learn and adapt over time, even as usage patterns change over time.

Learn more and contributeโ€‹

Building intelligent apps that leverage AI is still way too hard, even for advanced developers. Our mission is to make this as easy as creating a modern web page. If the vision resonates with you, join us!

Our Roadmap is public, and now that we have launched, the project and work are open for collaboration.

If you are interested in partnering, we'd love to talk. Try out, email us "hey," get in touch on Discord, or reach out on Twitter.

We are just getting started! ๐Ÿš€

Luke v0.4.1-alpha

ยท 2 min read
Luke Kim
Founder and CEO of Spice AI

Announcing the release of v0.4.1-alpha! โœ…

This point release focuses on fixes and improvements to v0.4-alpha. Highlights include AI engine performance improvements, updates to the dashboard observations data grid, notification of new CLI versions, and several bug fixes.

A special acknowledgment to @Adm28, who added the CLI upgrade detection and prompt, which notifies users of new CLI versions and prompts to upgrade.

CLI upgrade prompt

Highlights in v0.4.1-alphaโ€‹

AI engine performance improvementsโ€‹

Overall training performance has been improved up to 13% by removing a lock in the AI engine.

In versions before v0.4.1-alpha, performance was especially impacted when streaming new data during a training run.

Dashboard Observations Datagridโ€‹

The dashboard observations datagrid now automatically resizes to the window width, and headers are easier to read, with automatic grouping into dataspaces. In addition, column widths are also resizable.

CLI version detection and upgrade promptโ€‹

When it is run, the CLI will now automatically check for new CLI versions once a day maximum.

If it detects a new version, it will print a notification to the console on spice version, spice run or spice add commands prompting the user to upgrade using the new spice upgrade command.

New in this releaseโ€‹

  • Adds automatic resizing of the observations datagrid.
  • Adds header group by dataspace to the observations datagrid.
  • Adds CLI version detection and prompt for upgrade on version, run, and add commands.
  • Adds Support for parsing hex-encoded times and measurements. Use the time_format of hex or prefix with 0x.
  • Updates AI engine with improved training performance.
  • Updates Go and NPM dependencies.
  • Fixes detection of Spicepods in the Spicepods directory, and a resulting error when loading a non-Spicepod file.
  • Fixes a potential "zip slip" security issue.
  • Fixes an issue where the AI engine may not gracefully shutdown.


Communityโ€‹ started with the vision to make AI easy for developers. We are building in the open and with the community. Reach out on Discord or by email to get involved. We will also be starting a community call series soon!