Autoregressive moving-average models

Ellen L. Hamaker; Sophie W. Berkhout

Autoregressive moving-average models

Authors

Affiliation

Ellen L. Hamaker

Methodology & Statistics Department, Utrecht University

Sophie W. Berkhout

Methodology & Statistics Department, Utrecht University

Published

2025-08-28

This article has not been peer-reviewed yet and may be subject to change.

Want to cite this article? See citation info.

This article is about autoregressive moving-average (ARMA) models, which form a key family of models for the analysis of N=1 time series data. You can read more about pure autoregressive (AR) models and pure moving-average (MA) models in their separate articles. Here the focus in on the combination of the two.

When dealing with univariate data from a single case (e.g., a person or country), like daily diary measures of a person’s mood, you may consider using a first-order AR model: This is a time series model that can be used to account for temporal dependencies in the data, and it is a model that many researchers in psychology and related fields are familiar with. However, this model can only account for certain kinds of temporal patterns in the data. Learning more about the family of ARMA models will help you to appreciate a much wider variety of temporal patterns that may exist, how these can be accounted for, and what they may mean in terms of the underlying process.

Below you find information on: 1) mixed ARMA models; 2) how to decide on the order of the AR and the MA components in a mixed ARMA model; 3) various connections that can be made between different members of the ARMA family; and 4) estimation techniques for ARMA models.

1 Mixed ARMA models

The classes of autoregressive (AR) and moving-average (MA) models were both proposed in 1927, respectively by Yule (1927) and Slutzky (1927; which later appeared in English as Slutzky, 1937). Both were developed to handle temporal dependencies and cycles in time series data. The combination of the two is known as the mixed autoregressive moving-average (ARMA) model, which has been popularizard by Box & Jenkins (1970) (according to Wikipedia, Peter Whittle was the first to introduce it).

An ARMA model is determined by the order \(p\) of its AR component and the order \(q\) of its MA component, and is referred to as an ARMA(\(p,q\)). In this section you can read more about an ARMA(1,1) model first, and then about the general ARMA case.

1.1 ARMA(1,1) models

An ARMA(1,1) model combines components of two models:

A first-order AR model, which can be expressed as: \(y_t = c + \phi_1 y_{t-1} + \epsilon_t.\)
A first-order MA model, which can be expresses as: \(y_t = \mu + \theta_1 \epsilon_{t-1} + \epsilon_t.\)

An ARMA(1,1) model contains both an AR component and an MA component, and can be expressed as

\[y_t = c + \phi_1 y_{t-1} + \theta_1 \epsilon_{t-1} + \epsilon_t,\] where \(c\) is the intercept, \(\phi_1\) and \(\theta_1\) are the AR and MA parameter respectively, and \(\epsilon_t\) is referred to as the innovation. The latter is also described as the random shock that is given to the system at occasion \(t\). This model is visualized in Figure 1.

Figure 1: Path diagram of the first-order autoregressive first-order moving-average (ARMA(1,1)) model.

It is important to realize that an ARMA(1,1) model is not the sum of an AR(1) model and an MA(1) model. Such a sum would imply that there are two distinct innovations \(\epsilon_t\) included—one from the AR(1) process and a different one from the MA(1) process—and that only the innovation of the MA model has a lagged effect. Such summations do exist, and you can read more about this below (in section 3). But the ARMA(1,1) model presented above should be thought of as a combination of AR and MA components, not as a sum of an AR process and MA process.

From a substantive point of view, you can think of an ARMA(1,1) as a process in which there is on the one hand carry-over from one occasion to the next in terms of the observed variable; this is captured by the AR component; on the other hand there are random shocks to the system, and these also have a delayed effect, as captured by the MA term.

Example: Daily drinking behavior

Mike is studying daily alcohol consumption in a person who is identified as a heavy drinker. He has been thinking about using an AR(1) model, because he thinks that a day on which the person drank a lot is likely to be followed by another day on which the person drinks a lot. But Mike also considered an MA(1) model as a potentially suitable way to account for the dependencies in the data, reasoning that random factors that occur on a given day and that impact the person’s drinking behavior, may also have a delayed impact the following day.

Mike decides to use an ARMA(1,1) model for the data: This model has the carry-over in the drinking behavior from one day to the next, as well as the delayed effect of random shocks to the system. By considering the size and uncertainty of the estimates of \(\phi_1\) and \(\theta_1\), he can then also investigate whether each of the mechanisms (carry-over and delayed effect) seems to be supported by the data.

The mean of an ARMA(1,1) is obtained with the same expression as for an AR(1) model, that is

\[\mu=\frac{c}{1-\phi_1}.\]

In contrast, the variance of an ARMA(1,1) model is a more complicated expression that involves both \(\phi_1\) and \(\theta_1\) and the innovation variance \(\sigma_\epsilon^2\), that is

\[\sigma_y^2 = \frac{(1+\theta_1^2 + 2\theta_1\phi_1)\sigma_{\epsilon}^2}{1-\phi_1^2}.\]

To ensure the model is stationary, the same restriction applies as for an AR(1); that is, \(\phi_1\) has to lie between -1 and 1. For the model to be invertible, the same restriction applies as for an MA(1); that is, \(\theta_1\) must lie between -1 and 1. You can read more about these constraints in the articles about autoregressive models and moving-average models.

1.2 ARMA(\(p,q\)) models

The general ARMA(\(p,q\)) model combines \(p\) AR terms with \(q\) MA terms, and can be expressed as

\[y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p} + \theta_q \epsilon_{t-q} + \dots + \theta_2 \epsilon_{t-2} + \theta_1 \epsilon_{t-1} + \epsilon_t.\] The ARMA(\(p,q\)) model can also be expressed with summation symbols, one for the AR part containing past versions of the outcome \(y_t\), and one for the MA part forming a weighted sum of past versions of the innovation \(\epsilon_t\), that is

\[y_t = c + \sum_{j=1}^p \phi_j y_{t-j} + \sum_{k=1}^q \theta_k \epsilon_{t-k} + \epsilon_t. \]

Instead of using the expressions above that contain an intercept \(c\), you can also use centered versions of past \(y\)’s, which gives

\[y_t = \mu + \sum_{j=1}^p \phi_j (y_{t-j}-\mu) + \sum_{k=1}^q \theta_k \epsilon_{t-k} + \epsilon_t,\]

in which case the parameter \(\mu\) is the mean of the series \(y\). The mean \(\mu\) and the intercept \(c\) are related through

\[\mu = \frac{c}{1 - \phi_1 - \phi_2 - \dots - \phi_p}.\] This connection between \(c\) when the past versions of \(y\) are not centered and \(\mu\) when past versions are centered, is identical to the connection within a pure AR(\(p\)) model. The fact that the MA part of the model does not affect the mean is because the innovations \(\epsilon\) have—by definition—a mean of zero.

The general expression for the variance of an ARMA(\(p,q\)) model is rather involved, and requires the use of the backshift operator (cf. Hamilton, 1994); this is beyond the scope of the current article.

2 Deciding on the order of your ARMA model

When you consider the use of ARMA models for analyzing your time series data, an important question is how you can decide what orders \(p\) and \(q\) are needed, or which values for these to consider. There are various ways to reach a decision about this.

2.1 Substantive reasons

A first way to decide on the order of your ARMA model is to consider substantive reasoning based on theory. This requires you to really have an idea about why certain lagged effects—either from past observed scores or past innovations—would characterize the process you are investigating.

A pure AR model can be thought of as a random variable that varies over time, but that is characterized by some momentum effect (Granger & Morris, 1976). For instance, when you are cycling to work, you may go somewhat faster or slower from time to time, but your current speed is dependent on your speed a moment ago.

A pure MA process can be thought of as a variable that is perturbed by random shocks (Granger & Morris, 1976). Slutzky (1927) described it as the result of the summation of random causes. For instance, while you are cycling to work, your momentary mood is affected by all kinds of random events that occur: someone cuts you off in traffic, you get a phone call from a friend, the sun starts shining, you get a flat tire, you suddenly notice the fragrance of the flowers, et cetera. These factors have an immediate effect, but also a somewhat delayed effect on your mood.

While there may be situation where you have strong substantive grounds to prefer one model over another, in general researchers do not have such explicit ideas about the substantive reasons for including—or excluding—particular lagged effects. In that case you need alternative grounds for deciding on the orders \(p\) and \(q\).

2.2 Autocorrelations and partial autocorrelations

Box & Jenkins (1970) promoted the use of the autocorrelations to determine what values to consider for \(p\) and \(q\). Specifically, their method is based on considering the patterns in both the autocorrelation function (ACF) and the partial autocorrelation function (PACF) to distinguish between pure AR, pure MA, and mixed ARMA processes. The three categories are described below.

2.2.1 ACF and PACF of pure AR processes

Pure AR(\(p\)) processes are characterized by an ACF that shows a slow decay towards zero as the lag increases, whereas the PACF cuts off suddenly after lag \(p\). This is illustrated for two AR(1) processes in Figure 2.

Figure 2: Examples of first-order autoregressive processes with an autoregressive parameter of 0.8 (top) or -0.8 (bottom). On the left are the simulated data plotted against time; middle contains the autocorrelation function plots; right contains the partial autocorrelations function plots.

The AR(1) process at the top is characterized by a positive autoregressive parameter, which implies that deviations from the average tend to linger for a while. In contrast, the AR(1) process at the bottom is characterized by a negative autoregressive parameter, resulting in a sawtooth pattern, where high scores tend to be followed by low scores and vice versa. These features can also be seen when you consider the ACF plots: The autocorrelations tend to be high and positive for the first process, and switch between positive and negative values for the second process. In contrast, the PACF plots are characterized by a large value at lag 1, after which they are basically zero (although they may differ significantly from zero occasionally, due to coincidence).

When the order of the AR process is higher, more complex temporal patterns may arise. This is illustrated in Figure 3, which contains two AR(2) processes. You can see that while the ACF plots have autocorrelations different from zero for quite a few lags, the PACF plots are characterized by non-zero partial autocorrelations only lags 1 and 2 (with an occasional non-zero partial autocorrelation at a larger lag, due to coincidence again).

Figure 3: Examples of second-order autoregressive processes with: \(\phi_1=1\) and \(\phi_2=-0.6\) (top), giving an oscillating pattern; and \(\phi_1=0.5\) and \(\phi_2=0.2\) (bottom), giving a non-oscillating pattern. On the left are the simulated data plotted against time; middle contains the autocorrelation function plots; right contains the partial autocorrelations function plots.

The AR(2) process displayed at the top is an oscillating process, which you may see from the plot of the time series itself on the left; it becomes more clear when considering the ACF though. The latter shows a change from positive autocorrelation at small lags, to negative autocorrelations at larger lags, then going back in the direction of positive autocorrrelations (although there is not enough evidence to conclude that these are different from zero). This implies that observations close in time tend to deviate from the mean in the same direction, while observations that are a bit further apart (say observations at an interval of 3 to 6) tend to be negatively related to each other. This is very different from the non-oscillating pattern that characterizes the AR(2) process at the bottom: This process is characterized by positive autocorrelations across quite some lags. You can try these differences out yourself with the interactive tool for an AR(2) model in the article about autorgressive models.

2.2.2 ACF and PACF of pure MA processes

For pure MA(\(q\)) processes, the patterns in the ACF and PACF are exactly the other way around: The ACF cuts off suddenly after lag \(q\), whereas the PACF shows a slow decay towards zero. You can see several examples of this in Figure 4.

Figure 4: Examples of first-order moving-average processes with an moving-average parameter of 0.8 (top) or -0.8 (bottom). On the left are the simulated data plotted against time; middle contains the autocorrelation function plots; right contains the partial autocorrelations function plots.

When you look at the ACF and PACF plots of these two processes, you can see that indeed the autocorrelations are zero after lag 1, while the partial autocorrelations are different from zero also beyond lag 1.

When considering an MA(2) process, the expected pattern is: autocorrelations at lag 1 and 2 different from zero while for larger lags they are (about) zero; partial autocorrelations may well be different from zero beyond lag 2. Two examples of an MA(2) process are provided in Figure 5.

Figure 5: Examples of second-order moving-average processes. Process at the top has \(\theta_1=0.7\) and \(\theta_=0.6\); process at the bottom has \(\theta_1=-0.7\) and \(\theta_=0.6\). On the left are the simulated data plotted against time; middle contains the autocorrelation function plots; right contains the partial autocorrelations function plots.

While the ACF and PACF plots show the expected pattern to some degree, it should be noted that this pattern does not seem to be very consistent when dealing with time series of this length (i.e., 150 occasions). This means that—while the theory about autocorrelations and partial autocorrelations is clear—in practice, it is not that easy to determine what process generated the data based on the ACF and PACF.

2.2.3 ACF and PACF of mixed processes

Mixed ARMA(\(p,q\)) processes are characterized by an ACF and PACF that may both show a slow decay towards zero. You can try this out with the interactive tool below; this allows you to generate an ARMA(1,1) process, and to see how the various model parameters affect the kind of patterns that such a process can have.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 410
library(shiny)
library(bslib)

## file: app.R
library(shiny)
library(bslib)
library(ggplot2)
library(patchwork)

simulateData <- function(n, mu, sigma, phi, theta) {
  
  burnin <- 100
  
  burnin_t <- n + burnin
  
  innov_var <- (sigma - phi ^ 2 * sigma) / (2 * phi * theta + theta ^ 2 + 1)
  
  e <- rnorm(burnin_t, sd = sqrt(innov_var))
  y <- numeric(burnin_t)
  
  for (t in 2:burnin_t) {
    y[t] <- mu + phi * (y[t - 1] - mu) + theta * e[t - 1] + e[t]
  }
  
  y <- y[(burnin + 1):burnin_t]
  
  dat <- data.frame(y = y, t = 1:n)
  
  return(dat)
  
}

plotTimeSeries <- function(dat, plot_widths = c(5, 2, 2)) {
  
  x_breaks <- pretty(1:nrow(dat))
  
  y_breaks <- pretty(dat$y)
  y_lim <- c(min(y_breaks), max(y_breaks))
  
  axis_line_width <- 0.3
  
  meany <- mean(dat$y)
  
  p_ts <- ggplot(dat) + 
    geom_segment(x = x_breaks[1], xend = max(x_breaks), y = meany, yend = meany,
                 linewidth = 0.3, colour = "gray") +
    geom_line(aes(x = t, y = y)) +
    geom_point(aes(x = t, y = y), size = 2, shape = 21, fill = "white") +
    scale_y_continuous(breaks = y_breaks) +
    scale_x_continuous(breaks = x_breaks) +
    coord_cartesian(ylim = y_lim, xlim = c(min(x_breaks), max(x_breaks))) +
    theme_void() +
    labs(x = "t", y = "y") + 
    theme(text = element_text(family = "sans", size = 16),
          axis.text = element_text(margin = margin(5, 5, 5, 5)),
          axis.text.y = element_text(hjust = 1),
          axis.title = element_text(margin = margin(5, 5, 5, 5)),
          axis.ticks = element_line(lineend = "butt",
                                    linewidth = axis_line_width),
          axis.ticks.length = unit(2.5, "pt"),
          panel.spacing = unit(5, units = "pt"),
          plot.margin = margin(0, 5, 0, 0),
          legend.position = "none") +
    geom_segment(x = -Inf, xend = -Inf, y = y_breaks[1], yend = max(y_breaks),
                 linewidth = axis_line_width, lineend = "square") +
    geom_segment(y = -Inf, yend = -Inf, x = x_breaks[1], xend = max(x_breaks),
                 linewidth = axis_line_width, lineend = "square")
  
  # p_hist <- ggplot(dat) + 
  #   geom_histogram(aes(y = y, x = ..density..), fill = "gray", colour = "black") +
  #   geom_density(aes(y = y)) +
  #   geom_hline(yintercept = meany, colour = "gray", linewidth = 0.5) +
  #   scale_y_continuous(breaks = y_breaks) +
  #   scale_x_continuous(breaks = x_breaks) +
  #   coord_cartesian(ylim = y_lim) +
  #   theme_void() +
  #   labs(x = "", y = "") +
  #   theme(text = element_text(family = "sans", size = 16),
  #         axis.title = element_text(margin = margin(5, 5, 5, 5)),
  #         panel.spacing = unit(5, units = "pt"),
  #         plot.margin = margin(0, 5, 0, 0))
  
  ac <- acf(dat$y)
  pac <- pacf(dat$y)
  y_breaks_ac <- pretty(c(ac$acf, pac$acf))
  x_breaks_ac <- pretty(ac$lag)
  
  p_acf <- plotACF(ac, x_breaks_ac, y_breaks_ac)
  p_pacf <- plotACF(pac, x_breaks_ac, y_breaks_ac, y_lab = "Partial ACF")
  
  p <- p_ts + p_acf + p_pacf + plot_layout(widths = plot_widths)
  
  return(p)
  
}

plotACF <- function(ac, x_breaks, y_breaks, y_lab = "ACF") {
  clim <- qnorm((1 + 0.95) / 2) / sqrt(ac$n.used)
  df <- data.frame(y = ac$acf, x = ac$lag)
  # df <- data.frame(y = ac$acf, x = ac$lag, lower = -clim, upper = clim)
  # df$areax <- df$x + c(-0.1, rep(0, length(df$x) - 2), 0.1)
  y_lim <- c(min(y_breaks), max(y_breaks))
  
  
  axis_line_width <- 0.3
  
  df_ribbon <- data.frame(x = c(min(x_breaks) - 0.1, max(x_breaks) + 0.1),
                          upper = clim, lower = -clim)
  p_acf <- ggplot(df) +
    geom_segment(x = min(x_breaks), xend = max(x_breaks), y = 0, yend = 0,
                 linewidth = axis_line_width, lineend = "square") +
    # geom_segment(x = min(x_breaks), xend = max(x_breaks), y = clim, yend = clim,
    #              linewidth = axis_line_width, lineend = "square", lty = 2) +
    # geom_segment(x = min(x_breaks), xend = max(x_breaks), y = -clim, yend = -clim,
    #              linewidth = axis_line_width, lineend = "square", lty = 2) +
    geom_linerange(aes(ymax = y, ymin = 0, x = x)) +
    geom_ribbon(data = df_ribbon, aes(x = x, ymin = lower, ymax = upper), alpha = 0.15) +
    scale_y_continuous(breaks = y_breaks) +
    scale_x_continuous(breaks = x_breaks) +
    coord_cartesian(ylim = y_lim, xlim = c(min(x_breaks), max(x_breaks))) +
    theme_void() +
    labs(x = "Lag", y = y_lab) + 
    theme(text = element_text(family = "sans", size = 16),
          axis.text = element_text(margin = margin(5, 5, 5, 5)),
          axis.text.y = element_text(hjust = 1),
          axis.title = element_text(margin = margin(5, 5, 5, 5)),
          axis.title.y = element_text(angle = 90),
          axis.ticks = element_line(lineend = "butt",
                                    linewidth = axis_line_width),
          axis.ticks.length = unit(2.5, "pt"),
          panel.spacing = unit(5, units = "pt"),
          plot.margin = margin(0, 5, 0, 0),
          legend.position = "none") +
    geom_segment(x = -Inf, xend = -Inf, y = y_breaks[1], yend = max(y_breaks),
                 linewidth = axis_line_width, lineend = "square") +
    geom_segment(y = -Inf, yend = -Inf, x = x_breaks[1], xend = max(x_breaks),
                 linewidth = axis_line_width, lineend = "square")
  
  return(p_acf)
}

ui <- fluidPage(
  fluidRow(
    column(2,
           numericInput("n", label = h4("# occasions T"), value = 100,
                        min = 10, max = 1e5, step = 10)
    ),
    column(2,
           numericInput("mean", label = HTML("<h4>Mean &mu;</h4>"), value = 0, step = 1)
    ),
    
    column(2,
           numericInput("var", label = HTML("<h4>Variance &sigma;<sup>2</sup><sub style='position: relative; left: -0.4em;'>y</sub></h4>"),
                        value = 1, min = 0, max = 100, step = 1)
    ),
    
    column(3,
           sliderInput("autoregression", label = HTML("<h4>Autoregression \U03D5<sub>1</sub></h4>"), min = -0.999, 
                       max = 0.999, value = 0.3, step = 0.05)
    ),
    
    column(3,
           sliderInput("movingaverage", label = HTML("<h4>Moving average \U03B8<sub>1</sub></h4>"), min = -0.999, 
                       max = 0.999, value = 0.3, step = 0.05)
    )
  ),
  
  actionButton("refresh", "Refresh"),
  
  plotOutput(outputId = "main_plot", height = "250px"),
  
)

server <- function(input, output) {
  
  dat <- reactive({
    input$refresh  # include to make it reactive to the button
    simulateData(input$n, input$mean, input$var,
                 input$autoregression, input$movingaverage)
  })
  
  output$main_plot <- renderPlot({
    
    plotTimeSeries(dat())
    
  })
  
}

shinyApp(ui = ui, server = server)

The tool shows three visualizations of the data generated from an ARMA(1,1) model with the specified parameter values and number of occasions. The left figure is a time series plot of \(y\) against time, where the horizontal grey line indicates the observed mean of the series. The middle plot contains the ACF, and the right plot the PACF. Note that you can also use this tool to generate either a pure AR(1) process, or a pure MA(1) process, by setting \(\theta_1=0\) or \(\phi_1=0\), respectively.

When you generate data from the same model (using the same parameter values, only pressing the Refresh button), you see that the time series itself changes, but that its ACF and PACF can also change quite a bit. Hence, when your time series is not that long, these are not really secure diagnostic tools for determining the order of the underlying ARMA process.

2.2.4 Conclusion regarding the use of the ACF and PACF

While using the ACF and PACF as diagnostic tools works well when the time series are long enough, for shorter time series they tend to be less reliable. The reason for this is that when you have fewer time points, the ACF and PACF are more strongly affected by sample fluctuations (i.e., coincidences), and their appearance may easily change from sample to sample, as you may have seen when using the interactive tool provided above.

The use of the ACF and PACF to guide one’s decision on what values to use for \(p\) and \(q\) should be understood as a strategy from the pre-computer era, when Box & Jenkins (1970) proposed this approach. At the time, obtaining maximum likelihood estimates for ARMA parameters was very time consuming, as recursive equations had to be run by hand. Hence, getting a initial sense of what orders for \(p\) and \(q\) to consider was a valuable way to narrow down the number of possibilities.

After obtaining a diagnosis about what process may have generated the data, the next step of the strategy laid out by Box & Jenkins (1970) consists of estimating the model and saving the residuals from that model. If all the temporal dependencies that are present in the data are properly accounted for by the model, the residuals should behave as [white noise]; this implies that all autocorrelations and partial autocorrelations of the residuals should be zero.

This latter check is still commonly used after model estimation. However, deciding on the orders of \(p\) and \(q\) is nowadays typically no longer done by eyeballing the ACF and PACF; instead, people tend to simply estimate and compare multiple ARMA models.

2.3 Comparing models

With the fast computers of today, you can easily and quickly run various models and compare their fit using maximum likelihood estimation. There are also programs that automatically search for the optimal values for \(p\) and \(q\). For instance, the Hyndman-Khandakar algorithm runs a series of models, and uses the AICc (i.e., the small sample corrected Aikake information criterion) to select the best fitting model (see for details section 9.7 of Hyndman & Athanasopoulos, 2021, you can find a link to an online version of their book at the bottom of this article).

An alternative way to compare models is by determining their out-of-sample performance using a time-series cross-validation approach. This requires you to divide your observed time series into a training set (used for estimating a model), and a validation set (used for the out-of-sample prediction). This can be done in various ways (see section 5.10 of Hyndman & Athanasopoulos, 2021).

3 Connection between AR, MA, and ARMA models

While the insights from Box & Jenkins (1970) about the differences in behaviors of pure AR models, pure MA models, and mixed ARMA models in terms of their (partial) autocorrelations suggest that these are inherently distinct sub-classes of models, there are some interesting and important connections between them. It is good if you know about this, because it provides a more nuanced perspective on ARMA modeling.

In this section, you can read more about how pure AR models can be represented as pure MA models and vice versa, and how mixed ARMA models can be represented as pure versions of one or the other. Another connection that you can read about here is how summing independent processes from the class of ARMA models results in specific versions of an ARMA model.

3.1 Rewriting an AR(p) model as an MA(\(\infty\)) model

Pure AR models of finite order \(p\) can be re-expressed as MA models of infinite order. This is based on recursively replacing past values of the outcome by their expression in terms of past values and innovations.

To see how this works, consider the AR(1) model

\[y_t = c + \phi_1 y_{t-1} + \epsilon_t.\] The predictor \(y_{t-1}\) can be expressed using the same AR(1) structure, that is, \(y_{t-1} = c + \phi_1 y_{t-2} + \epsilon_{t-1}\). If you use this to replace \(y_{t-1}\) in the expression above, you get

\[y_t = c + \phi_1 (c+ \phi_1 y_{t-2} + \epsilon_{t-1}) + \epsilon_t\]

\[\;\;\;\;\;\;= c + \phi_1 c + \phi_1^2 y_{t-2} + \phi_1 \epsilon_{t-1} + \epsilon_t.\] In the same way, you can make use of the fact that \(y_{t-2} = c+\phi_1 y_{t-3} + \epsilon_{t-2}\), to get

\[y_t = c + \phi_1 c + \phi_1^2 (c + \phi_1 y_{t-3} + \epsilon_{t-2}) + \phi_1 \epsilon_{t-1} + \epsilon_t\]

\[ \;\;\;\;\;= c + \phi_1 c + \phi_1^2 c + \phi_1^3 y_{t-3} + \phi_1^2 \epsilon_{t-2} + \phi_1 \epsilon_{t-1} + \epsilon_t.\] When you continue this approach, you end up with

\[y_t = (1+ \phi_1 + \phi_1^2 + \dots )c + \epsilon_t + \phi_1 \epsilon_{t-1} + \phi_1^2\epsilon_{t-2} + \dots \]

where the first term on the right-hand side is known in mathematics as a geometric series that can be rewritten as \(\frac{c}{1-\phi_1}\), and the weighted sum of past innovations can be expressed using a summation symbol. Together this results in

\[y_t = \frac{c}{1-\phi_1} + \sum_{k=1}^\infty \phi_1^k \epsilon_{t-k} + \epsilon_t.\] This shows that the AR(1) model that you started with has now been re-expressed as an MA(\(\infty\)) model, with mean \(\mu=\frac{c}{1-\phi_1}\). The MA parameter at lag \(k\) is \(\theta_k = \phi_1^k\). These parameters become increasingly smaller as \(k\) increases, because—by definition—\(\phi_1\) has to lie between -1 and 1, to ensure the process is stationary.

You can also use this way of rewriting in the case of higher-order AR processes, although this is much more involved than the relatively simple case of an AR(1) process. The general result of this is

\[y_t = \frac{c}{1 -\phi_1 -\dots-\phi_p} + \sum_{k=1}^\infty \theta_k \epsilon_{t-k} + \epsilon_t,\] where the moving-average parameters \(\theta_k\) are functions of the autoregressive parameters \(\phi_1\) to \(\phi_p\) (for details see section 3.7 in Hamilton, 1994).

3.2 Rewriting an MA(q) model as an AR(\(\infty\)) model

Similarly, you can rewrite a finite order MA process as an infinite order AR process. This is based on rewriting the innovation at each occasion as a function of the observation at that occasion and past innovations. To see how this works, consider an MA(1) model, that is

\[ y_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1}.\] If you subtract \(\mu + \theta_1 \epsilon_{t-1}\) from both sides, you get \(\epsilon_t = (y_t - \mu) - \theta_1 \epsilon_{t-1}\). You can use the same strategy to express the previous innovation, that is, \(\epsilon_{t-1} = (y_{t-1} - \mu) - \theta_1 \epsilon_{t-2}\). If you use the latter and plug this in the expression of the MA(1) given above, you get

\[ y_t = \mu + \epsilon_t + \theta_1 \{(y_{t-1}-\mu) - \theta_1 \epsilon_{t-2} \}\]

\[ = \mu + \epsilon_t + \theta_1 (y_{t-1}-\mu) - \theta_1^2 \epsilon_{t-2}.\]

Similarly, you can replace \(\epsilon_{t-2}\), which gives

\[y_t = \mu + \epsilon_t + \theta_1 (y_{t-1}-\mu) - \theta_1^2 \{(y_{t-2}-\mu) - \theta_1 \epsilon_{t-3}\}\] \[ = \mu + \epsilon_t + \theta_1 (y_{t-1}-\mu) - \theta_1^2 (y_{t-2}-\mu) + \theta_1^3 \epsilon_{t-3}.\] If you keep continuing this procedure, you will get

\[ y_t = \mu + \sum_{k=1}^\infty (-\theta_1)^k (y_{t-k} - \mu) + \epsilon_t. \]

This shows that the MA(1) model you started with can be rewritten as an AR(\(\infty\)) process: It is based on an infinite weighted sum of past versions of the mean-centered observed variable \(y\). The autoregressive parameter at lag \(k\) is then \(\phi_k = (-\theta_1)^k\).

As a minor side note: To avoid this somewhat confusing switching between positive and negative versions of the \(\theta_1^k\) weight, which requires the use of \((-\theta_1)^k\) rather than \(-\theta_1^k\) in the summation above, the MA model is also often expressed as \(y_t= \mu + \epsilon_t - \vartheta_1 \epsilon_{t-1}\), where \(\vartheta_1 = - \theta_1\). Then, the weights in the AR(\(\infty\)) become \(\vartheta_1^k\).

For higher-order MA processes, the same strategy of replacing past innovations by their expressions in terms of the outcome can be employed. For this, the general result has been derived

\[y_t = \mu + \sum_{k=1}^\infty \phi_k (y_{t-k} - \mu) + \epsilon_t,\]

where the AR \(\phi_k\) are functions of the MA parameters \(\theta_1\) to \(\theta_q\) .

3.3 Rewriting a mixed ARMA model as a pure AR or MA model

It is also possible to rewrite a mixed ARMA(\(p,q\)) model as either an AR(\(\infty\)) model or an MA(\(\infty\)) model. The strategy to do so is similar to what has been shown above: You recursively replace past versions of the observed variable or of the innovations by their corresponding expressions of past observations and innovations.

In the time series literature this is typically explained through the use of the backshift operator, which is beyond the scope of the current article (Chatfield, 2004; Hamilton, 1994).

3.4 Summing independent processes together

Another interesting connection between members of the ARMA family was presented by Granger & Morris (1976). They derived general results that show what happens when you add two independent time series processes from the ARMA family, showing what ARMA(\(p,q\)) model results from this, and what values \(p\) and \(q\) take on.

Again, these derivations are based on using the backshift operator, which is a commonly used tool in the time series literature (Chatfield, 2004; Hamilton, 1994). While understanding the backshift operator is beyond the scope of the current article, it is nevertheless good to know about the results from Granger & Morris (1976), as it helps you to appreciate the relation between various ARMA models, and it provides an alternative perspective on the connections between and interpretation of them. The results from Granger & Morris (1976) can be summarized into the four following rules.

First, the sum of two independent AR processes results in a mixed ARMA process, that is

\[ AR(p_1) + AR(p_2) = ARMA(p_1 + p_2, max(p_1,p_2)), \] which means that:

the sum of an AR(1) process and white noise results in an ARMA(1,1) process
the sum of an AR(\(p\)) process and white noise results in an ARMA(\(p\),\(p\)) process
the sum of an AR(1) and an AR(1) process results in an ARMA(2,1) process

Second, the sum of two independent MA processes results in an MA process, that is

\[ MA(q_1) + MA(q_2) = MA(max(q_1,q_2)),\] which means that:

the sum of an MA(1) process and white noise results in an MA(1) process
the sum of an MA(\(q\)) process and white noise results in an MA(\(q\)) process
the sum of an MA(1) and an MA(1) process results in an MA(1) process

Third, the sum of an AR and an MA process results in a mixed ARMA process, that is

\[ AR(p) + MA(q) = ARMA(p, p+q),\] which implies that:

the sum of an AR(1) and an MA(1) process results in an ARMA(1,2) process
the sum of an AR(2) and an MA(1) process results in an ARMA(2,3) process

Fourth, the sum of a mixed ARMA process and white noise (WN) results in a mixed ARMA process, where the order of the latter depends on whether the order of the AR part is larger or smaller than that of the MA part. Specifically, when \(p>q\), the result is

\[ ARMA(p,q) + WN = ARMA(p,p),\] when \(p<q\) the result is

\[ ARMA(p,q) + WN = ARMA(p,q),\] and when \(p=q\), both of the above yield the same result.

This implies that:

the sum of an ARMA(1,1) process and white noise results in an ARMA(1,1) process
the sum of an ARMA(2,1) process and white noise results in an ARMA(2,2) process
the sum of an ARMA(1,2) process and white noise results in an ARMA(1,2) process

What is important to keep in mind when thinking about these rules is that they show what the result is of summing two independent processes; the rules do not inform you about the reverse. That is, a mixed ARMA model may be the result of a mixed ARMA process plus white noise, the sum of a pure AR and a pure MA process, or the sum of two AR processes. Moreover, it is also possible that the actual data generating mechanism is just a single mixed ARMA model, rather than a sum of independent processes. Moreover, there are specific combinations of parameter values of a mixed ARMA process that cannot result from summing two independent processes (see Granger & Morris, 1976 for details).

3.5 Conclusion

The connections discussed in this section show that the distinction between pure AR, pure MA, and mixed ARMA models is not as strict as what may have seemed based on Box & Jenkins (1970). Hence, when deciding between what model(s) to consider, you may want to look beyond the patterns that can be observed on the ACF and PACF, and instead also consider substantive reasons. Additionally, you may want to consider how well a model can predict the next (few) occasion(s). Moreover, you may also decide to consider model parsimony in terms of the number of parameters that need to be estimated freely, as an important criterion in selecting the “best” model. Yet, what is “best” will depend on your research goal.

4 Estimation of ARMA models

There are various ways to estimate the parameters of an ARMA model. For a pure AR model, you can make use of the Yule-Walker equations, which will give you estimates that are in general very close to the maximum likelihood estimates (Box & Jenkins, 1970; Hamaker et al., 2002). These equations are based on analytical expressions that link the autocorrelations to the parameters of the model. Similarly, there are also expressions that link the autocorrelations to the parameters of an MA model (Box & Jenkins, 1970). However, these do not have the same properties as maximum likehood estimates. Such autocorrelation-based estimators are also referred to as method of moment estimators. These parameter estimates for AR, MA and ARMA models can also be obtained through the use of a technique known as the Toeplitz method in structural equation modeling (see Hamaker et al., 2002).

While these moment esitmates can be used by themselves, they were used by Box & Jenkins (1970) as initial parameter estimates for their procedures called forward-backward filtering to get maximum likelihood estimates. Nowadays, maximum likelihood estimation of ARMA parameters is more typically done using a Kalman filter approach. An important advantage of using this algorithm is that it can also handle missing data, which makes it a widely applicable approach in the context of time series data, and it provides standard errors and the log likelihood which can be used for inference and model comparison.

5 Think more about

ARMA processes are by definition [stationary]. This means that their mean is invariant over time (i.e., there is no trend upward or downward), but it also means that the variance remains the same over time. Furthermore, the autocorrelations should be invariant over time, and when the data are non-normal, the skewness and kurtosis should also be invariant over time (when the data are normal, these are by definition fixed to 0 and 3 respectively). There are many ways in which a process may not be stationary; for instance, there may be a [trend] over time, or there may be a unit root process. The latter is included in the well-known extensions of the ARMA family, known as the integrated ARMA or [ARIMA model].

The ARMA model can be extended in multiple ways. For instance, when you have daily diary data of a person’s mood, you can consider including an exogenous variable like the weather, or the day of the week, to account for particular fluctuations and patterns in the data. You may also consider including time (and perhaps also time squared), to account for a long-term trend over time. Another extension is based on having a multivariate version, known as a vector autoregressive moving-average (VARMA) model.

You may also want to consider latent versions of an ARMA model. This may be of interest if you have multiple variables that you consider to be indicators of a latent variable. But even if you have a single observed variable, you may want to separate measurement error from an underlying latent construct, and have an ARMA model for the latter. These kind of models can also be achieved through the use of the Kalman filter, [dynamic structural equation modeling (DSEM)], or other flexible modeling frameworks that allow you to specify latent time series models.

6 Takeaway

Autoregressive moving-average (ARMA) models offer a flexible framework for analyzing N=1 time series data by combining autoregressive (AR) carry-over effects with moving-average (MA) delayed shock effects. While AR(1) and MA(1) models capture specific temporal patterns, ARMA(\(p,q\)) models can represent a much broader range of processes and even be rewritten as infinite-order AR or MA models. Deciding on the order of an ARMA model can be guided by theory, by diagnostic tools like ACF/PACF (with caution for shorter series), or by modern model comparison techniques. Ultimately, the “best” ARMA model depends on your [research godescription-prediction-causation.qmd)—whether it is understanding the process, achieving accurate prediction, or maintaining parsimony.

7 Further reading

We have collected various topics for you to read more about below.

References

Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis: Forecasting and control. Holden-Day.

Chatfield, C. (2004). The analysis of time series: An introduction (6th ed.). Chapman; Hall. https://doi.org/10.1007/978-1-4899-2923-5

Granger, C. W. J., & Morris, M. J. (1976). Time series modelling and interpretation. Journal of the Royal Statistical Society, 139, 246–257. https://doi.org/10.2307/2345178

Hamaker, E. L., Dolan, C. V., & Molenaar, P. C. M. (2002). On the nature of SEM estimates of ARMA parameters. Structural Equation Modeling, 9, 347–368. https://doi.org/10.1207/S15328007SEM0903_3

Hamilton, J. D. (1994). Time series analysis. Princeton University Press. https://doi.org/10.2307/j.ctv14jx6sm

Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and practice (3rd ed.). OTexts. OTexts.com/fpp3. Accessed on: October 1, 2024.

Slutzky, E. (1927). The summation of random causes as the source of cyclic processes. Problems of Economic Conditions (Moscow: Conjuncture Institute), 3.

Slutzky, E. (1937). The summation of random causes as the source of cyclic processes. Econometrica, 5, 105–146. https://doi.org/10.2307/1907241

Yule, G. U. (1927). On a method of investigating periodicities in disturbed series, with special reference to wolfer’s sunspot numbers. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 226, 267–298.

Citation

BibTeX citation:

@article{hamaker2025,
  author = {Hamaker, Ellen L. and Berkhout, Sophie W.},
  title = {Autoregressive Moving-Average Models},
  journal = {MATILDA},
  number = {2025-08-28},
  date = {2025-08-28},
  url = {https://matilda.fss.uu.nl/articles/arma-model.html},
  langid = {en}
}

For attribution, please cite this work as:

Hamaker, E. L., & Berkhout, S. W. (2025). Autoregressive moving-average models. MATILDA, 2025-08-28. https://matilda.fss.uu.nl/articles/arma-model.html