This article introduces a new R package (*pspatreg*) for the
estimation of semiparametric spatial autoregressive models.
*pspatreg* fits penalized spline semiparametric spatial
autoregressive models via Restricted Maximum Likelihood or Maximum
Likelihood. These models are very flexible as they make it possible to
simultaneously control for spatial dependence, nonlinearities in the
functional form, and spatio-temporal heterogeneity. The package also
allows to estimate parametric spatial autoregressive models for both
cross-sectional and panel data (with fixed effects), thus avoiding the
use of different libraries. The official demos, vignettes, and tutorials
of the package are distributed either in CRAN or GitHub. This article
illustrates the potentials of the package by applying it to
cross-sectional data.

**Key words**: R package, Spatial dependence,
Semiparametric models, Splines

Modeling spatial and spatio-temporal data requires flexible econometric tools that allow us to control spatial and temporal dependence, spatial heterogeneity, non-linearities, and other possible model specification biases. When combined with standard parametric spatial econometric approaches, semiparametric regression models can provide an answer to this demand for flexibility. New computational methods developed within most modern statistical software (such as R) allow us to overcome all technical problems that arise in this process.

Several packages have recently been proposed to perform spatial
econometrics in R (see Bivand, Millo, and Piras
(2021) for a recent survey). Focusing on packages and methods
dealing with polygonal (or areal) spatial data, the first package was
*spdep* (Bivand, Pebesma, and Gomez-Rubio
2013; Bivand 2022). It was primarily designed for cross-sectional
spatial data and to model spatial dependence through the Maximum
Likelihood (ML) or the Generalized Method of Moments (GMM) estimation of
the spatial lag model (SAR), the spatial error model (SEM), the spatial
Durbin model (SDM), and the SARAR model. The estimation functions from
*spdep* have recently been moved to the package
*spatialreg* (Bivand, Millo, and Piras
2021). Other spatial econometric models for cross-sectional data
have been implemented in other packages: *sphet* (Piras 2010) for estimating and testing spatial
models with heteroskedastic innovations, *spfilteR* (Juhl 2021) for filtering out spatial dependence
in linear models, *spgwr* (Bivand and Yu
2022) for estimating geographycally weighted regression models,
and *spsur* (Lopez, Minguez, and Mur
2020) for estimating seemingly unrelated regression equations.
Moreover, following several theoretical contributions to the literature
on the estimation of static and dynamic spatial panel data models (see
Elhorst (2014)), other R packages for
spatial econometric analysis have recently been developed. In
particular, *splm* (Millo and Piras
2012) and *SDPDm* implement estimation methods for static
and dynamic spatial panel data. All of these R packages focus on
parametric methods (except *spgwr*, of course), leaving aside
issues related to non-linearities in functional form and the estimation
of spatio-temporal trends.

The main focus of this article is *pspatreg*, a new R package
for spatial econometric analysis. *pspatreg* fits penalized
spline (PS) semiparametric spatial autoregressive models via Restricted
Maximum Likelihood (REML) and ML. This approach combines penalized
regression spline methods^{1} (Eilers, Marx, and
Durbán 2015) with standard spatial autoregressive models (such as
SAR, SEM and SDM). These types of models (PS-SAR, PS-SEM and PS-SDM) are
thoroughly discussed in Mı́nguez, Basile, and
Durbán (2020) (see also Montero, Mı́nguez,
and Durbán (2012); Basile et al.
(2014); Hoshino (2018)).

These models are very flexible as they make it possible to include
within the same specification: *i*) spatial autoregressive terms
(i.e. spatial lags of dependent and independent variables as well as
spatial error terms) to capture spatial interaction or network effects;
*ii*) time lags of the dependent variable to capture persistence
effects; *iii*) parametric and nonparametric (smooth) terms to
identify nonlinear relationships between the response variable and the
covariates; *iv*) spatial and spatio-temporal trends, i.e. a
smooth interaction between the spatial coordinates and the time trend,
to capture site-specific nonlinear time trends.

The proposed method also allows the user to apply an ANOVA decomposition of the spatial or spatio-temporal trend into several components (spatial and temporal main effects, and second- and third-order interactions between them). This gives further insights into the dynamics of the data. Thus, we use the acronym PS-ANOVA-SAR (SEM, SDM, SLX) for the newly proposed data generating process (DGP). The use of nested B-spline bases for the interaction components of the spatio-temporal trend (Lee, Durban, and Eilers 2013) contributes to the efficiency of the fitting procedure without compromising the goodness of fit of the model. Finally, we also consider an extension of the PS-ANOVA-SAR (SEM, SDM, SLX), including a first-order time series autoregressive term process (AR1) in the noise to accommodate residual serial correlation. Further extensions to include the time lag of the dependent variable (dynamic spatial model) will be considered in the future.

The next section (Section 2) describes the availability of
*pspatreg* with documentation and examples. Section 3 presents a
general specification of the semiparametric spatial autoregressive
model. Section 4 provides basic information on *pspatreg*.
Section 5 shows an example of using *pspatreg* with
cross-sectional spatial data. The last section presents a
conclusion.

The *pspatreg* package is available on both CRAN (https://cran.r-project.org/web/packages/pspatreg/index.html)
and GitHub (https://github.com/rominsal/pspatreg)
and can be installed in the usual way^{2}.

Once the package has been installed and loaded, an overview of the
functionality of the package, including main functions, methods and
databases, can be obtained executing the command
` ?pspatreg`.

The package includes three vignettes. The first one provides a brief
description of the methodology used in the package. The second vignette
gives a detailed example of modeling pure spatial data with
semiparametric models and spatial lags using the well-known
*Ames* database included in package *AmesHousing* (Kuhn 2020). It also compares the results of
*pspatreg* with the *spatialreg* package for parametric
spatial regression models. Lastly, the third vignette provides some
insights into spatio-temporal modeling using a panel database of
unemployment in Italian provinces. First, this vignette compares the
results of spatio-temporal parametric panels with the *splm*
package, and then it shows the results of semiparametric spatio-temporal
models. Plots of spatio-temporal trends are also included in these
examples.

Of course, every function in the package includes reproducible
examples. Those included in ` pspatfit()`,

Let \(y_{it}\) be a sample of spatial panel data, where \(i\) is an index for the cross-sectional dimension (spatial units), with \(i = 1, \ldots, N\), and \(t\) is an index for the time dimension (time periods), with \(t = 1, \ldots, T\). The general model proposed is written as:

\[ y_{it}=\rho \sum_{j=1}^N w_{ij,N} y_{jt} + \tilde{f}(s_{1i},s_{2i},\tau_t) + \sum_{\delta=1}^kg_\delta(x_{\delta_{it}}) + \epsilon_{it},\]

where \((s_{1i},s_{2i})\) are the spatial coordinates (latitude and longitude) of individual \(i\) (when \(i\) refers to areal units: municipality, provinces, etc., the standard convention here is to identify representative points for areal units, the most typical being areal centroids), \(\tau_t\) is the time period, and \(x_{\delta_{it}}\) are independent variables; \(w_{ij}\) are the spatial weights, and \(\rho\) the spatial autoregressive parameter. The functions \(g_\delta(.)\) are parametric or non-parametric smooth functions of the covariates \(x_{\delta_{it}}\) (they can be linear, or can accommodate varying coefficient terms, smooth interaction between covariates, smooth by-factor curves, and so on), and \(\tilde{f}(s_{1i},s_{2i},\tau_t)\) is an unknown non-parametric spatio-temporal trend. The idiosyncratic error term is assumed to follow an AR(1) process, i.e., \(\epsilon_{it}=\phi \epsilon_{it-1}+u_{it}\) with \(u_{it}\sim N(0,\sigma^2)\).

This semiparametric SAR model turns out to be extremely useful to capture interactive spatial and temporal unobserved heterogeneity when this heterogeneity is smoothly distributed over space and time (Mı́nguez, Basile, and Durbán 2020). The dynamic extension (including \(y_{it-1}\) and \(\sum_{j=1}^N w_{ij,N} y_{it-1}\)) is also very promising and merits further theoretical investigation. Finally, the following semiparametric SAR model is very useful for modeling cross-setional spatial data taking into account non-linearities, spatial dependence, and spatial heterogeneity:

\[y_{i}=\rho \sum_{j=1}^N w_{ij,N} y_{j} + \sum_{\delta=1}^\Delta g_\delta(x_{\delta, i}) + \tilde{f}(s_{1i},s_{2i})+\epsilon_{i}\]

\[\epsilon_{i}\sim i.i.d.(0,\sigma^2_\epsilon).\]

In many situations, the spatial or the spatio-temporal trend to be estimated can be complex, and the use of a single multidimensional smooth function may not be flexible enough to capture the structure in the data. To solve this problem, an ANOVA-type decomposition of \(\tilde{ f}(s_{1i},s_{2i},\tau_t)\) can be used, where spatial and temporal main effects, and second- and third-order interactions between them can be identified:

\[\begin{eqnarray} \nonumber \tilde{ f}(s_{1i},s_{2i},\tau_t) & = & f_1(s_{1i})+f_2(s_{2i})+f_{\tau}(\tau_t)+ f_{1,2}(s_{1i},s_{2i})+\\ \nonumber & & f_{1,\tau}(s_{1i},\tau_t)+f_{2,\tau}(s_{2i},\tau_t)+f_{1,2,\tau}(s_{1i},s_{2i},\tau_t) \end{eqnarray}\]

First, the geoadditive terms given by \(f_1(s_{1i}),f_2(s_{2i}),f_{1,2}(s_{1i},s_{2i})\) work as control functions to filter the spatial trend out of the residuals, and transfer it to the mean response in a model specification. Thus, they make it possible to capture the shape of the spatial distribution of \(y_{it}\), conditional on the determinants included in the model. These control functions also isolate stochastic spatial dependence in the residuals, that is, spatially autocorrelated unobserved heterogeneity. Thus, the geoadditive terms can be regarded as an alternative to the use of individual regional dummies to capture unobserved heterogeneity, as long as such heterogeneity is smoothly distributed over space. Regional dummies peak at significantly higher and lower levels of the mean response variable. If these peaks are smoothly distributed over a two-dimensional surface (i.e., if unobserved heterogeneity is spatially autocorrelated), the smooth spatial trend is able to capture them. It is also worth noticing that, in a cross-sectional setting, the inclusion of a smooth spatial trend in the model specification is often the best way to control for unobserved spatial heterogeneity in the absence of degrees of freedom for the introduction of spatial fixed effects.

Second, the smooth time trend, \(f_{\tau}(\tau_t)\), and the smooth interactions between space and time – \(f_{1,\tau}(s_{1i},\tau_t)\), \(f_{2,\tau}, (s_{2i},\tau_t)\), \(f_{1,2,\tau}(s_{1i}, s_{2i},\tau_t)\) – work as control functions to capture the heterogeneous effect of common shocks. Thus, conditional on a smooth distribution of the spatio-temporal heterogeneity, the PS-ANOVA-SAR (SDM, SEM, SLX) model works as an alternative to the models proposed by Bai and Li (2013), Shi and Lee (2018), Pesaran and Tosetti (2011), Bailey, Holly, and Pesaran (2016) and Vega and Elhorst (2016) which are extensions of common factor models to accommodate both strong cross-sectional dependence (through the estimation of the spatio-temporal trend) and weak cross-sectional dependence (through the estimation of spatial autoregressive parameters).

Furthermore, this framework is also flexible enough to control for the linear and non-linear functional relationships between the dependent variable and the covariates, as well as the heterogeneous effects of these regressors across space. The model inherits all the positive properties of penalized regression splines, such as coping with missing observations by appropriately weighting them and straightforward interpolation of the smooth functions.

In the case of a semiparametric model without the spatial lag of the dependent variable (PS model), if all regressors are independent of the errors, \(\hat{g}_\delta(x_{\delta, it})\) can be interpreted as the conditional expectation of \(y\) given \(x_{\delta}\) (net of the effect of the other regressors). Blundell and Powell (2003) use the term Average Structural Function (ASF) with reference to these functions. In contrast, in PS-SAR, PS-SDM or in PS-SARAR model, when \(\rho\) is different from zero, the estimated smooth functions cannot be interpreted as ASF. Taking advantage of the results obtained for parametric SAR, we can compute the total smooth effect (total–ASF) of \(x_{\delta}\) as:

\[\hat{g}_{\delta}^{T}\left(x_{\delta}\right)=\Sigma_{q} \left[\textbf{I}_{n}-\hat{\rho}\textbf{W}_{n}\right]^{-1}_{ij} b_{\delta q}(x_{\delta})\hat{\beta}_{\delta q},\]

where \(b_{\delta q}(x_{\delta})\) are the B-spline basis functions used to represent the smooth function, and \(\hat{\beta}_{\delta q}\) the corresponding estimated parameters.

We can also compute direct and indirect (or spillover) effects of smooth terms in the PS-SAR case as:

\[\hat{g}_{\delta}^{D}\left(x_{\delta}\right)=\Sigma_{q} \left[\textbf{I}_{n}-\hat{\rho}\textbf{W}_{n}\right]^{-1}_{ii} b_{\delta q}(x_{k})\hat{\beta}_{\delta q}\]

\[\hat{g}_{\delta}^{I}\left(x_{\delta}\right)=\hat{g}_{\delta}^{T}\left(x_{\delta}\right)-\hat{g}_{\delta}^{D}\left(x_{\delta}\right).\]

Similar expressions can be provided for the direct, indirect, and total effects of the PS-SDM.

We are now going to introduce some basic general information about
the package. The main function in the *pspatreg* package is
` pspatfit()`, which estimates spatio-temporal pernalized
spline spatial regression models using either the REML method or the ML
method. In its generic form,

```
pspatfit(formula, data, na.action, listw = NULL, type = "sim", method = "eigen", Durbin = NULL,
zero.policy = NULL, interval = NULL,trs = NULL, cor = "none", dynamic = FALSE, control = list())
```

The function ` pspatfit()` returns a list of objects of
class

The argument ` formula` within the function

```
formula <- y ~ x1 + x2 + pspl(x3, nknots = 15) + pspl(x4, nknots = 20) +
pspt(long, lat, year, nknots = c(18,18,8), psanova = TRUE,
nest\_sp1 = c(1, 2, 3),
nest\_sp2 = c(1, 2, 3),
nest\_time = c(1, 2, 2))
```

In the example above, the model includes two parametric terms, two
non-parametric terms, and a spatio-temporal trend (with long and lat as
spatial coordinates and year as temporal coordinate). The dimension of
the basis function, both in ` pspl(.)` and

In this example we also adopt an ANOVA decomposition of the
spatio-temporal trend (choosing ` psanova = TRUE`). Each
effect has its own degree of smoothing which allows a greater
flexibility for the spatio-temporal trend. Calculating up to third-order
interactions can be computationally expensive. We can select subgroups
of interaction effects for the second- and third-order effects to
address this problem. We use three parameters available in

We must set the parameters ` f1_main`,

Using the argument ` Type`, we can choose different
spatial model specifications:

The argument ` data` must contain all the variables
included in parametric and non-parametric terms of the model. If a

Both ` data.frame` and

Plotting the estimated non-parametric smooth terms represents an
important step in semiparametric regression analyses. First, the
function ` fit_terms()` computes estimated non-parametric
smooth terms. Then, the functions

The function ` impactspar()` computes direct, indirect,
and total impacts for continuous parametric covariates using the
standard procedure for their computation (LeSage
and Pace 2009).

The function ` impactsnopar()` computes direct,
indirect, and total impacts functions for continuous non-parametric
covariates, while the function

Here, we present the use of *pspatreg* for spatial
cross-sectional data (no time dimension involved). In particular, we use
Italian province-level data for the estimation of the relationship
between labor productivity growth and net internal migration. The
standard neoclassical growth model can be specified, in its linear form,
as follows:

\[\gamma_{i}=\alpha+\beta\ln y_{i,0}+\delta m_{i}+\tau \ln (n_{i})+X^{'}_{i}\psi+\epsilon_{i},\] where \(\gamma_{i}=(\ln y_{i,T}-\ln y_{i,0})/T\) is the average annual growth rate of labor productivity (measured as gross value added per worker) computed over \(T\) periods (our sample period goes from 2002 to 2018) for each province \(i\) (107 Italian provinces), \(\ln y_{i,0}\) captures the initial conditions of each province (a negative value of \(\beta\) indicates conditional convergence), \(m_{i}\) is the average annual provincial internal net migration rate (computed as the difference between internal immigration and emigration flows of the working-age population, i.e. people aged 15-65, divided by the total working-age population), \(\ln(n_{i})\) is the average employment growth rate (the neoclassical growth model suggests a negative value of \(\tau\)), \(X_{i}\) is a vector of variables controlling for other growth determinants such as physical and human investment rates, and \(\epsilon_{i}\) is an identically and independently distributed error term.

Net population movements generally tend to be oriented towards prosperous areas which offer higher real income prospects. This is also true for the Italian case (see Figure 1), where all Southern provinces have negative net migration rates and all Northern provinces have positive rates.

According to the standard neoclassical framework, this pattern of migration should represent a mechanism for reducing spatial economic differentials. Labor migration from poor to rich areas lowers capital intensity (increases the return to capital) in the destination region and increases capital intensity (lowers the return to capital) in the region of origin. When the same technologies are used everywhere, migration speeds up per worker inter-regional convergence in capital intensity and labor productivity. Therefore, the neoclassical framework predicts a negative value of \(\delta\) (i.e. net inward migration reduces labor productivity growth). However, alternative theories point to the importance of migrants’ characteristics such as youthfulness, entrepreneurship, and skills that, together with their impact on aggregate demand, may have growth-enhancing effects. In terms of aggregate demand, regions losing population through migration may face economic contraction, whereas regions gaining population through migration may benefit from an expansionary effect on output, employment, and income. The transfer of human capital from one place to another is another critical aspect. In particular, skill-selective mobility may have deep effects on origin and destination places. All these alternative contributions predict a positive effect of net migration on growth (i.e. a positive value of \(\delta\)). Moreover, the presence of a significantly positive effect of net migration is expected to decrease the estimate of \(\beta\), the parameter associated to the initial conditions (i.e. it is expected to remove the positive omitted variable bias in estimates of \(\beta\) in regressions without the migration variable). Our empirical analysis confirms this intuition. Using our dataset and estimating the model with simple OLS, we actually find a positive effect of net migration on labor productivity growth, in line with several empirical studies:

`library(pspatreg)`

```
formlin_0 <- growth_PROD ~ lnPROD_0 + lnoccgr
linear_0 <- lm(formlin_0, data = prod_it)
summary(linear_0, vcov = function(x) vcovHC(x, type="HC1"))
```

```
##
## Call:
## lm(formula = formlin_0, data = prod_it)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0085578 -0.0022181 0.0001756 0.0020764 0.0082702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.036413 0.033147 1.099 0.2745
## lnPROD_0 -0.001892 0.003171 -0.596 0.5522
## lnoccgr -0.153926 0.077535 -1.985 0.0498 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003261 on 104 degrees of freedom
## Multiple R-squared: 0.0847, Adjusted R-squared: 0.0671
## F-statistic: 4.812 on 2 and 104 DF, p-value: 0.01003
```

```
beta_conv_0 <- as.numeric(-log(linear_0$coefficients[2]*16+1)/16)
beta_conv_0
```

`## [1] 0.00192075`

```
formlin <- growth_PROD ~ lnPROD_0 + lnoccgr + net
linear <- lm(formlin, data = prod_it)
summary(linear, vcov = function(x) vcovHC(x, type="HC1"))
```

```
##
## Call:
## lm(formula = formlin, data = prod_it)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0085734 -0.0019501 -0.0000671 0.0021081 0.0089063
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.131008 0.039651 3.304 0.001312 **
## lnPROD_0 -0.010650 0.003747 -2.842 0.005402 **
## lnoccgr -0.153775 0.072837 -2.111 0.037173 *
## net 0.107752 0.027962 3.854 0.000203 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003064 on 103 degrees of freedom
## Multiple R-squared: 0.2, Adjusted R-squared: 0.1767
## F-statistic: 8.585 on 3 and 103 DF, p-value: 3.854e-05
```

```
beta_conv <- as.numeric(-log(linear$coefficients[2]*16+1)/16)
beta_conv
```

`## [1] 0.01167548`

The results indicate a positive correlation between the growth rate of labor productivity and the net migration rate of working-age population. Nevertheless, this linear specification of the model is characterized by a number of potential mis-specification biases. First, there can be a reverse causality problem between migration and productivity growth, so that the net migration variable should be instrumented. A second source of endogeneity could be the presence of omitted variables (or unobserved heterogeneity) correlated with the observed covariates. Indeed, we do not control for human and physical capital accumulation rates in the estimation above, due to the lack of information on these variables at the province level in Italy. Additionally, we cannot exclude a correlation between these omitted terms and the coviarates introduced in the model. Third, substantive spatial dependence effects can emerge due to the network structure of Italian provinces, which are strongly connected via trade or other kinds of links. A wrong functional form (due to non-linearities) can represent a further source of model mis-specification. For the sake of simplicity, we disregard the reverse causality issue and focus on the other sources of bias (unobserved heterogeneity, spatial dependence, and nonlinearities) in what follows. In particular, we show that controlling for unobserved heterogeneity is a fundamental challenge in cross-sectional analysis (where we cannot include spatial fixed effects). Moreover, we should also consider that spatial dependence may simply be the consequence of (spatially correlated) omitted variables rather than being the result of spillovers. If this is the case, there are no compelling reasons for using traditional parametric models, like the SAR or SEM. As McMillen (2012) shows, a simple semiparametric model, with a smooth interaction between latitude and longitude (the so-called Geoadditive Model), can remove unobserved heterogeneity.

Following a step-by-step procedure, we first extend the linear
classical model by including a spatial autoregressive term, i.e. by
estimating a SAR model^{5}:

\[\gamma_{i}=\alpha+\rho \sum_{j=1}^N
w_{ij,N} \gamma_{j} +\beta\ln y_{i,0}+\delta m_{i}+\tau \ln
(n_{i})+\epsilon_{i}.\] We estimate this model using the function
` pspatfit()` of the package

`linsar <- pspatfit(formlin, data = prod_it, listw = lwsp_it, method = "eigen", type = "sar")`

```
##
## Fitting Model...
```

```
##
## Time to fit the model: 0.28 seconds
```

`summary(linsar)`

```
##
## Call
## pspatfit(formula = formlin, data = prod_it, listw = lwsp_it,
## type = "sar", method = "eigen")
##
## Parametric Terms
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1310379 0.0368036 3.5605 0.0005640 ***
## lnPROD_0 -0.0111188 0.0034779 -3.1970 0.0018495 **
## lnoccgr -0.1367972 0.0676062 -2.0234 0.0456430 *
## net 0.1087253 0.0259539 4.1892 5.961e-05 ***
## rho 0.3654309 0.0977673 3.7378 0.0003065 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Goodness-of-Fit
##
## EDF Total: 5
## Sigma: 0.00310498
## AIC: -1138.08
## BIC: -1124.72
```

```
imp_parvar_sar <- impactspar(linsar, list_varpar)
summary(imp_parvar_sar)
```

```
##
## Total Parametric Impacts (sar)
## Estimate Std. Error t value Pr(>|t|)
## lnPROD_0 -0.0177186 0.0062356 -2.8415110 0.0045
## lnoccgr -0.2240677 0.1158853 -1.9335306 0.0532
## net 0.1748153 0.0513174 3.4065513 0.0007
##
## Direct Parametric Impacts (sar)
## Estimate Std. Error t value Pr(>|t|)
## lnPROD_0 -0.0115544 0.0036091 -3.2014537 0.0014
## lnoccgr -0.1458061 0.0698432 -2.0876198 0.0368
## net 0.1138760 0.0275382 4.1352029 0.0000
##
## Indirect Parametric Impacts (sar)
## Estimate Std. Error t value Pr(>|t|)
## lnPROD_0 -0.0061642 0.0032349 -1.9055239 0.0567
## lnoccgr -0.0782617 0.0533515 -1.4669061 0.1424
## net 0.0609392 0.0297445 2.0487588 0.0405
```

As already mentioned, McMillen (2012) and McMillen (2003) stress the importance of considering whether apparent spatial dependence is in fact engendered by model mis-specifications, such as the erroneous inclusion or omission of covariates and the inappropriate functional form of included covariates. Therefore, we extend the SAR model by first including a smooth spatial trend (thus estimating a semiparametric geoadditive SAR model):

\[\gamma_{i}=\alpha+\rho \sum_{j=1}^N w_{ij,N} \gamma_{j} +\beta\ln y_{i,0}+\delta m_{i}+\tau \ln (n_{i})+ \tilde{ f}(s_{1i},s_{2i})+\epsilon_{i}.\]

We use the function ` pspt()` with 10 knots for each
each variable (latitude and longitude of the centroid) to estimate the
spatial trend. A model with a smooth spatial trend can also be estimated
in R using alternative packages, such as

```
formgeo <- growth_PROD ~ lnPROD_0 + lnoccgr + net + pspt(longitude, latitude, nknots = c(10, 10), psanova = FALSE)
geosar <- pspatfit(formgeo, data = prod_it, listw = lwsp_it, method = "eigen", type = "sar")
```

```
##
## Fitting Model...
```

```
##
## Time to fit the model: 3.93 seconds
```

`summary(geosar)`

```
##
## Call
## pspatfit(formula = formgeo, data = prod_it, listw = lwsp_it,
## type = "sar", method = "eigen")
##
## Parametric Terms
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2169096 0.0473358 4.5824 1.430e-05 ***
## lnPROD_0 -0.0191931 0.0045088 -4.2568 4.958e-05 ***
## lnoccgr -0.0451729 0.0761514 -0.5932 0.55449
## net 0.0727907 0.0415870 1.7503 0.08337 .
## Xspt.2 -0.0116569 0.0155548 -0.7494 0.45551
## Xspt.3 0.0119923 0.0174543 0.6871 0.49375
## Xspt.4 -0.0149656 0.0204262 -0.7327 0.46561
## rho 0.2017864 0.1127590 1.7895 0.07679 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Non-Parametric Spatio-Temporal Trend
## EDF
## f(sp1, sp2) 6.228
##
## Goodness-of-Fit
##
## EDF Total: 14.2276
## Sigma: 0.00271168
## AIC: -1122.71
## BIC: -1084.69
```

```
list_varpar <- as.character(names(summary(geosar)$bfixed)[2:4])
eff_parvar <- impactspar(geosar, list_varpar)
summary(eff_parvar)
```

```
##
## Total Parametric Impacts (sar)
## Estimate Std. Error t value Pr(>|t|)
## lnPROD_0 -0.0245228 0.0067077 -3.6559304 0.0003
## lnoccgr -0.0586806 0.1019369 -0.5756557 0.5648
## net 0.0941898 0.0551518 1.7078286 0.0877
##
## Direct Parametric Impacts (sar)
## Estimate Std. Error t value Pr(>|t|)
## lnPROD_0 -0.0194804 0.0045614 -4.2707413 0.0000
## lnoccgr -0.0462291 0.0793750 -0.5824142 0.5603
## net 0.0749022 0.0417346 1.7947262 0.0727
##
## Indirect Parametric Impacts (sar)
## Estimate Std. Error t value Pr(>|t|)
## lnPROD_0 -0.0050424 0.0035509 -1.4200413 0.1556
## lnoccgr -0.0124514 0.0259634 -0.4795770 0.6315
## net 0.0192876 0.0184234 1.0469086 0.2951
```

We can plot the estimated spatial trend using the function
` plot_sp2d`.

`plot_sp2d(geosar, data=prod_it)`

As a last step in our empirical application, we extend the model by
allowing the variables \(lnPROD_0\) and
\(net\) to enter smoothly as
non-parametric terms. Specifically, we use the function
` pspl` with 9 knots for each univariate term:

\[\gamma_{i}=\alpha+\rho \sum_{j=1}^N w_{ij,N} \gamma_{j} +g_1(\ln y_{i,0})+g_2(m_{i})+\tau \ln (n_{i})+ \tilde{f}(s_{1i},s_{2i})+\epsilon_{i}.\]

```
formgam <- growth_PROD ~ pspl(lnPROD_0, nknots = 9) + lnoccgr + pspl(net, nknots = 9) + pspt(longitude, latitude, nknots = c(10,10), psanova = FALSE)
gamsar <- pspatfit(formgam, data = prod_it, listw = lwsp_it, method = "eigen", type = "sar")
```

```
##
## Fitting Model...
```

```
##
## Time to fit the model: 4.8 seconds
```

`summary(gamsar)`

```
##
## Call
## pspatfit(formula = formgam, data = prod_it, listw = lwsp_it,
## type = "sar", method = "eigen")
##
## Parametric Terms
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0108138 0.0021183 5.1050 1.754e-06 ***
## lnoccgr -0.0349342 0.0752188 -0.4644 0.6434249
## Xspt.2 -0.0137355 0.0118824 -1.1560 0.2506681
## Xspt.3 0.0172283 0.0137329 1.2545 0.2127989
## Xspt.4 -0.0136286 0.0158405 -0.8604 0.3918063
## pspl(lnPROD_0, nknots = 9).1 0.0200607 0.0054356 3.6906 0.0003772 ***
## pspl(net, nknots = 9).1 -0.0052598 0.0031396 -1.6753 0.0972432 .
## rho 0.1922019 0.1113342 1.7264 0.0876119 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Non-Parametric Terms
## EDF
## pspl(lnPROD_0, nknots = 9) 1.0295
## pspl(net, nknots = 9) 1.4016
##
## Non-Parametric Spatio-Temporal Trend
## EDF
## f(sp1, sp2) 3.769
##
## Goodness-of-Fit
##
## EDF Total: 14.2005
## Sigma: 0.00268938
## AIC: -1126.26
## BIC: -1088.31
```

```
list_varnopar <- c("lnPROD_0", "net")
terms_nopar <- fit_terms(gamsar, list_varnopar)
plot_terms(terms_nopar, prod_it, alpha = 0.10)
```

Then, we compute the direct and indirect (or spillover) effects of
the two smooth terms in the PS-SAR using the function
` impactsnopar`:

```
gamsar_impnopar <- impactsnopar(gamsar, listw = lwsp_it, viewplot = FALSE,smooth = FALSE, alpha = 0.10)
plot_impactsnopar(gamsar_impnopar, data = prod_it, smooth = TRUE)
```

This article has highlighted several advantages of using a
semiparametric approach over a purely parametric approach to space-time
data modeling. Additionally, it has provided a brief introduction to a
new R package (` psatreg`) that allows estimating this
class of models.

The article has also demonstrated the use of this package by using spatial cross-sectional data. This simple application has illustrated the existence of a strong interference between the various problems of mis-specification that characterize the models for spatial data. Specifically, it highlighted the existence of a strong trade-off between spatial dependence and spatial heterogeneity. The inclusion of a spatial trend within a simple SAR model for cross-sectional data (where the lack of degrees of freedom prevents the inclusion of spatial fixed effects) has a strong impact on the magnitude of the spatial spillover parameter (\(\rho\)), as well on the magnitude of the other model parameters (\(\beta\)). Other important examples, also for spatio-temporal (i.e. panel) data, are provided by the vignettes included in the package.

We also recognize the existence of limitations of the semiparametric approach for dealing with spatio-temporal data proposed here. An obvious limitation is the difficulty of these kinds of models to fit data that are characterized by a weak spatial pattern. In this case, while a fixed-effect approach (applied to spatial panel data) is capable of capturing spatial heterogeneity, the inclusion of a spatial trend surface on the r.h.s. of the model hardly captures the effects of omitted variables. However, we also observe that most of the standard economic and social variables show a relevant spatial trend.

We would also like to point out some practical problems associated with the imp-le-men-ta-tion of Spatial Autoregressive Semiparametric Models. In particular, it is well known that nonparametric estimates may be spurious due to outliers, although in the case of penalized splines the effect of the extreme values is often mitigated. In practice, it might be necessary to trim extreme values at the edge of the data domain.

Regarding the problem of model selection, it seems preferable to simply compare the performance of the different models in terms of some Information Criterion. We do not yet provide a battery of diagnostic tests for Spatial Autoregressive Semiparametric Models like the Lagrange Multiplier tests widely used in the traditional parametric spatial econometric literature (LM-SEM,LM-SAR,LM-SARSAR). Indeed, the use and abuse of LM tests for the spatial autocorrelation of the residuals has been largely criticized, as it may lead to a mechanical selection process.

Finally, for future considerations, it is planned to include some functionalities in the package to allow the estimation and inference of spatio-temporal regression models with varying coefficients using P-spline metodology. These models can be seen as an alternative to the usual Geographically Weighted Regression (GWR) models.

Bai, J., and K. Li. 2013. “Spatial Panel Data Models with Common
Shocks.” MPRA Paper 52786. University of Munich, Germany. https://ideas.repec.org/p/pra/mprapa/52786.html.

Bailey, Natalia, Sean Holly, and M. Hashem Pesaran. 2016. “A
Two-Stage Approach to Spatio-Temporal Analysis with Strong and Weak
Cross-Sectional Dependence.” *Journal of Applied
Econometrics* 31 (1): 249–80. https://doi.org/10.1002/jae.2468.

Basile, Roberto, María Durbán, Román Mínguez, Jose María Montero, and
Jesús Mur. 2014. “Modeling Regional Economic Dynamics: Spatial
Dependence, Spatial Heterogeneity and Nonlinearities.”
*Journal of Economic Dynamics and Control* 48: 229–45.
https://doi.org/http://dx.doi.org/10.1016/j.jedc.2014.06.011.

Bivand, Roger. 2022. “R Packages for Analyzing Spatial Data: A
Comparative Case Study with Areal Data.” *Geographical
Analysis* 54 (3): 488–518. https://doi.org/https://doi.org/10.1111/gean.12319.

Bivand, Roger, Giovanni Millo, and Gianfranco Piras. 2021. “A
Review of Software for Spatial Econometrics in R.”
*Mathematics* 9 (11): 1276. https://doi.org/https://doi.org/10.3390/math9111276.

Bivand, Roger, Edzer Pebesma, and Virgilio Gomez-Rubio. 2013.
*Applied Spatial Data Analysis with R*. Second. New
York: Springer. https://asdar-book.org/.

Bivand, Roger, and Danlin Yu. 2022. *Spgwr: Geographically Weighted
Regression*. https://CRAN.R-project.org/package=spgwr.

Blundell, Richard, and James L. Powell. 2003. “Endogeneity in
Nonparametric and Semiparametric Regression Models.” In
*Advances in Economics and Econometrics
Theory and Application*, edited by Mathias
Dewatripont, Lars Peter Hansen, and Stephen J. Turnovsky, II:312–57.
Cambridge University Press. https://doi.org/https://doi.org/10.1017/cbo9780511610257.011.

Eilers, Paul HC, Brian D Marx, and Maria Durbán. 2015. “Twenty
Years of p-Splines.” *SORT-Statistics and Operations Research
Transactions* 39 (2): 149–86.

Elhorst, J. P. 2014. *Spatial Econometrics. From Cross-Sectional Data
to Spatial Panels*. SpringerBriefs in Regional Science.
Berlin-Heidelberg-New York: Springer. https://doi.org/https://doi.org/10.1007/978-3-642-40340-8.

Hoshino, T. 2018. “Semiparametric Spatial Autoregressive Models
with Endogenous Regressors: With an Application to Crime Data.”
*Journal of Business & Economic Statistics* 36: 160–72.
https://doi.org/https://doi.org/10.1080/07350015.2016.1146145.

Juhl, Sebastian. 2021. “spfilteR:
An R Package for Semiparametric Spatial
Filtering with Eigenvectors in (Generalized) Linear Models.”
*The R Journal* 13 (2): 450–59. https://doi.org/https://doi.org/10.32614/RJ-2021-085.

Kuhn, Max. 2020. *AmesHousing: The Ames Iowa Housing Data*. https://CRAN.R-project.org/package=AmesHousing.

Lee, D. J., M. Durban, and P. Eilers. 2013. “Efficient
Two-Dimensional Smoothing with P-Spline ANOVA
Mixed Models and Nested Bases.” *Computational Statistics
& Data Analysis* 61: 22–37. https://doi.org/https://doi.org/10.1016/j.csda.2012.11.013.

LeSage, J., and K. Pace. 2009. *Introduction to Spatial
Econometrics*. Boca Raton: CRC Press.
https://doi.org/https://doi.org/10.1201/9781420064254.

Lopez, Fernando, Roman Minguez, and Jesus Mur. 2020.
“ML Versus IV Estimates of Spatial
SUR Models: Evidence from the Case of
Airbnb in Madrid Urban Area.” *The
Annals of Regional Science* 64 (2): 313–47. https://doi.org/10.1007/s00168-019-00914-1.

McMillen, Daniel P. 2003. “Spatial Autocorrelation or Model
Misspecification?” *International Regional Science Review*
26 (2): 208–17. https://doi.org/https://doi.org/10.1177/0160017602250977.

———. 2012. “PERSPECTIVES ON SPATIAL ECONOMETRICS: LINEAR SMOOTHING
WITH STRUCTURED MODELS.” *Journal of Regional Science* 52
(2): 192–209. https://doi.org/10.1111/j.1467-9787.2011.00746.x.

Millo, Giovanni, and Gianfranco Piras. 2012. “splm: Spatial Panel Data Models in
R.” *Journal of Statistical Software* 47 (1):
1–38. https://doi.org/https://www.jstatsoft.org/v47/i01/.

Mı́nguez, Román, Roberto Basile, and Marı́a Durbán. 2020. “An
Alternative Semiparametric Model for Spatial Panel Data.”
*Statistical Methods & Applications* 29 (4): 669–708.
https://doi.org/https://doi.org/10.1007/s10260-019-00492-8.

Montero, J, R Mı́nguez, and M Durbán. 2012. “SAR
Models with Nonparametric Spatial Trends. A
P-Spline Approach.” *Estadı́stica
Española* 54 (177): 89–111.

Pesaran, M Hashem, and Elisa Tosetti. 2011. “Large Panels with
Common Factors and Spatial Correlation.” *Journal of
Econometrics* 161 (2): 182–202. https://doi.org/https://doi.org/10.1016/j.jeconom.2010.12.003.

Piras, Gianfranco. 2010. “sphet:
Spatial Models with Heteroskedastic Innovations in
R.” *Journal of Statistical Software* 35 (1):
1–21. https://doi.org/10.18637/jss.v035.i01.

Shi, Wei, and Lung-fei Lee. 2018. “A Spatial Panel Data Model with
Time Varying Endogenous Weights Matrices and Common Factors.”
*Regional Science and Urban Economics* 72: 6–34.
https://doi.org/https://doi.org/10.1016/j.regsciurbeco.2017.03.007.

Vega, Solmaria Halleck, and J Paul Elhorst. 2016. “A Regional
Unemployment Model Simultaneously Accounting for Serial Dynamics,
Spatial Dependence and Common Factors.” *Regional Science and
Urban Economics* 60: 85–95. https://doi.org/https://doi.org/10.1016/j.regsciurbeco.2016.07.002.

Wood, S. N. 2017. *Generalized Additive Models: An Introduction with
R*. 2nd ed. Chapman; Hall/CRC. https://doi.org/https://doi.org/10.1201/9781315370279.

P-splines are a flexible tool for smoothing. They are based on regressions with a large number of local basis functions (called B-splines). A penalty function based on differences between adjacent coefficients is also included in the maximum likelihood function to tune the smoothness of the estimated curve.↩︎

You could install

*pspatreg*from CRAN executing. Usually the default options allow to install the package without any problems. Alternatively, to install from GitHub you could use**install.packages(“pspatreg”)***devtools*package. Once installed, execute the commandto install**devtools::install_github(“rominsal/pspatreg”)***pspatreg*package.↩︎In most empirical cases, the main effects are more flexible than interaction effects and therefore the number of knots in B-Spline bases for interaction effects do not need to be as large as the number of knots for the main effects (Lee, Durban, and Eilers 2013).↩︎

means simple features of spatial vector objects. The geographic vector data model is based on points located within a coordinate reference system (CRS). Points can represent self-standing features (e.g., the location of a house) or they can be linked together to form more complex geometries such as lines and polygons. Most point geometries contain only two dimensions**sf**and**x**(3-dimensional CRSs contain an additional**y**value, typically representing height above sea level).**z**objects provide both a**sf***geometry*information, describing where on Earth the feature is located, and*attributes*information, describing other properties (like the population of the region, the unemployment rate, etc.).objects store only attributes information.↩︎**data.frame**Preliminary diagnostic tests (using likelihood ratio statistics) work in favor of the SAR model, rather that the SDM and the SEM. Spatial autoregressive models are estimated using a standardized inverse distance \(W\) matrix combined with a binary minimum threshold distance matrix.↩︎