Comparing countries with macro-level data

Macro
Comparing countries
Economics
Political science
Sociology
Author

Carlo Knotz

Published

March 30, 2025

Why we do macro-level comparisons

Many important questions that political scientists, sociologists, or economists are asking are about patterns at the macro- or country-level: Why do some countries have bigger welfare states than others (Esping-Andersen 1990; Allan and Scruggs 2004; Korpi and Palme 2003; Iversen and Soskice 2006) or how do the political structures of a country affect its economic or environmental performance (Roller 2005; Acemoglu, Johnson, and Robinson 2001; Scruggs 2003)? Answering these questions usually requires some form of cross-country comparison with macro-level data on the size and shape of welfare states, political institutions, environmental performance, or economic growth. This analysis can be quantitative — also known as a time series cross-sectional regression analysis (see e.g., Beck 2001) — but it can also be a qualitative comparative case study — and even in the latter case, a few nice graphs that show relevant developments, patterns, and trends at the country-level can make the case study much more convincing and easy to follow.

Luckily, we are now sitting on a mountain of (usually) freely available macro-level data on all kinds of economic, social, political, or environmental aspects for many countries and over long periods of time. To name just a few examples:

  • International institutions such as the World Bank, the OECD, the European Union, or the UN offer free data on a vast number of economic, social, environmental, and political variables for their member countries.
  • The V-DEM project provides very detailed data on how democratic countries are for many countries going back to the 18-hundreds.
  • There are also many different datasets that measure countries’ political institutions, constitutional structures, party systems, election outcomes, and the representation of parties in parliaments and governments (see the Dataset of Political Datasets.)
  • Researchers in international relations and peace & conflict studies have created many datasets on countries’ military strengths, alliances, conflicts, wars, terrorist attacks, and many other aspects (see https://github.com/erikgahner/PolData?tab=readme-ov-file#international-relations.)

In addition, many datasets come with associated R packages that allow you do directly import the datasets (see e.g., the vdemdata, WDI, or manifestoR packages).

There are different ways to work with macro-level data. A beginner-friendly way to work with macro-level data is to do descriptive analyses, and that is what the rest of this post is going to focus on.

More specifically, we will go over some example techniques using the Comparative Political Data Set (CPDS; https://cpds-data.org/), which is a very popular and fairly easy-to-work-with dataset in political science. It is a kind of Swiss army knife of macro-level data that includes the (usually) most relevant political, economic, and social macro-level indicators for a set of wealthy democracies in Europe, North America, and Australasia for the post-World War II period in one single source (e.g., GDP growth, the partisan composition of parliaments and governments, welfare state spending, or political institutions).

Setup

If you want to follow along, make sure you have the tidyverse loaded and, if you like, pre-set the ggplot2 graph theme to save time later:

library(tidyverse)
theme_set(theme_classic())

What macro-level data (should) look like

The first important thing to understand is how a macro-level dataset should look like if you want to analyze it in R. As per Hadley Wickham’s Rules for Tidy Data (2014), all datasets should be structured in a way that:

  • Every row is an observation
  • Every column is a variable

This is easy when we work with a typical micro-level survey dataset like the European Social Survey, where the unit of observation is a single person. Here, every person is a row and every aspect that is recorded about them (their gender, income, age, etc.) is a column.

In a typical macro-level dataset, the unit of observation is usually a country-year: We observe Norway in 1990, 1991, 1992, and so on, and then we observe the France, Sweden, Japan, etc. in the same years.1 Here is a simple example of how this should look like using data on GDP growth (realgdpgr) from the CPDS dataset:

Important

This is how your dataset should look like!

# A tibble: 12 × 4
    year country iso   realgdpgr
   <dbl> <chr>   <chr>     <dbl>
 1  1990 France  FRA       3.03 
 2  1991 France  FRA       0.944
 3  1992 France  FRA       1.48 
 4  1990 Japan   JPN       5.57 
 5  1991 Japan   JPN       3.32 
 6  1992 Japan   JPN       0.819
 7  1990 Norway  NOR       1.93 
 8  1991 Norway  NOR       3.08 
 9  1992 Norway  NOR       3.57 
10  1990 Sweden  SWE       0.755
11  1991 Sweden  SWE      -1.15 
12  1992 Sweden  SWE      -1.16 

You see that the individual observations for each country and year (country-years) are “stacked” on top of each other, and that we have variables telling us which year and which country a given row corresponds to. These are important: You absolutely need to keep these variables in your dataset, otherwise you no longer know what each row in your dataset corresponds to.

The table also shows countries in two different formats: The plain English name, and the three-digit ISO country code. Many datasets use either of them (or different country codes), which can sometimes be a hassle to work with. Luckily, there is the countrycode package, which allows you to convert different country codes and names to other formats with a few lines of code.

Sometimes, and this can happen often when you download data from international organizations, the data you get look different (e.g., each row corresponds to a country and the columns refer to variables and years):

Important

This is how your dataset should not look like!

# A tibble: 4 × 4
  country realgdpgr_1990 realgdpgr_1991 realgdpgr_1992
  <chr>            <dbl>          <dbl>          <dbl>
1 France           3.03           0.944          1.48 
2 Japan            5.57           3.32           0.819
3 Norway           1.93           3.08           3.57 
4 Sweden           0.755         -1.15          -1.16 

If you do have a dataset that looks like this, you need to learn how to pivot or reshape your dataset. Here, the pivot_longer() and pivot_wider() functions from the tidyr package (included in the tidyverse) are your best friends (see also Urdinez and Cruz 2020, chap. 2.5.1).

Importing the CPDS dataset

OK, enough theory — time to work with some data. If you want to follow along, you need to download the latest version of the CPDS dataset (https://cpds-data.org/data/). Ideally, download the Stata version, unzip the file, and store it in your RStudio project folder (or the folder that is your current Working Directory, which you can find out with the getwd() function). Once you have that, all you need to do is to use the haven package to import the dataset:

cpds <- haven::read_dta("CPDS_1960_2022_Update_2024.dta")

The cpds object should now pop up in your Environment tab in RStudio. If you like, you can take a brief look at the data with glimpse. You should also download the official codebook and get familiar with the variables that are included in the dataset!

Another way to get a sense of what is contained is to look at the unique countries and years that are covered:

unique(cpds$year)
 [1] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974
[16] 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
[31] 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
[46] 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
[61] 2020 2021 2022
unique(cpds$country)
 [1] "Australia"      "Austria"        "Belgium"        "Bulgaria"      
 [5] "Canada"         "Croatia"        "Cyprus"         "Czech Republic"
 [9] "Denmark"        "Estonia"        "Finland"        "France"        
[13] "Germany"        "Greece"         "Hungary"        "Iceland"       
[17] "Ireland"        "Italy"          "Japan"          "Latvia"        
[21] "Lithuania"      "Luxembourg"     "Malta"          "Netherlands"   
[25] "New Zealand"    "Norway"         "Poland"         "Portugal"      
[29] "Romania"        "Slovakia"       "Slovenia"       "Spain"         
[33] "Sweden"         "Switzerland"    "United Kingdom" "USA"           

You see that we have, in principle, data for almost all of Europe and the other advanced democracies around the globe from the 1960s onwards. What this does not show, however, is that we only have data for the Central and Eastern European countries (Poland, Bulgaria, etc.) from 1990 on, after the collapse of the Soviet Union and the Warsaw Pact:

cpds |> 
  filter(country == "Poland") |> 
  select(year,country)
# A tibble: 32 × 2
    year country
   <dbl> <chr>  
 1  1991 Poland 
 2  1992 Poland 
 3  1993 Poland 
 4  1994 Poland 
 5  1995 Poland 
 6  1996 Poland 
 7  1997 Poland 
 8  1998 Poland 
 9  1999 Poland 
10  2000 Poland 
# ℹ 22 more rows

Because such a large batch of countries was added at this one time point, it makes sense to limit the data to the post-1990 period — otherwise, comparisons over time might not make sense.

cpds |> 
  filter(year>=1990) -> cpds

(An alternative, if the interest is in long trends since World War II, is to leave out the post-communist countries. Here, the poco — “post-communist” — variable in the CPDS dataset is useful within filter().)

Descriptive analyses with macro-level data

There are four basic ways to look descriptively at macro-level data:

  1. You can look at general trends across countries over time
  2. You can compare average patterns between countries
  3. You can compare trends within selected countries over time
  4. You can look at average relationships (correlations) between countries

Each of them tells you a different part of the entire story that is contained in the data. We will go over each of them and see how to aggregate and visualize the data. In most cases, your two best friends are the group_by() and summarize() functions from dplyr.

Aggregating by country to show differences

Another thing we might be interested in is which countries had, on average, the highest or lowest growth rates in the period between 1990 and today. To see this, we again use group_by() and summarize(), but we now group by country instead of year:

cpds |> 
  group_by(country) |> 
  summarise(avg_growth = mean(realgdpgr, na.rm = T))
# A tibble: 36 × 2
   country        avg_growth
   <chr>               <dbl>
 1 Australia            2.89
 2 Austria              1.87
 3 Belgium              1.81
 4 Bulgaria             1.94
 5 Canada               2.13
 6 Croatia              2.34
 7 Cyprus               3.41
 8 Czech Republic       1.58
 9 Denmark              1.76
10 Estonia              4.02
# ℹ 26 more rows

As before, we now get an aggregated version of the dataset — but now it is aggregated by country, not by year. We see that for example Australia had an average growth rate of arond 3.3% per year, while the rate in the Czech Republic was only around 1.6% per year.

We can again visualize the result, but here a bar graph makes most sense. We can also use reorder() to sort the bars according to the average growth rate:

cpds |> 
  group_by(country) |> 
  summarise(avg_growth = mean(realgdpgr, na.rm = T)) |> 
  ggplot(aes(x = avg_growth, y = reorder(country, avg_growth))) +
    geom_col() +
    labs(x = "Average rate of GDP growth (%)", y = "")

You see that Ireland (the Irish Tiger) had by far the highest growth rate since the 1990s, followed by Malta and Estonia. Italy, Greece, and Japan had clearly the lowest average rates of growth.

Comparing selected countries over time

Sometimes, for example when you do a comparative case study of a few selected countries, you want to show relevant developments in those countries, without any aggregation. This is obviously also possible with this type of data, and here the filter() function is your best friend.

Let’s say we want to compare the development of economic growth rates in the four largest Nordic countries (Denmark, Finland, Norway, Sweden) since the 1990s. In that case, we just need to use filter() to subset the data to those countries:

cpds |> 
  select(country,year,realgdpgr) |> # this is technically not necessary, but 
  # sometimes useful to avoid losing overview over the data
  filter(country %in% c("Denmark","Finland","Sweden","Norway"))
# A tibble: 132 × 3
   country  year realgdpgr
   <chr>   <dbl>     <dbl>
 1 Denmark  1990    1.48  
 2 Denmark  1991    1.39  
 3 Denmark  1992    1.96  
 4 Denmark  1993    0.0107
 5 Denmark  1994    5.33  
 6 Denmark  1995    3.03  
 7 Denmark  1996    2.90  
 8 Denmark  1997    3.26  
 9 Denmark  1998    2.21  
10 Denmark  1999    2.95  
# ℹ 122 more rows

This is all there is to it — we now have limited the dataset to the four Nordic countries. Obviously, the result is not very informative, but we can again visualize the result in a line graph with ggplot():

cpds |> 
  select(country,year,realgdpgr) |> # this is technically not necessary, but 
  # sometimes useful to avoid losing overview over the data
  filter(country %in% c("Denmark","Finland","Sweden","Norway")) |> 
  ggplot(aes(x = year, y = realgdpgr, group = country, color = country)) +
    geom_line(linewidth = 1) +
    geom_hline(yintercept = 0, linetype = "dashed", color = "grey") +
    scale_color_brewer(palette = "Paired") + # color-blind friendly palette
    scale_x_continuous(breaks = seq(1990,2020,5)) +
    labs(y = "GDP growth rate (%)", x = "", color = "") +
    theme(legend.position = "bottom")

In general, growth rates in the four countries are (unsurprisingly) behaving quite similarly — when Denmark experiences a crisis, Sweden, Norway, and Finland do as well — but it does seem that Finland tends a bit more toward the extremes than the other countries. Crises tend to hit hardest in Finland, but the following recoveries are also stronger.

An alternative way to visualize the result is to use facet_wrap() to create a separate graph for each country. This helps if, as is the case here, the lines overlap strongly:

cpds |> 
  filter(country %in% c("Denmark","Finland","Sweden","Norway")) |> 
  ggplot(aes(x = year, y = realgdpgr)) +
    geom_line() +
    geom_hline(yintercept = 0, linetype = "dashed", color = "grey") +
    facet_wrap(~ country) +
    scale_x_continuous(breaks = seq(1990,2020,5)) +
    labs(y = "GDP growth rate (%)", x = "", color = "") +
    theme(legend.position = "bottom")

The more extreme up- and downswings in Finland are still visible.

Showing bivariate relationships across countries

Although the development of a single variable over time or its variation between countries is often relevant to look at, we are in most cases primarily interested in relationships between variables: Does one variable affect the other, or are they at least correlated with each other?

One wat to check for bivariate relationships between two variables is to aggregate both variables by country and then use a scatterplot to visualize the result.

Let’s say we wanted to test the hypothesis that a stronger presence of left parties in government is bad for economic growth (as some people claim). The gov_left1 variable from the CDPS dataset gives us the share of cabinet posts that are held by left-of-center parties in a given year and country, and we can simply aggregate this variable along with the one measuring economic growth within summarize():

cpds |> 
  group_by(country) |> 
  summarise(avg_growth = mean(realgdpgr, na.rm = T),
            avg_leftgov = mean(gov_left1, na.rm = T)) 
# A tibble: 36 × 3
   country        avg_growth avg_leftgov
   <chr>               <dbl>       <dbl>
 1 Australia            2.89        38.1
 2 Austria              1.87        33.8
 3 Belgium              1.81        38.2
 4 Bulgaria             1.94        16.2
 5 Canada               2.13         0  
 6 Croatia              2.34        20.7
 7 Cyprus               3.41        13.8
 8 Czech Republic       1.58        28.3
 9 Denmark              1.76        38.8
10 Estonia              4.02        21.7
# ℹ 26 more rows

The numbers give us the average growth rate and the average share of cabinet posts held by left parties in each country in the period since 1990. We can now use geom_point() to visualize the result in a scatterplot, and add geom_smooth() to get a fitted line that highlights the relationship between the variables:

cpds |> 
  group_by(country) |> 
  summarise(avg_growth = mean(realgdpgr, na.rm = T),
            avg_leftgov = mean(gov_left1, na.rm = T)) |> 
  ggplot(aes(x = avg_leftgov, y = avg_growth)) +
    geom_point() +
    geom_smooth(method = "lm", se = F, color = "grey", 
                linetype = "dashed") +
    labs(x = "Avg. share of left parties in government (%)",
         y = "Average rate of economic growth (%)")
`geom_smooth()` using formula = 'y ~ x'

There is indeed a negative — but quite weak — relationship between the two variables. However, it important not to forget that the arrow of causality might run the other way: Maybe left parties get elected more often in times of economic crises? (What happens when you look at the relationship between economic growth and the share of right parties in government using gov_right1?)

One way to still improve the graph is to add labels for each country instead of anonymous black dots to be able to see where the different countries are located. To do that, we can replace geom_point() with geom_text(), and we use the iso variable (which is equivalent to the country variable) to aggregate the data. By using iso, we later have handy short labels that we can use in the graph:

cpds |> 
  group_by(iso) |> 
  summarise(avg_growth = mean(realgdpgr, na.rm = T),
            avg_leftgov = mean(gov_left1, na.rm = T)) |> 
  ggplot(aes(x = avg_leftgov, y = avg_growth)) +
    geom_text(aes(label = iso)) +
    geom_smooth(method = "lm", se = F, color = "grey", 
                linetype = "dashed") +
    labs(x = "Avg. share of left parties in government (%)",
         y = "Average rate of economic growth (%)")
`geom_smooth()` using formula = 'y ~ x'

This clarifies matters. It almost seems as if the slight negative relationship between left government participation and economic growth is mainly driven by the outlying case of Ireland…

What next?

This post showed you how you can do descriptive analyses of cross-country macro-level (or time series cross-sectional) datasets. This can help you spice up a comparative case study with descriptive statistics of relevant macro-level indicators, and it can be a stepping stone toward learning how to do regression analyses with this type of data.

More concrete steps you can take to advance further are:

  1. Get more of an overview over what macro-level datasets there are out there (see also the sources above).
  2. Learn how to combine (“merge”) different datasets. This is not as difficult as it might sound. Since all these datasets have the same underlying country-year structure, you just need to figure out how to work with the left_join() function to merge datasets (see also the post on how to measure globalization exposure), and probably also how to convert different country codes and names between each other with the countrycode package (see also Urdinez and Cruz 2020, chap. 11).
  3. Explore other types of macro-level datasets. Relevant examples are the Manifesto Project Database, which provides quantitative estimates of the ideological positions of political parties in different countries (here, the unit of observation is a party at a given election or “party-election”) or different peace & conflict datasets (e.g., Raleigh et al. 2010; Uppsala Conflict Data Program 2014; Gibler and Miller 2023; Vogt et al. 2015).
  4. Learn how to do multivariate regression analyses with these datasets. Relevant works to read are Beck and Katz (1995, 1996, 2011), Beck, Katz, and Tucker (1998), Beck (2001), De Boef and Keele (2008), Wilson and Butler (2007), Carter and Signorino (2010), Honaker and King (2010), Birkel (2014), and for more advanced methods see Blackwell and Glynn (2018). Urdinez and Cruz (2020, chap. 7) and Croissant and Millo (2008) show how to implement the main techniques in R.

References

Acemoglu, Daron, Simon Johnson, and James A. Robinson. 2001. “The Colonial Origins of Comparative Development: An Empirical Investigation.” American Economic Review 91 (5): 1369–1401.
Allan, James B., and Lyle Scruggs. 2004. “Political Partisanship and Welfare State Reform in Advanced Industrial Societies.” American Journal of Political Science 48 (3): 496–512.
Beck, Nathaniel. 2001. “Time-Series-Cross-Section Data: What Have We Learned in the Past Few Years?” Annual Review of Political Science 4: 271–93.
Beck, Nathaniel, and Jonathan N. Katz. 1995. “What to Do (and Not Do Do) with Time-Series Cross-Section Data.” American Political Science Review 89 (3): 634–47.
———. 1996. “Nuisance Vs. Substance: Specifying and Estimating Time-Series-Cross-Section Models.” Political Analysis 6 (1): 1–36.
———. 2011. “Modeling Dynamics in Time-Series Cross-Section Political Data.” Annual Review of Political Science 14: 331–52.
Beck, Nathaniel, Jonathan N. Katz, and Richard Tucker. 1998. “Taking Time Seriously: Time-Series-Cross-Section Analysis with a Binary Dependent Variable.” American Journal of Political Science 42 (4): 1260–88.
Birkel, Christoph. 2014. “The Analysis of Non-Stationary Pooled Time Series Cross-Section Data.” International Journal of Conflict and Violence 8 (2): 223.
Blackwell, Matthew, and Adam N Glynn. 2018. “How to Make Causal Inferences with Time-Series Cross-Sectional Data Under Selection on Observables.” American Political Science Review 112 (4): 1067–82.
Carter, David B., and Curtis S. Signorino. 2010. “Back to the Future: Modeling Time Dependence in Binary Data.” Political Analysis 18 (3): 271–92.
Croissant, Yves, and Giovanni Millo. 2008. “Panel Data Econometrics in R: The plm Package.” Journal of Statistical Software 27 (2): 1–43.
De Boef, Suzanna, and Luke Keele. 2008. “Taking Time Seriously.” American Journal of Political Science 52 (1): 184–200.
Esping-Andersen, Gøsta. 1990. The Three Worlds of Welfare Capitalism. Cambridge: Polity Press.
Gibler, Douglas M, and Steven V Miller. 2023. “The Militarized Interstate Events (MIE) Dataset, 1816–2014.” Conflict Management and Peace Science forth.
Honaker, James, and Gary King. 2010. “What to Do about Missing Values in Time-Series Cross-Section Data.” American Journal of Political Science 54 (2): 561–81.
Iversen, Torben, and David Soskice. 2006. “Electoral Institutions and the Politics of Coalitions: Why Some Democracies Redistribute More Than Others.” American Political Science Review 100 (2): 165–81.
Korpi, Walter, and Joakim Palme. 2003. “New Politics and Class Politics in the Context of Austerity and Globalization: Welfare State Regress in 18 Countries, 1975-95.” American Political Science Review 97 (3): 425–45.
Raleigh, Clionadh, Andrew Linke, Håvard Hegre, and Joakim Karlsen. 2010. “Introducing ACLED: An Armed Conflict Location and Event Dataset.” Journal of Peace Research 47 (5): 651–60.
Roller, Edeltraud. 2005. The Performance of Democracies: Political Institutions and Public Policies. Oxford: Oxford University Press.
Scruggs, Lyle. 2003. Sustaining Abundance: Environmental Performance in Industrial Democracies. Cambridge: Cambridge University Press.
Uppsala Conflict Data Program. 2014. UCDP/PRIO Armed Conflict Dataset. Uppsala: The Uppsala Conflict Data Program.
Urdinez, Francisco, and Andres Cruz. 2020. R for Political Data Science: A Practical Guide. Boca Raton; others: CRC Press.
Vogt, Manuel, Nils-Christian Bormann, Seraina Rüegger, Lars-Erik Cederman, Philipp Hunziker, and Luc Girardin. 2015. “Integrating Data on Ethnicity, Geography, and Conflict: The Ethnic Power Relations Data Set Family.” Journal of Conflict Resolution 59 (7): 1327–42.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23.
Wilson, Sven E., and Daniel M. Butler. 2007. “A Lot More to Do: The Sensitivity of Time-Series Cross-Section Analyses to Simple Alternative Specifications.” Political Analysis 15 (2): 101–23.

Footnotes

  1. An exception are many peace & conflict datasets, where the unit of observation is a country-pair (dyad) or a conflict-year. This can make these datasets a bit more difficult to work with.↩︎

  2. Obviously, you should always make sure that you do not have large changes in the composition of your dataset — e.g., where many new countries are added from a given year on — because that can lead to sudden jumps or dips in the average values.↩︎