Sunday, July 23, 2017

Administrative Boundaries

There are several GIS datasets on administrative boundaries.

GADM is the most popular among economists. My experiences of using both GADM and GAUL show that GADM is indeed more trustworthy than GAUL.
  • GAUL's parish (fourth level administrative boundary) data for Uganda shows multiple polygons for the same parish name. This is not the case for GADM.
  • GAUL's coastline on the south of Baku in Azerbaijan is drawn where elevation (according to SRTM30: see this post) is above zero meter while GADM's coastline is drawn where elevation changes from positive to negative values.

National Boundaries

the World Vector Shoreline (WVS)
  • If you're interested in the historical national boundaries since 1945

Sub-national Boundaries

Global Administrative Areas (GADM)
  • an alternative to GAUL. Whether it is better or worse is not clear. 
  • mentioned by Gleditsch and Weidmann (2012) in their review of spatial data analysis in political science.
  • Used by the Gridded Population of the World Version 4 (see here).
  • Used by Dreher et al (2015).
    • They mention that GADM does not include the second level administrative boundaries (counties/districts) for Egypt, Equatorial Guinea, Lesotho, Libya, and Swaziland.
  • Also used by Alesina et al. (2016) to measure inequality across subnational administrative regions (which turns out to be negatively correlated with per capita GDP).
Global Administrative Unit Layers (GAUL)
  • Supposedly an annual panel data from 1990, but the district boundary changes are properly tracked only for some countries.
  • Used by Briggs (2015).

The Second Administrative Level Boundaries (SALB) dataset

  • compiled by the United Nations
  • provides the GIS data on second-tier subnational administrative boundaries (ie. district boundaries). 
  • I'm not sure whether the GAUL dataset mentioned above incorporates this or the SALB dataset has its original data.
  • For subnational boundary changes during early years
  • This is the online updated version of the book Administrative Subdivisions of Countries by Gwillim Law (Jefferson, North Carolina: McFarland & Company, 1999).
  • provides the list of administrative regions for every country, past and present. Very useful if you need to match different sub-national or micro datasets based on sub-national regions, especially when a country of your interest has changed the boundaries of sub-national regions quite frequently such as Nigeria and Uganda.

Sunday, July 16, 2017

Raw Material Data (RMD)

Annual panel data of mines around the world since 1980, compiled by IntierraRMG now part of S&P Global. See

According to Berman et al. (2017), the dataset includes variables such as
  • whether a mine is active, 
  • the year production started, 
  • the specific minerals produced, 
  • the total production for each of them
  • ownership structure and characteristics of the mines
  • extraction methods

  • small-scale mines, and those that are illegally operated, are not included.

    Used by Berman et al. (2017) (and see their footnote 12).

    Monday, July 3, 2017

    Cross-country education datasets

    Penn World Tables 9.0

    See this document, which reviews the academic debate on the quality of Barro-Lee dataset.

    Barro-Lee dataset

    A well-known dataset on average years of schooling (i.e. stock of human capital).

    The 2010 updated version is now available at

    For details on the data construction, read Robert J. Barro and Jong-Wha Lee, "International Data on Educational Attainment: Updates and Implications" (CID Working Paper No. 42, April 2000). This 2000 paper is an updated version of Barro and Lee (1993). Both papers compare various measures of human capital.

    The average years of schooling is available for the six sets of the population: male over 25, female over 25, all over 25, male over 15, female over 15, all over 15.

    Population over the age of 15 "corresponds better to the labor force for many developing countries." (Barro and Lee 2000, p.2)

    Percentages of those who attained/completed each level of school in the total/male/female population are also available. Note that the sum of variables LU, LP, LS, and LH is 100; Lx-LxC, where x is either P, S, or H, is the percentage of those dropping out before completing primary, secondary, or higher school, respectively. In other words, the percentage of ".... school attained" contains the percentage of "... school complete".

    Downloadable at this page by Center for International Development at Harvard University (CID).

    The data file in the panel dataset format is best avoided because it excludes countries not in Penn World Table 5.0 (e.g. former socialist countries).

    Note that variable SHCODE (numerical country code in Penn World Table 5.0) is different from the one in Penn World Table 5.6.

    A very minor point, but the data entries for USSR/Russia in 1990 seem unreliable. Population seems to refer to USSR while educational attainment figures seem to refer to Russia.

    Papers using this dataset include Acemoglu et al. (2005) and Glaeser et al. (2007).

    For other datasets on average schooling years, see Kyriacou (1991), which is used by Benhabib and Spiegel (1994, JME), and Nehru et al. (1995), which is used by Pritchett (2000).

    See Krueger and Lindahl (2001, JEL) for critical reviews on average schooling year data.

    Infant and Child Mortality

    Many researchers use infant and child mortality data compiled by the World Bank's World Development Indicators, by UNICEF's State of the World's Children, or, for child mortality rates only, by Ahmad, Lopez, and Inoue (2000). According to Ross (2006), the most transparent is UNICEF's (see page 866).

    These international organizations now coordinate in producing infant and child mortality statistics, under the name of The UN Inter-agency Group for Child Mortality Estimation (IGME). See their collection of papers on the child mortality estimation.

    Abouharb and Kimball (2007) introduce a dataset on annual infant mortality rates in each country for 1816-2002, by filling as many country-year cells as possible with infant mortality data from a variety of sources (I am not sure if this does not sacrifice the comparability across countries and years). The dataset and the codebook are available at (look for the last link for 2007 (vol. 44), no. 6). They avoid using the UN Demographic Yearbooks (which actually do provide annual data in its printed version, but not online) as much as possible. They keep the record on which data source is used, for each country-year observation. It turns out that 41 percent of observations after 1950, mainly developing countries, come from US Census Bureau's International Data Base. I am not sure why we should trust US Census Bureau more than the United Nations.

    For poor countries, however, these data may be created by the interpolation of very few data points. See Qian (2015: 303).

    Tuesday, April 4, 2017


    Tucker et al. (2005) compile the global NDVI data, based on the AVHRR satellite sensors, for each of 8km square cells with a bimonthly frequency between July 1981 and December 2004, by improving the methodology over the previous attempts.

    • The data is supposed to be available here, but as of April 2017, it's out of service.

    The NDVI data for 2001-2006, based on MODIS satellite sensors (better data quality than AVHRR but only for more recent periods), is available here.

    Historical land use datasets

    One of the first attempts to compile historical land use datasets is Ramankutty and Foley (1999), who focus on the fraction of areas used for agricultural cultivation at each of the 5 by 5 arc-minute cells across the world.

    • Downloadable at here.
    • The sample period: 1700-1992
    • The 1992 data is based on their own 1992 Croplands Dataset, with a few revisions (see section 2 of the paper)
    • Using historical cropland area statistics at the national level (or at the subnational level for 8 large countries) from FAOSTAT and other sources, the 1992 data is then extrapolated backwards.
    • The extrapolation assumes that the spatial distribution of cropland within a country (or a sub-national region where historical data is available) has remained the same throughout the sample period.

    There are several subsequent attempts to improve historical land use data. Below are a few examples:

    • Pongratz et al. (2008), for example, extend the analysis back to 800 by using historical population data.
    • Meiyappan and Jain (2012) start with the construction of the land cover map for the year 1765 and then estimate land use change in subsequent years, with satellite data used for validation over the past few decades.
    • HYDE 3.1 (Click the link to jump to another post in this blog)

    1992 Croplands Dataset

    Compiled by Ramankutty and Foley (1998).

    • They start with the DISCover land cover dataset, which classifies each of the 1 x 1 km cells into one of several land use types, based on monthly NDVI data from March of 1992 through February of 1993.
    • These classifications are then regrouped into six categories: (0) other vegetation, (1) other vegetation with crops, (2) other vegetation/crop mosaic, (3) crop/other vegetation mosaic, (4) crops with other vegetation, and (5) crops.
    • The fraction of a cell used for agriculture is calibrated for each of these six labels, to match with the national-level total cropland area from FAOSTAT and other sources.
    • The 1 x 1 km cells are aggregated into the 5 x 5 arc-minute cells.

    Downloadable here.

    Ramankutty and Foley (1999) update this dataset by using alternative data sources etc., as part of their "Historic Croplands Dataset, 1700-1992."