Saturday, September 2, 2017

Ultra-violet Radiation across the world

"NASA produces daily satellite-based data for ambient UV-R. The UV index captures the
strength of radiation at a particular location, and it is available in the form of geographic grids
and daily rasters with pixel size of 1◦ latitude by 1◦ longitude." (Anderson et al. (2016), p. 1339)

Used by Anderson et al. (2016), who show that the average of 1990 and 2000 values at the country level correlates with per capita income in 2004 conditional on latitude and many other controls.

Friday, August 11, 2017

World Migration Matrix, 1500-2000

Constructed by Louis Putterman and his colleagues.

For each of 165 countries, the data provides the shares of the current (as of 2000) population's ancestors living in the area of each of today's 172 countries back in around 1500.

See this page for details. See also Putterman and Weil (2010).

Used by Andersen et al. (2016), among others.

Michalopoulos (2012) uses this data to identify countries in which more than 40% of the current population can trace ancestry within the same country boundary back to 1500 AD, for which variation in agricultural suitability and elevation is found to be a strong predictor of ethnic diversity

Sunday, July 23, 2017

Administrative Boundaries

There are several GIS datasets on administrative boundaries.

GADM is the most popular among economists. My experiences of using both GADM and GAUL show that GADM is indeed more trustworthy than GAUL.
  • GAUL's parish (fourth level administrative boundary) data for Uganda shows multiple polygons for the same parish name. This is not the case for GADM.
  • GAUL's coastline on the south of Baku in Azerbaijan is drawn where elevation (according to SRTM30: see this post) is above zero meter while GADM's coastline is drawn where elevation changes from positive to negative values.

National Boundaries

the World Vector Shoreline (WVS)
  • If you're interested in the historical national boundaries since 1945

Sub-national Boundaries

Global Administrative Areas (GADM)
  • an alternative to GAUL. Whether it is better or worse is not clear. 
  • mentioned by Gleditsch and Weidmann (2012) in their review of spatial data analysis in political science.
  • Used by the Gridded Population of the World Version 4 (see here).
  • Used by Dreher et al (2015).
    • They mention that GADM does not include the second level administrative boundaries (counties/districts) for Egypt, Equatorial Guinea, Lesotho, Libya, and Swaziland.
  • Also used by Alesina et al. (2016) to measure inequality across subnational administrative regions (which turns out to be negatively correlated with per capita GDP).
Global Administrative Unit Layers (GAUL)
  • Supposedly an annual panel data from 1990, but the district boundary changes are properly tracked only for some countries.
  • Used by Briggs (2015).

The Second Administrative Level Boundaries (SALB) dataset

  • compiled by the United Nations
  • provides the GIS data on second-tier subnational administrative boundaries (ie. district boundaries). 
  • I'm not sure whether the GAUL dataset mentioned above incorporates this or the SALB dataset has its original data.
  • For subnational boundary changes during early years
  • This is the online updated version of the book Administrative Subdivisions of Countries by Gwillim Law (Jefferson, North Carolina: McFarland & Company, 1999).
  • provides the list of administrative regions for every country, past and present. Very useful if you need to match different sub-national or micro datasets based on sub-national regions, especially when a country of your interest has changed the boundaries of sub-national regions quite frequently such as Nigeria and Uganda.

Sunday, July 16, 2017

Raw Material Data (RMD)

Annual panel data of mines around the world since 1980, compiled by IntierraRMG now part of S&P Global. See

According to Berman et al. (2017), the dataset includes variables such as
  • whether a mine is active, 
  • the year production started, 
  • the specific minerals produced, 
  • the total production for each of them
  • ownership structure and characteristics of the mines
  • extraction methods

  • small-scale mines, and those that are illegally operated, are not included.

    Used by Berman et al. (2017) (and see their footnote 12).

    Monday, July 3, 2017

    Cross-country education datasets

    Penn World Tables 9.0

    See this document, which reviews the academic debate on the quality of Barro-Lee dataset.

    Barro-Lee dataset

    A well-known dataset on average years of schooling (i.e. stock of human capital).

    The 2010 updated version is now available at

    For details on the data construction, read Robert J. Barro and Jong-Wha Lee, "International Data on Educational Attainment: Updates and Implications" (CID Working Paper No. 42, April 2000). This 2000 paper is an updated version of Barro and Lee (1993). Both papers compare various measures of human capital.

    The average years of schooling is available for the six sets of the population: male over 25, female over 25, all over 25, male over 15, female over 15, all over 15.

    Population over the age of 15 "corresponds better to the labor force for many developing countries." (Barro and Lee 2000, p.2)

    Percentages of those who attained/completed each level of school in the total/male/female population are also available. Note that the sum of variables LU, LP, LS, and LH is 100; Lx-LxC, where x is either P, S, or H, is the percentage of those dropping out before completing primary, secondary, or higher school, respectively. In other words, the percentage of ".... school attained" contains the percentage of "... school complete".

    Downloadable at this page by Center for International Development at Harvard University (CID).

    The data file in the panel dataset format is best avoided because it excludes countries not in Penn World Table 5.0 (e.g. former socialist countries).

    Note that variable SHCODE (numerical country code in Penn World Table 5.0) is different from the one in Penn World Table 5.6.

    A very minor point, but the data entries for USSR/Russia in 1990 seem unreliable. Population seems to refer to USSR while educational attainment figures seem to refer to Russia.

    Papers using this dataset include Acemoglu et al. (2005) and Glaeser et al. (2007).

    For other datasets on average schooling years, see Kyriacou (1991), which is used by Benhabib and Spiegel (1994, JME), and Nehru et al. (1995), which is used by Pritchett (2000).

    See Krueger and Lindahl (2001, JEL) for critical reviews on average schooling year data.

    Infant and Child Mortality

    Many researchers use infant and child mortality data compiled by the World Bank's World Development Indicators, by UNICEF's State of the World's Children, or, for child mortality rates only, by Ahmad, Lopez, and Inoue (2000). According to Ross (2006), the most transparent is UNICEF's (see page 866).

    These international organizations now coordinate in producing infant and child mortality statistics, under the name of The UN Inter-agency Group for Child Mortality Estimation (IGME). See their collection of papers on the child mortality estimation.

    Abouharb and Kimball (2007) introduce a dataset on annual infant mortality rates in each country for 1816-2002, by filling as many country-year cells as possible with infant mortality data from a variety of sources (I am not sure if this does not sacrifice the comparability across countries and years). The dataset and the codebook are available at (look for the last link for 2007 (vol. 44), no. 6). They avoid using the UN Demographic Yearbooks (which actually do provide annual data in its printed version, but not online) as much as possible. They keep the record on which data source is used, for each country-year observation. It turns out that 41 percent of observations after 1950, mainly developing countries, come from US Census Bureau's International Data Base. I am not sure why we should trust US Census Bureau more than the United Nations.

    For poor countries, however, these data may be created by the interpolation of very few data points. See Qian (2015: 303).

    Tuesday, April 4, 2017


    Tucker et al. (2005) compile the global NDVI data, based on the AVHRR satellite sensors, for each of 8km square cells with a bimonthly frequency between July 1981 and December 2004, by improving the methodology over the previous attempts.

    • The data is supposed to be available here, but as of April 2017, it's out of service.

    The NDVI data for 2001-2006, based on MODIS satellite sensors (better data quality than AVHRR but only for more recent periods), is available here.