Census Demo powered by R and Shiny

/2022-12-20-shiny-census/featured-image.webp

Census Demo powered by R and Shiny

2022-12-20 2799 words 14 minutes

Contents

Introduction

Click here for all the source code for this demo

The American Community Survey (ACS) is a demographic program of the United States Census Bureau. It is designed to help government officials, businesses, and community leaders understand population changes taking place across the American territory. The ACS includes social, economic, housing, and demographic characteristics organized in 1-year or 5-year datasets (U.S. Census Bureau, 2017). A sample of ACS data, known as the Public Use Microdata Sample (PUMS), provides information from individual people at a fine level of geographic detail. The smallest geographic area for PUMS is the Public Use Microdata Area (PUMA).

Here I present the Census Demo, a web app for the visualization of Ohio age data taken from ACS-5 PUMS datasets. The app was developed on the Shiny platform, for which I wrote code on R, JavaScript and CSS. The source code for the app is available on my GitHub repository and the live demo is deployed on my personal laptop, available to the public through the link at the top of this page; here it is again.

In this article I describe the general architecture of the web app, the data model, aggregation, and deployment.

Presentation of the web app

The web app is inspired by the work of the Scripps Gerontology Center (Mehri, 2020). In line with that work, this app is a demonstration tool that visualizes and summarizes demographic trends across the territory of Ohio, which can inform legislation and executive work related to topics like aging and migration. The app consists of two main sections: an Interactive map tab and a Data explorer tab. The interactive map presents PUMS age data for Ohio distributed over PUMAs, which are colored according to user-selected ages, statistics, and the year of the dataset. The Data explorer presents the same selection of data organized in a table.

The interactive map can be zoomed and panned. It comes with a movable panel that succinctly explains the type of data being displayed, then shows controls to indicate the desired age range, dataset year, and statistic to be visualized. The final control is a checkbox that adds the full names of the PUMAs to the map. These names, together with the yearly statistics, are shown as tool tips when the user hovers over a PUMA on the map.

At the bottom of the panel, a histogram helps summarize the statistic shown in the map. The unit of analysis of the histogram is the PUMA, as can be seen in the vertical axis of the histogram. In the horizontal axis, the histogram shows the values of the selected statistic, partitioned in the same bins that are shown in the map legend. With this in mind, the histogram counts the number of PUMAS according to each bin.

As an additional control, the dataset year slider comes with a Play button to the right, which starts an animation of the evolution of the map over the years.

The Data explorer tab comes with a brief caption that reminds the user of the selections that were made in the map panel, followed by the same data presented in the form of a searchable table. Each row in the table is a PUMA, indicated by number and name, and each column represents a version of the PUMS dataset, indicated by the publication year. The cell data is the statistic for the given PUMA and year.

Platform selection: Why R?

A wide variety of tools are available for the interactive visualization of demographic data on a map. The selection of tools may depend on technical aspects like speed, memory consumption, and online availability, but it may also depend on human factors like the familiarity of the developer with the tools, or the interoperability among the tools. This section discusses some advantages and disadvantages that the selected software may have for the purposes of my web app.

At the core of the app is the R statistical software. R is one of the most commonly used programming languages for data science. R counts on community-contributed packages to mine data and present it in a wide variety of platforms. The web app combines R’s statistical features with the tidycensus package to download the PUMS data, the data.table package for data processing, and the Shiny package for web publishing, among others.

An advantage of R for the purposes of the web app is the integration of workflows through contributed packages. For example, preparation of PUMS datasets could be accomplished with a separate workflow, such as downloading the files from the Census Bureau website and preparing them in spreadsheet software such as Excel. A better alternative, which is used here, is the fetch-dataset.R script which fetches the datasets, curates them, and saves them into a file to be used by the app. This integrated approach improves the workflows of preparing the data and building the app.

Most importantly, the integrated approach removes the need to develop translational steps from one platform to the next. In our Excel example, we would need to save the curated data into an intermediate format such as CSV, then load it into R, and then save it into a format suitable for the mapping platform. These intermediate, manual steps are eliminated by R’s integration of technologies; the developer only needs to think in terms of R objects when preparing the datasets and writing the app.

Also, thanks to R’s integration, we can use the same data.table package both in the fetch-dataset.R script and the app. In the script, the PUMS datasets are combined into a data.table object called facd$dataset. In the app, this object is filtered according to the user’s desired age and publication year. Both combination and filtering processes are accomplished with software from the data.table package, which facilitates development. There is no need for a separate data model or library functions for the combination step than for the filtering process.

Another advantage of R is the use of the same language for most steps of the development process: downloading the PUMS datasets is done with R code; curating the data is also done with R code; the webpage is written in R code; same goes for creating the map, the histogram, and the data explorer table. Having one language cover most of the process helps keep the technologies out of the way so that the developer can focus on telling the story through the data.

Another advantage of R is the availability of multiple alternative platforms for an application. For example, the map of my web app is implemented using the Leaflet library for Javascript (Agafonkin et al., 2011), which is available in R using the leaflet package (Cheng et al., 2022). That is only one of the multiple platforms that R can use to build maps. An alternative is the mapboxapi package, which implements an interface between R and the Mapbox platform (Walker et.al., 2022). In another example, an alternative to the data.table package would be plyr. I personally prefer data.table because it is written in the C programming language at its core, which gives it consistently higher speeds and efficient memory usage. See this question on Stack overflow for a more detailed discussion.

However, R and the selection of tools for this web app come with disadvantages. One disadvantage is the speed of animation and, more generally, the fluidity of the app interface. There is a delay of the order of a second from the time a selection is made in a slider and the corresponding map update. A similar delay is seen when the animation is activated. The delay comes from the combination of multiple bottlenecks that might be difficult to address; one bottleneck could be from portions of code that are interpreted in the R language, and another could come from portions of code that are interpreted in the JavaScript language. Even though the data.table package speeds up the data aggregation and organization processes, this is not enough to compensate for the bottlenecks.

In a related disadvantage, contributed packages are usually incomplete. For example, the leaflet platform supports plugins and animations. However, R’s leaflet package does not offer functionality to incorporate arbitrary plugins into an R-leaflet project, or ready-made animation functions that can be invoked from R code. A proposal to address this limitation was made by Edwin de Jonge in a 2018 contribution to the leaflet package. Even though the package is managed by a reputable company that is well known in data science, the contribution has not been reviewed, commented on, or accepted by the company yet. But the contribution has already been adopted by users of the package and is also adopted for the animations in this web app.

Finally, a disadvantage of the R-shiny platform is the absence of a first-class debugging solution. This is because the web app uses tools from multiple technologies and languages. More generally, the company behind Shiny acknowledges the difficulty in debugging any kind of web application developed on R for web publication.

This section has discussed some advantages and disadvantages of the R platform while mentioning some of the technologies that are involved in the the web app. In the next few sections, I try to paint a more complete picture of the app and the packages it uses.

Data preparation

I initially intended the app to be a fully integrated solution. It would take charge of fetching data from the U.S. Census Bureau in real time while filtering and presenting it in the map. Full integration may have showcased the potential of community-contributed packages for live data science. However, during development I found out that fetching even a small subset of data takes time in the order of the tens of minutes, which would make live fetching or caching census data impractical. Instead, the finished app implements a separate step of data preparation through the fetch-dataset.R script, and saves the prepared data to the facd.RData file.

The script uses the tidycensus package to fetch PUMS data from the Census website; the data.table package for census data organization, integration, and aggregation; the sf package for map data organization, and the stringr package for textual processing of the PUMA identifiers.

The fetch-dataset.R script starts with a few lines of debugging and house-keeping code (cat(), stopifna()). Then, the script opens the geographic files (shapefiles) for the PUMAs and transforms them into the appropriate coordinate system (WGS84). Next, the script builds helper objects (puma_helper, year_puma_helper) to handle corner cases such as PUMAs that have no respondents after filtering, and to explicitly avoid mixing up PUMAs when they are being colored.

The code then proceeds to the most important, time-consuming task: downloading the PUMS data from the Census website for which it uses the get_pums() function from the tidycensus package. The function must be invoked separately for each PUMS publication year. Even though only age data is being requested through the function, the whole process can take time in the order of the tens of minutes. After all the downloads are finished, the rbindist() function from the data.table package is used to combine all the downloads into one PUMS dataset.

The script then proceeds to sort the dataset (setkey()), check that all data is valid (stopifna()), perform a couple consistency tests (dataset_test0, dataset_test1), and save a quick summary of the age data (age_range). The script ends with a table (properties) that organizes the labels for the different statistics offered in the app, and then saves the completed data to facd.RData.

Data model

The fetch-dataset.R script is focused on building the facd object, which consists of the map data (sf), the helper objects (puma_helper, year_puma_helper), the PUMS dataset (dataset), a summary of the age data (age_range), and a properties table. This section outlines the structure of these varibles.

The map data sf passes from the file system to the leaflet package almost unchanged; the only curation work is the translation of the coordinates into the WGS84 system. The app does not look into any of its variables; the only data fields that are needed by my code are the PUMA ID, which is matched to the PUMS dataset, and the name of the PUMA, which is used in tooltips on the map.

The helper objects ensure that the filtered, aggregated data produced by the app are well organized and valid. For example, a user may move the sliders in a way that causes some PUMAs to be devoid of respondents after filtering. At the aggregation step, operations like counts, means, or medians will pass over empty PUMAs. The helpers will help make sure empty PUMAs are marked with NAs instead, to avoid a mixup in the coloring step. The concrete instruction is a data join, in app.R, which is marked by an on = parameter that can be easily found with a code editor.

The dataset object contains approximately 3 million rows. Each row is a response to the ACS. There is an Area column of character type, which is used for matching the data to the sf object. There is an AreaId column with a numerical version of Area, which is used for sorting, organizing and aggregating the data (by =). There is a Year column, with the year of publication of the data point, and the Age column.

Data aggregation and processing

Data aggregation in the app is accomplished by data.table through the functionality of the by = and keyby = parameters in app.R. All aggregation work takes place in the age_sel_reactive() function, which is triggered whenever the user chooses a statistic or an age range on the app. The function makes sure that the chosen range is valid, then performs the aggregation based on the selected statistic. Then, it calculates the range of the data, histograms for all publication years, and the colors that should be assigned to the PUMAs on the map. Finally, the function organizes the tooltips, renders the data explorer (renderDataTable()), and creates the description for the map.

The histograms are implemented in ggplot (Wickman, 2022). They are saved as a grammar object in memory, and are only rendered as the final step, which makes them reactive to any changes in UI size.

The data explorer is implemented using the DataTables library for Javascript (SpryMedia, 2008), which is available in R using the DT package (Xie, 2022).

The age_sel_reactive() has one call in the app code, inside an observe() block. While age_sel_reactive() performs all of its work across publication years, the block is in charge of filtering such data according to the year that is selected by the app user. With the finalized selection, the block updates the map color, legends, tooltips, and description. At the end of the block, a renderPlot() instruction from the shiny package renders the histogram that corresponds to the selected year.

Installation and deployment

Installation of the web app took place in my laptop, where I am hosting a few platforms. I installed the RStudio Server using the official instructions and then cloned the web app directly from the GitHub repository. I proxied the connection through my Apache server to make sure the web app is available to the public via https.

Conclusion

The web app showcases the integration of technologies that R-shiny offers for the deployment of web apps for data science. The app also exhibits some of the advantages and disadvantages of this approach. While the app demonstrates the power offered by R for data organization, aggregation, and visualization, it also shows an obvious performance bottleneck. Fluidity in a Shiny app requires clever design and significant optimization efforts.

References

Agafonkin, V. et al. (2011). Leaflet, a JavaScript library for mobile-friendly interactive maps. Available here and here.

Cheng, J., et al. (2022), leaflet: Create Interactive Web Maps with the JavaScript ‘Leaflet’ Library. The Comprehensive R Archive Network. Available here and here.

Mehri, N., Cummins, P. A., Sun, N., Nelson, I. M., Wilson, T. L., and Kunkel, S. (2020). Ohio Population Interactive Data Center, Scripps Gerontology Center, Miami University, Oxford, OH. Available here and here.

SpryMedia (2008). DataTables, a plug-in for the jQuery Javascript library. Available here.

U.S. Census Bureau (2017). American Community Survey Information Guide, Washington, DC. Available here.

Walker, K., et al. (2022), mapboxapi: R Interface to Mapbox Web Services. The Comprehensive R Archive Network. Available here and here.

Wickham, H., et al. (2022), ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. The Comprehensive R Archive Network. Available here, here, and here.

Xie, Y., et al. (2022), DT: A Wrapper of the JavaScript Library ‘DataTables’. The Comprehensive R Archive Network. Available here and here.

Credits

Featured photo: Bird flock, cloud, sky by Mohamed Hassan, available here. Licensed under CC0 public domain license, free for personal & commercial use, no attribution required.