Project Showcase

The focus of the workshop was a set of story ideas suggested by the participants. We broke up into four groups, each of which tackled one of these stories, searching for online data sources, wrangling the data into a form suitable for analysis, then working to understand and visualize its meaning. Tools used included several IPython notebooks but also spreadsheets and a variety of online mapping and graphics services. At the end of the day, each group reported back on what they had discovered and accomplished.

Videography by Lars-Erik Sirén.

The Geography of Solar Energy

Group members: Kat Friedrich, Cara Giaimo, David Holzman, and Kim Krieger

In which states is public opinion most likely to support expansion of solar power – and in which states is that expansion likely to occur? To seek answers to these questions, we created an Excel spreadsheet combining state data from the following two sources:

Yale’s new map showing public sentiment supporting 20 percent renewable electricity mandates.
Data from DSIRE about favorable policies for renewable energy (net metering, power purchase agreements, and renewable portfolio standards).

Because the 20 percent renewable electricity mandate may include resources other than solar power, there is likely to be some overlap with other technologies. We experimented with creating maps based on this composite spreadsheet using Python. Our final product for this training was a map of Yale’s public opinion data, color-coded in orange.

Online data sources

Code and data

Solar energy data for U.S. states in csv spreadsheet format
IPython notebook: View online via NBViewer or download and run on your computer.

Presentation

Whither Water?

Group members: Marty Downs, Diana Hwang, Noelle Swan, and Jack Vaughan

Water—Issues and Questions

California water is in the headlines. That leads us to ask:

What is the ground water usage for agriculture versus other uses in California?
Wondering too—How much water goes to grow feed for lifestock overseas?
We didn’t get there, but we had an interesting day.

Code and data

California water consumption by county. Data file in tab-separated-value (tsv) format.
Ipython notebook for data wrangling. View online via NBViewer or download and run on your computer.
IPython notebook for mapping the data. View online via NBViewer or download and run on your computer.

Presentation

Download the Powerpoint slides.

In a Talking Data podcast Diana Hwang describes her group’s work on the project.

Secret Studies

Group members: Karen Miller, Carol Morton, Anna Nowogrodzki, and Tom Ulrich

How safe and effective is any given drug? Often, physicians and patients do not really know the answer. Much evidence shows that information is routinely concealed or selectively reported in academic journals. Physicians, patients, policy makers, and health care payers may decide on certain drugs based on incomplete or inaccurate information. Millions of dollars and the health of millions of people are at stake.

That may be changing. On April 14, 2015, the World Health Organization (WHO) called for all clinical trials of new drugs to publicly post their methods and summary results within one year. For such mandatory reporting, the U.S. ClinicalTrials.gov database has been proposed as a quicker and more reliable registry than academic journals.

The first step is to explore the extent to which results from registered clinical trials are publicly disclosed. To carve out a manageable scope, we focused on multiple sclerosis (MS), an inflammatory and neurodegenerative disease of the brain and spinal cord with no cure but with at least a dozen new drugs in the last 20 years.

The original idea was to compare the studies and available results using two major U.S. databases, ClinicalTrials.gov, an extensive list of clinical trials, and PubMed, an extensive collection of published biomedical studies.

Our key questions:

How many completed clinical trials results are made public?
How are the results made public – summary results on ClinicalTrials.gov, a published study on PubMed, or both?
Funding sources (how does that affect which results are made public?)
When are results made public?

Online data sources:

Code and data

Excel spreadsheet with data on published and unpublished clinical trials.

We found more than 1,321 studies searching under “multiple sclerosis” in ClinicalTrials.gov, at various stages of being open or closed. Of those, 158 were closed studies with results, and 740 studies were closed with no results. A PubMed search for “multiple sclerosis” narrowed to “clinical trial” article type yielded 3,582 records.

An attempt to use Python to find publications with clinical trial numbers ran into a problem when Python couldn’t read non-ASCII-encoded characters, such as the trademark symbol, requiring cleaning of the data. It was also difficult to keep track of the data as the coding progressed. Other questions required more reporting, such as if clinical trials numbers are included in all PubMed listings of results and if the analysis should include all categories of closed trials.

Moving to a spreadsheet to try visualizing answers to other questions, we analyzed how many results were reported based on industry funding or not. Comparative pie charts were misleading, showing that a higher percentage of industry-funded studies reported results. A stacked bar graph showed that most unreported results come from industry-funded studies, which are the majority of clinical trials.

Presentation

Download the Powerpoint slides.

Code and Cod

Group members: Kelsey Calhoun, Diana Kenney, Molly Murray, and Andreas von Bubnoff

The story inspiration: Most cod sold today on Cape Cod is imported from Iceland. Local cod stocks are very unhealthy, where cod used to be so plentiful they gave Cape Cod its name.

We started looking for data before our problem and story were clearly defined, resulting in a tour of the many available data sets. All of our group members (Molly Murray, Diana Kenney, Andreas von Bubnoff, Kelsey Calhoun) had little to no experience with coding and none with Python. While we were all interested in telling an interesting story about cod, our main goal was to become more facile with Python and using datasets as part of stories. While we never fully established the core of a cod story that would be pitchable, we found many possible stories that would need a lot of work, and many possible data representations that would work if we developed the Python/Javascript/Tableau skills to make them. Here are some of the data sets we worked with and the approaches we took toward visualizing them.

Our original data set was generated by the National Marine Fisheries Service, including pounds of Atlantic cod caught commercially in the U.S. each year from 1954 to 2013. We used this to learn the basics of cleaning data and creating histograms. From there, we looked at (international cod export data) and catch data. We also looked at the controversy over methods of evaluating the health of cod populations off Massachusetts and Maine but could not find non-governmental evaluation methods or data for comparison, only anecdotal accounts from fishermen. We did find multi-species trawl data published by URI, evaluating populations through series of yearly trawls in Narragansett Bay. Since this data set didn’t include cod, we choose six species to visualize, using fifty years of trawl data in two locations, Whale Rock and Fox Island. We spent a good chunk of time googling previously written code that could help us create animated bubble visualizations that we could superimpose on a map of the bay. It turns out there is code for bubble charts, a la Gapminder, but they typically chart three variables, instead of simply populations over time. With presentation time approaching, we realized how few skills we had and switched to working in Tableau, and came up with this visualization.

NESW Workshop

Telling Science Stories with Code and Data