Tuesday, May 19, 2020

COVID 19 Testing, Ay, There's the Rub

One of the most important things about data is that it has to be collected. As mentioned in the last post, the data about COVID 19 infections comes from a number of different agencies across the county, state, nation and world. The story we wish to tell from this data is where this thing is headed and how to stop it. The CDC has outlined several specific goals for data collection:
  • Monitor spread and intensity
  • Determine disease severity and characteristics
  • Identify risks increasing severity and transmission
  • Track changes in the virus itself
  • Estimate impact on health system
  • Forecast the spread and magnitude
Much of the data about the virus' behavior comes from researchers and medical staff working the front lines of care. Data about the spread and societal impact of the disease depends on a comprehensive testing regime. Both of these data streams require a rigorous reporting infrastructure as well as a detailed inspection process to ensure data is consistent between disparate sources.

Combining the various testing reports results in a broad and many-layered repository. From it analysts can extract information about the spread of the virus and about the pattern of testing itself. An example of the latter is a study from April 1st that produced two related maps. They illustrate differences in testing between the states as a way of evaluating the variability and extent of testing. I selected just the northeast region of the country from each map and show them side by side below.

The map on the left indicates that for each 100,000 people, New York has a higher rate of testing than Vermont, and Vermont has a higher rate than Pennsylvania. The map on the right shows a different story, where Vermont and Pennsylvania have a relatively low percentage of positive tests per total testing. New York on the other hand has 37% positive cases. This could mean that many more people are infected in New York, or that much more testing needs to be done.  We know from the map in the previous post, that New York on April 26 was reporting more than 400 cases per 100,000. While this is higher than elsewhere, it represents only 0.4 percent of the total population.

If most of the testing took place in New York City, where so many infected people were overburdening the health care system, this could skew the rate of positives per total. Perhaps if more testing was done throughout the state, the percentage of positives would come down slightly. Interpreting this result is difficult, since the data was collected from a relatively small sample of the people in the state. This is an example of a data quality issue that was mentioned in the  previous post - gaps in the data. According to the CDC and other groups studying the pandemic, testing needs to increase substantially in order to gauge the effectiveness of any policy or treatment changes.

We cannot simply ignore the data we have, even if it is not as complete as we would like. There is a wild fire burning and we have to act to put it out using any means we can. In a LinkedIn interview with Tom Lawson, CEO of FM Global, Mr. Lawson was asked how his leadership style has changed in this crisis. He responded that "Speed beats perfection in a crisis because if you wait until you get all the information, it may be too late. You have to make do with the best information you have."

Thursday, May 14, 2020

Using Data to Tell a Story

Every day we see graphs and maps in the news to illustrate how things are going with the Coronavirus. We look to these visualizations (views of data) as sign posts to guide us through this nightmare pandemic. Already in this blog I have shown several visualizations and talked about the stories they are telling and what it means to us in the path of COVID 19.

We are all familiar with the map of the United States; we see it during election time when news reports want to show which states favor one party or the other. It is a useful framework for showing the distribution of something across the nation. In today's post, I am highlighting two similar maps that show different views of COVID 19 cases in the United States. The first map shows the "Total Cases" for each state and was published in the Brazil Times using data from the CDC (Center for Disease Control). The map was designed to show where the most and the least number of people are infected.

The data needed to make this map is fairly simple: the outcome of each test (positive or negative), the state where the test took place, and the date of testing. For this map  the most recent test results were used. Rather than show the individual totals as a number on the map, the counts were assigned a color based on which range of values they fell into (1 to 100, 101 to 1,000, etc). By graduating the colors from lighter to darker, the map presents a quickly understood picture of where the virus is most active.

Most people know that more people live in California than live in North Dakota. It is not surprising then that the map of "Total Cases" shows the state with the higher population in the red range (the highest virus counts) and the more sparsely populated state in the yellow range (fewer cases).

 The second map shows "Cases Per 100,000" and was published in the Daily Independent on April 30. It uses slightly more recent testing data and calculates the tests per capita from each state's total population (obtained from the US Census Bureau).

What the first map does not tell us, is how each state is doing when you compare the number of cases to the number of people in the state. The "cases per 100,00" values represent the total cases that have occurred for each 100,000 people. This approach offers a way to compare states on an even playing field. Are they doing better or worse at controlling the virus across an equal sample of people? According to this view, California and North Dakota are doing equally well at handling the outbreak for their respective population sizes.

Each of these graphic narratives depend on data that has been collected by government agencies, health care organizations, and other official sources. Often the data is combined in order to show a more complete description of events or to show how the virus is progressing across broader geographic extents such as states or nations. In order to make the best use of data and tell a reliable story, we need to have a clearly defined goal as well as a detailed understanding of the data itself. We need to know how it was collected, where it was collected, and, in the case of the virus, who was selected for testing.

Whether data comes from multiple sources or a single source, we always want to check if it meets some basic criteria. This "fitness for analysis" (Data Quality) is determined by asking several questions:
  • Is the data relevant to the story you wish to tell?
  • Is the data consistently measured and accurate to a similar standard?
  • Is the data complete for the appropriate time period and geographic area, without gaps?
  • Is the data clean - without duplicates and with a well documented structure, so we know what each column or field refers to?
During the pandemic, more and more data is being collected every day. The story you want to tell will help determine what sets of data are relevant. If you want to create a map from your data, you also need a geogaphic representation of the area with the name of the state tied to each state's boundary. This provides a way to link your tabular data to the map..