Monday, June 1, 2020

COVID 19 Testing - the Elephant In the Room

The biggest problem with COVID 19 testing, as mentioned in the last post, is the lack of comprehensive testing. If a broader portion of the population were tested, a more accurate assessment  of the infection could be made, leading to more useful recommendations for action. But, as also mentioned, you have to go with what you have in order to make any headway in fighting the virus. Even if testing was reaching more people, however, there would still be obstacles to understanding the true scope of the pandemic. We might call these impediments the "Hippos" in the room.

Built in to any data set is the inherent tendency for errors to creep in. This particular "hippo" is not often mentioned when data is presented in graphs and maps - it is assumed by the presenters, and not well understood by the audience. The path of data from collecting it to loading it into a database is comprised of a number of steps, each one of which can be a possible source of data ambiguity, alteration, and misrepresentation. The cumulative effect of errors on data and, consequently, on conclusions drawn from data, can be minimal and of little concern, or it can be substantial and lead to significant harm when actions based on data analysis are invoked (testing bias).

For examples of two types of errors, we can look at Texas and its experience with testing and the reporting of results. Other states are having similar problems, so Texas is not unique. The map below indicates the probability of an outbreak, calculated for each county.


The map was included in a report from April 5, 2020. The authors referenced a paper by researchers who had developed a tool to estimate the risk associated with the spread of infections. The results depended on the extent of infection reporting, level of information about the transmission rate and the possibility of "super-spreading events". As in most states early on, testing in Texas was minimal. The number of tests that returned positive probably did not represent the true count of cases at the time. The group estimated that if only one case was reported in a county, there was a 51% chance that an outbreak was already taking place. An outbreak was defined as a "sustained local transmission that will continue to spread".

The second example deals with the ambiguity of the reporting process itself. The graph below shows the daily number of cases reported as of May 20, 2020.

The values for daily cases vary widely, so I added a rolling average, calculating the average number of cases for each seven days (blue line). The line smooths out the graph so that the trend becomes visible. The values rose quickly from mid-March to the early part of April, then plateaued till the first week of May. The average cases per day once more climbed quickly until by May 20 they had more than doubled from the first week of April. The governor of Texas began reopening retail establishments at the beginning of May. Many states that have started reopening businesses are now experienced higher numbers of cases per day.

This graph came from an article that reported testing in Texas was also increasing to more that 20,000 tests per day. Most states are now using the percent new cases relative to total tests per day as an indication of the improving or worsening status of the infection. This value was chosen as an acknowledgement that as testing increases, more cases will be found, but the percentage of new cases to tests would better represent the trend of the infection in the population as a whole.

Unfortunately, reporting guidelines set forward by the CDC at the beginning of the pandemic were not specific enough to distinguish between viral diagnostic testing and antibody testing which gauges the level of immunity in the population. It was not until after April 5th that the CDC revised their reporting form and clarified the definition of a confirmed case versus a probable case. As a result of the change, a case was considered "probable" if an antibody test was positive, but was not counted as a "confirmed" case. Confirmed cases are most often used in graphs and maps to indicate the spread of the disease.

Initially, some states, including Texas, had been reporting the results of viral and antibody tests without distinguishing between the two. According to the referenced article, the state continued to lump the two test results together even after the reporting form was revised to separate the two. There can be accuracy problems with both tests (true and false results), but a positive viral test is usually considered to be reliable. Antibody tests, on the other hand, have been reported to have less reliability in determining the occurrence of infections and result in misleading test statistics when included with viral test results.

The two types of errors described in this post, under-testing and ambiguous reporting, can lead to confusion when presented to the public and to an inadequate basis for policy decisions regarding social restrictions. This puts people's lives at risk as we learn the hard way that data is critical for fighting this pandemic. The hope is that we will in time, more clearly and carefully bring robust data to bear on the effort.

Tuesday, May 19, 2020

COVID 19 Testing, Ay, There's the Rub

One of the most important things about data is that it has to be collected. As mentioned in the last post, the data about COVID 19 infections comes from a number of different agencies across the county, state, nation and world. The story we wish to tell from this data is where this thing is headed and how to stop it. The CDC has outlined several specific goals for data collection:
  • Monitor spread and intensity
  • Determine disease severity and characteristics
  • Identify risks increasing severity and transmission
  • Track changes in the virus itself
  • Estimate impact on health system
  • Forecast the spread and magnitude
Much of the data about the virus' behavior comes from researchers and medical staff working the front lines of care. Data about the spread and societal impact of the disease depends on a comprehensive testing regime. Both of these data streams require a rigorous reporting infrastructure as well as a detailed inspection process to ensure data is consistent between disparate sources.

Combining the various testing reports results in a broad and many-layered repository. From it analysts can extract information about the spread of the virus and about the pattern of testing itself. An example of the latter is a study from April 1st that produced two related maps. They illustrate differences in testing between the states as a way of evaluating the variability and extent of testing. I selected just the northeast region of the country from each map and show them side by side below.

The map on the left indicates that for each 100,000 people, New York has a higher rate of testing than Vermont, and Vermont has a higher rate than Pennsylvania. The map on the right shows a different story, where Vermont and Pennsylvania have a relatively low percentage of positive tests per total testing. New York on the other hand has 37% positive cases. This could mean that many more people are infected in New York, or that much more testing needs to be done.  We know from the map in the previous post, that New York on April 26 was reporting more than 400 cases per 100,000. While this is higher than elsewhere, it represents only 0.4 percent of the total population.

If most of the testing took place in New York City, where so many infected people were overburdening the health care system, this could skew the rate of positives per total. Perhaps if more testing was done throughout the state, the percentage of positives would come down slightly. Interpreting this result is difficult, since the data was collected from a relatively small sample of the people in the state. This is an example of a data quality issue that was mentioned in the  previous post - gaps in the data. According to the CDC and other groups studying the pandemic, testing needs to increase substantially in order to gauge the effectiveness of any policy or treatment changes.

We cannot simply ignore the data we have, even if it is not as complete as we would like. There is a wild fire burning and we have to act to put it out using any means we can. In a LinkedIn interview with Tom Lawson, CEO of FM Global, Mr. Lawson was asked how his leadership style has changed in this crisis. He responded that "Speed beats perfection in a crisis because if you wait until you get all the information, it may be too late. You have to make do with the best information you have."


Thursday, May 14, 2020

Using Data to Tell a Story

Every day we see graphs and maps in the news to illustrate how things are going with the Coronavirus. We look to these visualizations (views of data) as sign posts to guide us through this nightmare pandemic. Already in this blog I have shown several visualizations and talked about the stories they are telling and what it means to us in the path of COVID 19.

We are all familiar with the map of the United States; we see it during election time when news reports want to show which states favor one party or the other. It is a useful framework for showing the distribution of something across the nation. In today's post, I am highlighting two similar maps that show different views of COVID 19 cases in the United States. The first map shows the "Total Cases" for each state and was published in the Brazil Times using data from the CDC (Center for Disease Control). The map was designed to show where the most and the least number of people are infected.



The data needed to make this map is fairly simple: the outcome of each test (positive or negative), the state where the test took place, and the date of testing. For this map  the most recent test results were used. Rather than show the individual totals as a number on the map, the counts were assigned a color based on which range of values they fell into (1 to 100, 101 to 1,000, etc). By graduating the colors from lighter to darker, the map presents a quickly understood picture of where the virus is most active.

Most people know that more people live in California than live in North Dakota. It is not surprising then that the map of "Total Cases" shows the state with the higher population in the red range (the highest virus counts) and the more sparsely populated state in the yellow range (fewer cases).

 The second map shows "Cases Per 100,000" and was published in the Daily Independent on April 30. It uses slightly more recent testing data and calculates the tests per capita from each state's total population (obtained from the US Census Bureau).

What the first map does not tell us, is how each state is doing when you compare the number of cases to the number of people in the state. The "cases per 100,00" values represent the total cases that have occurred for each 100,000 people. This approach offers a way to compare states on an even playing field. Are they doing better or worse at controlling the virus across an equal sample of people? According to this view, California and North Dakota are doing equally well at handling the outbreak for their respective population sizes.

Each of these graphic narratives depend on data that has been collected by government agencies, health care organizations, and other official sources. Often the data is combined in order to show a more complete description of events or to show how the virus is progressing across broader geographic extents such as states or nations. In order to make the best use of data and tell a reliable story, we need to have a clearly defined goal as well as a detailed understanding of the data itself. We need to know how it was collected, where it was collected, and, in the case of the virus, who was selected for testing.

Whether data comes from multiple sources or a single source, we always want to check if it meets some basic criteria. This "fitness for analysis" (Data Quality) is determined by asking several questions:
  • Is the data relevant to the story you wish to tell?
  • Is the data consistently measured and accurate to a similar standard?
  • Is the data complete for the appropriate time period and geographic area, without gaps?
  • Is the data clean - without duplicates and with a well documented structure, so we know what each column or field refers to?
During the pandemic, more and more data is being collected every day. The story you want to tell will help determine what sets of data are relevant. If you want to create a map from your data, you also need a geogaphic representation of the area with the name of the state tied to each state's boundary. This provides a way to link your tabular data to the map..


Thursday, April 30, 2020

A Tale of Two States

Besides the big bailout programs passed by Congress and the coordination of vaccine development by the National Institutes of Health, most of the hands on work of fighting COVID 19 is happening at the state and local levels. This includes the procurement of medical supplies (ventilators. Personal Protective Equipment, ICU beds, and testing kits), the management of hospital capacity and testing strategies, and coordinating public information distribution. Another important responsibility of state and local leaders is to provide guidelines and limitations for how we need to change individual behviors to help curtail the spread of the disease.

Without viable treatments beyond ventilators and without a vaccine yet available, there is little left that can stop the virus from rampaging around the world claiming thousands of lives. The only remaining preventative measure we have is social distancing. By limiting human to human contact and proximity, the spread of the virus can be slowed. There is no way to stop it completely, since we need to obtain food and care for our sick. By staying apart as much as possible we can at least lessen the stress on the health care system and gain valuable time in which to develop medical interventions.

An example of how our social practices have aided in the fight against the Coronavirus, I combined two state's graphs of deaths over time with the incidence of government restrictions on social interaction. The figure below compares Massachusetts responses to that of Illinois. These two states have similar population characteristics that allow us to compare the effects of other variables. Both have one very large metropolitan area (Boston - 7 m pop; Chicago - 9.5 m pop) which represents over 70% of the population in the entire state (Wikipedia; US Census). Both of the states also have a significant regional and national impact based on their industries and trade (Wikipedia)

.

I added markers to the graphs of deaths per 100k per day (NPR): a star for the first case identified in each state and a triangle for the date on which major social behavior recommendations were issued. In both of the states, the first case was reported well over a month before significant deaths began to occur.The implementing of social regulations, while similar, differs by 3 to 4 days; Massachusetts lagging behind Illinois..

Both Massachusetts and Illinois took steps to begin limiting social interaction on the 11th and 13th of March. In the case of Illinois (Executive orders), this included limiting large gatherings of people and school closures. In Massachusetts (Executive orders), only nursing home restrictions were invoked. School closures happened 6 days after the first order. On March 20, Illinois instituted "Stay at home" regulations; Massachusetts began stay at home policy on March 24, four days later. When you look at the lines representing deaths per 100K per day, it is clear that both states enacted restrictions over a week before deaths began to rise, The projected outcomes for the two states, however, show strikingly different levels of total deaths: Illinois - 2,269; Massachusetts - 3,326.

Researchers have pointed out relationships between delayed shutdowns and deaths when comparing California and New York (NPR). As mentioned before, there are many factors influencing the virus' behavior. One that may also contribute to the differences in death rate between the two states just mentioned is population density; New York City is more dense that L.A. and Boston is 30% more dense than Chicago (US Census). Staying away from each other seems to be a useful strategy in limiting the spread of the disease until approved treatments and/or vaccines become available. The question is, how do you stay away from people in tight living spaces and packed subways.


Wednesday, April 22, 2020

Data Versus Reality

Today in Wisconsin, as in many places in America, we are still under lockdown (we call it "Safer at Home"). In fact, our Governor, Tony Evers, just extended the lockdown until the end of May. Since I am in the at-risk group, being older with an underlying condition, I am thinking that caution moving forward will help me stay well. I am also retired, so, as long as the stock market doesn't completely bottom out, my savings will carry me through.

I know that there are many who would rather get back to work sooner, so they can start making a living again, caring for their kids, paying off their car, and covering all the other monthly bills we all have. This is where we reach the end of what we know and what we think will happen. The data only goes as far as today, tomorrow we have to guess. The good thing is, we can use the past to project into the future; based on what has happened before, from the data, we can extrapolate what would happen if we changed one variable, like say opening up businesses.

Let's take another look at the "flattening the curve" graph. The chart shows a dashed line crossing the center of the diagram which represents health care capacity. This might be the conditions under normal circumstances. Up until now, hospitals do not maintain warehouses of medical supplies, just in case the worst should happen. This is why there has been a herculean effort to ramp up production of protection equipment and build up patient care facilities in sports stadiums. The reality is that the line on the graph representing health care capacity is actually rising over time as suppliers respond to the emergency..


 In my modified graph, the new line for number of cases (green) represents a growth in patients that was at first not completely anticipated. It happened so fast that supplies initially fell behind - this is where the green line passes above the orange line (capacity). For a time, there were not enough beds, ventilators, testing kits, or health care workers to handle the influx of patients.

Eventually, supplies have begun to catch up with demand, and due to social distancing in all its forms, the incidence of new patients has begun to level off. This is where the green line (patients) finally goes under the available health care line (orange). Some of the actual data about the spread of the disease is worse than shown in my mocked up graph.


Source: NPR

The graphs above represent deaths per 100k per day for three states, and each are different. There are many factors that can alter how the disease progresses. One of the major influences, however, is whether or not the response to the virus has been swift and vigilant or delayed and relaxed. This virus is a wild fire and without control it will burn the place down.