Wednesday, August 5, 2020

Contact Tracing: Tracking the Virus To Clusters of Infection

The purpose of Contact Tracing is to find possible infections and stop the spread before it goes further. When someone is tested and the result is positive for COVID-19, the case is turned over to the tracers and the investigation begins to find those contacts who may be infected, but do not know it yet. If the cases are found soon enough and the patients are quarantined quickly, they either get better or are provided more intensive care (Tracing Principles). The important thing is that they would not be able to transmit the virus to one of their contacts. If not enough of these cases are found, the virus continues to spread and, at some point the hospital beds fill up, treatments become depleted, more health workers get sick, and the system breaks down.

In the last post, I introduced the idea of Contact Networks as a tool that can be used to visualize an infection and from which certain conclusions can be drawn that might help mitigate the spread of COVID-19. The networks are built using data from the tracing investigations. Groups of individuals can be assembled, branching out from the original patient. The map of Winnipeg in the last post showed a way to add a geographic dimension to the network in order to highlight certain influences that are spatially coincident with patients. Sometimes, however, it is not the physical distance between these features and patients that is important. By incorporating sites as nodes in a Contact Network along with contacts, a clearer picture of their relationships becomes apparent.

An example of this sort of graph was created by researchers at the University of Arizona, Tuscon, They used data from the SARS virus outbreak in Taiwan in 2003 (SNA For Tracing).


The network focuses on patients who have had contact with hospitals where outbreaks of the virus were active. In the graph above, major clusters of patients surrounding each hospital have been removed. What remains are those individuals who act as bridges between clusters. This additional interaction adds to the possible spread of the infection.

For COVID-19, as contact tracing data is analyzed, it is becoming more likely that most of the transmission of the disease occurs at places where people gather. This includes workplaces, special events, recreation areas, bars and other social gatherings. The exception to this trend is when the majority of the attendees are wearing masks, social distancing or outside with free air flow.

The evidence for these conclusions comes from a number of reports made by health departments regarding the status of the disease in their area. On July 1, after a month of starting to reopen, the director of Public Health Dane County in Wisconsin announced new restrictions on bars in response to an increase in cases (Public Health). Data from contact tracing of 614 new cases for two weeks in June indicated that 45% of patients had attended parties outside their home. The Tavern League of Wisconsin criticized the new order, saying it was unfairly penalizing bars over other activities like protesting. It was during this time that nightly protests in Madison were drawing hundreds of people in response to the police custody death of George Floyd.

The county released more detailed data, showing that 21% of the patients said they had been to bars during that time, but only 2% had attended the protests.


Many of the protesters were wearing masks in news footage and the crowds were outside in the street with a free flow of air. This is just another indication that personal preventive measures can help bring the positive cases down. Gatherings without protection and inside allow the virus to spread quickly.

In July, the Lincoln County, Oregon, public health director presented data to the Board of Commissioners showing that most local transmission was due to outbreaks, rather than out-of-county visitors (sources of infection). An outbreak was defined as two or more cases in separate households linked to a single event or location, To illustrate the spread of the disease, the health department produced a chart using contract tracing data for four individuals. These original cases were responsible for the infection of 58 additional people over several weeks.

It was not known how the original patients got the virus, but three of them were responsible for 10 workplace outbreaks affecting 39 people. These and other outbreaks resulted in 73% of the positive cases in the county. Additional cases came from household transmission or sporadic instances of community spread. The position of nodes in the diagram above has no relation to their actual location relative to each other within the county. This helps visualize how the links between cases and gathering places are related.

When restrictions are lifted on residents of different jurisdictions, it is important for people to remember that it does not mean the virus is gone. The virus, according to most experts, is here to stay for quite a while, even after the vaccine becomes available. No matter how young or old you are, or how unafraid you are, if you catch the disease you may be lucky and not suffer a long illness. You will, however, spread the virus to anyone you come in contact with unless you are careful and protect them from possible infection: wear a mask, stand apart, wash your hands.

Sunday, August 2, 2020

COVID-19: Connecting the Dots Between Spreaders and the Vulnerable

In the last post I introduced the idea of connections and how they can affect the spread of COVID-19. In this post I will go a little further down the COVID data rabbit hole to where the abstract is real and relationships can be fatal.

You are probably all familiar with the old saying, "It's not what you know, but who you know". For those who don't, it means if you want to move up the ladder, your knowledge and skills are less important than your network of personal contacts (Wiktionary). In this time of pandemic, there is a downside to connections if you test positive. The virus will know who you know and can use that ladder to move on to the next host.

No matter how reclusive you are, we are all connected to other people: friends, relatives, your spouse and kids, work mates, bar buddies, pickup basketball teams. And, of course, the people we know have their own set of acquaintances, and so on down the line. Before delving deeper into the intricacies of "social networks", as they are called, it is important to know about the background conditions that exist which determine where the most vulnerable live. It is these people, at risk of severe complications from COVID-19, who need to guard against connections to those already infected.

The CDC has been alerting us all, over and over, that certain individuals are more likely to have negative outcomes from the virus. In Austin, Texas, researchers have compiled a set of measures that can be used to identify those populations and locate them on a map (Houston Map). The measures they used included a number of economic, environmental, and health care factors that can influence vulnerability..


The idea was that if you know where these areas are, it can help in the allocation of testing and health care efforts. The data for the study came from national and local databases that tied records of the following statistics to census tract areas:
  • access to hospitals and medical care
  • underlying medical conditions
  • exposure to pollutants
  • areas prone to disasters and flooding
  • other lifestyle choices like smoking and drinking
But how does the virus find these people? That is where the concept of networks can help to uncover the "invisible threads" tying us all together.

Networks are defined as a set of nodes connected to each other by links or "edges". Networks can be used to describe many phenomena, including computer connectivity, electrical systems, biological interactions, and financial transactions (Network Theory). Social Networks describe connections between people and between people and other entities they may interact with. In epidemiology, the study of disease within a population, Social Networks can be used to visualize the spread of disease and possible interventions to control it. This type of Social Network is referred to as a Contact Network.

The data required to construct a Contact Network is produced by Contact Tracing. In an outbreak, many investigators are needed to interview people who are infected, tracing back along the individual's set of contacts to determine who they may have come in contact with and when. This list of contacts might contain people who will become infected through the contact or who infected the individual. Tracing is very time consuming and depends on the cooperation and recollection of the patients. It therefore has the most affect on disease control when the rate of infection is low (SNA For Tracing). Automated tracing through cell phone proximity logging can speed up the process of identifying contacts, as long as security concerns are addressed. In the end, though, manual interviews are still needed to provide health and quarantine support (Practical Application).

By adding a geographical value to network nodes, researchers at Penn State were able to locate individuals associated with nodes on a map while maintaining the links between them (Where You Go).


The networks they mapped above are referred to as components. The nodes in each network represent groups of at risk street youths whose residences are shown linked to each other. The networks are overlain on a density heat map of locations where at risk behaviors, like drug use, were performed by individuals in the network. The map shows a high level of overlap between the various networks relative to the risk sites, indicating a more cohesive interaction between networks. It seems possible that in a disease investigation, similar mappings of contacts along with areas of vulnerable populations might provide clues to transmission sources within those communities.

In the next post, I will look more closely at typical Contact Network analyses and how they help uncover gathering places that accelerate disease transmission.


Wednesday, July 15, 2020

The Path of Infection - No Masks, No Recovery

Much of public policy regarding COVID-19 has to do with limiting the spread of the virus. While it may seem like a totally random occurrence, popping up unexpectedly here and there, the spread can be  calculated by a model based on the virus biology and the structural landscape of potential hosts (you and me). The predicted spread, however, cannot account for random variation and depends on the accuracy of the information fed into the model. What makes this process difficult for COVID-19 is that little is known yet about the infectious nature of the virus.

Below are a group of maps that allow us to visualize the spread of the virus.



A. Predicted Spread

On March 12, 2020, Time Magazine produced a dynamic map of the predicted spread of the virus. The data came from models produced for the CDC based on knowledge of the virus at that point and on population density, mobility, commuting patterns and air travel. The user could select one of three surveillance levels: low - indicating minimal testing; moderate - greater testing coverage; and high - comprehensive testing of individuals.

What is pictured in map "A" is the lowest testing scenario, where little can be determined about the status of the outbreak. The virus would then spread undetected among the population and unmitigated by government or health official policies. The number of cases in an area are represented by colored pixels, ranging from high density - orange, to medium density - yellow, to low density - green. The actual level of testing in the US has been at best in the moderate range, but many states have either chosen not to test in sufficient numbers or have not been able to obtain the necessary testing supplies and lab evaluation resources.

B. Recent Increases - 7/9/2020
 
Map "B" is a snapshot of a dynamic map produced by USA Today based on data from John Hopkins University. The original map is updated regularly and shows total cases and deaths per county since the infection began or just for the last seven days. The number of cases are indicated by colored circles whose size is relative to the case totals. In the map above, total increase in cases for the last seven days is shown, as of July 9, 2020.

The overall appearance of map "A" and map "B" is strikingly similar. While the symbolization of the two maps differs, both maps are still designed to show the extent of the spread of the virus over time. The areas that are most active are the same in both maps, but when the size differences between areas are compared on one map to the same areas in the other, some differences are apparent.

If we look at the relative sizes of case areas within map "A" (predicted spread) for California, Texas and Florida, they seem to correspond to the relative sizes of case circles between the same states in the map "B"(current increases). Arizona and New York case areas and case circles, however, do not seem to match the scale of other area extents, Arizona being relatively larger in map "B" and New York being smaller. This would seem to correspond to the level of surveillance (testing) that has been done in New York versus Arizona, where New York practiced very strict mitigation and Arizona less so.

C. Global Path of the Virus

The map of global paths of the virus as it spread is based on genetic sequencing of the virus in samples taken from patients all over the world. The circle symbols in the map represent the relative size of the outbreaks and the color of the paths relate to where the virus originated from.Genetic sequencing identifies the molecular building blocks of the virus and the order in which they are assembled into chains. As with most viruses, the Coronavirus that causes COVID-19 (SARS-CoV-2) makes mistakes occasionally when it creates copies of the genome (genetic blueprint). These mistakes are referred to as mutations and are passed on to future generations of the virus. As the copies of the virus infect new hosts, they can then be carried to other locations where they might infect additional people.

The researchers who constructed the map, analyzed the genetic sequences to create inheritance trees. This made it possible to link later generations of the virus in one country to ancestors in source countries. Besides the map above, they also created one of the time period from December 3, 2019, to February 3, 2020. That map indicated the source for all initial cases in other countries as being China. In the period shown in the map above (Feb. 3, 2020 to April 21,2020), however, later generations of the virus spread from these secondary infection sites to other countries. As an outbreak began at a location with little or no travel restrictions, the virus was free to spread back and forth on the airways by tagging along with human hosts.

D. Highways and Urban Areas

I built map "D" by overlaying maps from two different sources. The base layer is a map of the National Highway System, including US Highways and the Interstate System produced by the US Department of Transportation. The overlay is based on US Census Urban Areas defined as "clusters of development that meet a minimum population density" (Urban Areas). The size of the circles are relative to the total population within the area.

This map also bears a strong resemblance to maps "A" and "B". Which is to say that COVID-19 goes where the people are - and it gets there riding along with infected people traveling our highways and byways. People go everywhere in their cars and trucks, and, eventually, they carry the virus to your home town, or close to it. COVID-19 spreads by certain physical mechanisms (coughing, sneezing, talking) and as long as nothing gets in the way of those processes, it will find the next host. Until we have a vaccine to stop the spread, there are only a few physical barriers that can slow the infection: masks, hand washing, hand sanitizer, and social distancing. These counter measures will work, but only if the majority of people practice them. Where people ignore these guidelines, the disease has a wide open path to their front door.


How Many Degrees of Separation Are You From the Infection?

In early April of 2020, a team of network epidemiologists, who study the prevalence of disease, put together a thought project at the University of Washington designed to simulate various levels of social distancing and the resulting infection rate. The initial population was represented by a group of equidistant dots. As connections were made, the dots became networked beyond the point of initial contact, represented by linked lines. Three scenarios are shown below.


The first scenario represents households (dots) that are all perfectly self-isolating. Under this scenario, no one would catch the disease, even if some people were infected. Without close personal contact there is no transmission. This is not a realistic option, though: people in the same household would find it very difficult to isolate from each other and without some interaction with others, it would not be possible to obtain food, medicine, healthcare and and other  public services.

At the other end of the scale is scenario 3, where interactions are the same as before the pandemic. People would be very closely connected and if only a few people were infected, eventually all would contract the disease. Somewhere in between 1 and 3 is scenario 2. It represents the situation where slightly more than essential contacts are taking place, perhaps allowing for some on site work and socializing, but still without social distancing practiced. Even in this case, 90% of all households would be connected and that is a path the virus can use to infect more individuals. If you get sick, you may not have symptoms, but you will continue to be contagious. If you visit a friend, a neighbor or a relative without precautions, you may infect them. What if they also have an existing health condition that puts them more at risk of experiencing a life threatening infection? One that you could have prevented.

The longer people stay away from work, the more precarious their economic well-being will become.  On the other hand. the more that people reengage with others, the more important it will be to use masks and separation to keep their distance from infected individuals. The solution seems to be to take it slow, monitor things closely and go no faster than the numbers allow.

Thursday, June 25, 2020

Who's Getting Sick - Race Matters

On April 10, 2020, the CDC posted a report that discussed the geographic variations in the spread and mortality rate of COVID 19. These included the following differences in location that might be influencing the pattern of disease incidence occurring across the United States:
  • the timing of COVID 19 introduction into an area
  • the relative population density of cities compared to rural areas
  • demographic values such as prevalence of different age groups and those with existing conditions
  • the timing and extent of government recommendations to diminish public interaction
  • diagnostic testing capacity in different jurisdictions
  • the level of public health reporting consistency and prioritization.
I have written about several of these in my posts, but there is one that now intersects with current events in a significant way  beyond the issues of health and economic upheaval: the demographics of race. In the midst of this global pandemic, an event occurred that burst into the fore front of the daily news cycles, moving COVID 19 updates into the background. George Floyd was living in Minneapolis, Minnesota, when on May 25, 2020, he was arrested by police after having been identified by a store clerk as having paid for his purchase with counterfeit money. Seventeen minutes later, Mr. Floyd was handcuffed and on the ground, held down by three police officers, one with a knee on Mr. Floyd's neck. At that point, after 8 minutes and 46 seconds of being held in that position, George Floyd had become unresponsive. An ambulance arrived a few minutes later and took Mr. Floyd to the hospital where he was declared dead.

It was not just because George Floyd was black that this story resonated so strongly around the world. As Sherrilyn Ifill, president of the NAACP Legal Defense fund, said in an interview with CBS's Bill Whitaker, "one of the reasons why the George Floyd video set us off so much was the realization that it's not different. We've-- we've seen the videos. And the videos seem not to make a difference. And that's why that officer could look like that. He wasn't afraid of being videotaped. He wasn't trying to hide what he was doing."

As Ms. Ifill said, we have seen this all before, many times. If we look at the numbers of men at risk of being killed by police, the imbalance between ethnic groups is overwhelming.

Adding insult to injury, George Floyd was tested for COVID 19 after his death and was found to be positive, though asymptomatic. That African Americans are victims of police brutality is bad enough, they are also almost five times more likely than white people to be hospitalized for COVID 19.

As can be seen in the two graphs above, it is not just blacks who are being treated more harshly by the  police and the pandemic - all minorities are suffering at a greater rate than whites. While density of the population in urban places plays a role in increased rates of infection, it really comes down to whether you are rich enought to "shelter in place", or, if you are not, being forced to go out to work at frontline service jobs in close proximity to others. The maps below show that, even before the pandemic in 2015, minorities were less likely to find work than whites.

The values that are represented in these maps are based on the ratio between the rate of unemployment for the minority and the rate of unemployment for whites. The rate of unemployment is calculated by dividing the number of a group who are unemployed by the total labor force of that group. The "labor force" is defined as those currently employed or who are not working, but who are actively looking for work. The areas for which the values are aggregated are congressional districts.

When the rates of COVID-19 deaths for different races are compared to each group's proportion in a state and then combined, the result can be used to show how far variances in racial deaths diverge from the entire state population death rate. A map of these divergences was developed by the University of California, Berkeley.

A cursory examination of the map above and the ones of unemployment show a probable cause-effect relationship between deaths and unemployment rates for certain states: Arizona, Georgia, Nevada, Michagan, Florida, and Missouri. In others, however, there seems to be no relationship: California, Texas, Oregon, and Wyoming. The unemployment rate itself is partly a function of a racial bias, which also reinforces several other circumstances that increase susceptibility to infection.

The CDC lists a number of race-related influences that affect health:
  • residential segregation that creates denser populations and greater distances to groceries and health care;
  • higher employment in essential industries requiring working outside the home and less paid sick leave;
  • poorer underlying health conditions like lack of health insurance and serious pre-existing illness
The inequality that exists in our society has made a difficult situation even worse for those who, for no reason other than the color of their skin, face so many injustices already.

Monday, June 1, 2020

COVID 19 Testing - Learning From Our Mistakes

There is considerable evidence that our handling of the onset of COVID 19 was less than adequate. If a broader portion of the population had been tested early on, a more accurate assessment of the infection could have been made leading to more useful recommendations for action. Several scientists fear that the current pandemic is the leading edge of what may be a wave of more frequent infectious diseases. They suggest that this increase is in part due to changes in climate brought about by human activity. We need to pay attention to what works and remember those lessons. Insufficient testing is just one of the teachable moments; there are many other missteps that should be noted and corrected.

Even if testing had reached more people in the beginning, there were other practices that delayed our understanding of the true scope of the pandemic. Built in to any data set is the inherent tendency for errors to creep in.  The path of data from collecting it to loading it into a database is comprised of a number of steps, each one of which can be a possible source of data ambiguity, alteration, and misrepresentation. The cumulative effect of errors on data and, consequently, on conclusions drawn from data, can be minimal and of little concern, or it can be substantial and lead to significant harm when actions based on data analysis are invoked (testing bias).

For examples of two types of errors, we can look at Texas and its experience with testing and the reporting of results. Other states are having similar problems, so Texas is not unique. The map below indicates the probability of an outbreak, calculated for each county.


The map was included in a report from April 5, 2020. The authors referenced a paper by researchers who had developed a tool to estimate the risk associated with the spread of infections. The results depended on the extent of infection reporting, level of information about the transmission rate and the possibility of "super-spreading events". As in most states early on, testing in Texas was minimal. The number of tests that returned positive probably did not represent the true count of cases at the time. The group estimated that if only one case was reported in a county, there was a 51% chance that an outbreak was already taking place. An outbreak was defined as a "sustained local transmission that will continue to spread".

The second example deals with the ambiguity of the reporting process itself. The graph below shows the daily number of cases reported as of May 20, 2020.

The values for daily cases vary widely, so I added a rolling average, calculating the average number of cases for each seven days (blue line). The line smooths out the graph so that the trend becomes visible. The values rose quickly from mid-March to the early part of April, then plateaued till the first week of May. The average cases per day once more climbed quickly until by May 20 they had more than doubled from the first week of April. The governor of Texas began reopening retail establishments at the beginning of May. Many states that have started reopening businesses are now experienced higher numbers of cases per day.

This graph came from an article that reported testing in Texas was also increasing to more that 20,000 tests per day. Most states are now using the percent new cases relative to total tests per day as an indication of the improving or worsening status of the infection. This value was chosen as an acknowledgement that as testing increases, more cases will be found, but the percentage of new cases to tests would better represent the trend of the infection in the population as a whole.

Unfortunately, reporting guidelines set forward by the CDC at the beginning of the pandemic were not specific enough to distinguish between viral diagnostic testing and antibody testing which gauges the level of immunity in the population. It was not until after April 5th that the CDC revised their reporting form and clarified the definition of a confirmed case versus a probable case. As a result of the change, a case was considered "probable" if an antibody test was positive, but was not counted as a "confirmed" case. Confirmed cases are most often used in graphs and maps to indicate the spread of the disease.

Initially, some states, including Texas, had been reporting the results of viral and antibody tests without distinguishing between the two. According to the referenced article, the state continued to lump the two test results together even after the reporting form was revised to separate the two. There can be accuracy problems with both tests (true and false results), but a positive viral test is usually considered to be reliable. Antibody tests, on the other hand, have been reported to have less reliability in determining the occurrence of infections and result in misleading test statistics when included with viral test results.

The two types of errors described in this post, under-testing and ambiguous reporting, can lead to confusion when presented to the public and to an inadequate basis for policy decisions regarding social restrictions. This puts people's lives at risk as we learn the hard way that data is critical for fighting this pandemic. The hope is that we will in time, more clearly and carefully bring robust data to bear on the effort.

Tuesday, May 19, 2020

COVID 19 Testing, Ay, There's the Rub

One of the most important things about data is that it has to be collected. As mentioned in the last post, the data about COVID 19 infections comes from a number of different agencies across the county, state, nation and world. The story we wish to tell from this data is where this thing is headed and how to stop it. The CDC has outlined several specific goals for data collection:
  • Monitor spread and intensity
  • Determine disease severity and characteristics
  • Identify risks increasing severity and transmission
  • Track changes in the virus itself
  • Estimate impact on health system
  • Forecast the spread and magnitude
Much of the data about the virus' behavior comes from researchers and medical staff working the front lines of care. Data about the spread and societal impact of the disease depends on a comprehensive testing regime. Both of these data streams require a rigorous reporting infrastructure as well as a detailed inspection process to ensure data is consistent between disparate sources.

Combining the various testing reports results in a broad and many-layered repository. From it analysts can extract information about the spread of the virus and about the pattern of testing itself. An example of the latter is a study from April 1st that produced two related maps. They illustrate differences in testing between the states as a way of evaluating the variability and extent of testing. I selected just the northeast region of the country from each map and show them side by side below.

The map on the left indicates that for each 100,000 people, New York has a higher rate of testing than Vermont, and Vermont has a higher rate than Pennsylvania. The map on the right shows a different story, where Vermont and Pennsylvania have a relatively low percentage of positive tests per total testing. New York on the other hand has 37% positive cases. This could mean that many more people are infected in New York, or that much more testing needs to be done.  We know from the map in the previous post, that New York on April 26 was reporting more than 400 cases per 100,000. While this is higher than elsewhere, it represents only 0.4 percent of the total population.

If most of the testing took place in New York City, where so many infected people were overburdening the health care system, this could skew the rate of positives per total. Perhaps if more testing was done throughout the state, the percentage of positives would come down slightly. Interpreting this result is difficult, since the data was collected from a relatively small sample of the people in the state. This is an example of a data quality issue that was mentioned in the  previous post - gaps in the data. According to the CDC and other groups studying the pandemic, testing needs to increase substantially in order to gauge the effectiveness of any policy or treatment changes.

We cannot simply ignore the data we have, even if it is not as complete as we would like. There is a wild fire burning and we have to act to put it out using any means we can. In a LinkedIn interview with Tom Lawson, CEO of FM Global, Mr. Lawson was asked how his leadership style has changed in this crisis. He responded that "Speed beats perfection in a crisis because if you wait until you get all the information, it may be too late. You have to make do with the best information you have."


Thursday, May 14, 2020

Using Data to Tell a Story

Every day we see graphs and maps in the news to illustrate how things are going with the Coronavirus. We look to these visualizations (views of data) as sign posts to guide us through this nightmare pandemic. Already in this blog I have shown several visualizations and talked about the stories they are telling and what it means to us in the path of COVID 19.

We are all familiar with the map of the United States; we see it during election time when news reports want to show which states favor one party or the other. It is a useful framework for showing the distribution of something across the nation. In today's post, I am highlighting two similar maps that show different views of COVID 19 cases in the United States. The first map shows the "Total Cases" for each state and was published in the Brazil Times using data from the CDC (Center for Disease Control). The map was designed to show where the most and the least number of people are infected.



The data needed to make this map is fairly simple: the outcome of each test (positive or negative), the state where the test took place, and the date of testing. For this map  the most recent test results were used. Rather than show the individual totals as a number on the map, the counts were assigned a color based on which range of values they fell into (1 to 100, 101 to 1,000, etc). By graduating the colors from lighter to darker, the map presents a quickly understood picture of where the virus is most active.

Most people know that more people live in California than live in North Dakota. It is not surprising then that the map of "Total Cases" shows the state with the higher population in the red range (the highest virus counts) and the more sparsely populated state in the yellow range (fewer cases).

 The second map shows "Cases Per 100,000" and was published in the Daily Independent on April 30. It uses slightly more recent testing data and calculates the tests per capita from each state's total population (obtained from the US Census Bureau).

What the first map does not tell us, is how each state is doing when you compare the number of cases to the number of people in the state. The "cases per 100,00" values represent the total cases that have occurred for each 100,000 people. This approach offers a way to compare states on an even playing field. Are they doing better or worse at controlling the virus across an equal sample of people? According to this view, California and North Dakota are doing equally well at handling the outbreak for their respective population sizes.

Each of these graphic narratives depend on data that has been collected by government agencies, health care organizations, and other official sources. Often the data is combined in order to show a more complete description of events or to show how the virus is progressing across broader geographic extents such as states or nations. In order to make the best use of data and tell a reliable story, we need to have a clearly defined goal as well as a detailed understanding of the data itself. We need to know how it was collected, where it was collected, and, in the case of the virus, who was selected for testing.

Whether data comes from multiple sources or a single source, we always want to check if it meets some basic criteria. This "fitness for analysis" (Data Quality) is determined by asking several questions:
  • Is the data relevant to the story you wish to tell?
  • Is the data consistently measured and accurate to a similar standard?
  • Is the data complete for the appropriate time period and geographic area, without gaps?
  • Is the data clean - without duplicates and with a well documented structure, so we know what each column or field refers to?
During the pandemic, more and more data is being collected every day. The story you want to tell will help determine what sets of data are relevant. If you want to create a map from your data, you also need a geogaphic representation of the area with the name of the state tied to each state's boundary. This provides a way to link your tabular data to the map..