# Go Deep: Using DataThief to Rebuild Misleading Figures

May 8, 2013

By Jon Fisher, spatial scientist

Have you ever looked at a difficult-to-read graph and wished there was a way to figure out what the precise values of the data were?

Or maybe you wanted to extract the data so that you could do your own analysis (or at least produce a clearer graph)? You’re in luck!

DataThief is a program that lets you take an image of a graph or chart and extract the underlying values.

To show how useful this can be, I’m starting with a misleading graph I recently found and recreating it to be more informative and honest.  (I’m using a figure unrelated to conservation for demonstration purposes).

This graph — by having an absolute value on one y-axis, and a percentage on the other y-axis — creates the false impression that unemployment and lack of insurance are both sharply increasing (and that the rate of unemployment has surpassed the rate of lacking insurance). I used DataThief to extract the underlying data (see my blog Science Jon for detailed instructions on how I did this), which looks something like this:

In Excel I multiplied the values for “Uninsured Americans” by a million to get the true number. I then got some estimates of US population for January 2008 and 2009, and used those to calculate an average growth per month, the baseline population in March 2007, and the population for each of our data points.

This allowed me to calculate the percentage of Americans who are uninsured, to allow us to compare that to the percentage of Americans who are unemployed.

A graph of the resulting data reveals a different pattern than what we saw before: lack of insurance is increasing very slightly (from ~15.2 percent  to ~16.1 percent) as unemployment increases more rapidly (from ~4.4 percent to 7.6 percent). Note that the  unemployment percentage never surpasses the percentage of people who are uninsured (contrary to how this appeared in the original graph):

Comparing side-by-side:

There are two important considerations before using this software. First, these values will only be approximate, so if possible it’s always better to get the underlying data from the person who created the first figure. Second, it is possible that the data you are extracting is copyrighted, and that your reuse of their data may violate the data license. Use at your own risk!

Note that despite the name, DataThief is shareware; if you find it useful, please put your thieving on hold long enough to buy a \$25 license.

Opinions expressed on Cool Green Science and in any corresponding comments are the personal opinions of the original authors and do not necessarily reflect the views of The Nature Conservancy.

Photo Credit: Flickr users Phil and Pam under a Creative Commons license.

Jon Fisher is a senior conservation scientist for the new Center for Sustainability Science at The Nature Conservancy. He is leading efforts to put rigorous science front and center in our sustainable agriculture work, and finding ways to improve sustainability through corporate practices and public policy. More from Jon