In many situations, the main problem faced by information managers is not a lack of data, but the fact that data is stored in too many conflicting formats and full of inconsistencies and errors. This week I discovered a few free Google tools that can help to turn messy data into clean data.
1. Google Fusion Tables
Google Fusion Tables allows you to merge your own data with other tables or spreadsheets that are available on the web or that you have on your computer.
The reverse is also true: you can select parts of an existing Fusion Table and delegate these parts to somebody else. If you need to collaborate on data, this is extremely useful.
Merging and visualizing
One of the most useful ways to use Google Fusion tables is probably to merge tables containing programme data (number of tents distributed, percentage of people infected by HIV) with tables containing geographical boundaries. The Fusion Table tutorial shows you how to create intensity maps for different kinds of data from a community survey. The ease and speed with which this is possible is really impressive! You can also create various types of charts.
In addition, the Fusion table is automatically updated every time the underlying tables are being updated.
Divide and collaborate
Fusion Tables also allow you to go the opposite way: you can create a selection of a part of your table, called a “view”, and allow others to update this part of the table. Of course, anytime a view is updated, the whole table is being updated as well.
This is an extremely powerful feature: You can create a view that only contains data from a specific county. Or you can create a view showing only HIV data from a specific county. You can then assign responsibility for that data to a staff member who will only see his “view”. This way, complex tables appear less intimidating and there is no risk that the staff member will accidentally corrupt the data by working in a part of the table that he is not supposed to work in.
From what I can see, there is no limit to the number of views you can create, meaning you could simply create a column with the names of your data-entry people or team names and then create a dynamic view just for them. Obviously, this is also very useful when crowdsourcing data collection.
2. Google Refine
Any kind of data-output or visualization engine depends on the data being accurate and consistent. Unfortunately, that is rarely the case in humanitarian contexts where data is being entered by many people of different skill levels, frequently even from different organizations and in a great hurry.
As a result, fixing spelling-errors and identifying data that simply doesn’t make sense, takes up a lot of time for many information managers. Google Refine is a free software that you can download and install locally (and which doesn’t require admin rights) and which helps you to clean up your data, be it in Excel, XML or on a GoogleDoc.
One of the most easily accessible and impressive features is Google Refine’s ability to cluster similar-looking data, allowing you to harmonize them with just a click. This is particularly useful for tables that contain a lot of text strings that might have been spelled differently in different cells. For example “Disaster preparedness” “Disaster-preparedness” and “DP”. With Google Refine you can quickly find groups of data that should belong together and make them consistent.
Google refine is very powerful, but I found that I quickly reached the end of my ability to use it since a lot of the advanced functions require you to be able to write JSON or regular expressions. Fortunately, most of things I need it to do are rather basic jobs and don’t require these skills. I also found that the Google Refine forums contain a lot of code snippets you can re-use. And of course there is an undo-button.
3. My experiment: Presidential election results in Liberia, by county
In this visualization I merged election results with shape files. I then filtered the data for the name of a county before displaying everything on a map.
It’s always easy to do amazing things with the sample data that is being provided as part of tutorials.
I was curious to see whether I could do the same with data that was bound to be more messy and fragmented. So I decided to try to show the results from the final round of last year’s presidential election in Liberia on a map. To make a long story short: it worked, but it was not easy.
While I was able to find a table with shape files for all counties in Liberia easily, merging them with the data of the polling stations and the results proved more difficult:
- The names of counties had been spelled differently and had to be harmonized with Refine
- The election results themselves were not available in spreadsheet-format and I had to import them from a web-page into a Google spreadsheet with “ImportHtml” before exporting them to Google Refine for transposition.
- I had some problems deleting rows.
- Some things like a simple “search and replace” were much easier to do in Excel than in Google Refine.
- In Fusion Tables, I wasn’t able to overlay the locations of the polling stations on the county shape files.
- Updates I made in the original spreadsheets were not automatically visible on the merged table and I wasn’t able to find out what I had done wrong. Since this is one of the very attractive feature of Fusion Tables, this was quite disappointing.
- Originally I had planned to use Fusion Tables and/or Google Chart Tools to show the percentages of votes for each county in a pie chart, but in the end I decided that it was enough for me to know that it was possible.
All in all it was a good exercise since it showed that the constraints of the tools as well as the strengths. I’ll definitely continue to use both tools.
Google Refine and Google Fusion Tables are very versatile and powerful tools that can make data capture and analysis much easier. But like any tool, you need to learn how to use it. While neither tool is incredibly difficult to use, they are also not beginner-friendly and require that you are willing to dig through documentation, tutorials and forums to get the most out of them.
The big disadvantage of Fusion Tables is that you need to have a stable internet connection to use it, which will not be the case in many contexts. Google Refine however works offline and even of a USB stick in case you can’t get your IT department’s approval.
Have you used Google Refine or Fusion Tables in a humanitarian or development context? If so, please share your projects in the comments section below!