Why PostGIS?

(Note- Originally posted here: http://digitalarchaeology.msu.edu/why-postgis/ on April 8, 2016)

Benjamin Carter, Muhlenberg College

This will be a relatively quick post. As promised in this post, I will discuss PostGIS, a relational database management system.

First, what is a relational database and why should archaeologists use them (for a fuller explanation and discussion see Keller 2009 and Jones and Hurley 2011)? Of course, many archaeologists already use these (especially at larger contract archaeology firms), too many of us avoid them. Indeed, even in graduate school, I never discussed data organization (presumably this is common). However, the way that you organize your data can reduce time spent data wrangling and promote richer analysis. It also promotes and limits certain types of analysis.

Let’s contrast a relational database to the “flat” file (often in the form of an Excel spreadsheet) that is all too common in archaeology. Anyone who has used a spreadsheet knows that they are incredibly frustrating to use: Have you ever sorted by a column and then found that you didn’t highlight all of the columns. Now, one column is disconnected from its rightful data. No problem, right? That’s why there is an undo button. What if you accidentally saved it? No problem, you have that archived copy, right? Where was it?

Analyses of data in flat files are constrained by the contents of the spreadsheet. Even if you have multiple sheets in separate tabs (e.g., one for ceramics and one for lithics), they are not linked (yes, you can link through formulas, etc., but that is laborious as well). What if you need to input a new set of information? Let’s say you have a context code that includes site, unit and level, but you want to analyze by unit, you would need to create a new column and either manually enter the unit or digitally separate the unit from your context code. All of this takes time, creates poorly organized files that are difficult to reuse (frequently because data is disconnected from its metadata). Similarly, these frequently lack the appropriate metadata that allows them to be shared and archived. They are largely designed for and with the interests in mid of a single researcher (or perhaps a small team). Frequently, specialists have disparate spreadsheets that cannot “talk” to each other.

While no database is perfect, relational databases can alleviate many of these issues. The essential concepts behind a database are to disaggregate data, limit busy work and standardize your data (Note that this never means that you would lose the qualitative narrative). This reduces time and increases quality control. To conceptualize a relational database, think of multiple tables linked together. For example, I may have an excavation table with a wide array of data, including a column with site number. Each “record” (i.e., row) includes all the nitty-gritty information from a single layer from a single unit. If I use the trinomial system, there are three pieces of information buried in a single number/ column (state, county, site number). However, if I wanted to disaggregate these pieces of information in a spreadsheet, I would need to make new columns and do a great deal of copying and pasting all the while risking separating a piece of data from its original record. In a relational database, the original table can be easily connected to a small table that includes one column each for trinomial number, state, county and site number, but only ONE record for each unique trinomial. Then you create a “relationship” between the trinomial column in the original table and the trinomial column in the new table. In other words, each record (row in a table) of your original data is directly linked to state, county and site number with no insertion of columns or copying and pasting of data. Imagine your original table includes three sites in two counties and a total of 1000 records (levels of units). To associate state, county and site number with the trinomial, you would need to insert three columns and copy and paste data into the right cells for all 1000 records (that is, you have created 3000 additional pieces of data; I hope you didn’t waste field time writing your state on each form! With a relational database, you only need to create three records (12 pieces of data). However, because of the relationship created, you have actually created the same 3000 data points. Sounds a bit more efficient, no?

I recently worked with census data from North Atlantic Population Project . Much of the data is coded. The downloaded data includes numbers that mean nothing to me, but those codes can be linked to text; a 336 in the IND50US column (Industry categories in 1950) means “Blast furnaces, steel works, and rolling mills”. The original data table is linked to a small table (indeed many of them) that convert apparently meaningless codes into understandable text.  This means that I entered the words “Blast furnaces, steel works, and rolling mills” only once, but they are now associated with all 600 records in the original table from NAPP that included the “336” code in the IND50US column.

Why Post GIS? PostGIS is simply a spatial extension of PostgreSQL, an “object relational database management system.” That is, it is simply a language for creating and organizing relational databases. The main reason for choosing this system is that it is incredibly popular, widely used in industry and academia. It is open source and works on all computer platforms; it is now the native on Mac OSX servers. It can be stored a server or on your computer.  I prefer a graphical user interface and pgadmin, the “native” client for accessing and editing your database, is not intuitive to me. However, I am in the process of switching word processing and spreadsheets to the open source LibreOffice Suite. LibreOffice Base, their answer to MS Access/ Filmaker Pro has native support for PostgreSQL. Other database management programs, such as the two mentioned in the previous sentence, also have native support for PostgreSQL (i.e., you do not need to use LibreOffice).  Similarly, PostgreSQL/ PostGIS is supported by GRASS/QGIS,  an open source GIS programs (This is a huge plus. Most data in GIS programs are in the flat files ridiculed above). While PostgreSQL/ PostGIS is certainly not the only option available to do these things, it appeared to be the most widely supported.

Finally, I will openly admit that I have only begun to work with PostGIS/ LibreOffice Base and I am having some difficulties. I will refrain from being too critical yet because it may simply be part of the learning curve.

Kobo Toolbox ( a field data collection web app discussed here) yields tables that

To open a can of worms that I am still struggling with, I will suggest that relational databases will allow field data to be easily converted into (or perhaps collected as) linked open data.

Inexpensive, easy-to-use data collection: Kobo Toolbox as an example.

(Note- Originally posted here: http://digitalarchaeology.msu.edu/inexpensive-easy-to-use-data-collection-kobo-toolbox-as-an-example/ on Feb 29, 2016. )

As promised in my previous post , this post is about using Kobo Toolbox as a data collection tool for archaeology. I also address important issues that have materialized as I design and use forms in Kobo Toolbox.

What is Kobo Toolbox? Kobo Toolbox is an open source suite of tools for the collection and analysis of humanitarian data- especially in remote places or after a disaster. Kobo Toolbox is comprised of three different tools, one for form design, one for data collection and another for data analysis (the latter is rather simple and won’t be discussed here). Kobo Toolbox can be used from their website or installed on a local server. It was designed by a fairly large collective organized by the Harvard Humanitarian Initiative and supported by lots of heavy hitters, including the United Nations, the International Rescue Committee and US government- through USAID. It is very well supported and has a very active and dynamic user and collaborator community.It appears to be sustainable. Similarly, the data collection form is based upon Enketo, another open source, well supported and actively growing tool.

Why Kobo Toolbox? First and most importantly, Kobo Toolbox is device independent (though there is an Android app as well). Instead of operating via an app, it runs as a webform that can be accessed through a browser on any device that can run a browser (browser does need to be updated to handle HTML5; Google Chrome seems to be the best at this point; see this). With the expansion of the use of smartphones throughout the world, this is incredibly significant. Anyone with a device with a browser can collect data. Personally, in trying to devise ways to collect archaeological data digitally, I purchased software and hardware into the $1000s- and I was doing it “on the cheap”. As my two iPads aged substantially between field seasons, I became increasingly frustrated because the expectation was that I was going to need to purchase new iPads in the near future- I was stuck in the technology treadmill. The system was unsustainable. With Kobo Toolbox, an inexpensive smart phone (yes, they do exist) is all one needs. Data collection becomes much less restricted and the opportunities for collaboration with interested communities, especially in remote places, is much, much greater.

Second, web forms can be used OFF-LINE! That’s right, a WEB form that can be used OFF-LINE. This is essential for nearly all situations in archaeology. Even if you have access to the internet (through nearby WiFi or a cellular data connection), it is likely that your connection will be cut at some point- usually the most inconvenient one- leaving you unable to collect data. Not with Kobo Toolbox. You can continue to collect data; it will upload and synchronize once a data connection has been reestablished. How much data is based upon the browser and settings with the browser; here’s details for Google Chrome. Important update: questions are stored in question banks that can now be shared!

Third, Kobo Toolbox includes an intuitive form design tool via the web. Most importantly, this means that the “learning curve” or threshold is very low. I was able to create basic forms in minutes the very first time I tried the tool. However, for more advanced data collection, the web design tool can be nearly as complicated as one desires. Data types are varied and include everything from alphanumeric to GPS location to images and audio. There are some qualifiers here; for example, barcodes can be collected using the Android app, but not the web form. Data can be collected by radio buttons, check boxes, drop down lists and many other ways. Options include skip logic (i.e. show certain questions based upon the response to previous questions) and validation criteria, both of which increase ease of use and the reliability of the data. Form construction can be even more complex if XLSforms (http://xlsform.org/ )(based upon the open standard XForms; https://www.w3.org/MarkUp/Forms/ ) are used. XLSforms can be designed using the ubiquitous Microsoft Excel (LibreOffice Calc (https://www.libreoffice.org/discover/calc/ ) could be used as well and the file saved as .xls). The tool, therefore, is incredibly easy to use from the very beginning, but can be as complex as the user demands.

HE sample form
A portion of the data collection form in Kobo Toolbox

How about a quick example? In the fall of 2015, I taught a course entitled “Historical Ecology of the Lehigh Gap” at Muhlenberg College, which was “clustered” with another course, “Degradation and Restoration” taught by my colleague Kimberly Heiman. As a component of this course, we mapped plant distributions along a transect through the Lehigh Gap Nature Center. The Lehigh Gap is a Superfund site contaminated by heavy metals from a zinc factory. The purpose of the transect was to identify different plants that represented degraded or restored communities. I created the form (see example) in an evening and shared the link with the students.

 

In the field, students pulled up the link on their smartphones. Although there were some hiccups, the form worked like a charm. We had one phone that could not use the forms- we still do not know why. Some students found that one browser was preferable to others (Chrome seemed to be the best) and we had significant issues collecting photos. Except for a few exceptions, GPS coordinates were within their known error (c. 10 meters). Those points that were not located within the expected 10 meters appear to have been random. That is, there was no apparent pattern that would suggest particular phone brands, individual phones, or users that were less accurate than others. The data was exported (via CSV) and imported into CartoDB, which students used to analyze the spatial distribution of plants. The map below shows one day of collection and only shows one plant, sweet birch, which in this case actually shows the effectiveness of prescribed burns on the northeastern portion of the transect (birch tend to pull heavy metals from the soil and reintroduce them into the ecosystem; controlled burns limit the growth of birch).

HE sample map
A sample of data collected in Kobo Toolbox being displayed in CartoDB.

This was a particularly effective exercise. We were able to collect approximately 75 data points on 10 plants at 10 meters intervals (i.e. a distance of 3/4 kilometers) in approximately 2 hours with 17 students. Data collection required no special tools, but students were able to collect and analyze a rich data set with relatively simple, intuitive tools.

Ok, it may appear that I am trying to “sell” Kobo Toolbox. However, let’s face it, there are some drawbacks to Kobo Toolbox.
First, data entry via a small screen virtually requires that typing be minimized- drop down boxes, radio buttons, etc. are far superior. This means that it is preferable to design these types of data entry into the form, which has the positive effect of standardized data, but also can reduce recording important, narrative data. Really, this is not a drawback of Kobo Toolbox, but of the device. Narrative data would be best collected via audio or text through voice recognition, but neither of these is ideal either.

Second, because Kobo Toolbox is designed around location collected through the device GPS, it is wonderful for survey, but the location aspect is less useful for excavation. If a sub-centimeter RTK GPS was connected to the device (via Bluetooth?) the location aspect of the form would be much more useful for archaeology, but that requires serious expense (or, perhaps not). The device could also be connected via Bluetooth to a total station, etc. for increased locational control. Excavation data is likely better collected via a tablet, rather than a phone.

Third, initially I wanted a tool that could be connected directly to a relational database; at this point, this cannot be easily done with Kobo Toolbox. However, because the format of the Kobo export is a CSV file, the data can be easily synchronized with any database, relational or otherwise with relatively little effort. I am now convinced that this sort of compartmentalization is actually preferable. With such a format, the user can decide how to store, analyze and archive their data. While I now prefer a PostGIS database accessed through a LibreOffice Base, others may prefer to database types or GUIs. Compartmentalization means that no one tool is reliant upon any other, but it does mean that standardized (and preferably open) data formats are required to go between tools. Compartmentalization also means that forms must be designed with the database in mind and vice versa, the database must be designed with the intension that all data will be arriving in CSV files (alternatives include KML and XLS; except images, audio, etc., which must be entered into the database manually).

Fourth, I also initially wanted the ability to modify forms in the field. I was able to do this with my initial Filemaker Pro database, but only because I carried the server into the field. However, I used adjusted forms largely during field testing. Adjusting forms in Kobo Toolbox once they have been deployed is difficult, requires an internet connection and new “projects” must be created with new URLs, etc (side note- you can install Kobo Toolbox on your own server and take it into the field). However, once past the testing phase this may not be important; once data collection has begun in earnest, changing forms is usually a poor idea because it reduces consistency and makes synchronization more difficult.

Although Kobo Toolbox has some important limitations, I now consider these limitations to be beneficial. Open source tools that do one thing extremely well and that use open standards and open file formats are preferable because their output is useful in a wide variety of other tools (and in ways that I cannot even imagine).

Robust, Open, Flexible and Offline Digital Data Collection in the Field.

(Note- originally published here: http://digitalarchaeology.msu.edu/robust-open-flexible-and-offline-digital-data-collection-in-the-field/ on Sept. 25, 2015).

First, a little background… My name is Ben Carter and I am currently an assistant professor of anthropology at Muhlenberg College in Allentown, PA. However, I came to the project described below long before I was lucky enough to get my current job. My essential perspective was forged in the fires of many years in “in-between” states- as a graduate student, an adjunct and a non-tenure-track faculty member. At the same time, I was trying to run a field school- because I believe in the fundamental pedagogical value of field school for those going into archaeology and for everyone else. However, because I operated in these in-between states, I often had severe budget and time constraints. Initially, I had hoped to employ digital tools in the field in order to save time, thus allowing me to spend more time with my students discussing important anthropological issues and less time on data entry. Although I have field tested a range of options; none have done what I want them to do (and most cost me MORE time). Indeed, my students will tell you that these attempts, while perhaps educational, were quite frustrating.   Therefore, broadly speaking, my goal has been to develop a field data collection system that is inexpensive, easy to deploy and use and results in data in a format that can be openly shared. These characteristics would not only ensure my own ease of use, but enable other archaeologists to give it a whirl. The project should yield a product useful to graduate students, faculty with severely limited resources, community projects, small CRM firms and anyone else.

Beyond my own experience, archaeologists have been shifting towards collecting data digitally in the field for some time. These systems can be as simple as using an app on a smart phone or as complex as setting up a WiFi network- including a server (laptop) and multiple data collection devices (such as iPads)- in the field.  We have done this largely to reduce both time spent converting paper-based data to digital as well as to reduce the errors introduced in this process. Also, a digital collection system is simply the first stage of a digital pipeline that funnels data to both online publishing and archival platforms. However, many of the systems developed within archaeology have been hampered by five major concerns:

First, many employ proprietary software (Filemaker Pro is one of the best known) and/or hardware that limit the integration, analysis and open dissemination of data and restrict flexibility. Similarly, once entangled in proprietary software and hardware, it is extremely difficult to leave- at least partially because they have a steep learning curve.

Second, the cost of proprietary software and the devices needed to deploy it limit its applicability to  larger, well-funded academic projects or CRM companies. Smaller groups and individuals, including small academic teams, faculty at small less-well-funded schools, graduate students, and small CRM firms, cannot sustainably afford these systems, even if the may be able to afford a one-time purchase.

Third, while proprietary software often supplies a wide array of options making it attractive, they can neither provide all present requirements nor those that may be necessary in the future. An open, community-based tool can be relatively quickly re-purposed and modified to account for nearly any contingency- in the present or the future. All this should be possible with minimal time investment and assistance from an open source community.

Fourth, most of these data collection systems require either a wireless local network or a connection to a cellular network so that the collection tool, such as a tablet, can communicate with the database on a server. While this is logistically feasible for many (especially small, localized) excavations, they are not applicable to field survey. For these tools to be used in field survey, a cellular connection is required. The tool we hope to design can be used offline and then synchronized with a database once a connection is established.

Fifth, while there are some serviceable open-source projects, these are platform specific (e.g., FAIMS is designed for Android). We prefer a browser based system in order to avoid the problems associated with restricting tools to a particular platform. We would like to build a field data collection system that is open, inexpensive, highly flexible, browser-based, and can be used offline. We believe that this exists in a set of open-source software known as wq.io.

The goal of my project- or rather OUR project for I am working with a fellow institute participant who will have a separate post and other partners are being courted- is to develop a tool that remedies these issues.  We hope that the end product of this project is a tool that:
1.    Is browser-based (and, therefore, device independent)
2.    Can be used off-line (now possible with HTML5)
3.    Employs open source code
4.    Can be connected to a database of choice (proprietary or open-source)
5. is inexpensive and simple (at least on the user end once it has been built)
6. promotes the open use of resulting data

We realize that this is a bit of a pipe dream for we will never be able to do all of this in one year. We plan to have a working prototype. We are feverishly working to figure out the details- more on that in coming posts.