(Note- Originally posted here: http://digitalarchaeology.msu.edu/why-postgis/ on April 8, 2016)
Benjamin Carter, Muhlenberg College
This will be a relatively quick post. As promised in this post, I will discuss PostGIS, a relational database management system.
First, what is a relational database and why should archaeologists use them (for a fuller explanation and discussion see Keller 2009 and Jones and Hurley 2011)? Of course, many archaeologists already use these (especially at larger contract archaeology firms), too many of us avoid them. Indeed, even in graduate school, I never discussed data organization (presumably this is common). However, the way that you organize your data can reduce time spent data wrangling and promote richer analysis. It also promotes and limits certain types of analysis.
Let’s contrast a relational database to the “flat” file (often in the form of an Excel spreadsheet) that is all too common in archaeology. Anyone who has used a spreadsheet knows that they are incredibly frustrating to use: Have you ever sorted by a column and then found that you didn’t highlight all of the columns. Now, one column is disconnected from its rightful data. No problem, right? That’s why there is an undo button. What if you accidentally saved it? No problem, you have that archived copy, right? Where was it?
Analyses of data in flat files are constrained by the contents of the spreadsheet. Even if you have multiple sheets in separate tabs (e.g., one for ceramics and one for lithics), they are not linked (yes, you can link through formulas, etc., but that is laborious as well). What if you need to input a new set of information? Let’s say you have a context code that includes site, unit and level, but you want to analyze by unit, you would need to create a new column and either manually enter the unit or digitally separate the unit from your context code. All of this takes time, creates poorly organized files that are difficult to reuse (frequently because data is disconnected from its metadata). Similarly, these frequently lack the appropriate metadata that allows them to be shared and archived. They are largely designed for and with the interests in mid of a single researcher (or perhaps a small team). Frequently, specialists have disparate spreadsheets that cannot “talk” to each other.
While no database is perfect, relational databases can alleviate many of these issues. The essential concepts behind a database are to disaggregate data, limit busy work and standardize your data (Note that this never means that you would lose the qualitative narrative). This reduces time and increases quality control. To conceptualize a relational database, think of multiple tables linked together. For example, I may have an excavation table with a wide array of data, including a column with site number. Each “record” (i.e., row) includes all the nitty-gritty information from a single layer from a single unit. If I use the trinomial system, there are three pieces of information buried in a single number/ column (state, county, site number). However, if I wanted to disaggregate these pieces of information in a spreadsheet, I would need to make new columns and do a great deal of copying and pasting all the while risking separating a piece of data from its original record. In a relational database, the original table can be easily connected to a small table that includes one column each for trinomial number, state, county and site number, but only ONE record for each unique trinomial. Then you create a “relationship” between the trinomial column in the original table and the trinomial column in the new table. In other words, each record (row in a table) of your original data is directly linked to state, county and site number with no insertion of columns or copying and pasting of data. Imagine your original table includes three sites in two counties and a total of 1000 records (levels of units). To associate state, county and site number with the trinomial, you would need to insert three columns and copy and paste data into the right cells for all 1000 records (that is, you have created 3000 additional pieces of data; I hope you didn’t waste field time writing your state on each form! With a relational database, you only need to create three records (12 pieces of data). However, because of the relationship created, you have actually created the same 3000 data points. Sounds a bit more efficient, no?
I recently worked with census data from North Atlantic Population Project . Much of the data is coded. The downloaded data includes numbers that mean nothing to me, but those codes can be linked to text; a 336 in the IND50US column (Industry categories in 1950) means “Blast furnaces, steel works, and rolling mills”. The original data table is linked to a small table (indeed many of them) that convert apparently meaningless codes into understandable text. This means that I entered the words “Blast furnaces, steel works, and rolling mills” only once, but they are now associated with all 600 records in the original table from NAPP that included the “336” code in the IND50US column.
Why Post GIS? PostGIS is simply a spatial extension of PostgreSQL, an “object relational database management system.” That is, it is simply a language for creating and organizing relational databases. The main reason for choosing this system is that it is incredibly popular, widely used in industry and academia. It is open source and works on all computer platforms; it is now the native on Mac OSX servers. It can be stored a server or on your computer. I prefer a graphical user interface and pgadmin, the “native” client for accessing and editing your database, is not intuitive to me. However, I am in the process of switching word processing and spreadsheets to the open source LibreOffice Suite. LibreOffice Base, their answer to MS Access/ Filmaker Pro has native support for PostgreSQL. Other database management programs, such as the two mentioned in the previous sentence, also have native support for PostgreSQL (i.e., you do not need to use LibreOffice). Similarly, PostgreSQL/ PostGIS is supported by GRASS/QGIS, an open source GIS programs (This is a huge plus. Most data in GIS programs are in the flat files ridiculed above). While PostgreSQL/ PostGIS is certainly not the only option available to do these things, it appeared to be the most widely supported.
Finally, I will openly admit that I have only begun to work with PostGIS/ LibreOffice Base and I am having some difficulties. I will refrain from being too critical yet because it may simply be part of the learning curve.
Kobo Toolbox ( a field data collection web app discussed here) yields tables that
To open a can of worms that I am still struggling with, I will suggest that relational databases will allow field data to be easily converted into (or perhaps collected as) linked open data.