This notebook has stemmed from 'A course on Spatial Thinking and Methods for Archaeologists' given to the Honours Students in Archaeology at the University of Glasgow by Dr Rachel Opitz. The course enabled the students to think about how to use spatial analysis to analyse archaeological data.
This notebook has two goals:
By reading this notebook, we want you to be able to create your own notebooks and experiment with real archaeological data.
More specifically, the learning outcomes are to:
This is thinking critically about space and place in archaeology to answer questions:
Why do we care about the 'where' in archaeology? What is the relationship between space and place? How do you know where something is located? How do we define location? How do we recognise a spatial relationship? Why spatial patterns are important?
Science is based on gathering evidence and interpreting the evidence to draw logical conclusions. Data science is a branch of computer science dealing with, capturing, processing, and analysing data to gain new insights about the systems being studied (ref. to open source).
From an archaeology point of view, this is all about learning how to:
This data has to be digital (or made digital) and spatial!
Spatial data for ARCHAEOLOGY! |
Which tools and which platform?
Because data science is is a broad field that's expanding rapidly, we need to find techniques and methods to systematically question and analyse data. Data can be untangled and analysed with multiple tools:
Adapted from Modern Data Scientist skill set by Krzysztof Zawadzki
To capture, observe, collect, store, and manage geographic data; and to analyse and visualise this spatial data, you need a System that deals with Geographic Information called GIS (Longley et al.,2001). A GIS comprises hardware (computer, mobile device, GPS), software (ArcGIS, QGIS or online mapping) and data (geographic information). The process of developing software is called programming where a program instructs the computer to accomplish a task based on the orders (Yang, 2017). There many different types of programming levels: CPU for hardware instructions; C programming to develop software, C++ for programme organisation; Java and JavaScript to programme objects manipulations in a web browser; R (add the links from before) and Python are both object oriented and interactive programming language for data science. This is not the intention of this book to teach you to develop programmes but by using practical exercises to introduce you gradually to programming in order to question and analyse real archaeological data using one of its language: Python.
Python was originally developed by a Dutch Programmer, Guido van Rossum, in 1990. Van Rossum was reportedly a fan of the British Monty Python's Flying Circus from which he borrowed the name 'Python' when he developed the open-source programming language (and his non-profit institution Python Software Foundation'). Python is described as a 'fully object-oriented language, simple yet elegant, and stable and mature' (Yang, 2017).
These notebooks are your first step to experiment with spatial data using a programming language using python language.
There are several steps to learning Python and to ease your steps in these practical exercises few fundamental concepts are introduced succinctly first:
Lock and Pouncett (2017) reminds us "that thinking spatially entails knowing about:
Spatial analysis will help you to formulate problems, think creatively about solutions, and express a solution clearly and accurately as schematise in figure below:
Spatial analysis |
A ‘Feature-oriented’ approach. Spatial data can be stored as points (0-Dimensional), lines (1-Dimensional), areas (2-Dimensional), it is vector data:
Data stored as vector | |
| |Vector files can be Shapefiles, WKT, geoJSON|Data stored at raster |
Raster files can be geoTiff, HDF, BSQ, ASC, etc. |
Database is an integrated set of attributes on a particular subject (a structured collection of data files) and
Geographic database / Geospatial database / Geodatabase are set of attributes on a particular subject for a particular subject area (a structured collection of data files including spatial locations).
GeoDatabase stores geometry information and relationships between different geometry layers |
Each object/data will have attributes that define the value of its properties and will be sorted differently depending it is a raster or vector. Attributes can be broadly split between numerical and categorical (see image below), and the difference that lies between them is crucial to grasp as it will define which tests (spatial analysis) and commands (from script) you can perform. These attributes can be:
* Nominal – qualitative data defining the kind or type of object
* Ordinal – ranking or place in hierarchy
* Interval – quantitative measurement with arbitrary unit and zero
* Mathematical operators limited
* Ratio – quantitative measurement with non-arbitrary zero
* Full range of mathematical operators applicable.
Material for these series of practical exercises is built on top of individual [Jupyter Notebook]using Python. The code has all been written for you, you just need to execute it. Most code cells have an extra explanatory information to explain and detail how the code is working. So, when confused by python language, look for Learning a new language cells .
Jupyter notebooks are open-source web applications that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. Here are the user interfaces you can work with.
Jupyter notebooks have been used in this practical course exercises because they ease the creation and the sharing of documents containing live code such asPythonor R). You can explore a gallery of interesting Jupyter notebooks and read why Jupyter is data scientists’ computational notebook of choice (i and ii).
GitHub is widely known as one of the most popular code hosting platforms for version control and collaboration. This is where these notebooks will be saved and accessed publicly an open-source project. All working material can be found in https://github.com/Francoz-Charlotte/Spatial_teaching_CFediting (NEED THE FINAL LINK HERE!)
Jupyter Notebooks are made of a list of CELLS. Cells contain either explanatory text or executable code (here we are using Python 3) and its output (once you run the code).
This cell box is a text cell. Text cells use a language called Markdown. To learn more, you can follow markdown guide provides examples to edit the Markdown cell, there is also an excellent markdown cheatsheet or you could follow this brief tutorial.
We will use these code cells to do things. They are executable code, mostly written in programming language. These code cells are commonly using a language called Python or R. In this course we'll mostly be using python. As you know, stereotypically archaeologists hate snakes. Python is the exception to this.
You can run the entirety of these notebooks here or make your own copies and run the script with your own data.
To run the notebooks, you can install python and work with these jupyter notebooks, for its simplicity Anaconda is an easy way to have everytthing in place on your computer.
To make your own copy, you'll need to:
In this introductory exercise, you will learn how to get some tools to work with simple spatial data and you will also learn how to open up some data.
= |
In these notebooks, to make things happen we need to speak Python language. The language is compiled in different libraries that need to be imported to make things happen. These tools are open source libraries from Python that enable powerful data visualisation. Importing these libraries is in fact the first bit of python code you will be using.
In this first exercise, we will be using Pandas and Geopandas libraries. Pandas is a powerful and rich library that is designed for data analysis in Python. Geopandas is a library used to make working with geospatial data in python easier. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types.
#codecell_introduction_importyourlibraries
#start by importing a tool you want.
#Pandas is a basic tool for working with text and spreadsheet type data.
#Just say 'import' and the name of the tool
import pandas as pd
WHAT IS THIS? YOUR FIRST PROGRAMMING CODE CELL!
Add another code cell below this one and try to import geopandas and give it the abbreviation 'gpd'
#codecell_introduction_importyourlibraries
import geopandas as gpd
# Odds are that didn't work... Some tools need an additional 'install' step.
# This happens when the tool isn't in the cloud environment of the software you are working on.
# If something doesn't import, I pretty much always try to '!pip install' it.
# pip is python's tool for getting tools. Very meta.
!pip install geopandas
import geopandas as gpd
You now have three tools: Pandas, Geopandas and Matplotlib. Whenever you need more tools, use import or !pip install which is a package management system to install manage software packages.
Geopandas is notoriously difficult to install, especially on Windows. If you intend to work directly on the Jupyter notebooks, and pip install does not work for you, this video will help you to get started with installation. if you are running out of luck I can only recommend to follow this and as a last resort this link.
Add another code cell below this one and figure out how to import matplotlib.pyplot. Matplotlib.pyplot lets you plot in 2D with an extensive plotting library for quality publication & provides a MATLAB-like way of plotting.
You can add text or code cells to your version of the notebook: use the +
button in your toolbar and choose which type of cell you need (e.g. Markdown, Raw NB convert, Heading) as the default is Code
cell. By hitting the up or down arrow button, you can move the cell to wherever you want.
Editing the cell: click the cell to select it, and, double-click to edit it.
To make things happen in a Code cell you can click on the cell and then type 'ctrl + enter' and your command can be executed. An asterisk in square brackets In [*]:
will appear while the code is being executed, and this will change to a number In [1]:
when the code is finished. *The order in which you execute the code blocks matters, they must be run in sequence.*
Inside blocks of python code there are comments indicated by lines that start with #
. These lines are not computer code but rather comments providing information about what the code is doing to help you follow along.
You are ready to go! and remember to save regularly.
Exploring the data you already know
You can see some geojson (vector for the internet) data from Canmore that represents locations of sites on Shetland as points with descriptive attributes here.
Have a look and identify the coordinates (spatial data), attributes, and coordinate system EPSG code.
Write an example of each one in a new text field below this one in your notebook.
What's a GeoJSON? is an open standard format designed for representing simple geographical features, along with their non-spatial attributes. It is based on the JavaScript Object Notation. GeoJSON embeds attributes and the coordinates projection information (we'll talk a bity more about them in chapter 2). It is vector data.
GEoJson example |
What's a WTK or known as Well-known text (WKT). It is a text markup language for representing vector geometry objects on a map. A binary equivalent is known as well-known binary (WKB). Open Geospatial Consortium (OGC) designed and approved. The current standard definition is in the ISO/IEC 13249-3:2016 standard.
GEOMETRYCOLLECTION(POINT(4 6) ~ LINESTRING(4 6,7 10))~ POINT ZM (1 1 5 60) ~ POINT M (1 1 80) ~ POINT EMPTY MULTIPOLYGON ~ EMPTY TRIANGLE((0 0 0,0 1 0,1 1 0,0 0 0))~ TIN (((0 0 0, 0 0 1, 0 1 0, 0 0 0)), ((0 0 0, 0 1 0, 1 1 0, 0 0 0))) ~ POLYHEDRALSURFACE Z ( PATCHES ((0 0 0, 0 1 0, 1 1 0, 1 0 0, 0 0 0)), ((1 1 1, 1 0 1, 0 0 1, 0 1 1, 1 1 1)), ((1 1 1, 1 0 1, 1 0 0, 1 1 0, 1 1 1)),((1 1 1, 1 1 0, 0 1 0, 0 1 1, 1 1 1))) | |
Well-known text (WKT)details | WKT example |
Let's import some real data
We want to bring some data into this notebook. It is done by telling the notebook where the data is located and asking it to read that file using the geopandas tool using the data's URL (web address). So this data is digital. Instead of using this long web url, we can give the file a name within the notebook. Here I've called it 'df'
#codecell_introduction_getyourdata
url = "http://ropitz.github.io/digitalantiquity/data/CanmoreShetlandPoints.geojson"
df = gpd.read_file(url)
The command .head() will show you the first part of any file in a notebook.
#codecell_introduction_takeapick@yourdata
# go ahead and look at the first few lines of the file.
#Note this is slightly easier to read than the raw geojson
df.head()
This is usually fairly easy. For any spatial dataset, just call its name and say .plot()
#codecell_introduction_plot@yourdata
df.plot()
Not very pretty... Next week we'll learn about making pretty maps.
Go find some other archaeological, read it into the notebook, and plot it.
Like the scheduled monuments from York: https://opendata.arcgis.com/datasets/26857683681542b798ef97708cbdaf61_14.geojson
Or know wrecks off the coast of Northern Ireland: https://www.opendatani.gov.uk/dataset/c5fceed7-b07a-4bc4-a0a0-8c45b9033083/resource/2d8da801-39f7-48b7-81ad-8db58d107962/download/protected_wrecks.geojson
Or any other open data you find in the geojson format...
Howey, M. C., & Burg, M. B. (2017). Assessing the state of archaeological GIS research: Unbinding analyses of past landscapes. Journal of Archaeological Science, 84, 1-9.
Lock, G., & Pouncett, J. (2017). Spatial thinking in archaeology: Is GIS the answer?. Journal of Archaeological Science, 84, 129-135.
Longley, P. A., Goodchild, M. F., Maguire, D. J. & Rhind, D. W. (2001) Geographic Information Systems and Science. Wiley, New York.
Yang, C. (2017). Introduction to GIS Programming and Fundamentals with Python and ArcGIS. Boca Raton, FL : CRC Press, Taylor & Francis Group, 2017.
Verhagen, P. (2018). Spatial analysis in archaeology: moving into new territories. In Digital Geoarchaeology (pp. 11-25). Springer, Cham.
In this practical exercise, you will make some basic maps. You're making them interactively in this notebook.
A basic map allows the viewer to :
So, even a basic map is centred around a research question. As such, this exercise sets some goals as to:
= |
To get started, you will get the tools required. In this lab, we are using Panda, Folium and Branca libraries.
#codecell_makeabasicmap_importyourlibraries
import folium
import branca
import pandas as pd
print(folium.__file__)
print(folium.__version__)
We've discussed how to come up with a good, explicitly spatial question. We've also discussed how to design a good map. Here you're going to start putting all this into practice.
The data in Palmisano et al. (2018) article provides information about settlement and population patterns in central Italy and how they change over time. Where people were living and working at various times in the past is a basic archaeological question.
Let's say your question is about how many iron age sites are present in the region, and what their distribution is like in space - that is where they are located and how many relatively speaking are in each area.
How would you go about answering these questions with a map?
This data must be explicitly spatial - that is have spatial coordinates that tell us about locations in a way that machines can understand. Spatial coordinates are pairs of numbers organised on a x-horizontal and y-vertical axis – this is the bit of geometry you will be using the most as an archaeologist!
Projections: Distances and directions to all places are true only from the centre point of projection.
Projections and coordinate systems: Coordinates (x,y) can also be called eastings & northings or latitude & longitudes depending on the type projection used to transform the map. The most common coordinate format for maps on the internet are latitude and longitude coordinates.
Transformations: When working with different dataset ALL YOUR DATA MUST BE IN THE SAME COORDINATE SYSTEM. If they are different you will need to transform it... below are illustrations of the relationships between different projections.
How to unfold a cylindrical projection of the world to a Mercator grid projection? | Or from Geographic (left image)to Cartesian coordinates (right image)? |
South America many projections | Ways to transform(with a bit of math too!) |
Whilst the conversion from geodetic to cartesian is fairly straightforward, converting cartesian to geodetic is a complex problem (& if you are interested on how this is done mathematically have a look at this).
here are some useful links to use with your dataset:
In this notebook, we'll experiment with the spatial data from:
Palmisano, A., Bevan, A. and Shennan, S., 2018. Regional Demographic Trends and Settlement Patterns in Central Italy: Archaeological Sites and Radiocarbon Dates. Journal of Open Archaeology Data, 6(1), p.2. DOI: http://doi.org/10.5334/joad.43
Spend some time exploring read about it before jumping in this practical...
These researchers have been very thoughtful and provided the data behind their analysis so we can re-use it. Their old school shapefile has been converted to a CSV for you to make things easier. If you ever have to do this yourself, there are lots of online converters like this one.
Later in the course, you will also learn to read data from shapefiles directly into your notebook. This is a little more complicated, so we are skipping it for now.
Reminders:
A CSV file is simple tabular file format using commas or spaces to distinguish elements within it.
A shapefile is a vector data storage file format that can be used in spatial software such as Q-GIS .
Mandatory files: • .shp — shape format; the feature geometry itself. Variable-record-length file in which each record describes a shape with a list of its vertices. • .shx — shape index format; a positional index of the feature geometry to allow seeking forwards and backwards quickly. The index file (.shx) contains a 100-byte header followed by 8-byte, fixed-length records. • .dbf — attribute format; one-to-one relationship between geometry and attributes is based on record number. Attribute records in the dBASE file must be in the same order as records in the main file. | Optional files: • .prj — provides projection information and coordinate system;, a plain text file describing the projection using well-known text format • .sbn and .sbx — a spatial index of the features • .fbnand .fbx — a spatial index of the features for shapefiles that are read-only • .ain and .aih — an attribute index of the active fields in a table • .ixs — a geocoding index for read-write shapefiles • .mxs — a geocoding index for read-write shapefiles (ODB format) • .atx — an attribute index for the .dbf file in the form of shapefile.columnname.atx (ArcGIS 8 and later) • .shp.xml — contain geospatial metadata in XML format. • .cpg — used to specify the code page (only for .dbf) for identifying the character encoding to be used. |
#codecell_makeabasicmap_GetUrdataReady
#Get the data by reading it into the notebook
palmisano_tuscany_sites = pd.read_csv("https://raw.githubusercontent.com/ropitz/spatialarchaeology/master/data/site_centriods_tuscany.txt")
palmisano_tuscany_sites.head()
In #codecell_makeabasicmap-GetUrdataReady, we have a simple piece of code which allows you to open the CSV file containing the archaeological sites information. The image below will help you to understand the code better and reuse it easily.
The original data file contains over 10970 entries for all archaeological settlement in Central Italy. So you need to filter or SUBSELECT your big data file to get just the iron age sites.
#codecell_makeabasicmap_SubSelectData
#tell the notebook you only want to see stuff where the period is the iron age.
palmisano_tuscany_sites_iron_age = palmisano_tuscany_sites[(palmisano_tuscany_sites['Period']=="Iron Age") ]
palmisano_tuscany_sites_iron_age.head()
In the #code_makeabasicmap_SubSelectData, we have a simple piece of code which allows you to open the CSV file containing the archaeological sites information. The image below is decomposing this codeline, so you can understand how python language works and re-use it easily.
Now we want to add this data to a simple map.
You need to start by getting the coordinates of all the points so you can center the map - that is focus on the area where your data is. Probably it will be a good idea to put the middle of your map roughly where the middle of your dataset is located.
Think about it... if your data is in Italy, putting Antarctica as the center of your map is not going to be very effective.
Now, you want to add a marker for each Iron Age site. A marker, also called recall, is a graphic icon to represent the coordinates where a site is centred.
#codecell_makeabasicmap_BringingUrData2theMap
#location is the mean of every lat and long point to centre the map.
location = palmisano_tuscany_sites_iron_age['Latitude'].mean(), palmisano_tuscany_sites_iron_age['Longitude'].mean()
#A basemap is then created using the location to centre on and the zoom level to start.
m = folium.Map(location=location,zoom_start=10)
#Each location in the DataFrame is then added as a marker to the basemap points are then added to the map
for i in range(0,len(palmisano_tuscany_sites_iron_age)):
folium.Marker([palmisano_tuscany_sites_iron_age['Latitude'].iloc[i],palmisano_tuscany_sites_iron_age['Longitude'].iloc[i]]).add_to(m)
#To display the map simply ask for the object (called 'm') representation
m
You have a map.
The #codecell_makeabasicmap_BringingUrData2theMap contains four codelines aimed to:
centre your map by calculating the mean of the latitude coordinates and mean of longitude coordinates;
zoom your map to a suitable level - e.g. world, country, region or street level;
place a marker for each archaeological site location.
show the results interactively.
Below are the three codelines of the script decomposed for you, so you can adapt it easily in the future.
Reminder: the mean or average of a set is the total sum of all numbers divided by the number of items in the set. So, for instance, the mean of latitude coordinates = sum of all latitude coordinates/ divided by the number of latitude coordinates in the dataframe.
This codeline shows the powerful capabilities of the Folium library you've imported at the start of this practical lab. You can find more detailed guidance about it here.
This codeline introduces you to two important concepts in coding: the range and the loop by using iloc command. Furthermore, by using range(), len() and iloc() you could implement a loop in Python for iterating rows and columns of your filtered pandas dataframe (palminsano_tuscany_sites_iron_age).
Now, a design question: what is a good starting zoom level?
Do you want your map zoomed further out or further in given the extent of your data?
In the code cell above, play around with the 'zoom_start' parameter and find a good zoom level that makes you happy.
You just modified the code to suit your own needs
Rather than just executing code by pushing play, you've edited the code by changing a variable.
Easy, right?
This is where many people get started with scripting in archaeology. You find a bit of open source code that seems to do what you need, and then you modify it. There's no sense in writing code from scratch when there's lots of useful open source code available that is meant to be shared and modified.
So far, we have seen how the data is spread visually, but we want to look at this data quantitatively. How do these sites compare?
Let's look at how to represent larger and smaller sites. The archaeologists in this project have a qualitative ranking of site size from large (A) to small (F). You can see the categories they've used, and that there are more small sites than larger ones. Say you want to make the two categories of sites that are the largest different colours. Why do this? Maybe larger sites have more people, so this provides insights into demographics.
In the step above, #codecell_makeabasicmap_SubSelectData, you've set up your new dataset by filtering the iron age settlements from the original dataset. It's what we would call a 'subset' of your original data.
Now you are ready to make a new map.
#codecell_makeabasicmap_ManipulatingyourData_ValueCounting
# Size surely matters... group the sites by size category and get the number of sites in each size category.
# SizeQual is the name of the 'attribute', or column in the table, that has information on site size.
# An attribute is anything attached to spatial data other than the coordinates, such as period, type, size in hectare,etc.
palmisano_tuscany_sites_iron_age['SizeQual'].value_counts()
.value_counts() is a function that allows you to count numerical and categorical data (e.g. numbers, objects or labels such as attributes). The results are returned as a series of numbers in descending order, so, that you can further evaluate the frequency and distribution of these values.
#codecell_makeabasicmap_ManipulatingyourData_UsingSymbology
#now make a map just like you did before. Note that this time we're adding a scale bar with 'control_scale'
location = palmisano_tuscany_sites_iron_age['Latitude'].mean(), palmisano_tuscany_sites_iron_age['Longitude'].mean()
m = folium.Map(location=location,zoom_start=10,control_scale = True)
#Assign different colours to the two large site categories - B and C in this case
for i in range(0,len(palmisano_tuscany_sites_iron_age)):
site_size = palmisano_tuscany_sites_iron_age['SizeQual'].iloc[i]
if site_size == 'C':
color = 'blue'
elif site_size == 'B':
color = 'green'
else:
color = 'red'
# add the markers to the map, using the locations and colours
folium.Marker([palmisano_tuscany_sites_iron_age['Latitude'].iloc[i],palmisano_tuscany_sites_iron_age['Longitude'].iloc[i]],icon=folium.Icon(color=color)).add_to(m)
#type 'm' for map (the variable you set above) to tell the notebook to display your map
m
for #codecell_makeabasicmap_ManipulatingyourData_UsingSymbology can be decomposed like this:
Why design matters? Why it is important? Some map design can observe some principles following a simple equation: Aesthetics + Information Communication + Argumentation. In other words, your map should be beautiful, well designed, and tell a clear true story.
Research, design, revision, and evaluation are the standard process for making a good map following a simple workflow : Careful planning -> Data analysis -> Presentation -> Critique = Feedback -> Production.
For starters, you can look at this document to familiarise yourself with some of the concepts
Changing Symbol Colour
Now, go back into #codecell_makeabasicmap_ManipulatingyourData_UsingSymbology and experiment a bit. Try out some different colours. Do certain combinations work well? Try adding different colours for each of the smaller categories of sites. Does that make the map clearer or more confusing?
Changing Symbol size
Maybe it makes more sense to show size by changing the size of the icon than the colour. Let's make another map that varies the size of the icon for each site based on its size in hectares. This is making a proportional symbol map.
Here are two example maps from Kenneth Field illustrating the power of map symbology:
Basic geocoded structure of the dataset | Proportional symbol map of poverty |
#codecell_makeabasicmap_ManipulatingyourData_UsingSymbologyExperimenting
#now make a map just like you did before.
location = palmisano_tuscany_sites_iron_age['Latitude'].mean(), palmisano_tuscany_sites_iron_age['Longitude'].mean()
m = folium.Map(location=location,zoom_start=8,control_scale = True)
# Set the size for each circle icon by defining the 'radius' which is the radius of the circle
# Here we are multiplying the size in hectares (the SizeHa attribute) by 15. Try different values here to get icons a size you like
for i in range(0,len(palmisano_tuscany_sites_iron_age)):
folium.Circle(
location=[palmisano_tuscany_sites_iron_age['Latitude'].iloc[i],palmisano_tuscany_sites_iron_age['Longitude'].iloc[i]],
popup=palmisano_tuscany_sites_iron_age.iloc[i]['Toponyms'],
radius=palmisano_tuscany_sites_iron_age.iloc[i]['SizeHa']*15, #this is where we set the value of 15 - change this variable to get differrent size icons
color='crimson',
fill=True,
fill_color='crimson'
).add_to(m)
#type 'm' for map (the variable you set above) to tell the notebook to display your map
m
for #codecell_makeabasicmap_ManipulatingyourData_UsingSymbologyExperimenting can be decomposed like this:
Many things in life naturally tend to cluster. Patterns of higher density or cluster can be clearly identified in urban and rural landscapes and populations. Forming general understandings of the processes at particular points in time (e.g. socioeconomic, demographic and environmental structures of different places) that give rise to the geographical patterning of landscapes and populations are essential to predict and simulate how they settlements have evolved in the past and might evolve in the future.
When might this be important? What kinds of archaeological questions involve finding the spatial centre of a distribution of sites or finds or features?
Research question Now what if you want to make a map that shows the concentration of sites in the iron age, that is areas with more and fewer sites?
In order to show areas of higher density of activity, a 'heatmap' is often created by using a colour gradient. A heatmap represents the geographic density of point features on a map by using colored areas to represent those points. The areas will be of high intensity colour where the most points are concentrated and least intensity where they are less concentrated. Essentially it shows areas with more sites as 'hotter' (generally red in colour) and areas with fewer sites as cooler.
For our dataset, in this kind of map surely larger sites should count more - we call this 'weighting'. We want to add the weight of the size of the settlement for our Iron Age settlement map. A site that is twice as large should count for twice as much in our heatmap.
For this need to think how the points are distributed geographically. This entails obtaining basic spatial statistics about the dataset, a first step in the analysis of a geographic data.
Just a reminder of some basic terminologies in data statistics:
The overarching aim is to determine the default search radius or bandwidth which will define the area of influence and how the colour gradient will be applied as illustrated in the image below.
To find the (default) search radius consists in 5 steps (Yang,2017) where you calculate:
Here is an example of population density map to illustrate this concept:
Density surfaces show where point or line features are concentrated. For example, you might have a point value for each town representing the total number of people in the town, but you want to learn more about the spread of population over the region. Since all the people in each town do not live at the population point, by calculating density, you can create a surface showing the predicted distribution of the population throughout the landscape. The following graphic gives an example of a density surface. When added together, the population values of the cells equal the sum of the population of the original point layer. ref |
Folium is the library will be using to create our heatmap, you can see more example in Anthony Ivan page or Aly Sivji.
#codecell_makeabasicmap_ManipulatingyourData_Heatmap
#first make a list of the coordinates of each site and its size in hectares, which we will use for the size-based weighting
data_heat = palmisano_tuscany_sites_iron_age[['Latitude','Longitude','SizeHa']].values.tolist()
#look at the first line of your list to check it seems to have worked
data_heat[0]
#to make the heatmap, we need to get an extra tool, so...
import folium.plugins as plugins
# now make a map a bit like you did before, set your zoom level and your centre location. Then use the plugin to make the heatmap.
m = folium.Map(location=location, zoom_start=10)
#tiles='stamentoner'
plugins.HeatMap(data_heat).add_to(m)
#type 'm' for map (the variable you set above) to tell the notebook to display your map
m
You should have a heatmap showing the concentrations of Iron Age sites in the region.
You could repeat the exercise with sites from any period in which you are interested.
Hopefully you've learned to make some basic maps and are starting to understand how to put into practice some of the theory of map design and spatial visualisation we've been discussing.
That's it for today... Remember to save your notebook (under a new name) so you can come back to it and practice making basic static maps.
In this practical you have learned new commands that you can now reuse for your own datasets:
Palmisano, A., Bevan, A. and Shennan, S. (2018) Regional Demographic Trends and Settlement Patterns in Central Italy: Archaeological Sites and Radiocarbon Dates. Journal of Open Archaeology Data, 6(1), p.2. DOI: http://doi.org/10.5334/joad.43
Pimpler, E. (2017) Spatial Analytics with ArcGIS. Pakt Publishing.
Yang, C. (2017). Introduction to GIS Programming and Fundamentals with Python and ArcGIS. Boca Raton, FL : CRC Press, Taylor & Francis Group.
~ déjà vu ~ In Chapter 2, we focused on making "mostly static" maps, that is maps where you mostly just expect your user to look at the product you've prepared. We looked at a research question about the distribution of Iron Age sites in Central Italy, and we focused the practical exercise on using the map for:
~ new ~ This week we'll continue exploring different types of maps. Specifically, we'll look at making maps where interactivity is a key part of the design. Beyond the question of how to present spatial data (i.e. designing the map), we will spend more time on an important topic in archaeology: spatial distributions. Central to investigating spatial distributions (patterns) is our ability to manipulate and rearrange spatial data to answer spatially explicit questions.As such, this practical exercise aims to help you with:
Why does this data need to be presented in an interactive map? What spatial relationships are important here? What kinds of questions might a user ask?
Interactive maps are for exploring spatial data where poeple can:
When should you use interactive maps?
Interactive maps are spatial and descriptive data interfaces
Visual design *(Front end)* | Data for design *(Back end)* |
Symbols | Data Quality (accuracy and resolution) |
Fonts | Data Cleanliness |
Labels | Data Consistency |
Scale | Data Accessibility |
North Arrow | Metadata |
Generalisation | Data Architecture |
Visual Interface & Query Interface | Data architecture & Spatial Data resource |
Open source software In Chapter 1, when introducing Jupyter notebooks, we mentioned open source software. Open data operates under the same broad ethos, and follows many of the same principles. Sharing, reuse, and attribution are key. If you continue to reuse the Antikythera data, be sure to continue to link back to and cite the source.
Perhaps the most relevant example in the UK is OS (Ordnance survey) OpenData which was made freely available for the first time in 2010. Last review (2018) counted 1.9 million open data downloads, the equivalent of 150 people download OS OpenData every day!
2015 has seen Environmental Agency made lidar (light detection and ranging) data available to the public, for free, as open data. Within the first year of release 500,000 lidar downloads were made equating to nearly 13 million km2 of data!
Working with Legacy data: the case of West Heslerton, North Yorkshire, excavation of the Early Anglo-Saxon or Anglian Settlement between 1986 and 1995,
Interactive Map | Settlement Archive |
The West Heslerton Interactive Map: This is the map | The West Heslerton Settlement Archive: This is the data that drives the map |
Front end | Back end |
= |
Start by getting tools, always. In this Chapter, you still be working with panda and folium and you will be replacing the branca library with numpy.
#codecell_Webmaps&Distributions_ImportUrLibraries
#Like last time, get the tools we need at the start
import pandas as pd
import folium
import numpy as np
This time, we are working with the data from the Antikythera survey project.
It's citation is: Bevan, A. and Conolly, J. 2012. Intensive Survey Data from Antikythera, Greece. Journal of Open Archaeology Data 1(1), DOI: http://dx.doi.org/10.5334/4f3bcb3f7f21d.
This data has been made available at the Archaeological Data Service (ADS) archive at https://archaeologydataservice.ac.uk/archives/view/antikythera_ahrc_2012/ and completely open for re-use. You can see how many times their database has been viewed and downloaded.
Working with other people's data
Have a quick look around Antikythera dataset as it's described on the ADS site. You'll notice that they've split up their dataset in ways that made sense to them at the time. Specifically they've divided up the artefact data into three discrete elements: 'pottery', 'lithics' and 'other' into separate files (very much like we did last week by filtering and creating a subset). This is a pretty normal archaeological data thing to do.
Here's the trick, we want to focus on both ceramics and small finds and to look at these datasets together. This means you'll have to grab both of them and combine them. When you are combining them, you will also need to re-organize the attribute data in order to reuse it for something new. This follows the model of a relational database
What’s a database vs a graph database?
How to query this data? Query languages are the languages used to ask questions of databases. Python acts as a query language for pandas data structures. SQL and SPARQL are specific database query languages
Querying Python | Querying SQL |
Example python script (top) & Your script in 2.3.2 (bottom) | Example of SQL query language |
#codecell_Webmaps&Distributions_OrganisingUrdata
#Like last time, get the data we'll need at the start. I've been nice again and converted their coordinates to latitude and longitude for you.
#You'll learn to do this yourself later in the course.
# we label the first dataset 'pottery'
pottery = pd.read_csv('https://raw.githubusercontent.com/ropitz/spatialarchaeology/master/data/antikythera_survey_pottery.csv')
# we label the second dataset 'small finds'
small_finds = pd.read_csv('https://raw.githubusercontent.com/ropitz/spatialarchaeology/master/data/antikythera_survey_small_finds.csv')
#codecell_Webmaps&Distributions_CheckingZeData
#let's check the individual pottery file.
pottery.head()
#codecell_Webmaps&Distributions_CheckingZeData
#let's check the individual pottery file to see
small_finds.head()
in #codecell_Webmaps & Distributions_CheckingZeData:
= allows you to provide a new name, often more convenient than the link pathname
pd.read_csv allows you to open a CSV format file
.head() function take a peek at the first part of the dataset
Spend some time understanding your data
Now have a look at these two datasets and how they are structured. Also, the coordinates have been transformed for you from cartesian co-ordinates : a grid system - shown in column 'Xsugg' & 'Ysugg' - back to latitude/longitude, a geodetic system - shown in column 'DDlat' & 'DDlong' -. (we've talked about projections, coordinates systems and transformations in Chapter 2, section 2.3).
#codecell_Webmaps&Distributions_Concatenation
# then we combine the two datasets together to make a big dataset we call 'survey data'
survey_data = pd.concat([pottery,small_finds], sort=False, ignore_index=True)
Codecell__Webmaps&Distributions_Concatenation can be decomposed like this:
An array is a group of elements/objects/items organised as a set where columns (of equal size) are multiplied by rows (of equal size). Below are two illustration showing what is happening with an array and how it differs from a list:
The advantage of using arrays is that all elements can be accessed at any time randomly. In Chapter 2, you had done something similar to call all items from the range [i] ( in #codecell_makeabasicmap_BringingUrData2theMap ). This time, you are linking together 2 arrays. Pandas library provides various ways to combine together two arrays such as merge, join and concatenate (see user guide).
In this example, df1 and df4 are concatenated in the same way we have concatenated pottery and small finds |
The result shows that the axis have been appended them but the overlapping indexes have been ignored. To do this, we have used the argument ignore_index=True) |
Let's make sure nothing went wrong...
#codecell_Webmaps&Distributions_CheckingZeData
#check things loaded in and combined OK
survey_data.head()
Like we said last week, we are using maps and spatial analysis to pose, explore and respond to spatial questions.
My question is a bit like last chapter's question. I want to know about how many sites are in each period, so I can try and understand changing patterns over time.
However, you may have noticed when you read in the data that it's structure is a bit different from last week's data. Instead of each site belonging to one period, it's assigned with varying probability to several different periods.
This is a totally legit archaeological thing to do. Many sites have activity from multiple periods, and depending on the available evidence, you might be more or less confident about the presence or absence of activity in a specific period.
We might start simply by assigning each site primarily to its 'most likely period'.
Data cleaning is the most time-consuming and important part of any analysis - and not only in archaeology but for all scientific analyses. This takes a few steps....
Therefore we will walk through the steps of data cleaning in this exercise.
...you have to re-organize the dataset to answer our question: 'how many sites are in each period?'.
#codecell_Webmaps&Distributions_Subselect
# first we create a subset of our data that only contains the columns with information about time
# this is in part because we want to do some operations where everything has to be a number, and some of the other fields contain text
# it's also just to make things simpler when we look at them
survey_data_time = survey_data[['MNLN', 'FNEB1', 'EB2', 'LPrePal', 'FPal', 'SPal', 'TPal', 'PPalPG', 'Geom', 'Arch', 'Class', 'Hell', 'ERom', 'MRom', 'LRom', 'EByz', 'MByz', 'EVen', 'MVen', 'LVen', 'Recent', 'Other']]
survey_data_time.head()
#codecell_Webmaps&Distributions_ChangingDataType
# if you were to look through this data
survey_data_time.dtypes
In #codecell__Webmaps&Distributions_ChangingDataType, the ~ .dtypes ~ function allows you to see which type of data is in our 'survey_data_time' dataframe. Three types of data are returned int64 which are integers (whole numbers -numerical data), float64 which are floating data (contains floating decimal points -numerical data) and object which is a string (not a number -categorical data).
At the very start, in Chapter 1.1, we said that the type of data you are using matters, and this matters because tests and commands will differ depending the data is numerical or categorical. The image below is here as a reminder:
Looking further in this dataset, 'survey_data_time', we can also see that some that some fields contain missing values (NaN). Let's see what we can do about that and why it matters.
#codecell_Webmaps&Distributions_AllNumbers
# We can get rid of NaN (null) values by 'filling' them.
# This is important because null values can break number-based operations.
# Let's get rid of missing values and make sure everything is a number.
survey_data_time.astype('float64').fillna(0)
In the #codecell_Webmaps&Distributions_AllNumbers, you are making sure all these missing values, NaN, becomes null numbers (0=zero).
Why do missing values need to be removed? Simply because you cannot apply maths to them (e.g. adding or multiplying columns together, reordering values of a column from max to min value, etc.).
And this is how it can be done:
Now that you've removed the null values (NaN), you can move to the next step in data cleaning, which is applying a transformation to arrange the data the way you want it for your analysis. Data transformation is an important thing to learn to do.
Right now we have a bunch of columns with information about time. What we want is one single column that contains the most likely period - which is represented in each row by the column with the greatest value.
think about that for a moment
Right now the 'most likely period' is represented by a number in each row, but that's not the piece of information we want in our new column - we want the name of the column that contains that number.In other words, we want to know the the item in the table that satisfies the condition 'the most likely period', and to be able to do something with this item.
Once you have identified the maximum values, you can extract them...
#codecell_Webmaps&Distributions_MaximumValue
#here we take the columns from all the different periods, get the one with the maximum value, and write that column's name to our new 'colmax' field
def returncolname(row, colnames):
return colnames[np.argmax(row.values)]
survey_data_time['colmax'] = survey_data_time.apply(lambda x: returncolname(x, survey_data_time.columns), axis=1)
In the #codecell_Webmaps&Distributions_MaximumValue you are creating a function. A function is a small script with multiple steps that you can run over your whole dataset. In this function, for each your you return the name of the column where the maximum (greatest) value in that row is found. You then create a new column "colmax" and store that value.
This can be done using a lambda function. Let's break down the code the steps...
*How does this really work?*
Remember that we are working with arrays, so, getting maximum or minimum values can be schematised like this:
#we can check it has all gone well
survey_data_time.head()
OK now we have a single column with the information we need - the most likely date. To create this column, we broke off some of our data (the columns with numbers) from the rest of the data (important descriptive text). We might well want to stick these two datasets back together before proceeding.
splitting and merging tables is another basic skill when working with data
#codecell_Webmaps&Distributions_MergingZeData
#now we can also add our new column back to our original data table by doing a 'merge'
#create a new table 'survey_data_maxtime' by merging our original 'survey_data' with ONLY the 'colmax' column from our new table
survey_data_maxtime = pd.merge(survey_data, survey_data_time['colmax'], how='inner', left_index=True, right_index=True)
survey_data_maxtime.head()
In the #codecell_Webmaps&Distributions_AllNumbers, you learned how to merge using pandas, and, if you want to further explore and apply pd.merge() to your data, check which parameters apply.
Have a look at the resulting table. What do all those column names mean? Right now you are probably justifiably confused. We'll be talking more about the mess that is 'other people's data' next week. For now, have a look at the documentation for these datasets at: https://archaeologydataservice.ac.uk/catalogue/adsdata/arch-1115-2/dissemination/csv/pottery/documentation/pottery.txt
You'll see they explain that many of those weird abbreviations are periods and that the number in each one represents the chance that a given find belongs to that period. Sometimes I wish people wouldn't use abbreviations like this, but they've defined them in their metadata file, so we can't compain too much.
Broadly speaking, there are two ways to approach interpreting spatial patterns. There's visualisation and interpretation, where you might visually compare distributions or densities or locations of two or more datasets by plotting them on a map and intepreting what you see. Then there's statistical analysis.
We'll start with the tools to do the first one, and introduce statistical analysis later in the course. We'll also discuss the value of each approach, and when to apply it.
here are illustrative examples of data distribution and how we can interpolate this data:
Types of Distribution: Data Sampling
Bonnier et al. (2019) |
Examples of Applications: Interpolating the data
Orengo et al. (2019) |
As we said at the beginning of today, we're interested in change over time.
How does the distribution of finds change between different periods?
= |
#codecell_Webmaps&Distributions_ImportUrLibraries
# we're going to get geopandas, another tool for making maps
!pip install geopandas
#codecell_Webmaps&Distributions_ImportUrLibraries
# get some more tools for making maps (and other things)
%matplotlib inline
import geopandas as gpd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
We're going to use geopandas for the next few steps. You've used geopandas before, in the first chapter.
It can do many of the same things as folium, which we were using last time.
Geopandas is particularly useful for showing categorical data.
Categorical data is data about groups |
---|
Categories can be anything where you are defining groups. Archaeological periods are categories. Types of lithics or ceramics are categories. As above, types of cars are categories. The name of each category is sometimes referred to as a label.
We have seen above in #codecell_Webmaps&Distributions_ChangingDataType that the function .dtype() allows you to see which type of data is your columns.
We could start tryping to see and understand the distributions of our sites by periods simply by mapping the period labels with different colors.
#codecell_Webmaps&Distributions_dataframesINTOgeodataframes
# take our big dataset from above and turn it from a 'dataframe' which is the data that folium uses to make maps into a 'geodataframe' which is the data geopandas uses to make maps
gdf_survey = gpd.GeoDataFrame(
survey_data_maxtime, geometry=gpd.points_from_xy(survey_data_maxtime.DDLon, survey_data_maxtime.DDLat))
print(gdf_survey.head())
In the #codecell_Webmaps&Distributions_dataframesINTOgeodataframes, gpd.GeoDataFrame() allows you to define the geometry (geometry=) of the new dataframe 'survey_data_maxtime', so it can be mapped.
Here, the location of the points is defined by the x and y centroids (exact middles of polygon shapes)(gpd.points_from_xy ) following their respective geodetic coordinates, which can be found in the column 'survey_data_maxtime.DDLon' for the Xs and 'survey_data_maxtime.DDLat' for Ys.
print()~ allows you to see the results.
#Codecell_Webmaps&Distributions_PlotGeodataframes
#plot your data colouring the points by the period to which they belong. You are grouping your sites by their category label when you do this.
#the plot requires you to define which type of data it is (categorical or numerical)
#figsize =(width, height)
gdf_survey.plot(column='colmax', categorical=True, legend=True, figsize=(15,15))
Is it useful? Why or why not?
Can you see the distribution of sites from individual periods easily?
Can you easily discern change over time?
I'm not overly convinced by the result here. If you think about the map design principles we discused last week, you will probably also conclude that this is not a successful map.
What other approach might we take?
Let's try something else.
#codecell_Webmaps&Distributions_SplittingUrData
#Maybe it would be better to only look at two or three periods at a time.
# Recall last week's discussion about selecting appropriate data that matches our question, and also about appropriate levels of generalisation.
#let's select a subset of our periods to see change from the early bronze age to hellenistic to late roman periods
# note that some crazy abbreviations have been used for the names of periods...
# list the types (periods) of ceramics we are interested in seeing.
types = ['EB2','Hell','LRom']
#define 'classic' (as in classical archaeology) as this group of periods. Create a subset of the data to contain only the types you just listed
classic = gdf_survey.loc[gdf_survey['colmax'].isin(types)]
# check you get the expected result - only sites from the periods you just defined as belonging to the 'classic' group
classic.head()
in #codecell__Webmaps&Distributions_SplittingUrData, the function gdf_survey.loc is similar to .loc[] & .iloc[] used in pandas last chapter (in #codecell_makeabasicmap_BringingUrData2theMap). Note that .loc is using a label to index data whereas .iloc uses an integer position. This is the difference between working by columns or by rows.
The .isin()~ command allows you to exclude data.
#codecell_Webmaps&Distributions_PlotUrData
#plot your data colouring the points by the period to which they belong
# Do this by setting 'categorical' to 'true' so that the plot is coloured by category
classic.plot(column='colmax', categorical=True, legend=True, figsize=(15,15))
That's a bit better perhaps. The map is less crowded.
Recall our discussions about how to design a map well. Clearly too much data introduces design problems.
Well, now we can see the distributions a bit, and maybe say something about change over time, but there are still a lot of dots, and it's pretty clear dots from some periods are hidden under dots from other periods and we have no way to separate them.
What we need here is:
To address our design problem, we will make interactive maps with layers that can be toggled on and off
= |
#codecell_Webmaps&Distributions_ImportUrLibraries
#Like last time, we'll use folium and one of it's plugins. Import the tools you'll need, as usual.
from folium.plugins import HeatMapWithTime
To see the survey data in context and build our interactive maps, we'll start by generating the base map that will be used throughout this notebook.
'Basemaps' are generic background maps, like a satellite image or an image of the street map. You know the different backgrounds you can show on google maps? Those are 'basemaps'. Very much like the basemap imported folium.Map (#codecell_makeabasicmap_BringingUrData2theMap).
Have a look around the web, and you'll see that most modern online maps use a basemap, so we're going to do so as well.
#codecell_Webmaps&Distributions_BringingUrData2theMap
#get the survey area centre, like you did last week, so you can centre the map where the data is located
location_survey=survey_data_maxtime['DDLat'].mean(), survey_data_maxtime['DDLon'].mean()
print(location_survey)
#define a basemap we can reuse. Use the coordiantes for the centre you generated just above to centre the basemap
#This is a variant on how we did things last time...
def generateBaseMap(default_location=[35.870086207930626, 23.301798820980512], default_zoom_start=11):
base_map = folium.Map(location=default_location, control_scale=True, zoom_start=default_zoom_start)
return base_map
#codecell_Webmaps&Distributions_CheckingZeData
#check the basemap is working
base_map = generateBaseMap()
base_map
MORE TOOLS |
#codecell_Webmaps&Distributions_ImportUrLibraries
#lets get the heatmap tool, like last time, let's also get a measure control so we can measure distance
from folium import plugins
from folium.plugins import HeatMap
from folium.plugins import MeasureControl
# Measure controls are pretty neat. Rather than just having a scale bar, like you would in a static map, and needing to visually estimate the size of features, you can mesure them.
# The ability to measure is a benefit of moving up the 'interactivity spectrum'.
Let's start by visually comparing MRom to LRom, that is middle roman to late roman sites by putting their data in separate layers.
#codecell_Webmaps&Distributions_Splitting your data
# make a layer for when each period is more than 50% likely, so you have all the sites that are probably in that period
survey_data_MRom = survey_data_maxtime.loc[(survey_data_maxtime['MRom'] > 50)]
survey_data_ERom = survey_data_maxtime.loc[(survey_data_maxtime['ERom'] > 50)]
# Yes, I know choosing a 50% cut-off is arbitrary. You could choose a different cut-off and change the values listed above.
# If you do so, all your maps (and all your conclusions about change over time) will be affected.
In the last practical lab in #codecell_makeabasicmap_ManipulatingyourData_UsingSymbology, you used the symbol == . This is similar, you are using a mathematical symbol to filter (in this case split it in two using 50%)your dataframe: > greater than and < less than.
We've introduced the concept of 'Spatial Data' in Chapter 1.1, and presented them as layers. Similarly, in our dataset, each layer contains information that can be turned on and off. Think of this like a stack of transparent paper. Each sheet of paper is a layer, and can be added to or taken away from the stack. Their order can also be changed.
#codecell_Webmaps&Distributions_PrepareUrBasemaps_CreateLayers
# like last time, make heatmaps, but one for each period, put them in different layers.
# give your map a name
base_map = generateBaseMap()
# add two layers, one for each period
mrom = HeatMap(data=survey_data_MRom[['DDLat', 'DDLon', 'MRom']].groupby(['DDLat', 'DDLon']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map)
erom = HeatMap(data=survey_data_ERom[['DDLat', 'DDLon', 'ERom']].groupby(['DDLat', 'DDLon']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map)
#give the layers sensible names that human can read
mrom.layer_name = 'Middle Roman Distribution'
erom.layer_name = 'Early Roman Distribution'
# add the layer control. This is the tool that lets you turn different layers in your map on and off
folium.LayerControl().add_to(base_map)
In #codecell__Webmaps&Distributions_PrepareUrBasemaps_CreateLayers, you should refer to #codecell_makeabasicmap_ManipulatingyourData_Heatmap to review the steps taken.
Last time, to create your heatmap you've used a size-based weighting (making larger sites have larger symbols). This time, you are creating two heatmaps named mrom ('Middle Roman Distribution')and erom ('Early Roman Distribution'). Because the data is located within the same island, the maps will overlap.
The layer control plugin allows you to switch on and off the visibility of one or the other (or both). you can find some documentation on LayerControl() here.
#codecell__Webmaps&Distributions_GenerateUrBasemap
#Now generate your map by calling it by its name
base_map
Now try and add some more layers to the map to show other periods! What other periods might it be relevant to consider if you are trying to understand change over time? Edit the cell above to add more layers, or add a new cell below and follow the steps above to make a new map, and add your extra layers to it.
#codecell_makeabasicmap_ManipulatingyourData
# Here I'm doing the same thing as before but with different periods
# make a layer for when the max period is LRom or MRom to compare these periods
survey_data_lrommax = survey_data_maxtime.loc[(survey_data_maxtime['colmax'] =='LRom')]
survey_data_mrommax = survey_data_maxtime.loc[(survey_data_maxtime['colmax'] =='MRom')]
#codecell_Webmaps&Distributions_SplittingUrData_CreateLayers
# like last time, make heatmaps, but one for each period, put them in different layers
base_map = generateBaseMap()
lrommax = HeatMap(data=survey_data_lrommax[['DDLat', 'DDLon']].groupby(['DDLat', 'DDLon']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map)
mrommax = HeatMap(data=survey_data_mrommax[['DDLat', 'DDLon']].groupby(['DDLat', 'DDLon']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map)
#give the layers sensible names
lrommax.layer_name = 'Late Roman Distribution'
mrommax.layer_name = 'Middle Roman Distribution'
# add the layer control
folium.LayerControl().add_to(base_map)
base_map
# Adds a measure tool to the top right
base_map.add_child(MeasureControl())
Stacking two layers on top of one another is one way to visually compare distributions. Do you think it is effective in this case?
Perhaps it will be more useful to be able to view our distributions and explore them side by side, to help us compare what is happening in these two periods.
We'll import a new tool to allow us to see two maps where the views are synced - that is identical and moving together - to have another way to compare distributions.
# get another plugin for side by side maps
from folium.plugins import DualMap
# declare you are making a new map "m"
# set the location to the location of your survey, set your starting zoom
m = plugins.DualMap(location=location_survey, tiles=None, zoom_start=13)
# the dual maps plugin automatically defines two buddy maps, "m1" and "m2" which pan and zoom together
# give yourself some options in life for your base layers, add them to both maps 'm1' and 'm2' by just using "m"
folium.TileLayer('cartodbpositron').add_to(m)
folium.TileLayer('openstreetmap').add_to(m)
# like last time, make heatmaps, one for each period, put them in different layers
# put one layer in the left hand map 'm' and the other in the right hand map 'm2'
lrommax = HeatMap(data=survey_data_lrommax[['DDLat', 'DDLon']].groupby(['DDLat', 'DDLon']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(m.m1)
mrommax = HeatMap(data=survey_data_mrommax[['DDLat', 'DDLon']].groupby(['DDLat', 'DDLon']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(m.m2)
#give the layers sensible names
lrommax.layer_name = 'Late Roman Distribution'
mrommax.layer_name = 'Middle Roman Distribution'
# layer control time
folium.LayerControl(collapsed=False).add_to(m)
# Adds a measure tool to the top right
m.add_child(MeasureControl())
#draw your side by side maps
m
Once you have done this exercise, I recommend you diving in NLS website and see the differences between overlays (remember to slide the transparency cursor) and side by side maps.
Thought exercise: The results of these two maps should be similar but slightly different. What is making the difference?
How good are you at interpreting these distributions and comparing them visually? This is me hinting at you that you are going to end up wanting to use statistics eventually
The principles of what we've done this week are the same as the principles of what we did last week.
I think it's important to learn to do things more than one way, and to adapt to slightly different tools. The software and code packages used for modern spatial analysis and mapping are pretty diverse and are always developing as people improve things. It doesn't make much sense to just learn one way of making maps mechanistically. The important thing is to understand the principles and aims of what you're doing.
Key principles from this week are:
In any code package that is meant to be used for making maps, odds are good you will find a way to set the zoom level, set the centre starting location, and set the initial scale.
You will be able to set up colour schemes, map attributes, and make layers. Knowing keywords and princples is the important thing.
In the past two practical labs, you have learned and experimented with programming commands -
To re-use the codes - you will need to first load their respective libraries. So far, you have used ...:
libraries |plugins | --- |--- | folium | HeatMapWithTime branca| HeatMap pandas| MeasureControl geopandas| PrepareUrBasemaps_CreateLayers from [folium.plugins]
your lexicode is non-exhaustive, keep expanding, find your own 'best way' to reuse these code/scripts...
Lexicode_MakingaBasicMap | Lexicode_Webmaps&Distributions --- | --- == () [] | pd.concat() .head_csv() | .dtype() .read_csv() | astype() mean() | fillna() folium.Map | def return range() | .apply(lambda x:function,axis=) len() | pd.merge() iloc[]| how= , left_index= ,left_index= .value_counts()| gpd.GeoDataFrame() if =:| geometry=gpd.points_from_xy elif =: |print() else =:| .isin() folium.Marker()| classic.plot() folium.Icon()| generateBaseMap() folium.Circle| .groupby(['', '']) popup= | .reset_index() radius= | max_zoom= .values.tolist() |folium.TileLayer() .add_to()| plugins.DualMap(location= , tiles= , zoom_start= ) |
Orengo, H.A. Garcia-Molsosa, A. (2019) A brave new world for archaeological survey: Automated machine learning-based potsherd detection using high-resolution drone imagery. Journal of Archaeological Science,Volume 112.
Bonnier, A., Finné, M., & Weiberg, E. (2019). Examining land-use through GIS-based kernel density estimation: A re-evaluation of legacy data from the berbati-limnes survey. Journal of Field Archaeology, 44(2), 70-83.
That's all for this chapter. Be sure to save your copy of the notebook in your own repo so you can re-use it!
Understanding the meanings behind patterns of finds recovered through excavation is a tricky problem. We hope to distinguish activity areas, places devoted to domestic and industrial use, or inhabited places that are distinct from liminal ones. We often want to discern change over time, identifying areas with finds associated with different temporal periods.
To successfully unravel these patterns, we must look not only at the distributions of different types of finds, but how they correlate with one another, the character of the contexts in which they were recovered, and their own physical and social characteristics. Are they likely to be curated? Are they light and likely to be moved from one area to another by post-depositional processes? It's all a bit of a mess.
Importantly, all these processes are spatial. Alignments or proximity between areas with similar (or quite different) finds is potentially meaningful.
The aims of this exercise is for you to:
You'll do this using data collected by the Gabii Project, a 10+ year excavation in central Italy.
= |
##codecell_SpatialPatterns_ImportUrLibraries
# as usual, start by getting your tools: your prerequisites.
%matplotlib inline
# Matplotlib is your tool for drawing graphs and basic maps. You need this!
!pip install fiona
!pip install geopandas
import pandas as pd
import requests
import fiona
import geopandas as gpd
import ipywidgets as widgets
In #codecell_SpatialPatterns_ImportUrLibraries. These are what we call prerequisites. You know by now that they are basic tools so you can get started. Let's review them broadly:
##codecell_SpatialPatterns_ImportUrData
# then get your data
# This is where I put the data. It's in a format called geojson, used to represent spatial geometry (shapes) and attributes (text).
url = 'http://ropitz.github.io/digitalantiquity/data/gabii_SU.geojson'
# Please get me the data at that web address (url):
# use requests.get to retrieve data from any destination
request = requests.get(url)
# I will use the letter 'b' to refer to the data, like a nickname
#we can use requests to read the response content in bytes
b = bytes(request.content)
#So we will use fiona.BytesCollection referred by the letter 'f':
# to read the raw data (as single-file formats or zipped shapefiles)
# to wrap up all the data from 'b'
# check the coordinate refereence system (crs) listed in the features
with fiona.BytesCollection(b) as f:
crs = f.crs
#by using also GeoDataFrame.from_features you can read geospatial data that's in the url without saving that data to disk (your PC) first
gabii_su_poly = gpd.GeoDataFrame.from_features(f, crs=crs)
# and print out the first few lines of the file, so I can check everything looks ok: you know this by now...we will call .head()
print(gabii_su_poly.head())
In #codecell_SpatialPatterns_ImportUrData:
GeoJSON is used to store the excavation data. GeoJSON format allows to encode a variety of geographic data structures which contains features with spatial attributes (e.g. points, line strings, polygons, multiparts geometries) and non-spatial attributes (text). This is a really useful format to use when creating a GIS.
bytes() method returns bytes object which is an immmutable (cannot be modified) sequence of integers. We tend to use this to compress data, save or send it.
requests.get is used to retrieve data from any destination & requests.content to read all content which is in bytes (for non-text requests using the .content property).
BytesCollection() takes a buffer of bytes and maps to a virtual file that can then be opened by fiona. By using both fiona.BytesCollection and GeoDataFrame.from_features you can:
Open source
So far, we have abundantly benefitted from open-source software, data, tools, code, design documents, or content. It is only natural to open, share and use the results of 10 years excavation...
So you know what's in this dataset... Maybe you want to see it before you start querying or analysing it? ... Start by visualising the spatial data for all the contexts (stratigraphic units) from the excavation we'll be exploring.
So far you have dealt with survey data where all data has been logged with coordinates (x, y and sometimes z for the elevation height). However, it is not always possible, or even meaningful, to record the coordinates of all artefacts retrieved during an excavation, especially if it lasts over several years. It is then more appropriate to use stratigraphical units (SU). The spatial analysis of these finds is more challenging as they don't have a spatial location per se. We need to think carefully about organising and presenting them. Colour can be a great help for this!
##codecell_SpatialPatterns_PlottingUrData
# Now we have polygons, the shapes of our contexts. Let's visualise the data to double check that all is well
# We'll use again the function .plot (see lab_Webmaps&Distributions)
# 'plot' means draw me an image showing the geometry of each feature in my data.
# We want to control things like the color of different types of features on our map.
# I used the 'Blues' colorscale command (cmap stands for 'colour map')
# and asked it to draw the polygons differently based on the type of feature.
gabii_map1 = gabii_su_poly.plot(column='DESCRIPTIO', cmap='Blues', edgecolor='grey', figsize=(15, 15));
In #codecell_SpatialPatterns_PlottingUrData, when using .plot(), when column=' ' is specifed, the plot colouring is based on its values. The colour scale is defined by using cmap='' , the edge of the features (our polygons) are defined by edgecolor='' and the size of the plot by figsize=(width, height)) .
you can of course add parameters/symbologies to your plots. The choice of parameters is largely dictated by your data analysis. Here is some documentation on plot generation.
The colorscale options are: Accent, Accent_r, Blues, Blues_r, BrBG, BrBG_r, BuGn, BuGn_r, BuPu, BuPu_r, CMRmap, CMRmap_r, Dark2, Dark2_r, GnBu, GnBu_r, Greens, Greens_r, Greys, Greys_r, OrRd, OrRd_r, Oranges, Oranges_r, PRGn, PRGn_r, Paired, Paired_r, Pastel1, Pastel1_r, Pastel2, Pastel2_r, PiYG, PiYG_r, PuBu, PuBuGn, PuBuGn_r, PuBu_r, PuOr, PuOr_r, PuRd, PuRd_r, Purples, Purples_r, RdBu, RdBu_r, RdGy, RdGy_r, RdPu, RdPu_r, RdYlBu, RdYlBu_r, RdYlGn, RdYlGn_r, Reds, Reds_r, Set1, Set1_r, Set2, Set2_r, Set3, Set3_r, Spectral, Spectral_r, Wistia, Wistia_r, YlGn, YlGnBu, YlGnBu_r, YlGn_r, YlOrBr, YlOrBr_r, YlOrRd, YlOrRd_r, afmhot, afmhot_r, autumn, autumn_r, binary, binary_r, bone, bone_r, brg, brg_r, bwr, bwr_r, cividis, cividis_r, cool, cool_r, coolwarm, coolwarm_r, copper, copper_r, cubehelix, cubehelix_r, flag, flag_r, gist_earth, gist_earth_r, gist_gray, gist_gray_r, gist_heat, gist_heat_r, gist_ncar, gist_ncar_r, gist_rainbow, gist_rainbow_r, gist_stern, gist_stern_r, gist_yarg, gist_yarg_r, gnuplot, gnuplot2, gnuplot2_r, gnuplot_r, gray, gray_r, hot, hot_r, hsv, hsv_r, inferno, inferno_r, jet, jet_r, magma, magma_r, nipy_spectral, nipy_spectral_r, ocean, ocean_r, pink, pink_r, plasma, plasma_r, prism, prism_r, rainbow, rainbow_r, seismic, seismic_r, spring, spring_r, summer, summer_r, tab10, tab10_r, tab20, tab20_r, tab20b, tab20b_r, tab20c, tab20c_r, terrain, terrain_r, viridis, viridis_r, winter, winter_r
Swap out 'Blues' in the cell above for any of these options...
Like many excavations, not every special finds in this datset has spatial coordinates associated with it (because in real archaeology life things are found in the sieve, the wheelbarrow, and during washing).
###codecell_SpatialPatterns_WhichTypeOfSpecialFinds&fromWhere?
# Now I'm going to bring in all the basic Gabii special finds data - descriptions, object types, IDs and the contexts from which they come.
# We've had a few special finds over the years.
sf_su = pd.read_csv("https://raw.githubusercontent.com/ropitz/gabii_experiments/master/spf_SU.csv")
sf_su
###codecell_SpatialPatterns_WhichTypeOfSpecialFinds?
#the set()function allows to return all values (here our special finds type) without duplicates
#this is a useful tool when you need to standardise your finds labels (and check your metadata for spelling!)
sf_su_desc = sf_su['SF_OBJECT_TYPE']
set(sf_su_desc)
#however, this is a really long list to deal with, so we need to find ways to prepare this dataset to really see what has been happening on this site.
One of our area supervisors, Troy, is super excited about tools related to textile production. They're a great example of how we think about special finds at Gabii. Multiple types of finds are related to textile production. Do we find all types everywhere? Are certain types of tools more concentrated in one type of context or one area than others? Troy has lots of questions about the patterns of places where we find these tools. Do they provide evidence for early textile production? Are they a major factor in the city's early wealth? Do we find the same things in later periods? After all, people under the Republic and Empire wore clothes... Loom Weights, spools, and spindle whorls are the most common weaving tools at Gabii.
###codecell_SpatialPatterns_SpecialFindsSelection
#Let's pull all those find types out of the big list.
#We're selecting the finds data we want to work with before merging with the spatial data. We could do these operations in reverse if we wanted to.
#here very much like in lab1,#codecell_makeabasicmap_BringingUrData2theMap, & lab2, #codecell_Webmaps&Distributions_SplittingUrData, we are using iloc and isin functions
types = ['Loom Weight','Spool','Spindle Whorl']
textile_tools = sf_su.loc[sf_su['SF_OBJECT_TYPE'].isin(types)]
textile_tools
#we now have a new dataframe containing only textile_tools
Presence or absence isn't everything. You may want to know how many of a certain type of find is present in a given area.
###codecell_SpatialPatterns_TextileToolsListing
# Now let's count up how many of these tools appear in each context (SU).
# pd.value_counts() functioon returns a series containing counts of unique values.
# So we can print out a list of the number of textile_tools in each SU next to that SU number.
pd.value_counts(textile_tools['SU'].values, sort=True)
###codecell_SpatialPatterns_TextileToolsBecomesSpatial
#Then let's combine the special finds data with our polygons representing context with shape and a spatial location
# We do this with a command called 'merge'. In lab2,#codecell_Webmaps&Distributions_MergingZeData, you have used pandas pd.merge()
gabii_textools = gabii_su_poly.merge(textile_tools, on='SU')
# very much like p.merge(), you have now created a new dataframe ('gabii_textools') by merging dataframe 'textile_tools' on= SU, the stratigraphical unit
# let's have a look at the new dataframe using .head() to print out just the first few rows.
gabii_textools.head()
###codecell_SpatialPatterns_SeeingTextileToolsinContext
# If we want to see this result as a map, we just add the .plot command to the end of the dataframe's name
# here .plot() symbology is expanded to transparency with 'alpha=' where value of 1 is complete opacity and 0 complete transparency
gabii_textools.plot(column='SF_OBJECT_TYPE', cmap='Accent', figsize=(15, 15), legend=True, alpha=0.5)
OK, what do you see here? Compare the distribution of each type of textile tool. Do some types seem to be concentrated in certain areas? How might you check? What factors might contribute to this pattern? Do big layer simply aggregate lots of stuff? Do late dumps contain early materials? Why would one type of tool appear where the others don't?
###codecell_SpatialPatterns_SortingDataTextileTools
# We can try and see the relationship between layer size and count by sorting
#our list of finds by the surface area of each layer.
# We use the command 'sort_values'
gabii_textools.sort_values(by=['Shape_Area'],ascending=False)
# '.sort_values' function sort along their axis (here the axis is defined by 'Shape_Area' ).
# the default sorting is on ascending values (smallest to largest) if you are happy with this =True, however, we want to see them in descending order, so we select =False
Gabii excavations have revealed that there are enormous colluvial layers. This is an important consideration as these large areas will contribute to a bias distribution of the artefacts across the site. Therefore, very large areas should probably be excluded.
###codecell_SpatialPatterns_RefiningDataSorting
# Outliers will mess with any analysis. Here large stratigraphical layer are our outliers
# By cutting out these layers i.e. excluding SUs with a surface area greater than 800 we can deal with these outliers
gabii_textools2 = gabii_textools.loc[gabii_textools['Shape_Area']<800]
###codecell_SpatialPatterns_VisualisingDataSorting
# If we want to see this result as a map, we just add the .plot command to the end again.
gabii_textools2.plot(column='SF_OBJECT_TYPE', cmap='Accent', figsize=(15, 15), legend=True, alpha=0.5)
# That's better. Plot the results to see that you've removed the big colluvial layers.
to answer to question: how many of each tool type appears in each SU? You will need to further group and merge your data.
###codecell_SpatialPatterns_GroupingData
# OK, count up how many of each tool type appears in each SU using the 'groupby' command.
# You have used this command before in #codecell_Webmaps&Distributions_SplittingUrData_CreateLayers
## and .fillna() was explained in codecell_Webmaps&Distributions_AllNumbers
textools_counts = gabii_textools2.groupby('SU')['SF_OBJECT_TYPE'].value_counts().unstack().fillna(0)
# Sort the list so that the SUs with the most stuff end up at the top.
textools_counts.sort_values(by=['Loom Weight','Spindle Whorl','Spool'], ascending=False)
###codecell_SpatialPatterns_MergingData
# Merge your textile tool counts with your spatial data for the contexts
# Because both dataframes have a 'SU' column, you can use this to match up the rows.
# so the merger will occur on='SU'
gabii_textools_counts = gabii_su_poly.merge(textools_counts, on='SU')
gabii_textools_counts.head()
Side by side plots of different variables can help you to visualize the differences between the spatial patterns you're exploring. Very much like in lab2_MakeaBasicMap, when you compared Late Roman and Middle Roman artefact distributions using two heatmaps side-by-side.
###codecell_SpatialPatterns_AssessingFindType
# Let's start by looking at each class of textile tool individually.
# Plot the counts of each type of find spatially
gabii_textools_counts.plot(column='Loom Weight', cmap='Accent', figsize=(15, 15), legend=True, alpha=0.5, legend_kwds={'label': "Number of Loom weight",'orientation': "vertical"})
gabii_textools_counts.plot(column='Spindle Whorl', cmap='Accent', figsize=(15, 15), legend=True, alpha=0.5, legend_kwds={'label': "Number of Spindle Whorl",'orientation': "vertical"})
gabii_textools_counts.plot(column='Spool', cmap='Accent', figsize=(15, 15), legend=True, alpha=0.5, legend_kwds={'label': "Number of Spool",'orientation': "vertical"})
###codecell_SpatialPatterns_PlottingAllFindType
# here's another visualisation. I've chosen a single colour scale - so shades of red, shades of blue, shades of green, to show the quantity of each type of find in a single map.
base = gabii_textools_counts.plot(column='Loom Weight', cmap='Blues', figsize=(15, 15), alpha=0.7)
gabii_textools_counts.plot(ax=base, column='Spindle Whorl', cmap='Reds', alpha=0.7)
gabii_textools_counts.plot(ax=base, column='Spool', cmap='Greens', alpha=0.7);
So far, it has been difficult to see what's happening, to identify activities between the buildings and to compare the maps when we have to scroll.
##codecell_SpatialPatterns_ImportUrLibraries
import matplotlib.pyplot as plt
###codecell_SpatialPatterns_AllFindTypeSidebySide
# Let's put the maps side by side to help with comparative visualisation.
fig, axes = plt.subplots(ncols=3,figsize=(15, 5))
gabii_textools_counts.plot(column='Loom Weight', cmap='autumn', ax=axes[0], legend=True, legend_kwds={'label': "Number of Loom weight",'orientation': "vertical"}).axis('equal')
gabii_textools_counts.plot(column='Spindle Whorl', cmap='autumn', ax=axes[1], legend=True, legend_kwds={'label': "Number of Spindle Whorl",'orientation': "vertical"}).axis('equal')
gabii_textools_counts.plot(column='Spool', cmap='autumn',ax=axes[2], legend=True, legend_kwds={'label': "Number of Spool",'orientation': "vertical"}).axis('equal')
Can you see any patterns here? Do the different types of tools concentrate in the same parts of the site? Why might different types of tools have different distributions?
OK, this next cell is a big scary cell is because something has broken after I drafted this exercise. Push run to fix the thing they've broken (hopefully).
#@title
!apt-get install -qq curl g++ make
#@title
!curl -L http://download.osgeo.org/libspatialindex/spatialindex-src-1.8.5.tar.gz | tar xz
#@title
import os
os.chdir('spatialindex-src-1.8.5')
#@title
!./configure
#@title
!make
#@title
!make install
#@title
!pip install rtree
#@title
!ldconfig
#Working through the example at http://toblerity.org/rtree/examples.html
#@title
from rtree import index
from rtree.index import Rtree
#@title
p = index.Property()
idx = index.Index(properties=p)
idx
There is a human limitation to recognise more than 3+ attributes/variables out of 100 point data such as this four-dimensional tic-tac-toe grid:
That's why we have to use statistics, spatial statistics to recognise and more to the point quantify patterns.
= |
##codecell_SpatialPatterns_ImportUrLibraries
# I think the distributions of different weaving tools vary.
# To investigate further, we are going to need more tools. Specifically we need statistical tools.
# pysal, numpy and sklearn are all useful for statistics. Seaborn is useful for visualisation.
!pip install pysal
import pysal
from sklearn import cluster
import seaborn as sns
import numpy as np
We're going to use cluster analysis to try and better understand our patterns. Clustering is a broad set of techniques for finding groups within a data set. Cluster analysis has as its objective grouping together similar observations (unlike factor analysis works by searching for similar variables).
M.Fortin & M. Dale (2005) explain that
" Arising from time series statistics and the more familiar parametric statistics, spatial statistics quantify the degree of self-similarity of a variable as a function of distance. These spatial statistics assume that, within the study area, the parameters of the function defining the underlying process, such as the mean and the variance, are constant regardless of the distance and direction between the sampling locations. Then the goal of spatial statistics is to test the null hypothesis of absence of ‘spatial pattern’. The null hypothesis implies that nearby locations (or attributes, measures) do not affect one another such that there is independence and spatial randomness (Figure 6.2(b)). The alternatives are that there is clustering and thus positive spatial autocorrelation (Figure 6.2(a)) or repulsion and negative spatial auotocorrelation (Figure 6.2(c))
K-Means clustering is probably the most well-known clustering algorithm that solve clustering problem by splitting a dataset into a set of k (k being an arbitrary number you get to choose) groups.
We can only recommand Ben Alex Keen blog to see how Python works through clustering! Here is a simple version of K-Means Clustering explained in 5 visual steps:
Step1 | Step2 | Step3 | Step4 | Step5 |
---|---|---|---|---|
Assign each points to similar centre: we need to identify the number of classes to use by looking at the data and identifying any discrete groupings. | Identify the cluster centroids (3 coloured symbols on graph). | Reassign the points based on the minimum distance or closest from cluster centroids. Here we should emphasise that there many possible definitions that may be used for “closest” (e.g.nearest neighbour, Ward's method). | Identify the new centroids by taking the average of all points in the cluster. | Reassign the groupings -points and assignment- until points stop changing clusters in a loop. |
Because clustering allows us to identify which things are alike on the basis of multiple characteristics, we will do just that. We will use cluster analysis to see which SUs are similar in terms of all three types of textile tools: loom weights; spools; spindle whorls. In #codecell_SpatialPatterns_ImportUrLibraries, we import a cluster tool from sklearn to help us with it.
Let's use python to run K-means function
First, we want to cluster together contexts where the pattern of the three types of textile tools are similar.
##codecell_SpatialPatterns_ClusterTextileTools
# Next step: cluster together contexts where the pattern of the three types of textile tools are similar,
# with and without respect to the size of the SU.
# In this cell we are including information on the size of the SU
# we will use three functions .cluster.Kmeans, .fit() and .drop():
km5 = cluster.KMeans(n_clusters=5)
km5cls = km5.fit(gabii_textools_counts.drop(['geometry', 'OBJECTID','DESCRIPTIO','Shape_Length','SU'], axis=1).values)
km5cls
In #codecell_SpatialPatterns_ClusterTextileTools, by:
we have arbitrarily given k (n_clusters) an arbitrary value of 5 groups
Each cluster produced should contain the SUs that are similar to one another on the basis of the number of each type of textile tool and the size of the surface area of the SU.
##codecell_SpatialPatterns_SubPlotClusterTextileTools
# Plot the clusters, groups of contexts that have similar textile tool assemblages.
# Give a different colour to the SUs that belong to each cluster.
#creating a single chart
f1, ax = plt.subplots(1, figsize=(15,15))
#plt.subplots() is a function that returns a tuple (=sequence of immutable Python objects like a list)
#this tuple contains a figure and axes object(s)
#so, when using fig, ax = plt.subplots() you unpack this tuple into the variables fig and ax.
#here f1 is your figure
#here we want to assign() our new clustering labels 'km5cls.labels_'to the dataframe
#and also plot the cluster ('cl'). (nb: that ax=ax are the features/objects here)
gabii_textools_counts.assign(cl=km5cls.labels_)\
.plot(column='cl', categorical=True, legend=True, \
linewidth=0.1, cmap='Accent', edgecolor='white', ax=ax)
ax.set_axis_off()
#turn x and y-axis off which will affect the axis lines, ticks, ticklabels, grid and axis labels
#let's see the plot using command plt.show()
plt.show()
##codecell_SpatialPatterns_ClusterTextileTools_2
#Do the same as in ##codecell_SpatialPatterns_ClusterTextileTools
#but let's ignore the size of the context so we also .drop('Shape_Area').
km5 = cluster.KMeans(n_clusters=5)
km5cls2 = km5.fit(gabii_textools_counts.drop(['geometry', 'OBJECTID','DESCRIPTIO','Shape_Length','SU','Shape_Area'], axis=1).values)
#we plot as in ##codecell_SpatialPatterns_SubPlotClusterTextileTools
#this time a'f2' and our new cluster 'cl2'
f2, ax = plt.subplots(1, figsize=(15,15))
gabii_textools_counts.assign(cl2=km5cls2.labels_)\
.plot(column='cl2', categorical=True, legend=True, \
linewidth=0.1, cmap='Accent', edgecolor='white', ax=ax)
ax.set_axis_off()
plt.show()
The patterns are definitely different. How can we interpret the fact that context size affects the pattern of the distribution of textile tools? Do big units, which perhaps represent dumps or colluvial mashups, have a fundamentally different character than the varied small contexts?
##codecell_SpatialPatterns_SubPlotClustersSidebySide
# Look at the difference with and without context size taken into account.
#we are plotting just like in##codecell_SpatialPatterns_SubPlotClusterTextileTools
#however we want to see them side by side to see effect of our selection.
#To do so, we define our plt.subplots as having 2 columns (ncols=2)
fig, axes = plt.subplots(ncols=2,figsize=(15, 5))
gabii_textools_counts.assign(cl2=km5cls2.labels_)\
.plot(column='cl2', categorical=True, legend=True, \
linewidth=0.1, cmap='Accent', edgecolor='white', ax=axes[0]).axis('equal')
gabii_textools_counts.assign(cl=km5cls.labels_)\
.plot(column='cl', categorical=True, legend=True, \
linewidth=0.1, cmap='Accent', edgecolor='white', ax=axes[1]).axis('equal')
# note here that we have to 2 axes for our to feature classes:
# ax=axes[0] for the first subplot and ax=axes[1] for the second subplot
##codecell_SpatialPatterns_AddLabelClusterstoDataframe
# assign the cluster IDs to each context permanently
gabiitextools_clas = gabii_textools_counts.assign(cl=km5cls.labels_)
gabiitextools_class = gabiitextools_clas.assign(cl2=km5cls2.labels_)
#and let's check how your dataframe looks like
gabiitextools_class.head()
Data exploration can help us to understand how well the cluster analysis performed.
##codecell_SpatialPatterns_SubselectClustergroupN.0
# Now let's look at some individual classes, with and without context size accounted for in the analyses.
# We choose the cluster group/class 0
# we use .loc[] command (see##codecell_makeabasicmap_BringingUrData2theMap & #codecell__Webmaps&Distributions_SplittingUrData)
# and create 2 dataframes:
# With=> [gabiitextools_class['cl']where we only select (==) group 0
# Without=> [gabiitextools_class['cl2']where we only select (==) group 0
gabiitextools_class0=gabiitextools_class.loc[gabiitextools_class['cl']==0]
gabiitextools_class0noarea=gabiitextools_class.loc[gabiitextools_class['cl2']==0]
#we are plotting just like in##codecell_SpatialPatterns_SubPlotClustersSidebySide
fig, axes = plt.subplots(ncols=2,figsize=(15, 5))
gabiitextools_class0.plot(ax=axes[0], legend=True).axis('equal')
gabiitextools_class0noarea.plot(ax=axes[1]).axis('equal')
##codecell_SpatialPatterns_ClusterTextileTools_7types_WithContextSize
# What happens when we change the number of clusters (groups)?
# let's repeat our first analysis (##codecell_SpatialPatterns_ClusterTextileTools)
# and let's keep the context size ('Shape_area')
km7 = cluster.KMeans(n_clusters=7)
km7cls3 = km7.fit(gabii_textools_counts.drop(['geometry', 'OBJECTID','DESCRIPTIO','Shape_Length','SU'], axis=1).values)
f3, ax = plt.subplots(1, figsize=(15,15))
gabii_textools_counts.assign(cl3=km7cls3.labels_)\
.plot(column='cl3', categorical=True, legend=True, \
linewidth=0.1, cmap='Accent', edgecolor='white', ax=ax)
ax.set_axis_off()
plt.show()
That also changes things. Without going into too much detail, finding the ideal number of clusters is challenging. There are many statistical techniques to help you choose the best number of clusters. However, this is for a more advanced stats course! For now, just try playing around with the number of clusters in the notebook, or the size cut-off for inclusion.
##codecell_SpatialPatterns_ClusterTextileTools_7types_WithoutContextSize
# Use 7 clusters folowing the same procedure than in ##codecell_SpatialPatterns_ClusterTextileTools_7types_WithContextSize
# and let's drop the context size ('Shape_area')
km7 = cluster.KMeans(n_clusters=7)
km7cls4 = km7.fit(gabii_textools_counts.drop(['geometry', 'OBJECTID','DESCRIPTIO','Shape_Length','SU','Shape_Area'], axis=1).values)
f4, ax = plt.subplots(1, figsize=(15,15))
gabii_textools_counts.assign(cl4=km7cls4.labels_)\
.plot(column='cl4', categorical=True, legend=True, \
linewidth=0.1, cmap='Accent', edgecolor='white', ax=ax)
ax.set_axis_off()
plt.show()
##codecell_SpatialPatterns_AddLabelto7ClusterGroups
# assign the cluster IDs to each context permanently just like in ##codecell_SpatialPatterns_AddLabelClusterstoDataframe
# Let's set up to investigate some of the individual clusters
gabiitextools_class3=gabiitextools_class.assign(cl3=km7cls3.labels_)
gabiitextools_class4=gabiitextools_class3.assign(cl4=km7cls4.labels_)
gabiitextools_class4.head()
##codecell_SpatialPatterns_EffectofClusterNumber&AreaSizetoClustergroupN.0
#we are following the same step than in ##codecell_SpatialPatterns_SubselectClustergroupN.0 but with our 7 cluster groups/classes
# set up variables to store several classes, with and without context size taken into account.
gabiitextools_class0=gabiitextools_class4.loc[gabiitextools_class4['cl']==0]
gabiitextools_class0noarea=gabiitextools_class4.loc[gabiitextools_class4['cl2']==0]
gabiitextools_k7_class0=gabiitextools_class4.loc[gabiitextools_class4['cl3']==0]
gabiitextools_k7_class0noarea=gabiitextools_class4.loc[gabiitextools_class4['cl4']==0]
# setting the subplots to the four dataframes, we have just identified above
# this can be done by arranging the subplots like a 2X2 table/array with 'ncols=2' & 'nrows=2'
fig, axes = plt.subplots(ncols=2,nrows=2,figsize=(15, 10))
gabiitextools_class0.plot(ax=axes[0,0]).axis('equal')
#note here the ax=axes defines the position of the figure in this 2X2 array (in this case is cell column 0, row0)
axes[0,0].set_title(' 5 clusters - area')
#for ease, we have added a title to the subplots
gabiitextools_class0noarea.plot(ax=axes[0,1]).axis('equal')
axes[1,0].set_title(' 5 clusters - no area')
gabiitextools_k7_class0.plot(ax=axes[1,0]).axis('equal')
axes[0,1].set_title('7 clusters - area')
gabiitextools_k7_class0noarea.plot(ax=axes[1,1]).axis('equal')
axes[1,1].set_title('7 clusters - no area')
##codecell_SpatialPatterns_EffectofClusterNumber&AreaSizetoClustergroupN.3
# now try some different cluster groups
# we are choosing cluster no.3 this time
#we are following our previous steps
gabiitextools_class3=gabiitextools_class4.loc[gabiitextools_class4['cl']==3]
gabiitextools_class3noarea=gabiitextools_class4.loc[gabiitextools_class4['cl2']==3]
gabiitextools_k7_class3=gabiitextools_class4.loc[gabiitextools_class4['cl3']==3]
gabiitextools_k7_class3noarea=gabiitextools_class4.loc[gabiitextools_class4['cl4']==3]
fig, axes = plt.subplots(ncols=2,nrows=2,figsize=(15, 10))
gabiitextools_class0.plot(ax=axes[0,0]).axis('equal')
axes[0,0].set_title('5 clusters - area')
gabiitextools_class0noarea.plot(ax=axes[0,1]).axis('equal')
axes[1,0].set_title(' 5 clusters - no area')
gabiitextools_k7_class0.plot(ax=axes[1,0]).axis('equal')
axes[0,1].set_title('7 clusters - area')
gabiitextools_k7_class0noarea.plot(ax=axes[1,1]).axis('equal')
axes[1,1].set_title('7 clusters - no area')
Cluster analysis is an important statistical technique. While not the main focus of this course, it's worth learning more about it.
We have used K-Means clustering because it presents the advantage to be fast, easy to understand and fairly robust and efficient; however, it has disadvantages too, such as the number of groups/classes selection (and as you have seen which isn’t always inconsequential; ideally, you'd want the algorithm to figure this out). Furthermore, K-Means clustering cannot deal with non-linear dataset and if there are highly overlapping data then k-means will not be able to resolve the differentiation between clusters. Best results can be achieved when datasets are distinct or well separated from each other.
Reminder: linear versus non-linear datapoints:
Assessing spatial patterns: Degree of Clustering Degree of Clustering is a measure of the degree to which nodes in a graph tend to cluster together. You can follow this video to understand the concept behind clustering coefficient and below presents various degree of clustering results.
K-means is not the only clustering technique. I encourage you to do some independent reading on different approaches to clustering. You may want to have a look at this to start off.
Although we can see differences, it is difficult to:
Let's turn to Statitics to do this for us.
In #codecell_SpatialPatterns_ImportUrLibraries, you have imported seaborn library to do just that.
Correlations are statistical associations between two variables. When we say that two variables are associated, we are suggesting that they vary (change) together. This implies that the processes that affect these variables may be related. We try to detect correlation to understand the relationship between two variables.
For example: should we expect the pattern of loom weights to be similar to the pattern of spindle whorls? should we expect them to be correlated?
Note that: Correlation does not imply causation! That is, the pattern of loom weights does not cause or create the pattern of spindle whorls.
The most frequently used Parametric correlation statistic is the Pearson’s correlation r. Pearson’s r tests if there is a linear relationship between two variables, and can only range between -1 and 1, where -1 indicates perfect negative linear (one to one) correlation and +1 indicates the perfect positive linear correlation (see Figure below).
Examples of Pearson's correlation coefficient (r) for different degrees of linearity |
---|
Examples of Pearson's correlation coefficient (r) for different degrees of linearity |
---|
Examples of positive and negative covariance |
---|
We will be using seaborn.pairplot to visualise the relationships between our variables. A pairs plot allows us to see both distribution of single variables and relationships between two variables. You can learn a bit more about it in Will Koehrsen's blog.
##codecell_SpatialPatterns_EffectofClusterNumber
# Do 7 clusters vs. 5 result in more correlation between the spatial distribution patterns of pairs of find types?
# we are simply using sns.pairplot function to visualize relationships (correlations) between the find types and the clusters they appear in.
# We are asking: do certain pairs of variables appear in the same clusters most of the time?
sns.pairplot(gabiitextools_k7_class0.drop(['OBJECTID','DESCRIPTIO','Shape_Length','Shape_Area','SU','geometry','cl','cl2','cl3','cl4'], axis=1), kind="reg")
plt.show()
##codecell_SpatialPatterns_EffectofClusterNumber
# Are some cluster classes/groups more correlated than others?
sns.pairplot(gabiitextools_class0.drop(['OBJECTID','DESCRIPTIO','Shape_Length','Shape_Area','SU','geometry','cl','cl2','cl3','cl4'], axis=1), kind="reg")
plt.show()
Is there strong positive or negative correlation between pairs of variables (textile tools and the clusters they end up in)? What might this say about the processes behind these patterns?
Statistically assessing correlation is a good starting point for understanding patterns in your data.
Because this isn't a stats course, we're not taking this forward any further today. But it's worth noting that if you want to do spatial analysis properly at some stage you will need to learn a bit about statistics.
That concludes this practical
Hopefully you have:
To re-use the codes - you will need to first load their respective libraries. So far, you have used ...:
libraries | | | --- |--- | --- | folium | numpy | branca| rtree | pandas| osmnx| geopandas| requests | seaborn | fiona| matplotlib.pyplot | ipywidgets| pysal |seaborn |
plugins | ||
---|---|---|
HeatMapWithTime HeatMap MeasureControl PrepareUrBasemaps_CreateLayers from [folium.plugins] cluster (from sklearn)
your lexicode is non-exhaustive, keep expanding, find your own 'best way' to reuse these code/scripts...
Lexicode_MakingaBasicMap | Lexicode_Webmaps&Distributions |Lexicode_StreetGridOrientations | Lexicode_SpatialPatterns --- | --- | ---|---| == () [] | pd.concat() | { } subselection from list| .head_csv() | .dtype() | ox.gdf_from_places()|requests.get()|requests.get() .read_csv() | astype() | ox.plot_shape()|request.content() mean() | fillna()|network_type= ''|.bytes() folium.Map | def return |ox.add_edge_bearings(ox.get_undirected())|gpd.GeoDataFrame.from_features() range() | .apply(lambda x:function,axis=) |count_and_merge()|Set() len() | pd.merge() |np.arrange()|pd.value_counts() iloc[]| how= , left_index= ,left_index= |np.histogram()|.merge() .value_counts()| gpd.GeoDataFrame()| ax.set_theta_location()|.sort_values if =:| geometry=gpd.points_from_xy |ax.set_ylim()|cluster.KMeans() elif =: |print() |ax.set_title()|.fit() else =:| .isin()|ax.set_yticks()|.drop() folium.Marker()| classic.plot()|ax.set_xlabels() & ax.set_yticklabels|.assign() folium.Icon()| generateBaseMap()|plt.subplots()|plt.show() folium.Circle| .groupby(['', ''])|.dropna()|.set_title popup= | .reset_index() |polar_plot()|sns.pairplot() radius= | max_zoom= |pd.Series()| .values.tolist() |folium.TileLayer()|np.pi| .add_to()| plugins.DualMap(location= , tiles= , zoom_start= )| |
Fortin, M., & Dale, M. (2005). Spatial Analysis: A Guide for Ecologists. Cambridge: Cambridge University Press.
In this exercise, we'll be relating learning to access information about the environment or landscape in a raster format. We'll then be relating the locations of visible sites to various environmental factors.
So far in this course you've been working mostly with vector data (points, lines, polygons) but now you're going to work with rasters as well. Environmental data is often held in a raster format because the data is continuous (see Raster versus Features below for more details). Elevation is everywhere, right? So if you have a big image file (a raster) you can have a value for elevation in pretty much every pixel of that file.
Therefore this week practical lab aims to use rasters to explore (and investigate of course):
In #codecell_StreetGridOrientations_geodataframe, we have started to talk about vector feature data and why topology characterisation was important in geographical information systems. Vectors are not the only way to represent the world and what's in it, raster is another format used ... because:
|| --- |--- | |
There are different types of raster data Image rasters and attributes rasters as illustrated below:
|| --- |--- | Image raster (geology.com)|Attributes rasters (met.office)|
A raster format file is composed of cells where each each cell have a pixel value (=picture element and relates to resolution & to their position on the surface) and a grid cell value (size in units) and each cell has attributes stored in the array. For instance, this attribute can relate to spectral reflectance (ranging from 0-255), colour (RGB is 3 attributes) or numeric values (which can be associated with a full description through a join with a Look-Up_Code -LU_code- such as a reference tables for Land Use map).
|Raster Characteristics|Example of a Raster Array composition| |--- |--- | |
Raster Chararcteristics
= |
##codecell_RasterLandscape_ImportUrLibraries
!pip install rasterio
!pip install elevation
!pip install richdem
In #codecell_RasterLandscape_ImportUrLibraries, we are using reference data for this practical which include:
##codecell_RasterLandscape_ImportUrLibraries
import rasterio
# import the main rasterio function
#read, read-write, and write access to raster data files
import matplotlib
# We have use this library lab4SpatialPatterns
# matplotlib is the primary python plotting and viz library
%matplotlib inline
# this bit of magic allows matplotlib to plot inline in a jupyter notebook
# We can check which version we're running by printing the "__version__" variable
print("rasterio's version is: " + rasterio.__version__)
print(rasterio)
|Implications of spatial detail of land use / land cover? Implications of temporal detail of land use / land cover mapping episodes?|Land cover indicates the physical land type such as forest or open water whereas land use documents how people are using the land| |--- |--- | |
##codecell_RasterLandscape_OpenImage
# filepath to our image
img_fp = 'http://ropitz.github.io/digitalantiquity/data/LE70220492002106EDC00_stack.gtif'
# Open a geospatial dataset
# open() handles your file
# this function allows you to open a file and return it as a file object
dataset = rasterio.open(img_fp)
print(dataset)
Let's learn some basic things about the image data.
##codecell_RasterLandscape_ImageCharacteristics
# what is the name of this image
img_name = dataset.name
print('Image filename: {n}\n'.format(n=img_name))
# How many bands does this image have?
num_bands = dataset.count
print('Number of bands in image: {n}\n'.format(n=num_bands))
# How many rows and columns?
rows, cols = dataset.shape
print('Image size is: {r} rows x {c} columns\n'.format(r=rows, c=cols))
# Does the raster have a description or metadata?
desc = dataset.descriptions
print('Raster description: {desc}\n'.format(desc=desc))
# What driver was used to open the raster?
driver = dataset.driver
print('Raster driver: {d}\n'.format(d=driver))
# What is the raster's projection?
proj = dataset.crs
print('Image projection:')
print(proj, '\n')
# What is the raster's "geo-transform"
gt = dataset.transform
print('Image geo-transform:\n{gt}\n'.format(gt=gt))
# Does the raster have a metadata?
metadata = dataset.meta
print('All raster metadata:')
print(metadata)
print('\n')
In #codecell_RasterLandscape_OpenImage, rasterio allows you to open the raster via the function .open() and returns it as Dataset object.
In #codecell_RasterLandscape_Characteristics, you wanted to know a bit more about this raster, in particular its characteristics & metadata by calling information on:
what does this mean?
That's is the fun part...
But what are image bands?
Raster dataset contains one or more layers called bands. For instance, a colour image can have three bands using RBG, with 1 band Red,1 band Green, and 1 band Blue:
An image can also be a multispectral image with many bands such as remote sensing data which is a recombination of the electromagnetic spectrum to create a multiband raster data. This is where every cell location has more than one value associated with it. You might have heard of Near Infrared (NIR), well these raster images have 8 bands and they come with each band as a separate file (bands 2,3,4 and 8 which are Blue, green, red, and NIR respectively) (see more here). Infrared (and ultraviolet) are that bit of the spectrum we cannot see with our own eyes.
The rasterio dataset object we've created contains a lot of useful information but it is not ready to be read in the raster image. Instead, we will need to access the raster's bands using the read() ) method:
##codecell_RasterLandscape_CheckShapOfBands
# Let's open fourth band in our image
# read() function that allow us to specify a subset of the raster band in the brackets ()
nir = dataset.read(4)
# Using function .shape we want to check out the dimensions of the image
nir.shape
# #same as in ##codecell_RasterLandscape_ImageMetadata
# by using .read() for all bands, the answer should have been (8, 250, 250)
# .shape answer provides you with # bands, rows, cols
# but images can be very very large files, so it is better to do this bit by bit
##codecell_RasterLandscape_ImportUrLibraries
#we get numpy again
# because raster tends to be large(heavy) files,
#numpy allows to represent the data very efficiently without using a lot of computer memory
import numpy as np
##codecell_RasterLandscape_BandsDataType
# What are the band's datatypes?
# remember we have used the dtype function in lab2Webmaps&Distribution
datatype = dataset.dtypes
print('Band datatypes: {dt}'.format(dt=datatype))
# How about some band statistics?
band_mean = np.mean(nir)
band_min = np.amin(nir)
band_max = np.amax(nir)
band_stddev = np.std(nir)
print('Band range: {minimum} - {maximum}'.format(maximum=band_max,
minimum=band_min))
print('Band mean, stddev: {m}, {s}\n'.format(m=band_mean, s=band_stddev))
In #codecell_RasterLandscape_BandsDataType, we are asking numpy (np) to:
tell us which ( datatype ) are the bands. the answer is 'int16' which is an integer of a storage capacity of 16 bit. Each type of integer has a different range of storage capacity:
Type -- Capacity
Int16 -- (-32,768 to +32,767)
Int32 -- (-2,147,483,648 to +2,147,483,647)
Int64 -- (-9,223,372,036,854,775,808 to +9,223,372,036,854,775,807)
provide us with the mean ( np.mean() ), minimum ( np.amin() ), maximum ( np.amax() ) and its standard deviation ( np.std() ). So, this is just like asking how the data (band no.4) is spread really:
and how the data is distributed:
##codecell_RasterLandscape_ImageDisplay
#read in the image
full_img = dataset.read()
#we are checking the .shape of the image (do I have all my bands? See ##codecell_RasterLandscape_CheckShapOfBands)
full_img.shape
# we are adding a wee bit of rasterio library here
# import the show function which allows us to display the image
from rasterio.plot import show
#then print the nir band with the whole image shape
#we would like to have a grayscale map (cmap see also ##codecell_SpatialPatterns_PlottingUrData)
print("Image dimensions: ", full_img.shape)
show(nir, transform=dataset.transform, cmap='gray')
##codecell_RasterLandscape_ImportUrLibraries
#more tools
import matplotlib # matplotlib is the primary python plotting and viz library
import matplotlib.pyplot as plt
# this bit of magic allows matplotlib to plot inline in a jupyter notebook
#see also lab4_spatial patterns
%matplotlib inline
import folium # folium is an interactive mapping library
Some raster datasets will only have one band. These behave a bit like a black and white image. Rasters can contain as many bands as you like, and single band images can be combined into multiple band images (see above in But what are image bands?). Multiple band images are useful because you can calculate ratios between the different bands to enhance the visibility of certain features in your image.
##codecell_RasterLandscape_RasterCombiningBands
# to make a multiple band raster add all the file paths to a list that we call s2_bands
s2_bands = [
"http://ropitz.github.io/digitalantiquity/data/sentinel-2/2018-10-13_Sentinel-2BL1C_B02.tiff",
"http://ropitz.github.io/digitalantiquity/data/sentinel-2/2018-10-13_Sentinel-2BL1C_B03.tiff",
"http://ropitz.github.io/digitalantiquity/data/sentinel-2/2018-10-13_Sentinel-2BL1C_B04.tiff",
"http://ropitz.github.io/digitalantiquity/data/sentinel-2/2018-10-13_Sentinel-2BL1C_B08.tiff"
]
# open these files and add all bands together
# and you are setting an argument 'for' 'in' 'with' 'as'
# in new dataframe name for this list => arrs []
# in which we are appending the files
arrs = []
for band in s2_bands:
with rasterio.open(band) as f:
arrs.append(f.read(1))
# convert the list of tiffs(our multiuple bands appended into 1 list = arrs) to a numpy array
# if you forgot what was an array you simply revisit decomposing the code of #codecell_Webmaps&Distributions_AllNumbers
sentinel_img = np.array(arrs, dtype=arrs[0].dtype)
# let's check the shape of this array
sentinel_img.shape
##codecell_RasterLandscape_MultibandRasterClipped&View
#clip to get a smaller area and show the multiple band image data
#this is done by setting the extents of the y and x axis to the desired (here smaller)extent
#using our subset special symbol []
clipped_img = sentinel_img[:, 0:750:, 0:1500]
clipped_img.shape
show(clipped_img[[2,1,0], :, :])
#we only asked to plot the first 3 bands RGB [3,2,1]
In #codecell_RasterLandscape_MultibandRasterClipped&View, you have clip the image by using the subset [ ] , you then asked rasterio to show() and plot() only the raster's first three bands so it will be an RGB image.
You can see the different information held in each band by plotting a histogram of the raster dataset pixel values.
##codecell_RasterLandscape_MultibandRasterHistogram
# plot a histogram of the data in each band
#revisit your lab SpatialPatterns for further info on how to plot with matplotlib
rasterio.plot.show_hist(clipped_img, bins=50, histtype='stepfilled', lw=0.0, stacked=False, alpha=0.3)
#here note bin= corresponds to the interval used between pixel values
Even from simple visualisations, we can see the contrast between the red and the near-infared (NIR) bands. In the histogramm, the peak at 255 is simply from areas with no data (labelled 255 on all bands).
##codecell_RasterLandscape_MultibandRasterNodata
#we can ckeck where on the image are the cell with no value
# we use the clipped image and subset to only [:0,0] data
clipped_img[:,0,0]
Different band ratios will highlight different features in a raster dataset. We might be interested in finding places where cropmarks are located, for example.
There are lots of band ratios that will highlight more green and more stressed vegetation. A common ratio that does this is called NDVI- the normalized difference vegetation index.
How does this work?
##codecell_RasterLandscape_NDVI
# let's calculate NDVI
# but we want to make sure that to ignore errors due to NaN pixels
#you remember these... revisit in #codecell__Webmaps&Distributions_ChangingDataType
np.seterr(divide='ignore', invalid='ignore')
bandNIR = clipped_img[3] # fourth band
bandRed = clipped_img[2] # second band
#Let's now apply the NDVI formula
# note that in python division of integers leads to integers
# so we need to specify (.astype) floats in order to get floats for NDVI results
# if you forgot what this means, revisit #codecell__Webmaps&Distributions_ChangingDataType
ndvi = (bandNIR.astype(float)-bandRed.astype(float))/(bandNIR.astype(float)+bandRed.astype(float))
#we now have a simple matrix
##codecell_RasterLandscape_showNDVI
#show NDVI
#here we are using matplotlib instead of the rasterio
# because we are now dealing with a matrix
plt.imshow(ndvi, cmap="RdYlGn")
plt.colorbar()
plt.show()
There is a lot of red for the water and green for the vegetation, so the formula looks like to apply correctly.
In this second part of the exercise, we'll be calculating some properties of the landscape like slope and aspect that are commonly used in analyses and models of where sites are likely to be visible.
##codecell_RasterLandscape_ImportUrLibraries
#let's work with geopandas again
!pip install geopandas
##codecell_RasterLandscape_ImportUrLibraries
#more tools. all the tools really!
from shapely.geometry import mapping
from shapely.geometry import Point
from rasterio.mask import mask
import geopandas as gpd
import zipfile
import io
import requests
import elevation
import richdem as rd
import gdal
import os
And now get some data to start the second part of the exercise- a raster dataset representing elevation referred to as a DEM or 'digital elevation model'.
You may want to revisit Chap 3-3 showing how data can be interpolated. However, there are many ways (>40! with the most common IDW, Spline, Trend Surface, Natural Neighbour, Kriging) to interpolate elevation points to create a surface.
The raw point cloud contains terrain and vegetation and surface objects | The filtered point cloud contains only selected classes of points, like terrain points | The bare earth terrain model is made from the classified point cloud | visualizations are created based on the bare earth terrain model |
(Images: Opitz, Nuninger, and Fruchart. Data: MSHE C.N. Ledoux)
##codecell_RasterLandscape_GetaDEM&clipIt
#fetch elevation data from the SRTM server and clip to our area of interest
dem_path = os.path.join(os.getcwd(),'areaDEM.tif')
elevation.clip(bounds=(5.1, 43.65, 5.5, 43.95), output=dem_path)
##codecell_RasterLandscape_LetsPlotDEM
#check the DEM has loaded nicely by plotting it
areaDEM = rd.LoadGDAL(dem_path)
plt.imshow(areaDEM, interpolation='none')
plt.colorbar()
plt.show()
DEMS are useful for 3D visualisation, mapping purposes and various other applications, and, can also derive additional types of topographical models from a DEM such as slope, aspect, hillshape, 3d surface views, visibility analysis.
DEMs can be then analysed:
DEM analysis | 3D Analysis |
---|---|
We will compute and plot in this practical lab slope and aspect.
*Why use a Slope model?*
*What can we use an Aspect Model for?*
##codecell_RasterLandscape_DEMSlope
slope = rd.TerrainAttribute(areaDEM, attrib='slope_riserun')
rd.rdShow(slope, axes=False, cmap='magma', figsize=(8, 5.5))
plt.show()
In #codecell_RasterLandscape_DEMSlope, you have used rd.TerrainAttribute function to compute the slope for each pixel containing the levation value and used rdShow function to visualise the slope map.
##codecell_RasterLandscape_DEMAspect
# now do the same thing to calculate and plot the aspect data
aspect = rd.TerrainAttribute(areaDEM, attrib='aspect')
rd.rdShow(aspect, axes=False, cmap='jet', figsize=(8, 5.5))
plt.show()
In #codecell_RasterLandscape_DEMAspect, you have used rd.TerrainAttribute() function to compute the aspect for each pixel containing the levation value and used rdShow() function to visualise the aspect map.
Contours, a common way to visualize elevation data, can also be derived from a raster.
visual representation of Iceland’s elevation levels with contours published by CartoDB
##codecell_RasterLandscape_DEMrastertoArray
#this script is using gdal library (see lexicode)
#to avoid having to bring your raster block by block
# we need to open the raster file using gdal library
#and essentially let gdal read what is in our 1 band DEM
gdal_data = gdal.Open(dem_path)
gdal_band = gdal_data.GetRasterBand(1)
nodataval = gdal_band.GetNoDataValue()
# let's convert to a numpy array
data_array = gdal_data.ReadAsArray().astype(np.float)
data_array
# replace missing values if necessary
# you knopw this by now ... we are talking about NaN values (the very annoying ones)
if np.any(data_array == nodataval):
data_array[data_array == nodataval] = np.nan
##codecell_RasterLandscape_DEMrasterPlotCountoursLines
#Plot out data using Matplotlib tool 'contour'
#the script layout is very similar to last week's practical on spatialpatterns
fig = plt.figure(figsize = (12, 8))
ax = fig.add_subplot(111)
plt.contour(data_array, cmap = "viridis",
levels = list(range(0, 5000, 100)))
plt.title("Elevation Contours")
cbar = plt.colorbar()
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
# and here we go!
##codecell_RasterLandscape_DEMrasterPlotCountoursFilled
#Plot our data using Matplotlib tool 'contourf' to get filled contours
fig = plt.figure(figsize = (12, 8))
ax = fig.add_subplot(111)
plt.contourf(data_array, cmap = "viridis",
levels = list(range(0, 5000, 500)))
plt.title("Elevation Contours")
cbar = plt.colorbar()
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
That's it for this lesson. Hopefully you've seen how you can access and manipulate raster data to represent the landscape or environmental factors.
To re-use the codes - you will need to first load their respective libraries. So far, you have used ...:
libraries | | | | --- |--- | --- | --- | folium | numpy | rasterio| branca| rtree | richdem| pandas| osmnx| elevation| geopandas| requests | zipfile | seaborn | fiona| io | matplotlib.pyplot | ipywidgets|requests | pysal |seaborn |gdal| os |
plugins | ||
---|---|---|
HeatMapWithTime
HeatMap
MeasureControl
PrepareUrBasemaps_CreateLayers from [folium.plugins]
cluster (from sklearn)
rasterio.plot
mapping (from shapely.geometry)
Point(from shapely.geometry)
mask (from rasterio.mask)
your lexicode is non-exhaustive, keep expanding, find your own 'best way' to reuse these code/scripts...
Lexicode_MakingaBasicMap | Lexicode_Webmaps&Distributions |Lexicode_StreetGridOrientations | Lexicode_SpatialPatterns | Lexicode_RasterLandscape --- | --- | ---|---|----| == () [] | pd.concat() | { } subselection from list|%matplotlib inline | .open() .head_csv() | .dtype() | ox.gdf_from_places()|requests.get()|.print() .read_csv() | astype() | ox.plot_shape()|request.content()|dataset.name| mean() | fillna()|network_type= ''|.bytes()|dataset.count| folium.Map | def return |ox.add_edge_bearings(ox.get_undirected())|gpd.GeoDataFrame.from_features()|dataset.shape| range() | .apply(lambda x:function,axis=) |count_and_merge()|Set()|dataset.descriptions| len() | pd.merge() |np.arrange()|pd.value_counts() |dataset.meta| iloc[]| how= , left_index= ,left_index= |np.histogram()|.merge()|dataset.driver| .value_counts()| gpd.GeoDataFrame()| ax.set_theta_location()|.sort_values|dataset.read if =:| geometry=gpd.points_from_xy |ax.set_ylim()|cluster.KMeans()|.shape elif =: |print() |ax.set_title()|.fit()|np.amean() else =:| .isin()|ax.set_yticks()|.drop() |np.amin () folium.Marker()| classic.plot()|ax.set_xlabels() & ax.set_yticklabels|.assign()|np.amax() folium.Icon()| generateBaseMap()|plt.subplots()|plt.show()|np.std() folium.Circle| .groupby(['', ''])|.dropna()|.set_title|show() popup= | .reset_index() |polar_plot()|sns.pairplot()|cmap= radius= | max_zoom= |pd.Series()|arrs.append()|np.seterr() .values.tolist() |folium.TileLayer()|np.pi|.show_hist()|plt.imshow .add_to()| plugins.DualMap(location= , tiles= , zoom_start= )| | |clipped_img| | | | |rd.TerrainAttribute() | | | |rdShow() | | | |gdal_data.GetRasterBand() | | | |gdal_band.GetNoDataValue() | | | |gdal_data.ReadAsArray | | | |gdal_data.ReadAsArray().astype(np.float) | | | |plt.contour() | | | |plt.contourf()
In this exercise we're going to get into some key spatial statistics. So far in this course we've mostly been visualising spatial distributions and patterns. Here we will run statistical tests to determine whether nor not a pattern or spatial structure exists, and to test what kind of pattern (dispersed vs. random vs. clustered) is present.
This kind of analysis is what we would call spatial statistical analysis. This is different from ** deterministic analysis** -which is when you analyse things like which polygons intersect one another in a dataset, and which polygons entirely contain other polygons. Spatial statistical analysis is a subset of probabilistic analysis - which is simply analysis of how likely it is that something happened. In this exercise we will start by working with vector-based data (points, lines and polygons). Then we will briefly look at working with raster datasets.
= |
##codecell_Spatial_Statistical_Analysis_ImportUrLibraries
# start by installing tools as usual
!pip install geopandas
!pip install descartes
!pip install mapclassify
!pip install pysal
#@title
!apt-get install -qq curl g++ make
#@title
!curl -L http://download.osgeo.org/libspatialindex/spatialindex-src-1.8.5.tar.gz | tar xz
#@title
import os
os.chdir('spatialindex-src-1.8.5')
#@title
!./configure
#@title
!make
#@title
!make install
#@title
!pip install rtree
#@title
!ldconfig
#Working through the example at http://toblerity.org/rtree/examples.html
#@title
from rtree import index
from rtree.index import Rtree
#@title
p = index.Property()
idx = index.Index(properties=p)
idx
##codecell_Spatial_Statistical_Analysis_ImportUrLibraries
#and importing tools...
import geopandas as gpd
import requests
import zipfile
import io
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pandas as pd
import pysal as ps
import numpy as np
in #codecell_Spatial_Statistical_Analysis_ImportUrLibraries. These are what we call prerequisites. You know by now that they are basic tools so you can get started.
You must have realised by now that the most time consuming part of your work is preparing your data ... Data preparation is important, and your ability to use or re-use a dataset depends on how well it has been prepared. When you are selecting and preparing your data, remember to think about your research aims and the questions you are trying to answer, and make sure your data is set up to respond to those aims and questions.
The following scripts repeat what you have done in previous practicals, in particular Lab4_spatial_patterns_in_excavation where you complete the following basic steps:
##codecell_Spatial_Statistical_Analysis_ImportOurData
# And now, as usual, get the data
url = 'https://github.com/ropitz/spatialarchaeology/blob/master/gabii_spatial.zip?raw=true'
local_path = 'temp/'
print('Downloading shapefile...')
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
print("Done")
#here this is very much like what we are doing on your PC file system
z.extractall(path=local_path) # extract to folder
#here we want to download the shapefiles
# so we sort them using sorted()
#so we have a comprehensive list of all variables available in our dataset
# print them
filenames = [y for y in sorted(z.namelist()) for ending in ['dbf', 'prj', 'shp', 'shx'] if y.endswith(ending)]
print(filenames)
#here you know it ...
#we use function read() to access the data from this notebook and work with it
dbf, shp, shx = [filename for filename in filenames]
gabii = gpd.read_file(local_path + 'gabii_SU_poly.shp')
##codecell_Spatial_Statistical_Analysis_Peek@OurData
# As you've done before, print out some information on the data
# To check the number of records in the file, so it has loaded in ok
# And preview the data
print("Shape of the dataframe: {}".format(gabii.shape))
print("Projection of dataframe: {}".format(gabii.crs))
gabii.tail() #last 5 records in dataframe
##codecell_Spatial_Statistical_Analysis_ObjectType&SU
# As we've done before (returning to the Gabii finds data - see ###codecell_SpatialPatterns_WhichTypeOfSpecialFinds&fromWhere?)
# Get the non-spatial special finds data
# And they are archived per SU /Stratigraphical Unit
sf_su = pd.read_csv("https://raw.githubusercontent.com/ropitz/gabii_experiments/master/spf_SU.csv")
sf_su.head()
##codecell_Spatial_Statistical_Analysis_FindsBecomesSpatial
#Then let's combine our polygons representing context shape and location
# with the special finds data
# We can use known command 'merge()' for this (see especially explanation for this in #codecell_Webmaps&Distributions_MergingZeData and also ###codecell_SpatialPatterns_TextileToolsBecomesSpatial)
gabii_textools = gabii.merge(sf_su, on='SU')
gabii_textools.head()
##codecell_WorkingWithPysal_DataPreparation
#Let's pull all those find types out of the big list.
#These commands should look familiar because you've done them before (###codecell_SpatialPatterns_SpecialFindsSelection)
types = ['Loom Weight','Spool','Spindle Whorl']
textile_tools = gabii_textools.loc[gabii_textools['SF_OBJECT_TYPE'].isin(types)]
# Now let's count up how many of these tools appear in each context (SU).
# This command will print out a list of the number of textile tools in each SU next to that SU number.
textile_tool_counts = textile_tools.groupby('SU')['SF_OBJECT_TYPE'].value_counts().unstack().fillna(0)
gts = gabii_textools.merge(textile_tool_counts, on='SU')
gts_new = gts.drop_duplicates(subset="SU")
gts_new.head()
##codecell_Spatial_Statistical_Analysis_PlotZeData
# Set up figure and axis
f, ax = plt.subplots(1, figsize=(9, 9))
# Plot SUs
#gabii.plot(ax=ax, facecolor='0.85', linewidth=0)
# Quantile choropleth of deaths at the street level
gts_new.plot(column='Spool', scheme='fisher_jenks', ax=ax, \
cmap='YlGn', legend=True, linewidth=3)
# Plot pumps
#xys = np.array([(pt.x, pt.y) for pt in pumps.geometry])
#ax.scatter(xys[:, 0], xys[:, 1], marker='^', color='k', s=50)
# Remove axis frame, also called graticule
ax.set_axis_off()
# Change background color of the figure
f.set_facecolor('0.75')
# Keep axes (your x and y) proportionate
plt.axis('equal')
# Title
# f.suptitle()command allows you to add a centered title to the figure
f.suptitle('Spool Distribution', size=30)
# Draw
plt.show()
They are different ways to explore your data.
So far, you've (rapidly) repeated the steps you've done in a previous exercise to visualise a spatial pattern - this time we are exploring the spools artefacts discovered while excavating at Gabii.
Now we want to test statistically if there is a pattern? Statistical tests are needed when it is not obvious from just looking at the distribution whether or not a pattern exists.
We can start with some of the more basic tests: Moran's and local Moran's, which are tests for spatial autocorrelation. Moran's I statistic (1948, 1950) is one of the classic (as well as one of the most common) ways of measuring the degree of spatial autocorrelation in data.
But how does this work?
You have seen that objects in space are rarely randomly distributed. In fact, they usually have some degree of patchiness (i.e., they are spatially clustered) which can produce a variety of distinct spatial patterns.
These patterns can be quantified according to the degree of similarity between the attributes values of adjacent spatial objects (polygons, lines, points, raster cells).
We will start by testing if nearby objects tend to have similar attributes than expected from a random distribution . The degree to which similar values cluster together spatially is the degree of autocorrelation. This value is reported as Moran's I.
In a nutshell, the Moran’s I statistic provides a correlation coefficient (a measurement of similarity) for the relationship between a measured value and its surrounding measured values.
To understand the meaning of this measurement of similarity, the Moran's I value, you need to understand how "statistical significance" works. Basically, a measurement is considered "statistically significant" if it very different from what you would get by chance. A spatial pattern is "statistically significant' if it is very different from a random spatial pattern.
In summary:
Conditions:
Z-score and p-value are used to evaluate the significance of the Index value. The p-value is the probability that the observed spatial pattern was created by some random process. When the p-value is very small, it means it is very unlikely (small probability) that the observed spatial pattern is the result of random processes. There are two possible scenarios:
You can read about Moran's. If you prefer a video, this one this one is very clear on how to interpret the resulst of a Moran's test.
As you may have suspected, we need to do something first before running the Moran's I. And, as said above, the emphasis of the Moran's I spatial analysis is on the role of the weights. So, one of the vital steps of spatial autocorrelation modelling is to construct a spatial weights matrix. The spatial weights can be based on:
which can be illustrated as such:
*where 0= no weights and 1As we've discussed in class, designing a good weights matrix is part of the interpretive process. It's an important step because if we fail to design the weights matrix well, the resulting of spatial analysis will likely not produce results that are meaningful. It is important, therefore, to understand the kinds of spatial relationships listed above, and to use the one that best represents the kind of spatial relationship presesnt in your data.
Let's create some weights for a real dataset that you know: finds from the Gabii excavations.
##codecell_Spatial_Statistical_Analysis_MoranStatistics_defineweights
# To start your Moran's statistical test, you need to create weights that define how strongly you think things near to one another influence one another.
# see the types of weights available to you by looking in pysals help file
help(ps.lib.weights)
##codecell_Spatial_Statistical_Analysis_MoranStatistics_KNN_weights
#create some weights. I've gone with KNN weights. Read about this in the pysal documentation linked above...
#we want to see what is happening with the Spools
#let's first subselect them and create a gts_spool
gts_spool = gts_new[['SU','Spool']]
#let's have the nearest neighbours weights
#we choose a k-distance of 5
gts_spool_weights = ps.lib.weights.KNN(gts_spool,5)
#here we Ignore the warnings because
#we know that not all the SU areas connect up physically
Here we create a matrix, or a grid of values, which contains the weights assigned to each spatial entity: .
##codecell_Spatial_Statistical_Analysis_MoranStatistics_DefineMatrix
# Rename IDs to match those in the `segIdStr` column
gts_spool_weights.remap_ids(gts_spool.index)
# Row standardise the matrix
gts_spool_weights.transform = 'R'
Often there is a need to apply a transformation .transform = 'R' to the spatial weights, such as in the case of row standardisation. You can use .transform = 'b' for binary and .transform = 'v' for variance stabilising doc.
~ Behind the scenes ~ the transform property is updating all other characteristics of the spatial weights that are a function of the values and these standardisation operations, freeing the user from having to keep these other attributes updated.
Now you have the data and the spatial weights matrix ready.
Let's start by computing the spatial lag of the spool distribution. The spatial lag is the product (multiplication) of the spatial weights matrix and a given variable. If gts_spool_weight is row-standardised, the result is the average value of the variable in the neighborhood of each observation.
Multiplication of two matrices looks like this:
We can calculate the spatial lag for the variable 'Spool' and store it directly in the main table with .lag_spatial() :
##codecell_Spatial_Statistical_Analysis_MoranStatistics_SpatialLag
#add the weights you've created to the attribute table
gts_spool['gts_spool_weights'] = ps.lib.weights.lag_spatial(gts_spool_weights, gts_spool['Spool'])
gts_spool.head()
Whenever there is a risk that the distribution of your features (see decomposing the code of #codecell_RasterLandscape_BandsDataType as a reminder) is potentially biased due to sampling design or an imposed aggregation scheme, it is important to row standardize the data.
Read about standardisation in spatial modelling.
So now we need to remove potentially biased distribution in our dataset by standardising the counts and standardising the means.
##codecell_Spatial_Statistical_Analysis_MoranStatistics_Standardisation&SpatialLag
# standardise the counts of the number of spools in each context and the weights
#we are applying some math here
gts_spool['spool_std'] = (gts_spool['Spool'] - gts_spool['Spool'].mean()) / gts_spool['Spool'].std()
gts_spool['w_spool_std'] = ps.lib.weights.lag_spatial(gts_spool_weights, gts_spool['spool_std'])
gts_spool.head()
##codecell_Spatial_Statistical_Analysis_MoransI_ImportUrLibraries
#get some more tools for the Moran test
from pysal.explore.esda.moran import Moran
##codecell_Spatial_Statistical_Analysis_RUNMoransI
# Run the Moran test
mi = Moran(gts_spool['Spool'], gts_spool_weights)
mi.I
##codecell_Spatial_Statistical_Analysis__RUNMoransI mean?
Now let's plot the results.
The cluster/outlier type (COType) field distinguishes between a statistically significant cluster of:
Statistical significance is set at the 95 percent confidence level (really quite confident, we could say).
##codecell_Spatial_Statistical_Analysis_PLOTMoransI
#Here we are using similar matplotlib commands than in Lab3_StreetOrientation & Lab5_RasterLandscape
# Setup the figure and axis
f, ax = plt.subplots(1, figsize=(9, 9))
# Plot values
#command sns.regplot()allows you to plot the data/results and a linear regression model fit
#such as our results of Moran's I
#
sns.regplot(x='spool_std', y='w_spool_std', data=gts_spool)
# Add vertical and horizontal lines
plt.axvline(0, c='k', alpha=0.5)
plt.axhline(0, c='k', alpha=0.5)
ax.set_xlim(-2, 7)
ax.set_ylim(-2.5, 2.5)
#add text within each quandrant
plt.text(3, 1.5, "HH", fontsize=25)
plt.text(3, -1.5, "HL", fontsize=25)
plt.text(-1, 1.5, "LH", fontsize=25)
plt.text(-1, -1.5, "LL", fontsize=25)
# Display
plt.show()
What's the figure generated from ##codecell_Spatial_Statistical_Analysis__PLOTMoransI mean?
In order to guide the interpretation of the plot, a linear fit is also included in the plot, together with confidence intervals. This line represents the best linear fit to the scatter plot or, in other words, what is the best way to represent the relationship between the two variables as a straight line. Because the line comes from a regression, we can also include a measure of the uncertainty about the fit in the form of confidence intervals (the shaded blue area around the line).
The plot displays a positive relationship between both variables. This is associated with the presence of positive spatial autocorrelation: similar values tend to be located close to each other. If we had to summarise the main pattern of the data in terms of how clustered similar values are, the best way would be to say they are positively correlated and, hence, clustered over space.
At the core of this is a classification of the observations in a dataset into four groups derived from the Moran Plot: high values surrounded by high values (HH), low values nearby other low values (LL), high values among low values (HL), and viceversa (LH). Each of these groups are typically called "quadrants"
Actually, as you may have realised, Moran's I can tell you about the complete spatial pattern so it tells you about the clustering BUT it tells us nothing about where the clusters might be ! It only tells you that the pattern is more clustered than it would be if it would be under spatial randomness.
So why do we need Moran's I then? well you need to test the significance of your spatial analysis, i.e. that your regression is not violating your assumptions... so Moran's is a global test.
And, now, what do we do to get test the location of spatial clusters? We need a local statistic...To test this, we use the local variant of the Moran's test.
##codecell_Spatial_Statistical_Analysis_ImportUrLibraries
# get the tools for the local test
from pysal.explore.esda.moran import Moran_Local
The local test breaks the global pattern down to test for the presence of local clusters. You can check at each SU whether or not it is likely (in a statistical significance sense) for it to participate in a local cluster.
How does this work? Local measures consider each single observation in a dataset and operate on them (as oposed to on the overall data) as global measures do.
LISAs (Local Indicators of Spatial Association) are widely used in many fields to identify clusters of values in space. They are a very useful tool that can quickly return areas in which values are concentrated and provide suggestive evidence about the processes that might be at work. Therefore, they are a key part of the exploratory toolset. In Python, we can calculate LISAs in a very streamlined way thanks to PySAL:
##codecell_Spatial_Statistical_Analysis_MoransLocal
# run the local test
lisa = Moran_Local(gts_spool['Spool'].values, gts_spool_weights)
All we need to pass is the .values variable of interest "gts_spool['Spool']" and the spatial weights "gts_spool_weights" that describes the neighborhood relations between the different observation that make up the dataset.
Looking at the numerical result of LISAs is not always the most useful way to exploit all the information they can provide. Remember that we are calculating a statistic for every sigle observation in the data so, if we have many of them, it will be difficult to extract any meaningful pattern. Instead, what is typically done is to create a cluster map that extracts the significant observations (those that are highly unlikely to have come from pure chance) and plots them with a specific color depending on their quadrant category.
All of the information needed is contained in the lisa object you have just created in ##codecell_Spatial_Statistical_Analysis_MoransLocal.
It is convenient to pull them out and insert them in the main data frame that we will call "gts_spool" as demonstrated in ##codecell_Spatial_Statistical_Analysis_MoransLocal_Significance.
##codecell_Spatial_Statistical_Analysis_MoransLocal_Significance
# Break observations into significant or not
# using .o_sim < we will set it at 95% confidence interval
# so significant values are all the ones which are less than 5%
gts_spool['significant'] = lisa.p_sim < 0.05
# By using .q command
# you can store the quadrant each observation belongs to:
# - the high-high, high-low, low-high, low-low
# just as before (in ##codecell_Spatial_Statistical_Analysis_PLOTMoransI) are quads 1-4
gts_spool['quadrant'] = lisa.q
First, the 'significance' column. As with global Moran's I, PySAL is automatically computing a p-value for each LISA. Because not every observation represents a statistically significant one, we want to identify those with a p-value small enough that rules out the possibility of obtaining a similar situation from pure chance. Since we have been working with a 95% confidence interval, we select 5% using p.sim< as the threshold for statistical significance (just like with global Moran's I).
To identify these values, we create in ##codecell_Spatial_Statistical_Analysis_MoransLocal_SignificanceResults below a variable, significant, that contains :
##codecell_Spatial_Statistical_Analysis_MoransLocal_SignificancereadingUrResults
gts_spool['significant'][:20]
# true means it is in a cluster, false means it is not
##codecell_Spatial_Statistical_Analysis_MoransLocal_SignificancereadingUrResults
# You can read out the calculated p values for each
lisa.p_sim[:20]
##codecell_Spatial_Statistical_Analysis_AddingUrMoransLocal2Gabiidataset
#add this info back onto the spatial data
gabii_spool_lisa = gabii.merge(gts_spool, on='SU')
gabii_spool_lisa.head()
Now we have these elements, the signifigance and quadrant , we can make a map showing which quadrant each SU belongs to, essentially a display of where the local clusters are located.
Parameters are for:
##codecell_Spatial_Statistical_Analysis_MappingGabiiMoransLocal
# Setup the figure and axis
f, ax = plt.subplots(1, figsize=(9, 9))
# Plot baseline su poly
gabii_spool_lisa.plot(column='quadrant', ax=ax, \
cmap='Accent', legend=True, linewidth=3)
ax.set_axis_off()
plt.axis('equal')
plt.show()
How would you interpret the results of this analysis?
Remember: The z-scores and p-values are measures of statistical significance which tell you whether or not the local pattern is random, feature by feature. In effect, they indicate whether the apparent similarity (a spatial clustering of either high or low values) or dissimilarity (a spatial outlier) is more pronounced than one would expect in a random distribution.
You can run similar spatial statistics on raster data to understand trends and distributions of data.
It is possible to convert point vector data to raster data via the process of interpolation.
~ déjà vu ~ We have talked about this before:
There are many ways to interpolate data, which you can read about on this site.
Whitebox is full of hundreds (412 for version 1.0.2!) of useful spatial tools.
You can download it and run it standalone (without installing!) or you can call its functions from the jupyter notebook.
It will be well worth checking out for your own independent projects: Read more about Whitebox Tools here.
= |
##codecell_Spatial_Statistical_Analysis_InstallUrLibraries
# we're going to use the whitebox tools to interpolate our data and imageio to view it
!pip install whitebox # acts as a spatial toolbox
!pip install imageio
!pip install imageio tifffile # stores numpy arrays in TIFF format files & read image&metadata from TIFFs
##codecell_Spatial_Statistical_Analysis_ImportUrLibraries
# import your tools and print the version to check it worked
import pkg_resources # provides runtime facilities for finding, introspecting, activating and using installed Python distributions
import whitebox #advanced geospatial data analysis platform. Well you need its toolbox for that (see 3 codeline down)!
import imageio # provides an easy interface to read and write a wide range of image data, including animated images, volumetric data, and scientific formats
import IPython # for displaying html outputs
import os
wbt = whitebox.WhiteboxTools()
print(wbt.version())
print(wbt.help())
##codecell_Spatial_Statistical_Analysis_CheckUrNewSpatialToolbox
# when there are lots of tools, it's useful to call their help info to find out what the parameters are
print(wbt.tool_help("IdwInterpolation"))
##codecell_Spatial_Statistical_Analysis_PrepareUrData
# To interpolate, as you may have noticed in your reading, we need points rather than polygons.
# To have points, we get the center of each polygon, called centroids.
# In these lines we essentially strip out the geometry information when we get the center of each polygon.
spool_centroids = gabii_spool_lisa.geometry.centroid
spoolgdf = gpd.GeoDataFrame(geometry=gpd.GeoSeries(spool_centroids))
#Then we 'merge' to glue the attributes we are interested in interpolating back on to our data.
spool_centroids_attr = spoolgdf.merge(gabii_spool_lisa.spool_std,right_index=True,left_index=True)
#Let's have a sneaky peek at the dataframe
spool_centroids_attr.head()
##codecell_Spatial_Statistical_Analysis_SetUrWorkingDirectory
# to work in whitebox effectively, we put this data out to a temporary shapefile
# we also set a data directory where we can put this file
data_dir = '/usr/local/lib/python3.6/dist-packages/whitebox/testdata/'
wbt.set_working_dir('data_dir')
print(data_dir)
#wbt.verbose command allows you to decide if you want output messages such as warnings, progress updates and other notifications
wbt.verbose = True
#Let's place our shapefile in a file
gabii_spool_shapes = spool_centroids_attr.to_file("/usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_spool.shp")
##codecell_Spatial_Statistical_Analysis_InterpolateRaster
# now we interpolate our centriods using the standardised spool weights and show the result
# note you can play around with the interpolation parameters to have different kernel sizes and cell sizes...
wbt.idw_interpolation(
i="/usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_spool.shp",
field="spool_std",
output="/usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_spool.tif",
use_z=False,
weight=2.0,
radius=5,
min_points=3,
cell_size=1,
base=None
)
raster = imageio.imread("/usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_spool.tif")
plt.imshow(raster)
plt.show()
#List data under the data folder
print(os.listdir('testdata'))
Now you can calculate statistics on this dataset, for example using rasterio like we did last week, or using whitebox's tools.
For example, you can check to see if the data forms a normal distribution. This lets you know what kinds of statistical tests will be valid. As we discussed last week, often archaeological distributions are not normal.
##codecell_Spatial_Statistical_Analysis_RasterSpatialStats_normality
wbt.ks_test_for_normality(
"/usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_spool.tif",
"/usr/local/lib/python3.6/dist-packages/whitebox/testdata/normal.html",
num_samples=None
)
##codecell_Spatial_Statistical_Analysis_RasterSpatialStats_ViewNormality
# view your results
IPython.display.HTML(filename="/usr/local/lib/python3.6/dist-packages/whitebox/testdata/normal.html")
You can also run a similar autocorrelation test to the LISA one we ran above with the vector data.
##codecell_Spatial_Statistical_Analysis_RasterSpatialStats_Autocorrelation
wbt.image_autocorrelation(
" /usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_spool.tif",
"/usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_autocorr.html",
contiguity="Rook"
)
##codecell_Spatial_Statistical_Analysis_RasterSpatialStats_ViewAutocorrelation
# view your results
IPython.display.HTML(filename='/usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_autocorr.html')
This exercise ends here. Hopefully you've learned that there are statistical tests for spatial patterns and that these let us go beyond 'just visualising' to look for patterns. These tests can be run on either vector or raster data.
Clearly there's a lot to learn and explore in the world of spatial stats...
To re-use the codes - you will need to first load their respective libraries. So far, you have used ...:
libraries | | | --- | --- | --- | folium | numpy | rasterio | branca | rtree | richdem | pandas | osmnx | elevation | geopandas | requests | zipfile | seaborn | fiona | io | matplotlib.pyplot | ipywidgets | seaborn | pysal | os | gdal | pkg_resources | whitebox | imageio | IPython | imageio tifffile | |
plugins | ||
---|---|---|
HeatMapWithTime
HeatMap
MeasureControl
PrepareUrBasemaps_CreateLayers from [folium.plugins]
cluster (from sklearn)
rasterio.plot
mapping (from shapely.geometry)
Point(from shapely.geometry)
mask (from rasterio.mask)
Moran (from pysal.explore.esda.moran)
Moran_Local (from pysal.explore.esda.moran)
At this point you have learned so many thigs that the lexicode is non-exhaustive. It might be useful to make your own custum 'lexicode' with the commands you use the most often.
Lexicode_MakingaBasicMap | Lexicode_Webmaps&Distributions | | Lexicode_SpatialPatterns | Lexicode_RasterLandscape | Lexicode_Pysal_weights_Moran_Lisa | --- | --- | --- | --- | ---- | --- | == () [] | pd.concat() | { } subselection from list | %matplotlib inline | .open() | .tail() | .head_csv() | .dtype() | ox.gdf_from_places() | requests.get() | .print() | ax.set_axis_off() | .read_csv() | astype() | ox.plot_shape() | request.content() | dataset.name | f.set_facecolor | mean() | fillna() | network_type= '' | .bytes() | dataset.count | plt.axis | folium.Map | def return | ox.add_edge_bearings(ox.get_undirected()) | gpd.GeoDataFrame.from_features() | dataset.shape | f.suptitle() | range() | .apply(lambda x:function,axis=) | count_and_merge() | Set() | dataset.descriptions | f.set_facecolor | len() | pd.merge() | np.arrange() | pd.value_counts() | dataset.meta | plt.axis | iloc[] | how= , left_index= ,left_index= | np.histogram() | .merge() | dataset.driver | f.suptitle() | .value_counts() | gpd.GeoDataFrame() | ax.set_theta_location() | .sort_values | dataset.read | .remap_ids | if =: | geometry=gpd.points_from_xy | ax.set_ylim() | cluster.KMeans() | .shape | .transform | elif =: | print() | ax.set_title() | .fit() | np.amean() | lag_spatial() | else =: | .isin() | ax.set_yticks() | .drop() | np.amin () | .mean() | folium.Marker() | classic.plot() | ax.set_xlabels() & ax.set_yticklabels | .assign() | np.amax() | .std() | folium.Icon() | generateBaseMap() | plt.subplots() | plt.show() | np.std() | sns.regplot() | folium.Circle | .groupby(['', '']) | .dropna() | .set_title | show() | Moran_local | popup= | .reset_index() | polar_plot() | sns.pairplot() | cmap= | p.sim() | radius= | max_zoom= | pd.Series() | arrs.append() | np.seterr() | Lisa.q | .values.tolist() | folium.TileLayer() | np.pi | .show_hist() | plt.imshow | wbt.set_working_dir() | .add_to() | plugins.DualMap(location= , tiles= , zoom_start= ) | | | clipped_img | wbt.verbose() | | | | | rd.TerrainAttribute() | .to_file(" ") | | | | | rdShow() | wbt.idw_interpolation() | | | | | gdal_data.GetRasterBand() | imageio.imread(os.path.join()) | | | | | gdal_band.GetNoDataValue() | wbt.ks_test_for_normality() | | | | | gdal_data.ReadAsArray | IPython.display.HTML() | | | | | gdal_data.ReadAsArray().astype(np.float) | wbt.image_autocorrelation() | | | | | plt.contour() | | | | | | plt.contourf() | |