Learning a new language ¶

WHAT IS THIS? YOUR FIRST PROGRAMMING CODE CELL!

Let's import another tool¶

Add another code cell below this one and try to import geopandas and give it the abbreviation 'gpd'

In [3]:

#codecell_introduction_importyourlibraries

import geopandas as gpd

In [ ]:

# Odds are that didn't work... Some tools need an additional 'install' step. 
# This happens when the tool isn't in the cloud environment of the software you are working on.
# If something doesn't import, I pretty much always try to '!pip install' it.
# pip is python's tool for getting tools. Very meta.

!pip install geopandas
import geopandas as gpd

Learning a new language ¶

You now have three tools: Pandas, Geopandas and Matplotlib. Whenever you need more tools, use import or !pip install which is a package management system to install manage software packages.

Geopandas is notoriously difficult to install, especially on Windows. If you intend to work directly on the Jupyter notebooks, and pip install does not work for you, this video will help you to get started with installation. if you are running out of luck I can only recommend to follow this and as a last resort this link.

Now try to import another tool¶

Add another code cell below this one and figure out how to import matplotlib.pyplot. Matplotlib.pyplot lets you plot in 2D with an extensive plotting library for quality publication & provides a MATLAB-like way of plotting.

Learning a new language ¶

You can add text or code cells to your version of the notebook: use the + button in your toolbar and choose which type of cell you need (e.g. Markdown, Raw NB convert, Heading) as the default is Code cell. By hitting the up or down arrow button, you can move the cell to wherever you want.

Editing the cell: click the cell to select it, and, double-click to edit it.

To make things happen in a Code cell you can click on the cell and then type 'ctrl + enter' and your command can be executed. An asterisk in square brackets In [*]: will appear while the code is being executed, and this will change to a number In [1]: when the code is finished. *The order in which you execute the code blocks matters, they must be run in sequence.* Inside blocks of python code there are comments indicated by lines that start with #. These lines are not computer code but rather comments providing information about what the code is doing to help you follow along.

You are ready to go! and remember to save regularly.

Now let's add data.¶

Exploring the data you already know
You can see some geojson (vector for the internet) data from Canmore that represents locations of sites on Shetland as points with descriptive attributes here. Have a look and identify the coordinates (spatial data), attributes, and coordinate system EPSG code. Write an example of each one in a new text field below this one in your notebook.

What's a GeoJSON? is an open standard format designed for representing simple geographical features, along with their non-spatial attributes. It is based on the JavaScript Object Notation. GeoJSON embeds attributes and the coordinates projection information (we'll talk a bity more about them in chapter 2). It is vector data.



	GEoJson example

What's a WTK or known as Well-known text (WKT). It is a text markup language for representing vector geometry objects on a map. A binary equivalent is known as well-known binary (WKB). Open Geospatial Consortium (OGC) designed and approved. The current standard definition is in the ISO/IEC 13249-3:2016 standard.

GEOMETRYCOLLECTION(POINT(4 6) ~ LINESTRING(4 6,7 10))~ POINT ZM (1 1 5 60) ~ POINT M (1 1 80) ~ POINT EMPTY MULTIPOLYGON ~ EMPTY TRIANGLE((0 0 0,0 1 0,1 1 0,0 0 0))~ TIN (((0 0 0, 0 0 1, 0 1 0, 0 0 0)), ((0 0 0, 0 1 0, 1 1 0, 0 0 0))) ~ POLYHEDRALSURFACE Z ( PATCHES ((0 0 0, 0 1 0, 1 1 0, 1 0 0, 0 0 0)), ((1 1 1, 1 0 1, 0 0 1, 0 1 1, 1 1 1)), ((1 1 1, 1 0 1, 1 0 0, 1 1 0, 1 1 1)),((1 1 1, 1 1 0, 0 1 0, 0 1 1, 1 1 1)))

Well-known text (WKT)details

WKT example

Let's import some real data

We want to bring some data into this notebook. It is done by telling the notebook where the data is located and asking it to read that file using the geopandas tool using the data's URL (web address). So this data is digital. Instead of using this long web url, we can give the file a name within the notebook. Here I've called it 'df'

In [ ]:

#codecell_introduction_getyourdata

url = "http://ropitz.github.io/digitalantiquity/data/CanmoreShetlandPoints.geojson"
df = gpd.read_file(url)

How do you see what's in your data?¶

The command .head() will show you the first part of any file in a notebook.

In [ ]:

#codecell_introduction_takeapick@yourdata

# go ahead and look at the first few lines of the file. 
#Note this is slightly easier to read than the raw geojson

df.head()

Well, if it's spatial data, you should be able to plot it¶

This is usually fairly easy. For any spatial dataset, just call its name and say .plot()

In [ ]:

#codecell_introduction_plot@yourdata

df.plot()

Not very pretty... Next week we'll learn about making pretty maps.

1.4 Challenge¶

Go find some other archaeological, read it into the notebook, and plot it.

Like the scheduled monuments from York: https://opendata.arcgis.com/datasets/26857683681542b798ef97708cbdaf61_14.geojson

Or know wrecks off the coast of Northern Ireland: https://www.opendatani.gov.uk/dataset/c5fceed7-b07a-4bc4-a0a0-8c45b9033083/resource/2d8da801-39f7-48b7-81ad-8db58d107962/download/protected_wrecks.geojson

Or any other open data you find in the geojson format...

1.5 References¶

Howey, M. C., & Burg, M. B. (2017). Assessing the state of archaeological GIS research: Unbinding analyses of past landscapes. Journal of Archaeological Science, 84, 1-9.

Lock, G., & Pouncett, J. (2017). Spatial thinking in archaeology: Is GIS the answer?. Journal of Archaeological Science, 84, 129-135.

Longley, P. A., Goodchild, M. F., Maguire, D. J. & Rhind, D. W. (2001) Geographic Information Systems and Science. Wiley, New York.

Yang, C. (2017). Introduction to GIS Programming and Fundamentals with Python and ArcGIS. Boca Raton, FL : CRC Press, Taylor & Francis Group, 2017.

Verhagen, P. (2018). Spatial analysis in archaeology: moving into new territories. In Digital Geoarchaeology (pp. 11-25). Springer, Cham.

Chapter 2: Make A Basic Map¶

2.1 Introduction: exercise motivations ?¶

In this practical exercise, you will make some basic maps. You're making them interactively in this notebook.

A basic map allows the viewer to :

to think visually for
- data exploration
- confirming idea
visual presentation
- synthesis
- presentaion

So, even a basic map is centred around a research question. As such, this exercise sets some goals as to:

answer the question visually (and quantitatively)
choose a specific design/symbology and show why it matters
understand a first type of spatial analysis using spatial density


	=

To get started, you will get the tools required. In this lab, we are using Panda, Folium and Branca libraries.

In [ ]:

#codecell_makeabasicmap_importyourlibraries
import folium
import branca
import pandas as pd
print(folium.__file__)
print(folium.__version__)

2.2 Maps should answer questions¶

We've discussed how to come up with a good, explicitly spatial question. We've also discussed how to design a good map. Here you're going to start putting all this into practice.

The data in Palmisano et al. (2018) article provides information about settlement and population patterns in central Italy and how they change over time. Where people were living and working at various times in the past is a basic archaeological question.

Let's say your question is about how many iron age sites are present in the region, and what their distribution is like in space - that is where they are located and how many relatively speaking are in each area.

How would you go about answering these questions with a map?

2.3 To make a map we need to get some data.¶

This data must be explicitly spatial - that is have spatial coordinates that tell us about locations in a way that machines can understand. Spatial coordinates are pairs of numbers organised on a x-horizontal and y-vertical axis – this is the bit of geometry you will be using the most as an archaeologist!

Projections: Distances and directions to all places are true only from the centre point of projection.

Projections and coordinate systems: Coordinates (x,y) can also be called eastings & northings or latitude & longitudes depending on the type projection used to transform the map. The most common coordinate format for maps on the internet are latitude and longitude coordinates.

Transformations: When working with different dataset ALL YOUR DATA MUST BE IN THE SAME COORDINATE SYSTEM. If they are different you will need to transform it... below are illustrations of the relationships between different projections.



How to unfold a cylindrical projection of the world to a Mercator grid projection?	Or from Geographic (left image)to Cartesian coordinates (right image)?

South America many projections	Ways to transform(with a bit of math too!)

Whilst the conversion from geodetic to cartesian is fairly straightforward, converting cartesian to geodetic is a complex problem (& if you are interested on how this is done mathematically have a look at this).

here are some useful links to use with your dataset:

OS
Some python tools : Rasterio tool and Pyproj tool.

In this notebook, we'll experiment with the spatial data from:

Palmisano, A., Bevan, A. and Shennan, S., 2018. Regional Demographic Trends and Settlement Patterns in Central Italy: Archaeological Sites and Radiocarbon Dates. Journal of Open Archaeology Data, 6(1), p.2. DOI: http://doi.org/10.5334/joad.43

Spend some time exploring read about it before jumping in this practical...

These researchers have been very thoughtful and provided the data behind their analysis so we can re-use it. Their old school shapefile has been converted to a CSV for you to make things easier. If you ever have to do this yourself, there are lots of online converters like this one.

Later in the course, you will also learn to read data from shapefiles directly into your notebook. This is a little more complicated, so we are skipping it for now.

Reminders:

A CSV file is simple tabular file format using commas or spaces to distinguish elements within it.

A shapefile is a vector data storage file format that can be used in spatial software such as Q-GIS .

Mandatory files: • .shp — shape format; the feature geometry itself. Variable-record-length file in which each record describes a shape with a list of its vertices. • .shx — shape index format; a positional index of the feature geometry to allow seeking forwards and backwards quickly. The index file (.shx) contains a 100-byte header followed by 8-byte, fixed-length records. • .dbf — attribute format; one-to-one relationship between geometry and attributes is based on record number. Attribute records in the dBASE file must be in the same order as records in the main file.

Optional files: • .prj — provides projection information and coordinate system;, a plain text file describing the projection using well-known text format • .sbn and .sbx — a spatial index of the features • .fbnand .fbx — a spatial index of the features for shapefiles that are read-only • .ain and .aih — an attribute index of the active fields in a table • .ixs — a geocoding index for read-write shapefiles • .mxs — a geocoding index for read-write shapefiles (ODB format) • .atx — an attribute index for the .dbf file in the form of shapefile.columnname.atx (ArcGIS 8 and later) • .shp.xml — contain geospatial metadata in XML format. • .cpg — used to specify the code page (only for .dbf) for identifying the character encoding to be used.

2.3.1 Let's get your data¶

In [ ]:

#codecell_makeabasicmap_GetUrdataReady

#Get the data by reading it into the notebook
palmisano_tuscany_sites = pd.read_csv("https://raw.githubusercontent.com/ropitz/spatialarchaeology/master/data/site_centriods_tuscany.txt")

palmisano_tuscany_sites.head()

Learning a new language – decomposing the code¶

In #codecell_makeabasicmap-GetUrdataReady, we have a simple piece of code which allows you to open the CSV file containing the archaeological sites information. The image below will help you to understand the code better and reuse it easily.

2.3.2 Let's filter this big dataset to get only the data you need¶

The original data file contains over 10970 entries for all archaeological settlement in Central Italy. So you need to filter or SUBSELECT your big data file to get just the iron age sites.

In [ ]:

#codecell_makeabasicmap_SubSelectData

#tell the notebook you only want to see stuff where the period is the iron age.
palmisano_tuscany_sites_iron_age = palmisano_tuscany_sites[(palmisano_tuscany_sites['Period']=="Iron Age") ]

palmisano_tuscany_sites_iron_age.head()

Learning a new language – decomposing the code¶

In the #code_makeabasicmap_SubSelectData, we have a simple piece of code which allows you to open the CSV file containing the archaeological sites information. The image below is decomposing this codeline, so you can understand how python language works and re-use it easily.

2.3.3 Let's put the data onto a map¶

Now we want to add this data to a simple map.

You need to start by getting the coordinates of all the points so you can center the map - that is focus on the area where your data is. Probably it will be a good idea to put the middle of your map roughly where the middle of your dataset is located.

Think about it... if your data is in Italy, putting Antarctica as the center of your map is not going to be very effective.

Now, you want to add a marker for each Iron Age site. A marker, also called recall, is a graphic icon to represent the coordinates where a site is centred.

In [ ]:

#codecell_makeabasicmap_BringingUrData2theMap

#location is the mean of every lat and long point to centre the map.
location = palmisano_tuscany_sites_iron_age['Latitude'].mean(), palmisano_tuscany_sites_iron_age['Longitude'].mean()

#A basemap is then created using the location to centre on and the zoom level to start.
m = folium.Map(location=location,zoom_start=10)

#Each location in the DataFrame is then added as a marker to the basemap points are then added to the map
for i in range(0,len(palmisano_tuscany_sites_iron_age)):
    folium.Marker([palmisano_tuscany_sites_iron_age['Latitude'].iloc[i],palmisano_tuscany_sites_iron_age['Longitude'].iloc[i]]).add_to(m)
        
#To display the map simply ask for the object (called 'm') representation
m

Yay!¶

You have a map.

Learning a new language – decomposing the code¶

The #codecell_makeabasicmap_BringingUrData2theMap contains four codelines aimed to:

centre your map by calculating the mean of the latitude coordinates and mean of longitude coordinates;
zoom your map to a suitable level - e.g. world, country, region or street level;
place a marker for each archaeological site location.
show the results interactively.

Below are the three codelines of the script decomposed for you, so you can adapt it easily in the future.

Reminder: the mean or average of a set is the total sum of all numbers divided by the number of items in the set. So, for instance, the mean of latitude coordinates = sum of all latitude coordinates/ divided by the number of latitude coordinates in the dataframe.

This codeline shows the powerful capabilities of the Folium library you've imported at the start of this practical lab. You can find more detailed guidance about it here.

This codeline introduces you to two important concepts in coding: the range and the loop by using iloc command. Furthermore, by using range(), len() and iloc() you could implement a loop in Python for iterating rows and columns of your filtered pandas dataframe (palminsano_tuscany_sites_iron_age).

2.3.4 Let's play with this code¶

Now, a design question: what is a good starting zoom level?

Do you want your map zoomed further out or further in given the extent of your data?

In the code cell above, play around with the 'zoom_start' parameter and find a good zoom level that makes you happy.

You just modified the code to suit your own needs

Rather than just executing code by pushing play, you've edited the code by changing a variable.

Easy, right?

This is where many people get started with scripting in archaeology. You find a bit of open source code that seems to do what you need, and then you modify it. There's no sense in writing code from scratch when there's lots of useful open source code available that is meant to be shared and modified.

2.4 Representing information and relative values or quantities with graphics¶

So far, we have seen how the data is spread visually, but we want to look at this data quantitatively. How do these sites compare?

Let's look at how to represent larger and smaller sites. The archaeologists in this project have a qualitative ranking of site size from large (A) to small (F). You can see the categories they've used, and that there are more small sites than larger ones. Say you want to make the two categories of sites that are the largest different colours. Why do this? Maybe larger sites have more people, so this provides insights into demographics.

In the step above, #codecell_makeabasicmap_SubSelectData, you've set up your new dataset by filtering the iron age settlements from the original dataset. It's what we would call a 'subset' of your original data.

Now you are ready to make a new map.

In [ ]:

#codecell_makeabasicmap_ManipulatingyourData_ValueCounting

# Size surely matters... group the sites by size category and get the number of sites in each size category.
# SizeQual is the name of the 'attribute', or column in the table, that has information on site size. 
# An attribute is anything attached to spatial data other than the coordinates, such as period, type, size in hectare,etc. 

palmisano_tuscany_sites_iron_age['SizeQual'].value_counts()

Learning a new language – decomposing the code ¶

.value_counts() is a function that allows you to count numerical and categorical data (e.g. numbers, objects or labels such as attributes). The results are returned as a series of numbers in descending order, so, that you can further evaluate the frequency and distribution of these values.

In [ ]:

#codecell_makeabasicmap_ManipulatingyourData_UsingSymbology

#now make a map just like you did before. Note that this time we're adding a scale bar with 'control_scale'
location = palmisano_tuscany_sites_iron_age['Latitude'].mean(), palmisano_tuscany_sites_iron_age['Longitude'].mean()
m = folium.Map(location=location,zoom_start=10,control_scale = True)

#Assign different colours to the two large site categories - B and C in this case
for i in range(0,len(palmisano_tuscany_sites_iron_age)):


    site_size = palmisano_tuscany_sites_iron_age['SizeQual'].iloc[i]
    if site_size == 'C':
        color = 'blue'
    elif site_size == 'B':
        color = 'green'
    else:
        color = 'red'
    
# add the markers to the map, using the locations and colours    
    folium.Marker([palmisano_tuscany_sites_iron_age['Latitude'].iloc[i],palmisano_tuscany_sites_iron_age['Longitude'].iloc[i]],icon=folium.Icon(color=color)).add_to(m)

#type 'm' for map (the variable you set above) to tell the notebook to display your map
m

Learning a new language – decomposing the code ¶

for #codecell_makeabasicmap_ManipulatingyourData_UsingSymbology can be decomposed like this:

![]( https://github.com/ropitz/spatial-thinking-archaeology-notebook/master/Chap2_makeabasicmap_ManipulatingyourData_UsingSymbology_.jpg? raw=1)

2.5 Designing your map - Symbology¶

Why design matters? Why it is important? Some map design can observe some principles following a simple equation: Aesthetics + Information Communication + Argumentation. In other words, your map should be beautiful, well designed, and tell a clear true story.

Research, design, revision, and evaluation are the standard process for making a good map following a simple workflow : Careful planning -> Data analysis -> Presentation -> Critique = Feedback -> Production.

For starters, you can look at this document to familiarise yourself with some of the concepts

2.5.1 What symbol scheme is best for your map?¶

Changing Symbol Colour

Now, go back into #codecell_makeabasicmap_ManipulatingyourData_UsingSymbology and experiment a bit. Try out some different colours. Do certain combinations work well? Try adding different colours for each of the smaller categories of sites. Does that make the map clearer or more confusing?

Changing Symbol size

Maybe it makes more sense to show size by changing the size of the icon than the colour. Let's make another map that varies the size of the icon for each site based on its size in hectares. This is making a proportional symbol map.

Here are two example maps from Kenneth Field illustrating the power of map symbology:



Basic geocoded structure of the dataset	Proportional symbol map of poverty

In [ ]:

#codecell_makeabasicmap_ManipulatingyourData_UsingSymbologyExperimenting

#now make a map just like you did before. 
location = palmisano_tuscany_sites_iron_age['Latitude'].mean(), palmisano_tuscany_sites_iron_age['Longitude'].mean()
m = folium.Map(location=location,zoom_start=8,control_scale = True)


# Set the size for each circle icon by defining the 'radius' which is the radius of the circle
# Here we are multiplying the size in hectares (the SizeHa attribute) by 15. Try different values here to get icons a size you like
for i in range(0,len(palmisano_tuscany_sites_iron_age)):
   folium.Circle(
      location=[palmisano_tuscany_sites_iron_age['Latitude'].iloc[i],palmisano_tuscany_sites_iron_age['Longitude'].iloc[i]],
      popup=palmisano_tuscany_sites_iron_age.iloc[i]['Toponyms'],
      radius=palmisano_tuscany_sites_iron_age.iloc[i]['SizeHa']*15, #this is where we set the value of 15 - change this variable to get differrent size icons
      color='crimson',
      fill=True,
      fill_color='crimson'
   ).add_to(m)
    
#type 'm' for map (the variable you set above) to tell the notebook to display your map
m

Learning a new language – decomposing the code ¶

for #codecell_makeabasicmap_ManipulatingyourData_UsingSymbologyExperimenting can be decomposed like this:

2.5.2 Thinking about spatial density or 'clustering' or concentration¶

Many things in life naturally tend to cluster. Patterns of higher density or cluster can be clearly identified in urban and rural landscapes and populations. Forming general understandings of the processes at particular points in time (e.g. socioeconomic, demographic and environmental structures of different places) that give rise to the geographical patterning of landscapes and populations are essential to predict and simulate how they settlements have evolved in the past and might evolve in the future.

When might this be important? What kinds of archaeological questions involve finding the spatial centre of a distribution of sites or finds or features?

Research question Now what if you want to make a map that shows the concentration of sites in the iron age, that is areas with more and fewer sites?

In order to show areas of higher density of activity, a 'heatmap' is often created by using a colour gradient. A heatmap represents the geographic density of point features on a map by using colored areas to represent those points. The areas will be of high intensity colour where the most points are concentrated and least intensity where they are less concentrated. Essentially it shows areas with more sites as 'hotter' (generally red in colour) and areas with fewer sites as cooler.

For our dataset, in this kind of map surely larger sites should count more - we call this 'weighting'. We want to add the weight of the size of the settlement for our Iron Age settlement map. A site that is twice as large should count for twice as much in our heatmap.

For this need to think how the points are distributed geographically. This entails obtaining basic spatial statistics about the dataset, a first step in the analysis of a geographic data.

Just a reminder of some basic terminologies in data statistics:

The overarching aim is to determine the default search radius or bandwidth which will define the area of influence and how the colour gradient will be applied as illustrated in the image below.

To find the (default) search radius consists in 5 steps (Yang,2017) where you calculate:

the mean center of the number of the input. This is the geographic center for a set of features (remember that this center can change a lot depending if there are outliers or not). We want to weight this centre to our attribute field size of the settlement.
the distance from the weighted mean center for all points.
the weighted median of the distances (D).
calculate the weighted standard distance (SD). This is the degree to which features are concentrated or dispersed around the geometric mean center. This shows the concentration or dispersal of the data.
calculate the bandwidth based on this data statistics. You can learn a bit more how density formula equation is working in this [Esri webpage] (https://pro.arcgis.com/en/pro-app/tool-reference/spatial-analyst/how-kernel-density-works.htm).

Here is an example of population density map to illustrate this concept:

Density surfaces show where point or line features are concentrated. For example, you might have a point value for each town representing the total number of people in the town, but you want to learn more about the spread of population over the region. Since all the people in each town do not live at the population point, by calculating density, you can create a surface showing the predicted distribution of the population throughout the landscape. The following graphic gives an example of a density surface. When added together, the population values of the cells equal the sum of the population of the original point layer. ref

Folium is the library will be using to create our heatmap, you can see more example in Anthony Ivan page or Aly Sivji.

In [ ]:

#codecell_makeabasicmap_ManipulatingyourData_Heatmap

#first make a list of the coordinates of each site and its size in hectares, which we will use for the size-based weighting
data_heat = palmisano_tuscany_sites_iron_age[['Latitude','Longitude','SizeHa']].values.tolist()

In [ ]:

#look at the first line of your list to check it seems to have worked
data_heat[0]

In [ ]:

#to make the heatmap, we need to get an extra tool, so...
import folium.plugins as plugins

In [ ]:

# now make a map a bit like you did before, set your zoom level and your centre location. Then use the plugin to make the heatmap. 

m = folium.Map(location=location, zoom_start=10)

#tiles='stamentoner'

plugins.HeatMap(data_heat).add_to(m)

#type 'm' for map (the variable you set above) to tell the notebook to display your map
m

Learning a new language – decomposing the code ¶

for #codecell_makeabasicmap_ManipulatingyourData_Heatmap can be decomposed like this:

For further documentation on folium plugins follow link.

Success!¶

You should have a heatmap showing the concentrations of Iron Age sites in the region.

You could repeat the exercise with sites from any period in which you are interested.

Hopefully you've learned to make some basic maps and are starting to understand how to put into practice some of the theory of map design and spatial visualisation we've been discussing.

That's it for today... Remember to save your notebook (under a new name) so you can come back to it and practice making basic static maps.

2.6 LexiCode¶

In this practical you have learned new commands that you can now reuse for your own datasets:

== , () []
.head_csv()
.read_csv()
mean()
folium.Map
range()
len()
iloc[]
.value_counts()
if =:
elif =:
else =:
folium.Marker()
folium.Icon()
folium.Circle
popup=
radius=
.values.tolist()
plugins.HeatMap()

2.7 References¶

Palmisano, A., Bevan, A. and Shennan, S. (2018) Regional Demographic Trends and Settlement Patterns in Central Italy: Archaeological Sites and Radiocarbon Dates. Journal of Open Archaeology Data, 6(1), p.2. DOI: http://doi.org/10.5334/joad.43

Pimpler, E. (2017) Spatial Analytics with ArcGIS. Pakt Publishing.

Yang, C. (2017). Introduction to GIS Programming and Fundamentals with Python and ArcGIS. Boca Raton, FL : CRC Press, Taylor & Francis Group.

Chapter 3: Interactive maps¶

3.1 Introduction: exercise motivations¶

~ déjà vu ~ In Chapter 2, we focused on making "mostly static" maps, that is maps where you mostly just expect your user to look at the product you've prepared. We looked at a research question about the distribution of Iron Age sites in Central Italy, and we focused the practical exercise on using the map for:

data exploration => filtering and creating subsets of data to show different aspects of the overall dataset;
visualisation => zoom level (visual balance), visual variables (symbols, fonts, generalisation, color);
organisation => Balancing geospatial and attribute (descriptive) data, and which attributes you will highlight.

~ new ~ This week we'll continue exploring different types of maps. Specifically, we'll look at making maps where interactivity is a key part of the design. Beyond the question of how to present spatial data (i.e. designing the map), we will spend more time on an important topic in archaeology: spatial distributions. Central to investigating spatial distributions (patterns) is our ability to manipulate and rearrange spatial data to answer spatially explicit questions.As such, this practical exercise aims to help you with:

Data preparation, data cleaning and data manipulation: by transforming database, merging database and creating data layers;
Presenting your data interactively.

3.1.1 Why interactive maps?¶

Why does this data need to be presented in an interactive map? What spatial relationships are important here? What kinds of questions might a user ask?

Interactive maps are for exploring spatial data where poeple can:

ask their questions about the data
ask their questions and control how the response is represented
create and add their own data

When should you use interactive maps?

When the ability to explore data and query to make different maps is important.
When the ability to manipulate the visual aspects of the map to show different combinations of data is important.
Archaeological motivations/example:
- How do we understand change over time and space in archaeology?
- How can we explore this using (interactive at different levels) maps?

Interactive maps are spatial and descriptive data interfaces


*Visual design (Front end)***	*Data for design (Back end)***
Symbols	Data Quality (accuracy and resolution)
Fonts	Data Cleanliness
Labels	Data Consistency
Scale	Data Accessibility
North Arrow	Metadata
Generalisation	Data Architecture
Visual Interface & Query Interface	Data architecture & Spatial Data resource

3.1.1 Open Data and Data Legacy¶

Open source software In Chapter 1, when introducing Jupyter notebooks, we mentioned open source software. Open data operates under the same broad ethos, and follows many of the same principles. Sharing, reuse, and attribution are key. If you continue to reuse the Antikythera data, be sure to continue to link back to and cite the source.

Perhaps the most relevant example in the UK is OS (Ordnance survey) OpenData which was made freely available for the first time in 2010. Last review (2018) counted 1.9 million open data downloads, the equivalent of 150 people download OS OpenData every day!

2015 has seen Environmental Agency made lidar (light detection and ranging) data available to the public, for free, as open data. Within the first year of release 500,000 lidar downloads were made equating to nearly 13 million km2 of data!

Working with Legacy data: the case of West Heslerton, North Yorkshire, excavation of the Early Anglo-Saxon or Anglian Settlement between 1986 and 1995,


Interactive Map	Settlement Archive
The West Heslerton Interactive Map: This is the map	The West Heslerton Settlement Archive: This is the data that drives the map
Front end	Back end

3.2 Let's start¶


	=

Start by getting tools, always. In this Chapter, you still be working with panda and folium and you will be replacing the branca library with numpy.

In [ ]:

#codecell_Webmaps&Distributions_ImportUrLibraries

#Like last time, get the tools we need at the start
import pandas as pd
import folium
import numpy as np

3.3 Let's get the data: the case of Antikythera¶

This time, we are working with the data from the Antikythera survey project.

It's citation is: Bevan, A. and Conolly, J. 2012. Intensive Survey Data from Antikythera, Greece. Journal of Open Archaeology Data 1(1), DOI: http://dx.doi.org/10.5334/4f3bcb3f7f21d.

This data has been made available at the Archaeological Data Service (ADS) archive at https://archaeologydataservice.ac.uk/archives/view/antikythera_ahrc_2012/ and completely open for re-use. You can see how many times their database has been viewed and downloaded.

Working with other people's data

Have a quick look around Antikythera dataset as it's described on the ADS site. You'll notice that they've split up their dataset in ways that made sense to them at the time. Specifically they've divided up the artefact data into three discrete elements: 'pottery', 'lithics' and 'other' into separate files (very much like we did last week by filtering and creating a subset). This is a pretty normal archaeological data thing to do.

Here's the trick, we want to focus on both ceramics and small finds and to look at these datasets together. This means you'll have to grab both of them and combine them. When you are combining them, you will also need to re-organize the attribute data in order to reuse it for something new. This follows the model of a relational database

What’s a database vs a graph database?

How to query this data? Query languages are the languages used to ask questions of databases. Python acts as a query language for pandas data structures. SQL and SPARQL are specific database query languages


Querying Python	Querying SQL

Example python script (top) & Your script in 2.3.2 (bottom)	Example of SQL query language

3.2.1 It is all about data preparation###¶

In [ ]:

#codecell_Webmaps&Distributions_OrganisingUrdata

#Like last time, get the data we'll need at the start. I've been nice again and converted their coordinates to latitude and longitude for you. 
#You'll learn to do this yourself later in the course.

# we label the first dataset 'pottery'
pottery = pd.read_csv('https://raw.githubusercontent.com/ropitz/spatialarchaeology/master/data/antikythera_survey_pottery.csv')

# we label the second dataset 'small finds'
small_finds = pd.read_csv('https://raw.githubusercontent.com/ropitz/spatialarchaeology/master/data/antikythera_survey_small_finds.csv')

In [ ]:

#codecell_Webmaps&Distributions_CheckingZeData

#let's check the individual pottery file.  
pottery.head()

In [ ]:

#codecell_Webmaps&Distributions_CheckingZeData

#let's check the individual pottery file to see 
small_finds.head()

Learning a new language – decomposing the code ¶

in #codecell_Webmaps & Distributions_CheckingZeData:

= allows you to provide a new name, often more convenient than the link pathname
pd.read_csv allows you to open a CSV format file
.head() function take a peek at the first part of the dataset

Spend some time understanding your data

Now have a look at these two datasets and how they are structured. Also, the coordinates have been transformed for you from cartesian co-ordinates : a grid system - shown in column 'Xsugg' & 'Ysugg' - back to latitude/longitude, a geodetic system - shown in column 'DDlat' & 'DDlong' -. (we've talked about projections, coordinates systems and transformations in Chapter 2, section 2.3).

3.2.2 Re-organising your data¶

In [ ]:

#codecell_Webmaps&Distributions_Concatenation

# then we combine the two datasets together to make a big dataset we call 'survey data'
survey_data = pd.concat([pottery,small_finds], sort=False, ignore_index=True)

Learning a new language – decomposing the code ¶

Codecell__Webmaps&Distributions_Concatenation can be decomposed like this:

An array is a group of elements/objects/items organised as a set where columns (of equal size) are multiplied by rows (of equal size). Below are two illustration showing what is happening with an array and how it differs from a list:

The advantage of using arrays is that all elements can be accessed at any time randomly. In Chapter 2, you had done something similar to call all items from the range [i] ( in #codecell_makeabasicmap_BringingUrData2theMap ). This time, you are linking together 2 arrays. Pandas library provides various ways to combine together two arrays such as merge, join and concatenate (see user guide).


In this example, df1 and df4 are concatenated in the same way we have concatenated pottery and small finds

The result shows that the axis have been appended them but the overlapping indexes have been ignored. To do this, we have used the argument ignore_index=True)

Let's make sure nothing went wrong...

In [ ]:

#codecell_Webmaps&Distributions_CheckingZeData

#check things loaded in and combined OK
survey_data.head()

3.2.3 Now let's ask a research question¶

Like we said last week, we are using maps and spatial analysis to pose, explore and respond to spatial questions.

My question is a bit like last chapter's question. I want to know about how many sites are in each period, so I can try and understand changing patterns over time.

However, you may have noticed when you read in the data that it's structure is a bit different from last week's data. Instead of each site belonging to one period, it's assigned with varying probability to several different periods.

This is a totally legit archaeological thing to do. Many sites have activity from multiple periods, and depending on the available evidence, you might be more or less confident about the presence or absence of activity in a specific period.

The most likely period/date?¶

We might start simply by assigning each site primarily to its 'most likely period'.

Data cleaning is the most time-consuming and important part of any analysis - and not only in archaeology but for all scientific analyses. This takes a few steps....

Therefore we will walk through the steps of data cleaning in this exercise.

step 1 - prepare the data¶

...you have to re-organize the dataset to answer our question: 'how many sites are in each period?'.

In [ ]:

#codecell_Webmaps&Distributions_Subselect

# first we create a subset of our data that only contains the columns with information about time
# this is in part because we want to do some operations where everything has to be a number, and some of the other fields contain text
# it's also just to make things simpler when we look at them

survey_data_time = survey_data[['MNLN', 'FNEB1',	'EB2',	'LPrePal', 'FPal', 'SPal', 'TPal', 'PPalPG', 'Geom', 'Arch', 'Class', 'Hell', 'ERom', 'MRom', 'LRom', 'EByz', 'MByz', 'EVen', 'MVen', 'LVen', 'Recent', 'Other']]
survey_data_time.head()

In [ ]:

#codecell_Webmaps&Distributions_ChangingDataType

# if you were to look through this data
survey_data_time.dtypes

Learning a new language – decomposing the code ¶

In #codecell__Webmaps&Distributions_ChangingDataType, the ~ .dtypes ~ function allows you to see which type of data is in our 'survey_data_time' dataframe. Three types of data are returned int64 which are integers (whole numbers -numerical data), float64 which are floating data (contains floating decimal points -numerical data) and object which is a string (not a number -categorical data).

At the very start, in Chapter 1.1, we said that the type of data you are using matters, and this matters because tests and commands will differ depending the data is numerical or categorical. The image below is here as a reminder:

Looking further in this dataset, 'survey_data_time', we can also see that some that some fields contain missing values (NaN). Let's see what we can do about that and why it matters.

In [ ]:

#codecell_Webmaps&Distributions_AllNumbers

# We can get rid of NaN (null) values by 'filling' them.
# This is important because null values can break number-based operations.
# Let's get rid of missing values and make sure everything is a number.

survey_data_time.astype('float64').fillna(0)

Learning a new language – decomposing the code ¶

In the #codecell_Webmaps&Distributions_AllNumbers, you are making sure all these missing values, NaN, becomes null numbers (0=zero).

Why do missing values need to be removed? Simply because you cannot apply maths to them (e.g. adding or multiplying columns together, reordering values of a column from max to min value, etc.).

And this is how it can be done:

step 2 - Reshaping your data¶

Now that you've removed the null values (NaN), you can move to the next step in data cleaning, which is applying a transformation to arrange the data the way you want it for your analysis. Data transformation is an important thing to learn to do.

Right now we have a bunch of columns with information about time. What we want is one single column that contains the most likely period - which is represented in each row by the column with the greatest value.

think about that for a moment

Right now the 'most likely period' is represented by a number in each row, but that's not the piece of information we want in our new column - we want the name of the column that contains that number.In other words, we want to know the the item in the table that satisfies the condition 'the most likely period', and to be able to do something with this item.

Once you have identified the maximum values, you can extract them...

In [ ]:

#codecell_Webmaps&Distributions_MaximumValue

#here we take the columns from all the different periods, get the one with the maximum value, and write that column's name to our new 'colmax' field
def returncolname(row, colnames):
    return colnames[np.argmax(row.values)]

survey_data_time['colmax'] = survey_data_time.apply(lambda x: returncolname(x, survey_data_time.columns), axis=1)

Learning a new language – decomposing the code ¶

In the #codecell_Webmaps&Distributions_MaximumValue you are creating a function. A function is a small script with multiple steps that you can run over your whole dataset. In this function, for each your you return the name of the column where the maximum (greatest) value in that row is found. You then create a new column "colmax" and store that value.

This can be done using a lambda function. Let's break down the code the steps...

1. define a function :

pass/apply this function as an argument:

*How does this really work?*

Remember that we are working with arrays, so, getting maximum or minimum values can be schematised like this:

In [ ]:

#we can check it has all gone well survey_data_time.head()

step 3 - Merging tables¶
OK now we have a single column with the information we need - the most likely date. To create this column, we broke off some of our data (the columns with numbers) from the rest of the data (important descriptive text). We might well want to stick these two datasets back together before proceeding.

splitting and merging tables is another basic skill when working with data

In [ ]:

#codecell_Webmaps&Distributions_MergingZeData #now we can also add our new column back to our original data table by doing a 'merge' #create a new table 'survey_data_maxtime' by merging our original 'survey_data' with ONLY the 'colmax' column from our new table survey_data_maxtime = pd.merge(survey_data, survey_data_time['colmax'], how='inner', left_index=True, right_index=True) survey_data_maxtime.head()

Learning a new language – decomposing the code ¶

In the #codecell_Webmaps&Distributions_AllNumbers, you learned how to merge using pandas, and, if you want to further explore and apply pd.merge() to your data, check which parameters apply.

The curse of abbreviations¶
Have a look at the resulting table. What do all those column names mean? Right now you are probably justifiably confused. We'll be talking more about the mess that is 'other people's data' next week. For now, have a look at the documentation for these datasets at: https://archaeologydataservice.ac.uk/catalogue/adsdata/arch-1115-2/dissemination/csv/pottery/documentation/pottery.txt

You'll see they explain that many of those weird abbreviations are periods and that the number in each one represents the chance that a given find belongs to that period. Sometimes I wish people wouldn't use abbreviations like this, but they've defined them in their metadata file, so we can't compain too much.

3.3 Finally, we make maps!¶
Broadly speaking, there are two ways to approach interpreting spatial patterns. There's visualisation and interpretation, where you might visually compare distributions or densities or locations of two or more datasets by plotting them on a map and intepreting what you see. Then there's statistical analysis.

We'll start with the tools to do the first one, and introduce statistical analysis later in the course. We'll also discuss the value of each approach, and when to apply it.

3.3.1 Maps for visualisation and interpretation¶
here are illustrative examples of data distribution and how we can interpolate this data:

Types of Distribution: Data Sampling

Bonnier et al. (2019)

Examples of Applications: Interpolating the data

Orengo et al. (2019)

3.3.2 As always, start with a question¶
As we said at the beginning of today, we're interested in change over time.

How does the distribution of finds change between different periods?

=

In [ ]:

#codecell_Webmaps&Distributions_ImportUrLibraries # we're going to get geopandas, another tool for making maps !pip install geopandas

In [ ]:

#codecell_Webmaps&Distributions_ImportUrLibraries # get some more tools for making maps (and other things) %matplotlib inline import geopandas as gpd import seaborn as sns import numpy as np import matplotlib.pyplot as plt

Geopandas¶
We're going to use geopandas for the next few steps. You've used geopandas before, in the first chapter.

It can do many of the same things as folium, which we were using last time.

Geopandas is particularly useful for showing categorical data.

Categorical data is data about groups

Categories can be anything where you are defining groups. Archaeological periods are categories. Types of lithics or ceramics are categories. As above, types of cars are categories. The name of each category is sometimes referred to as a label.

We have seen above in #codecell_Webmaps&Distributions_ChangingDataType that the function .dtype() allows you to see which type of data is your columns.

We could start tryping to see and understand the distributions of our sites by periods simply by mapping the period labels with different colors.

In [ ]:

#codecell_Webmaps&Distributions_dataframesINTOgeodataframes # take our big dataset from above and turn it from a 'dataframe' which is the data that folium uses to make maps into a 'geodataframe' which is the data geopandas uses to make maps gdf_survey = gpd.GeoDataFrame( survey_data_maxtime, geometry=gpd.points_from_xy(survey_data_maxtime.DDLon, survey_data_maxtime.DDLat)) print(gdf_survey.head())

Learning a new language – decomposing the code ¶

In the #codecell_Webmaps&Distributions_dataframesINTOgeodataframes, gpd.GeoDataFrame() allows you to define the geometry (geometry=) of the new dataframe 'survey_data_maxtime', so it can be mapped.

Here, the location of the points is defined by the x and y centroids (exact middles of polygon shapes)(gpd.points_from_xy ) following their respective geodetic coordinates, which can be found in the column 'survey_data_maxtime.DDLon' for the Xs and 'survey_data_maxtime.DDLat' for Ys.

print()~ allows you to see the results.

3.3.3 Visualising the results¶

In [ ]:

#Codecell_Webmaps&Distributions_PlotGeodataframes #plot your data colouring the points by the period to which they belong. You are grouping your sites by their category label when you do this. #the plot requires you to define which type of data it is (categorical or numerical) #figsize =(width, height) gdf_survey.plot(column='colmax', categorical=True, legend=True, figsize=(15,15))

3.3.4 Assess the resulting map¶
Is it useful? Why or why not?

Can you see the distribution of sites from individual periods easily?

Can you easily discern change over time?

I'm not overly convinced by the result here. If you think about the map design principles we discused last week, you will probably also conclude that this is not a successful map.

What other approach might we take?

Let's try something else.

In [ ]:

#codecell_Webmaps&Distributions_SplittingUrData #Maybe it would be better to only look at two or three periods at a time. # Recall last week's discussion about selecting appropriate data that matches our question, and also about appropriate levels of generalisation. #let's select a subset of our periods to see change from the early bronze age to hellenistic to late roman periods # note that some crazy abbreviations have been used for the names of periods... # list the types (periods) of ceramics we are interested in seeing. types = ['EB2','Hell','LRom'] #define 'classic' (as in classical archaeology) as this group of periods. Create a subset of the data to contain only the types you just listed classic = gdf_survey.loc[gdf_survey['colmax'].isin(types)] # check you get the expected result - only sites from the periods you just defined as belonging to the 'classic' group classic.head()

Learning a new language – decomposing the code ¶

in #codecell__Webmaps&Distributions_SplittingUrData, the function gdf_survey.loc is similar to .loc[] & .iloc[] used in pandas last chapter (in #codecell_makeabasicmap_BringingUrData2theMap). Note that .loc is using a label to index data whereas .iloc uses an integer position. This is the difference between working by columns or by rows.

The .isin()~ command allows you to exclude data.

In [ ]:

#codecell_Webmaps&Distributions_PlotUrData #plot your data colouring the points by the period to which they belong # Do this by setting 'categorical' to 'true' so that the plot is coloured by category classic.plot(column='colmax', categorical=True, legend=True, figsize=(15,15))

3.4 Thinking about data visualisation and map design¶
That's a bit better perhaps. The map is less crowded.

Recall our discussions about how to design a map well. Clearly too much data introduces design problems.

Well, now we can see the distributions a bit, and maybe say something about change over time, but there are still a lot of dots, and it's pretty clear dots from some periods are hidden under dots from other periods and we have no way to separate them.

What we need here is:

layers, so we can group our data and work with it interactively; and,

some context for all those dots

To address our design problem, we will make interactive maps with layers that can be toggled on and off

=

In [ ]:

#codecell_Webmaps&Distributions_ImportUrLibraries #Like last time, we'll use folium and one of it's plugins. Import the tools you'll need, as usual. from folium.plugins import HeatMapWithTime

3.4.1 Map Visualisations with Folium¶

To see the survey data in context and build our interactive maps, we'll start by generating the base map that will be used throughout this notebook.

'Basemaps' are generic background maps, like a satellite image or an image of the street map. You know the different backgrounds you can show on google maps? Those are 'basemaps'. Very much like the basemap imported folium.Map (#codecell_makeabasicmap_BringingUrData2theMap).

Have a look around the web, and you'll see that most modern online maps use a basemap, so we're going to do so as well.

In [ ]:

#codecell_Webmaps&Distributions_BringingUrData2theMap #get the survey area centre, like you did last week, so you can centre the map where the data is located location_survey=survey_data_maxtime['DDLat'].mean(), survey_data_maxtime['DDLon'].mean() print(location_survey)

In [ ]:

#define a basemap we can reuse. Use the coordiantes for the centre you generated just above to centre the basemap #This is a variant on how we did things last time... def generateBaseMap(default_location=[35.870086207930626, 23.301798820980512], default_zoom_start=11): base_map = folium.Map(location=default_location, control_scale=True, zoom_start=default_zoom_start) return base_map

Review - basic map controls ¶

generateBaseMap(default_location=[],default_zoom_start=)

location=: Define the default location to zoom at when rendering the map

zoom_start=: The zoom level that the map will default to when rendering the map

control_scale=: Shows the map scale for a given zoom level

In [ ]:

#codecell_Webmaps&Distributions_CheckingZeData #check the basemap is working base_map = generateBaseMap() base_map

MORE TOOLS

In [ ]:

#codecell_Webmaps&Distributions_ImportUrLibraries #lets get the heatmap tool, like last time, let's also get a measure control so we can measure distance from folium import plugins from folium.plugins import HeatMap from folium.plugins import MeasureControl # Measure controls are pretty neat. Rather than just having a scale bar, like you would in a static map, and needing to visually estimate the size of features, you can mesure them. # The ability to measure is a benefit of moving up the 'interactivity spectrum'.

3.4.2 Visualise with a Heatmap¶

Let's start by visually comparing MRom to LRom, that is middle roman to late roman sites by putting their data in separate layers.

In [ ]:

#codecell_Webmaps&Distributions_Splitting your data # make a layer for when each period is more than 50% likely, so you have all the sites that are probably in that period survey_data_MRom = survey_data_maxtime.loc[(survey_data_maxtime['MRom'] > 50)] survey_data_ERom = survey_data_maxtime.loc[(survey_data_maxtime['ERom'] > 50)] # Yes, I know choosing a 50% cut-off is arbitrary. You could choose a different cut-off and change the values listed above. # If you do so, all your maps (and all your conclusions about change over time) will be affected.

Learning a new language – decomposing the code ¶

In the last practical lab in #codecell_makeabasicmap_ManipulatingyourData_UsingSymbology, you used the symbol == . This is similar, you are using a mathematical symbol to filter (in this case split it in two using 50%)your dataframe: > greater than and < less than.

3.4.3 Practicing the concept of layers¶
We've introduced the concept of 'Spatial Data' in Chapter 1.1, and presented them as layers. Similarly, in our dataset, each layer contains information that can be turned on and off. Think of this like a stack of transparent paper. Each sheet of paper is a layer, and can be added to or taken away from the stack. Their order can also be changed.

In [ ]:

#codecell_Webmaps&Distributions_PrepareUrBasemaps_CreateLayers # like last time, make heatmaps, but one for each period, put them in different layers. # give your map a name base_map = generateBaseMap() # add two layers, one for each period mrom = HeatMap(data=survey_data_MRom[['DDLat', 'DDLon', 'MRom']].groupby(['DDLat', 'DDLon']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map) erom = HeatMap(data=survey_data_ERom[['DDLat', 'DDLon', 'ERom']].groupby(['DDLat', 'DDLon']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map) #give the layers sensible names that human can read mrom.layer_name = 'Middle Roman Distribution' erom.layer_name = 'Early Roman Distribution' # add the layer control. This is the tool that lets you turn different layers in your map on and off folium.LayerControl().add_to(base_map)

Learning a new language – decomposing the code ¶

In #codecell__Webmaps&Distributions_PrepareUrBasemaps_CreateLayers, you should refer to #codecell_makeabasicmap_ManipulatingyourData_Heatmap to review the steps taken.

Last time, to create your heatmap you've used a size-based weighting (making larger sites have larger symbols). This time, you are creating two heatmaps named mrom ('Middle Roman Distribution')and erom ('Early Roman Distribution'). Because the data is located within the same island, the maps will overlap.

The layer control plugin allows you to switch on and off the visibility of one or the other (or both). you can find some documentation on LayerControl() here.

In [ ]:

#codecell__Webmaps&Distributions_GenerateUrBasemap #Now generate your map by calling it by its name base_map

3.4.4 An exercise¶
Now try and add some more layers to the map to show other periods! What other periods might it be relevant to consider if you are trying to understand change over time? Edit the cell above to add more layers, or add a new cell below and follow the steps above to make a new map, and add your extra layers to it.

3.5 Back end to Front end: Tidying, manipulating, rearranging again and again for a better design¶

3.5.1 Tidying¶

In [ ]:

#codecell_makeabasicmap_ManipulatingyourData # Here I'm doing the same thing as before but with different periods # make a layer for when the max period is LRom or MRom to compare these periods survey_data_lrommax = survey_data_maxtime.loc[(survey_data_maxtime['colmax'] =='LRom')] survey_data_mrommax = survey_data_maxtime.loc[(survey_data_maxtime['colmax'] =='MRom')]

In [ ]:

#codecell_Webmaps&Distributions_SplittingUrData_CreateLayers # like last time, make heatmaps, but one for each period, put them in different layers base_map = generateBaseMap() lrommax = HeatMap(data=survey_data_lrommax[['DDLat', 'DDLon']].groupby(['DDLat', 'DDLon']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map) mrommax = HeatMap(data=survey_data_mrommax[['DDLat', 'DDLon']].groupby(['DDLat', 'DDLon']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map) #give the layers sensible names lrommax.layer_name = 'Late Roman Distribution' mrommax.layer_name = 'Middle Roman Distribution' # add the layer control folium.LayerControl().add_to(base_map) base_map # Adds a measure tool to the top right base_map.add_child(MeasureControl())

3.5.2 Making visual comparisons¶
Stacking two layers on top of one another is one way to visually compare distributions. Do you think it is effective in this case?

Perhaps it will be more useful to be able to view our distributions and explore them side by side, to help us compare what is happening in these two periods.

We'll import a new tool to allow us to see two maps where the views are synced - that is identical and moving together - to have another way to compare distributions.

In [ ]:

# get another plugin for side by side maps from folium.plugins import DualMap

In [ ]:

# declare you are making a new map "m" # set the location to the location of your survey, set your starting zoom m = plugins.DualMap(location=location_survey, tiles=None, zoom_start=13) # the dual maps plugin automatically defines two buddy maps, "m1" and "m2" which pan and zoom together # give yourself some options in life for your base layers, add them to both maps 'm1' and 'm2' by just using "m" folium.TileLayer('cartodbpositron').add_to(m) folium.TileLayer('openstreetmap').add_to(m) # like last time, make heatmaps, one for each period, put them in different layers # put one layer in the left hand map 'm' and the other in the right hand map 'm2' lrommax = HeatMap(data=survey_data_lrommax[['DDLat', 'DDLon']].groupby(['DDLat', 'DDLon']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(m.m1) mrommax = HeatMap(data=survey_data_mrommax[['DDLat', 'DDLon']].groupby(['DDLat', 'DDLon']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(m.m2) #give the layers sensible names lrommax.layer_name = 'Late Roman Distribution' mrommax.layer_name = 'Middle Roman Distribution' # layer control time folium.LayerControl(collapsed=False).add_to(m) # Adds a measure tool to the top right m.add_child(MeasureControl()) #draw your side by side maps m

Once you have done this exercise, I recommend you diving in NLS website and see the differences between overlays (remember to slide the transparency cursor) and side by side maps.

3.6 Visualisation and interpretation¶
Thought exercise: The results of these two maps should be similar but slightly different. What is making the difference?

How good are you at interpreting these distributions and comparing them visually? This is me hinting at you that you are going to end up wanting to use statistics eventually

3.7 Think about basic principles¶
The principles of what we've done this week are the same as the principles of what we did last week.

I think it's important to learn to do things more than one way, and to adapt to slightly different tools. The software and code packages used for modern spatial analysis and mapping are pretty diverse and are always developing as people improve things. It doesn't make much sense to just learn one way of making maps mechanistically. The important thing is to understand the principles and aims of what you're doing.

Key principles from this week are:

transforming, or re-shaping data to prepare it for mapping (or analysis)

visualizing categorical data

making layers in maps

adding tools to toggle layer visibility

adding tools to interactively measure things on maps

In any code package that is meant to be used for making maps, odds are good you will find a way to set the zoom level, set the centre starting location, and set the initial scale.

You will be able to set up colour schemes, map attributes, and make layers. Knowing keywords and princples is the important thing.

3.8 LexiCode¶
In the past two practical labs, you have learned and experimented with programming commands -

To re-use the codes - you will need to first load their respective libraries. So far, you have used ...:

libraries |plugins | --- |--- | folium | HeatMapWithTime branca| HeatMap pandas| MeasureControl geopandas| PrepareUrBasemaps_CreateLayers from [folium.plugins]

your lexicode is non-exhaustive, keep expanding, find your own 'best way' to reuse these code/scripts...

Lexicode_MakingaBasicMap | Lexicode_Webmaps&Distributions --- | --- == () [] | pd.concat() .head_csv() | .dtype() .read_csv() | astype() mean() | fillna() folium.Map | def return range() | .apply(lambda x:function,axis=) len() | pd.merge() iloc[]| how= , left_index= ,left_index= .value_counts()| gpd.GeoDataFrame() if =:| geometry=gpd.points_from_xy elif =: |print() else =:| .isin() folium.Marker()| classic.plot() folium.Icon()| generateBaseMap() folium.Circle| .groupby(['', '']) popup= | .reset_index() radius= | max_zoom= .values.tolist() |folium.TileLayer() .add_to()| plugins.DualMap(location= , tiles= , zoom_start= ) |

3.9 References¶
Orengo, H.A. Garcia-Molsosa, A. (2019) A brave new world for archaeological survey: Automated machine learning-based potsherd detection using high-resolution drone imagery. Journal of Archaeological Science,Volume 112.

Bonnier, A., Finné, M., & Weiberg, E. (2019). Examining land-use through GIS-based kernel density estimation: A re-evaluation of legacy data from the berbati-limnes survey. Journal of Field Archaeology, 44(2), 70-83.

3.10 The End¶
That's all for this chapter. Be sure to save your copy of the notebook in your own repo so you can re-use it!

Chapter 4: More about spatial patterns: The case of Gabii¶

4.1 Introduction: exercise motivations ?¶
Understanding the meanings behind patterns of finds recovered through excavation is a tricky problem. We hope to distinguish activity areas, places devoted to domestic and industrial use, or inhabited places that are distinct from liminal ones. We often want to discern change over time, identifying areas with finds associated with different temporal periods.

To successfully unravel these patterns, we must look not only at the distributions of different types of finds, but how they correlate with one another, the character of the contexts in which they were recovered, and their own physical and social characteristics. Are they likely to be curated? Are they light and likely to be moved from one area to another by post-depositional processes? It's all a bit of a mess.

Importantly, all these processes are spatial. Alignments or proximity between areas with similar (or quite different) finds is potentially meaningful.

The aims of this exercise is for you to:

learn to work real special finds data from an excavation, in all its messiness, to look for spatial patterns and relationships. -> helping you to better identify spatial patterns

start thinking about quantitative and spatial approaches to finds data from excavations and how they can help us better understand the patterns we see. -> thelping you to better quantifying spatial patterns

You'll do this using data collected by the Gabii Project, a 10+ year excavation in central Italy.

=

In [ ]:

##codecell_SpatialPatterns_ImportUrLibraries # as usual, start by getting your tools: your prerequisites. %matplotlib inline # Matplotlib is your tool for drawing graphs and basic maps. You need this! !pip install fiona !pip install geopandas import pandas as pd import requests import fiona import geopandas as gpd import ipywidgets as widgets

In #codecell_SpatialPatterns_ImportUrLibraries. These are what we call prerequisites. You know by now that they are basic tools so you can get started. Let's review them broadly:

Pandas manipulate data.

Geo-pandas manipulate geographic data. They're also black and white and like to eat bamboo... You need these to manipulate your data!

Fiona helps with geographic data (find more about fiona).

Requests are for asking for things. It's good to be able to ask for things.

ipywidgets supports interactivity.

4.2 Getting to know your Data...##¶

In [ ]:

##codecell_SpatialPatterns_ImportUrData # then get your data # This is where I put the data. It's in a format called geojson, used to represent spatial geometry (shapes) and attributes (text). url = 'http://ropitz.github.io/digitalantiquity/data/gabii_SU.geojson' # Please get me the data at that web address (url): # use requests.get to retrieve data from any destination request = requests.get(url) # I will use the letter 'b' to refer to the data, like a nickname #we can use requests to read the response content in bytes b = bytes(request.content) #So we will use fiona.BytesCollection referred by the letter 'f': # to read the raw data (as single-file formats or zipped shapefiles) # to wrap up all the data from 'b' # check the coordinate refereence system (crs) listed in the features with fiona.BytesCollection(b) as f: crs = f.crs #by using also GeoDataFrame.from_features you can read geospatial data that's in the url without saving that data to disk (your PC) first gabii_su_poly = gpd.GeoDataFrame.from_features(f, crs=crs) # and print out the first few lines of the file, so I can check everything looks ok: you know this by now...we will call .head() print(gabii_su_poly.head())

Learning a new language – decomposing the code ¶

In #codecell_SpatialPatterns_ImportUrData:

GeoJSON is used to store the excavation data. GeoJSON format allows to encode a variety of geographic data structures which contains features with spatial attributes (e.g. points, line strings, polygons, multiparts geometries) and non-spatial attributes (text). This is a really useful format to use when creating a GIS.
bytes() method returns bytes object which is an immmutable (cannot be modified) sequence of integers. We tend to use this to compress data, save or send it.
requests.get is used to retrieve data from any destination & requests.content to read all content which is in bytes (for non-text requests using the .content property).
BytesCollection() takes a buffer of bytes and maps to a virtual file that can then be opened by fiona. By using both fiona.BytesCollection and GeoDataFrame.from_features you can:

to read the raw data (as single-file formats or zipped shapefiles)

to wrap up all the data from 'b'

check the coordinate refereence system (crs) listed in the features

Open source

So far, we have abundantly benefitted from open-source software, data, tools, code, design documents, or content. It is only natural to open, share and use the results of 10 years excavation...

4.3 Assessing your data visually...¶
So you know what's in this dataset... Maybe you want to see it before you start querying or analysing it? ... Start by visualising the spatial data for all the contexts (stratigraphic units) from the excavation we'll be exploring.

So far you have dealt with survey data where all data has been logged with coordinates (x, y and sometimes z for the elevation height). However, it is not always possible, or even meaningful, to record the coordinates of all artefacts retrieved during an excavation, especially if it lasts over several years. It is then more appropriate to use stratigraphical units (SU). The spatial analysis of these finds is more challenging as they don't have a spatial location per se. We need to think carefully about organising and presenting them. Colour can be a great help for this!

In [ ]:

##codecell_SpatialPatterns_PlottingUrData # Now we have polygons, the shapes of our contexts. Let's visualise the data to double check that all is well # We'll use again the function .plot (see lab_Webmaps&Distributions) # 'plot' means draw me an image showing the geometry of each feature in my data. # We want to control things like the color of different types of features on our map. # I used the 'Blues' colorscale command (cmap stands for 'colour map') # and asked it to draw the polygons differently based on the type of feature. gabii_map1 = gabii_su_poly.plot(column='DESCRIPTIO', cmap='Blues', edgecolor='grey', figsize=(15, 15));

Learning a new language – decomposing the code ¶

In #codecell_SpatialPatterns_PlottingUrData, when using .plot(), when column=' ' is specifed, the plot colouring is based on its values. The colour scale is defined by using cmap='' , the edge of the features (our polygons) are defined by edgecolor='' and the size of the plot by figsize=(width, height)) .

you can of course add parameters/symbologies to your plots. The choice of parameters is largely dictated by your data analysis. Here is some documentation on plot generation.

The colorscale options are: Accent, Accent_r, Blues, Blues_r, BrBG, BrBG_r, BuGn, BuGn_r, BuPu, BuPu_r, CMRmap, CMRmap_r, Dark2, Dark2_r, GnBu, GnBu_r, Greens, Greens_r, Greys, Greys_r, OrRd, OrRd_r, Oranges, Oranges_r, PRGn, PRGn_r, Paired, Paired_r, Pastel1, Pastel1_r, Pastel2, Pastel2_r, PiYG, PiYG_r, PuBu, PuBuGn, PuBuGn_r, PuBu_r, PuOr, PuOr_r, PuRd, PuRd_r, Purples, Purples_r, RdBu, RdBu_r, RdGy, RdGy_r, RdPu, RdPu_r, RdYlBu, RdYlBu_r, RdYlGn, RdYlGn_r, Reds, Reds_r, Set1, Set1_r, Set2, Set2_r, Set3, Set3_r, Spectral, Spectral_r, Wistia, Wistia_r, YlGn, YlGnBu, YlGnBu_r, YlGn_r, YlOrBr, YlOrBr_r, YlOrRd, YlOrRd_r, afmhot, afmhot_r, autumn, autumn_r, binary, binary_r, bone, bone_r, brg, brg_r, bwr, bwr_r, cividis, cividis_r, cool, cool_r, coolwarm, coolwarm_r, copper, copper_r, cubehelix, cubehelix_r, flag, flag_r, gist_earth, gist_earth_r, gist_gray, gist_gray_r, gist_heat, gist_heat_r, gist_ncar, gist_ncar_r, gist_rainbow, gist_rainbow_r, gist_stern, gist_stern_r, gist_yarg, gist_yarg_r, gnuplot, gnuplot2, gnuplot2_r, gnuplot_r, gray, gray_r, hot, hot_r, hsv, hsv_r, inferno, inferno_r, jet, jet_r, magma, magma_r, nipy_spectral, nipy_spectral_r, ocean, ocean_r, pink, pink_r, plasma, plasma_r, prism, prism_r, rainbow, rainbow_r, seismic, seismic_r, spring, spring_r, summer, summer_r, tab10, tab10_r, tab20, tab20_r, tab20b, tab20b_r, tab20c, tab20c_r, terrain, terrain_r, viridis, viridis_r, winter, winter_r

Swap out 'Blues' in the cell above for any of these options...

4.4 Loading the special finds¶
Like many excavations, not every special finds in this datset has spatial coordinates associated with it (because in real archaeology life things are found in the sieve, the wheelbarrow, and during washing).

In [ ]:

###codecell_SpatialPatterns_WhichTypeOfSpecialFinds&fromWhere? # Now I'm going to bring in all the basic Gabii special finds data - descriptions, object types, IDs and the contexts from which they come. # We've had a few special finds over the years. sf_su = pd.read_csv("https://raw.githubusercontent.com/ropitz/gabii_experiments/master/spf_SU.csv") sf_su

In [ ]:

###codecell_SpatialPatterns_WhichTypeOfSpecialFinds? #the set()function allows to return all values (here our special finds type) without duplicates #this is a useful tool when you need to standardise your finds labels (and check your metadata for spelling!) sf_su_desc = sf_su['SF_OBJECT_TYPE'] set(sf_su_desc) #however, this is a really long list to deal with, so we need to find ways to prepare this dataset to really see what has been happening on this site.

One of our area supervisors, Troy, is super excited about tools related to textile production. They're a great example of how we think about special finds at Gabii. Multiple types of finds are related to textile production. Do we find all types everywhere? Are certain types of tools more concentrated in one type of context or one area than others? Troy has lots of questions about the patterns of places where we find these tools. Do they provide evidence for early textile production? Are they a major factor in the city's early wealth? Do we find the same things in later periods? After all, people under the Republic and Empire wore clothes... Loom Weights, spools, and spindle whorls are the most common weaving tools at Gabii.

4.5 Preparing your data, a prerequisite to all analysis¶
4.5.1 Selection¶
As this data is not yet spatial and only associated with a stratigraphic unit, logically, we can merge our non-spatial special finds data with our spatial stratigraphic units data to make all our data spatial.

In [ ]:

###codecell_SpatialPatterns_SpecialFindsSelection #Let's pull all those find types out of the big list. #We're selecting the finds data we want to work with before merging with the spatial data. We could do these operations in reverse if we wanted to. #here very much like in lab1,#codecell_makeabasicmap_BringingUrData2theMap, & lab2, #codecell_Webmaps&Distributions_SplittingUrData, we are using iloc and isin functions types = ['Loom Weight','Spool','Spindle Whorl'] textile_tools = sf_su.loc[sf_su['SF_OBJECT_TYPE'].isin(types)] textile_tools #we now have a new dataframe containing only textile_tools

4.5.2 Listing and merging to become spatial¶
Presence or absence isn't everything. You may want to know how many of a certain type of find is present in a given area.

In [ ]:

###codecell_SpatialPatterns_TextileToolsListing # Now let's count up how many of these tools appear in each context (SU). # pd.value_counts() functioon returns a series containing counts of unique values. # So we can print out a list of the number of textile_tools in each SU next to that SU number. pd.value_counts(textile_tools['SU'].values, sort=True)

In [ ]:

###codecell_SpatialPatterns_TextileToolsBecomesSpatial #Then let's combine the special finds data with our polygons representing context with shape and a spatial location # We do this with a command called 'merge'. In lab2,#codecell_Webmaps&Distributions_MergingZeData, you have used pandas pd.merge() gabii_textools = gabii_su_poly.merge(textile_tools, on='SU') # very much like p.merge(), you have now created a new dataframe ('gabii_textools') by merging dataframe 'textile_tools' on= SU, the stratigraphical unit # let's have a look at the new dataframe using .head() to print out just the first few rows. gabii_textools.head()

4.5.3 Visual assessment¶

In [ ]:

###codecell_SpatialPatterns_SeeingTextileToolsinContext # If we want to see this result as a map, we just add the .plot command to the end of the dataframe's name # here .plot() symbology is expanded to transparency with 'alpha=' where value of 1 is complete opacity and 0 complete transparency gabii_textools.plot(column='SF_OBJECT_TYPE', cmap='Accent', figsize=(15, 15), legend=True, alpha=0.5)

But what do you see really?¶
OK, what do you see here? Compare the distribution of each type of textile tool. Do some types seem to be concentrated in certain areas? How might you check? What factors might contribute to this pattern? Do big layer simply aggregate lots of stuff? Do late dumps contain early materials? Why would one type of tool appear where the others don't?

4.5.4 Sorting and tidying¶

In [ ]:

###codecell_SpatialPatterns_SortingDataTextileTools # We can try and see the relationship between layer size and count by sorting #our list of finds by the surface area of each layer. # We use the command 'sort_values' gabii_textools.sort_values(by=['Shape_Area'],ascending=False) # '.sort_values' function sort along their axis (here the axis is defined by 'Shape_Area' ). # the default sorting is on ascending values (smallest to largest) if you are happy with this =True, however, we want to see them in descending order, so we select =False

Knowing your site and refining your analysis¶
Gabii excavations have revealed that there are enormous colluvial layers. This is an important consideration as these large areas will contribute to a bias distribution of the artefacts across the site. Therefore, very large areas should probably be excluded.

In [ ]:

###codecell_SpatialPatterns_RefiningDataSorting # Outliers will mess with any analysis. Here large stratigraphical layer are our outliers # By cutting out these layers i.e. excluding SUs with a surface area greater than 800 we can deal with these outliers gabii_textools2 = gabii_textools.loc[gabii_textools['Shape_Area']<800]

In [ ]:

###codecell_SpatialPatterns_VisualisingDataSorting # If we want to see this result as a map, we just add the .plot command to the end again. gabii_textools2.plot(column='SF_OBJECT_TYPE', cmap='Accent', figsize=(15, 15), legend=True, alpha=0.5) # That's better. Plot the results to see that you've removed the big colluvial layers.

Grouping and merging further¶
to answer to question: how many of each tool type appears in each SU? You will need to further group and merge your data.

In [ ]:

###codecell_SpatialPatterns_GroupingData # OK, count up how many of each tool type appears in each SU using the 'groupby' command. # You have used this command before in #codecell_Webmaps&Distributions_SplittingUrData_CreateLayers ## and .fillna() was explained in codecell_Webmaps&Distributions_AllNumbers textools_counts = gabii_textools2.groupby('SU')['SF_OBJECT_TYPE'].value_counts().unstack().fillna(0) # Sort the list so that the SUs with the most stuff end up at the top. textools_counts.sort_values(by=['Loom Weight','Spindle Whorl','Spool'], ascending=False)

In [ ]:

###codecell_SpatialPatterns_MergingData # Merge your textile tool counts with your spatial data for the contexts # Because both dataframes have a 'SU' column, you can use this to match up the rows. # so the merger will occur on='SU' gabii_textools_counts = gabii_su_poly.merge(textools_counts, on='SU') gabii_textools_counts.head()

4.5.5 Visual assessment: exploring your data¶
Side by side plots of different variables can help you to visualize the differences between the spatial patterns you're exploring. Very much like in lab2_MakeaBasicMap, when you compared Late Roman and Middle Roman artefact distributions using two heatmaps side-by-side.

In [ ]:

###codecell_SpatialPatterns_AssessingFindType # Let's start by looking at each class of textile tool individually. # Plot the counts of each type of find spatially gabii_textools_counts.plot(column='Loom Weight', cmap='Accent', figsize=(15, 15), legend=True, alpha=0.5, legend_kwds={'label': "Number of Loom weight",'orientation': "vertical"}) gabii_textools_counts.plot(column='Spindle Whorl', cmap='Accent', figsize=(15, 15), legend=True, alpha=0.5, legend_kwds={'label': "Number of Spindle Whorl",'orientation': "vertical"}) gabii_textools_counts.plot(column='Spool', cmap='Accent', figsize=(15, 15), legend=True, alpha=0.5, legend_kwds={'label': "Number of Spool",'orientation': "vertical"})

In [ ]:

###codecell_SpatialPatterns_PlottingAllFindType # here's another visualisation. I've chosen a single colour scale - so shades of red, shades of blue, shades of green, to show the quantity of each type of find in a single map. base = gabii_textools_counts.plot(column='Loom Weight', cmap='Blues', figsize=(15, 15), alpha=0.7) gabii_textools_counts.plot(ax=base, column='Spindle Whorl', cmap='Reds', alpha=0.7) gabii_textools_counts.plot(ax=base, column='Spool', cmap='Greens', alpha=0.7);

Let's get another library to help visualisation¶
So far, it has been difficult to see what's happening, to identify activities between the buildings and to compare the maps when we have to scroll.

In [ ]:

##codecell_SpatialPatterns_ImportUrLibraries import matplotlib.pyplot as plt

In [ ]:

###codecell_SpatialPatterns_AllFindTypeSidebySide # Let's put the maps side by side to help with comparative visualisation. fig, axes = plt.subplots(ncols=3,figsize=(15, 5)) gabii_textools_counts.plot(column='Loom Weight', cmap='autumn', ax=axes[0], legend=True, legend_kwds={'label': "Number of Loom weight",'orientation': "vertical"}).axis('equal') gabii_textools_counts.plot(column='Spindle Whorl', cmap='autumn', ax=axes[1], legend=True, legend_kwds={'label': "Number of Spindle Whorl",'orientation': "vertical"}).axis('equal') gabii_textools_counts.plot(column='Spool', cmap='autumn',ax=axes[2], legend=True, legend_kwds={'label': "Number of Spool",'orientation': "vertical"}).axis('equal')

4.5.6 Questioning your maps¶
Can you see any patterns here? Do the different types of tools concentrate in the same parts of the site? Why might different types of tools have different distributions?

~ warning ~ ¶

OK, this next cell is a big scary cell is because something has broken after I drafted this exercise. Push run to fix the thing they've broken (hopefully).

In [ ]:

#@title !apt-get install -qq curl g++ make #@title !curl -L http://download.osgeo.org/libspatialindex/spatialindex-src-1.8.5.tar.gz | tar xz #@title import os os.chdir('spatialindex-src-1.8.5') #@title !./configure #@title !make #@title !make install #@title !pip install rtree #@title !ldconfig #Working through the example at http://toblerity.org/rtree/examples.html #@title from rtree import index from rtree.index import Rtree #@title p = index.Property() idx = index.Index(properties=p) idx

4.6 Quantifying these patterns¶
4.6.1 Using statistics to explore, characterize and quantify spatial patterns¶
There is a human limitation to recognise more than 3+ attributes/variables out of 100 point data such as this four-dimensional tic-tac-toe grid:

That's why we have to use statistics, spatial statistics to recognise and more to the point quantify patterns.

=

In [ ]:

##codecell_SpatialPatterns_ImportUrLibraries # I think the distributions of different weaving tools vary. # To investigate further, we are going to need more tools. Specifically we need statistical tools. # pysal, numpy and sklearn are all useful for statistics. Seaborn is useful for visualisation. !pip install pysal import pysal from sklearn import cluster import seaborn as sns import numpy as np

4.6.2 Data Clustering¶
We're going to use cluster analysis to try and better understand our patterns. Clustering is a broad set of techniques for finding groups within a data set. Cluster analysis has as its objective grouping together similar observations (unlike factor analysis works by searching for similar variables).

Given a set of data points, we can use a clustering algorithm to classify each data point into a specific group. The clustering algorithm tests the hypothesis that data points of the same group have similar properties, and that data points in different groups have dissimilar properties. So, when we cluster observations, we want items in the same group to be similar and items in different groups to be dissimilar.

M.Fortin & M. Dale (2005) explain that

" Arising from time series statistics and the more familiar parametric statistics, spatial statistics quantify the degree of self-similarity of a variable as a function of distance. These spatial statistics assume that, within the study area, the parameters of the function defining the underlying process, such as the mean and the variance, are constant regardless of the distance and direction between the sampling locations. Then the goal of spatial statistics is to test the null hypothesis of absence of ‘spatial pattern’. The null hypothesis implies that nearby locations (or attributes, measures) do not affect one another such that there is independence and spatial randomness (Figure 6.2(b)). The alternatives are that there is clustering and thus positive spatial autocorrelation (Figure 6.2(a)) or repulsion and negative spatial auotocorrelation (Figure 6.2(c))

K-Means clustering is probably the most well-known clustering algorithm that solve clustering problem by splitting a dataset into a set of k (k being an arbitrary number you get to choose) groups.

We can only recommand Ben Alex Keen blog to see how Python works through clustering! Here is a simple version of K-Means Clustering explained in 5 visual steps:

Step1 Step2 Step3 Step4 Step5

Assign each points to similar centre: we need to identify the number of classes to use by looking at the data and identifying any discrete groupings. Identify the cluster centroids (3 coloured symbols on graph). Reassign the points based on the minimum distance or closest from cluster centroids. Here we should emphasise that there many possible definitions that may be used for “closest” (e.g.nearest neighbour, Ward's method). Identify the new centroids by taking the average of all points in the cluster. Reassign the groupings -points and assignment- until points stop changing clusters in a loop.

Because clustering allows us to identify which things are alike on the basis of multiple characteristics, we will do just that. We will use cluster analysis to see which SUs are similar in terms of all three types of textile tools: loom weights; spools; spindle whorls. In #codecell_SpatialPatterns_ImportUrLibraries, we import a cluster tool from sklearn to help us with it.

Clustering types of textile tools¶

Let's use python to run K-means function

First, we want to cluster together contexts where the pattern of the three types of textile tools are similar.

In [ ]:

##codecell_SpatialPatterns_ClusterTextileTools # Next step: cluster together contexts where the pattern of the three types of textile tools are similar, # with and without respect to the size of the SU. # In this cell we are including information on the size of the SU # we will use three functions .cluster.Kmeans, .fit() and .drop(): km5 = cluster.KMeans(n_clusters=5) km5cls = km5.fit(gabii_textools_counts.drop(['geometry', 'OBJECTID','DESCRIPTIO','Shape_Length','SU'], axis=1).values) km5cls

Learning a new language – decomposing the code ¶

In #codecell_SpatialPatterns_ClusterTextileTools, by:

using .cluster.Kmeans() function,

we have arbitrarily given k (n_clusters) an arbitrary value of 5 groups

* using .fit() function, we wanted to estimate the best representative function for the data points on account for the size of the context (*'Shape_Area'*) and counts of different types of tools(*Loom Weight, Spindle Whorl, Spool*). * using .drop() function Drop all the other fields (*'geometry', 'OBJECTID','DESCRIPTIO','Shape_Length','SU'*).

Let's visualise this first step¶
Each cluster produced should contain the SUs that are similar to one another on the basis of the number of each type of textile tool and the size of the surface area of the SU.

In [ ]:

##codecell_SpatialPatterns_SubPlotClusterTextileTools # Plot the clusters, groups of contexts that have similar textile tool assemblages. # Give a different colour to the SUs that belong to each cluster. #creating a single chart f1, ax = plt.subplots(1, figsize=(15,15)) #plt.subplots() is a function that returns a tuple (=sequence of immutable Python objects like a list) #this tuple contains a figure and axes object(s) #so, when using fig, ax = plt.subplots() you unpack this tuple into the variables fig and ax. #here f1 is your figure #here we want to assign() our new clustering labels 'km5cls.labels_'to the dataframe #and also plot the cluster ('cl'). (nb: that ax=ax are the features/objects here) gabii_textools_counts.assign(cl=km5cls.labels_)\ .plot(column='cl', categorical=True, legend=True, \ linewidth=0.1, cmap='Accent', edgecolor='white', ax=ax) ax.set_axis_off() #turn x and y-axis off which will affect the axis lines, ticks, ticklabels, grid and axis labels #let's see the plot using command plt.show() plt.show()

In [ ]:

##codecell_SpatialPatterns_ClusterTextileTools_2 #Do the same as in ##codecell_SpatialPatterns_ClusterTextileTools #but let's ignore the size of the context so we also .drop('Shape_Area'). km5 = cluster.KMeans(n_clusters=5) km5cls2 = km5.fit(gabii_textools_counts.drop(['geometry', 'OBJECTID','DESCRIPTIO','Shape_Length','SU','Shape_Area'], axis=1).values) #we plot as in ##codecell_SpatialPatterns_SubPlotClusterTextileTools #this time a'f2' and our new cluster 'cl2' f2, ax = plt.subplots(1, figsize=(15,15)) gabii_textools_counts.assign(cl2=km5cls2.labels_)\ .plot(column='cl2', categorical=True, legend=True, \ linewidth=0.1, cmap='Accent', edgecolor='white', ax=ax) ax.set_axis_off() plt.show()

4.7 Interpreting the results: the cluster way¶

4.7.1 Visualising clustering analysis differences¶
The patterns are definitely different. How can we interpret the fact that context size affects the pattern of the distribution of textile tools? Do big units, which perhaps represent dumps or colluvial mashups, have a fundamentally different character than the varied small contexts?

In [ ]:

##codecell_SpatialPatterns_SubPlotClustersSidebySide # Look at the difference with and without context size taken into account. #we are plotting just like in##codecell_SpatialPatterns_SubPlotClusterTextileTools #however we want to see them side by side to see effect of our selection. #To do so, we define our plt.subplots as having 2 columns (ncols=2) fig, axes = plt.subplots(ncols=2,figsize=(15, 5)) gabii_textools_counts.assign(cl2=km5cls2.labels_)\ .plot(column='cl2', categorical=True, legend=True, \ linewidth=0.1, cmap='Accent', edgecolor='white', ax=axes[0]).axis('equal') gabii_textools_counts.assign(cl=km5cls.labels_)\ .plot(column='cl', categorical=True, legend=True, \ linewidth=0.1, cmap='Accent', edgecolor='white', ax=axes[1]).axis('equal') # note here that we have to 2 axes for our to feature classes: # ax=axes[0] for the first subplot and ax=axes[1] for the second subplot

4.7.1 Characterising our cluster analyses¶
There are many ways to label your data in matplolib, here are some examples.

In [ ]:

##codecell_SpatialPatterns_AddLabelClusterstoDataframe # assign the cluster IDs to each context permanently gabiitextools_clas = gabii_textools_counts.assign(cl=km5cls.labels_) gabiitextools_class = gabiitextools_clas.assign(cl2=km5cls2.labels_) #and let's check how your dataframe looks like gabiitextools_class.head()

4.7.2 Data interpretation: push your cluster analyses further¶
Data exploration can help us to understand how well the cluster analysis performed.

Let's compare the effect of area size to one cluster group/class¶

In [ ]:

##codecell_SpatialPatterns_SubselectClustergroupN.0 # Now let's look at some individual classes, with and without context size accounted for in the analyses. # We choose the cluster group/class 0 # we use .loc[] command (see##codecell_makeabasicmap_BringingUrData2theMap & #codecell__Webmaps&Distributions_SplittingUrData) # and create 2 dataframes: # With=> [gabiitextools_class['cl']where we only select (==) group 0 # Without=> [gabiitextools_class['cl2']where we only select (==) group 0 gabiitextools_class0=gabiitextools_class.loc[gabiitextools_class['cl']==0] gabiitextools_class0noarea=gabiitextools_class.loc[gabiitextools_class['cl2']==0] #we are plotting just like in##codecell_SpatialPatterns_SubPlotClustersSidebySide fig, axes = plt.subplots(ncols=2,figsize=(15, 5)) gabiitextools_class0.plot(ax=axes[0], legend=True).axis('equal') gabiitextools_class0noarea.plot(ax=axes[1]).axis('equal')

Effect of changing cluster numbers¶

In [ ]:

##codecell_SpatialPatterns_ClusterTextileTools_7types_WithContextSize # What happens when we change the number of clusters (groups)? # let's repeat our first analysis (##codecell_SpatialPatterns_ClusterTextileTools) # and let's keep the context size ('Shape_area') km7 = cluster.KMeans(n_clusters=7) km7cls3 = km7.fit(gabii_textools_counts.drop(['geometry', 'OBJECTID','DESCRIPTIO','Shape_Length','SU'], axis=1).values) f3, ax = plt.subplots(1, figsize=(15,15)) gabii_textools_counts.assign(cl3=km7cls3.labels_)\ .plot(column='cl3', categorical=True, legend=True, \ linewidth=0.1, cmap='Accent', edgecolor='white', ax=ax) ax.set_axis_off() plt.show()

That also changes things. Without going into too much detail, finding the ideal number of clusters is challenging. There are many statistical techniques to help you choose the best number of clusters. However, this is for a more advanced stats course! For now, just try playing around with the number of clusters in the notebook, or the size cut-off for inclusion.

In [ ]:

##codecell_SpatialPatterns_ClusterTextileTools_7types_WithoutContextSize # Use 7 clusters folowing the same procedure than in ##codecell_SpatialPatterns_ClusterTextileTools_7types_WithContextSize # and let's drop the context size ('Shape_area') km7 = cluster.KMeans(n_clusters=7) km7cls4 = km7.fit(gabii_textools_counts.drop(['geometry', 'OBJECTID','DESCRIPTIO','Shape_Length','SU','Shape_Area'], axis=1).values) f4, ax = plt.subplots(1, figsize=(15,15)) gabii_textools_counts.assign(cl4=km7cls4.labels_)\ .plot(column='cl4', categorical=True, legend=True, \ linewidth=0.1, cmap='Accent', edgecolor='white', ax=ax) ax.set_axis_off() plt.show()

In [ ]:

##codecell_SpatialPatterns_AddLabelto7ClusterGroups # assign the cluster IDs to each context permanently just like in ##codecell_SpatialPatterns_AddLabelClusterstoDataframe # Let's set up to investigate some of the individual clusters gabiitextools_class3=gabiitextools_class.assign(cl3=km7cls3.labels_) gabiitextools_class4=gabiitextools_class3.assign(cl4=km7cls4.labels_) gabiitextools_class4.head()

Let's compare the effect of changing cluster number (from 5 to 7) and area size to one cluster group/class¶

In [ ]:

##codecell_SpatialPatterns_EffectofClusterNumber&AreaSizetoClustergroupN.0 #we are following the same step than in ##codecell_SpatialPatterns_SubselectClustergroupN.0 but with our 7 cluster groups/classes # set up variables to store several classes, with and without context size taken into account. gabiitextools_class0=gabiitextools_class4.loc[gabiitextools_class4['cl']==0] gabiitextools_class0noarea=gabiitextools_class4.loc[gabiitextools_class4['cl2']==0] gabiitextools_k7_class0=gabiitextools_class4.loc[gabiitextools_class4['cl3']==0] gabiitextools_k7_class0noarea=gabiitextools_class4.loc[gabiitextools_class4['cl4']==0] # setting the subplots to the four dataframes, we have just identified above # this can be done by arranging the subplots like a 2X2 table/array with 'ncols=2' & 'nrows=2' fig, axes = plt.subplots(ncols=2,nrows=2,figsize=(15, 10)) gabiitextools_class0.plot(ax=axes[0,0]).axis('equal') #note here the ax=axes defines the position of the figure in this 2X2 array (in this case is cell column 0, row0) axes[0,0].set_title(' 5 clusters - area') #for ease, we have added a title to the subplots gabiitextools_class0noarea.plot(ax=axes[0,1]).axis('equal') axes[1,0].set_title(' 5 clusters - no area') gabiitextools_k7_class0.plot(ax=axes[1,0]).axis('equal') axes[0,1].set_title('7 clusters - area') gabiitextools_k7_class0noarea.plot(ax=axes[1,1]).axis('equal') axes[1,1].set_title('7 clusters - no area')

Let's compare the effect of changing cluster number (from 5 to 7) and area size to other cluster groups/classes¶

In [ ]:

##codecell_SpatialPatterns_EffectofClusterNumber&AreaSizetoClustergroupN.3 # now try some different cluster groups # we are choosing cluster no.3 this time #we are following our previous steps gabiitextools_class3=gabiitextools_class4.loc[gabiitextools_class4['cl']==3] gabiitextools_class3noarea=gabiitextools_class4.loc[gabiitextools_class4['cl2']==3] gabiitextools_k7_class3=gabiitextools_class4.loc[gabiitextools_class4['cl3']==3] gabiitextools_k7_class3noarea=gabiitextools_class4.loc[gabiitextools_class4['cl4']==3] fig, axes = plt.subplots(ncols=2,nrows=2,figsize=(15, 10)) gabiitextools_class0.plot(ax=axes[0,0]).axis('equal') axes[0,0].set_title('5 clusters - area') gabiitextools_class0noarea.plot(ax=axes[0,1]).axis('equal') axes[1,0].set_title(' 5 clusters - no area') gabiitextools_k7_class0.plot(ax=axes[1,0]).axis('equal') axes[0,1].set_title('7 clusters - area') gabiitextools_k7_class0noarea.plot(ax=axes[1,1]).axis('equal') axes[1,1].set_title('7 clusters - no area')

Learning a bit more about Cluster analysis ¶

Cluster analysis is an important statistical technique. While not the main focus of this course, it's worth learning more about it.

We have used K-Means clustering because it presents the advantage to be fast, easy to understand and fairly robust and efficient; however, it has disadvantages too, such as the number of groups/classes selection (and as you have seen which isn’t always inconsequential; ideally, you'd want the algorithm to figure this out). Furthermore, K-Means clustering cannot deal with non-linear dataset and if there are highly overlapping data then k-means will not be able to resolve the differentiation between clusters. Best results can be achieved when datasets are distinct or well separated from each other.

Reminder: linear versus non-linear datapoints:

Assessing spatial patterns: Degree of Clustering Degree of Clustering is a measure of the degree to which nodes in a graph tend to cluster together. You can follow this video to understand the concept behind clustering coefficient and below presents various degree of clustering results.

**Assessing spatial patterns: Cluster Validity**

K-means is not the only clustering technique. I encourage you to do some independent reading on different approaches to clustering. You may want to have a look at this to start off.

4.8 Interpreting the results: Using association and relationships¶
Although we can see differences, it is difficult to:

see them all

quantify them

express them

Let's turn to Statitics to do this for us.
In #codecell_SpatialPatterns_ImportUrLibraries, you have imported seaborn library to do just that.

Learning a bit more about Correlation ¶

Correlations are statistical associations between two variables. When we say that two variables are associated, we are suggesting that they vary (change) together. This implies that the processes that affect these variables may be related. We try to detect correlation to understand the relationship between two variables.

For example: should we expect the pattern of loom weights to be similar to the pattern of spindle whorls? should we expect them to be correlated?

Note that: Correlation does not imply causation! That is, the pattern of loom weights does not cause or create the pattern of spindle whorls.

* Correlation tests describe the statistical relationship between two different variables, in contrast to comparison tests which evaluate the differences between variables (e.g. t-test, sign test, Mann-Whitney). * Another important characteristic of the correlation statistics is that they do not only identify relationships but they characterise both the strength and the form of these relationships. * The robustness of the correlation calculations is, again, evaluated in terms of statistical significance using the p-value.
The most frequently used Parametric correlation statistic is the Pearson’s correlation r. Pearson’s r tests if there is a linear relationship between two variables, and can only range between -1 and 1, where -1 indicates perfect negative linear (one to one) correlation and +1 indicates the perfect positive linear correlation (see Figure below).

Examples of Pearson's correlation coefficient (r) for different degrees of linearity

Note that besides the sign of the correlation (positive or negative) the key term here is the term linear. We can only argue about linear relationships when we apply **parametric** correlation. A general **Non-Parametric** correlation statistic is the Spearman’s ρ (rho) correlation.

Examples of Pearson's correlation coefficient (r) for different degrees of linearity

**Spatial covariation and implied correlation**

Examples of positive and negative covariance

As you can see, a fundamental concept is hidden behind these types of statitistical tests, and, in fact behind all statistics, which is parametric vs. nonparametric distributions... you can start finding more about this important set of statistical concepts [here]( https://blog.minitab.com/blog/applying-statistics-in-quality-projects/a-correspondence-table-for-non-parametric-and-parametric-tests) and [here]( https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/).

4.8.1 Effect of Cluster numbers¶
We will be using seaborn.pairplot to visualise the relationships between our variables. A pairs plot allows us to see both distribution of single variables and relationships between two variables. You can learn a bit more about it in Will Koehrsen's blog.

In [ ]:

##codecell_SpatialPatterns_EffectofClusterNumber # Do 7 clusters vs. 5 result in more correlation between the spatial distribution patterns of pairs of find types? # we are simply using sns.pairplot function to visualize relationships (correlations) between the find types and the clusters they appear in. # We are asking: do certain pairs of variables appear in the same clusters most of the time? sns.pairplot(gabiitextools_k7_class0.drop(['OBJECTID','DESCRIPTIO','Shape_Length','Shape_Area','SU','geometry','cl','cl2','cl3','cl4'], axis=1), kind="reg") plt.show()

4.8.2 Effect of Cluster numbers-II¶

In [ ]:

##codecell_SpatialPatterns_EffectofClusterNumber # Are some cluster classes/groups more correlated than others? sns.pairplot(gabiitextools_class0.drop(['OBJECTID','DESCRIPTIO','Shape_Length','Shape_Area','SU','geometry','cl','cl2','cl3','cl4'], axis=1), kind="reg") plt.show()

4.9 What can you conclude?¶
Is there strong positive or negative correlation between pairs of variables (textile tools and the clusters they end up in)? What might this say about the processes behind these patterns?

Statistically assessing correlation is a good starting point for understanding patterns in your data.

Because this isn't a stats course, we're not taking this forward any further today. But it's worth noting that if you want to do spatial analysis properly at some stage you will need to learn a bit about statistics.

That concludes this practical

Hopefully you have:

started thinking (and perhaps are a bit confused) about how spatial patterns of different types of finds are created, and how we can interpret them when studying data from an excavation.

learned to combine spatial data and descriptive tables.

learned to use some basic clustering tools, and reinforced your knowledge about how to make charts and maps.

4.10 LexiCode¶
To re-use the codes - you will need to first load their respective libraries. So far, you have used ...:

libraries | | | --- |--- | --- | folium | numpy | branca| rtree | pandas| osmnx| geopandas| requests | seaborn | fiona| matplotlib.pyplot | ipywidgets| pysal |seaborn |

plugins

HeatMapWithTime HeatMap MeasureControl PrepareUrBasemaps_CreateLayers from [folium.plugins] cluster (from sklearn)

your lexicode is non-exhaustive, keep expanding, find your own 'best way' to reuse these code/scripts...

Lexicode_MakingaBasicMap | Lexicode_Webmaps&Distributions |Lexicode_StreetGridOrientations | Lexicode_SpatialPatterns --- | --- | ---|---| == () [] | pd.concat() | { } subselection from list| .head_csv() | .dtype() | ox.gdf_from_places()|requests.get()|requests.get() .read_csv() | astype() | ox.plot_shape()|request.content() mean() | fillna()|network_type= ''|.bytes() folium.Map | def return |ox.add_edge_bearings(ox.get_undirected())|gpd.GeoDataFrame.from_features() range() | .apply(lambda x:function,axis=) |count_and_merge()|Set() len() | pd.merge() |np.arrange()|pd.value_counts() iloc[]| how= , left_index= ,left_index= |np.histogram()|.merge() .value_counts()| gpd.GeoDataFrame()| ax.set_theta_location()|.sort_values if =:| geometry=gpd.points_from_xy |ax.set_ylim()|cluster.KMeans() elif =: |print() |ax.set_title()|.fit() else =:| .isin()|ax.set_yticks()|.drop() folium.Marker()| classic.plot()|ax.set_xlabels() & ax.set_yticklabels|.assign() folium.Icon()| generateBaseMap()|plt.subplots()|plt.show() folium.Circle| .groupby(['', ''])|.dropna()|.set_title popup= | .reset_index() |polar_plot()|sns.pairplot() radius= | max_zoom= |pd.Series()| .values.tolist() |folium.TileLayer()|np.pi| .add_to()| plugins.DualMap(location= , tiles= , zoom_start= )| |

4.11 References¶
Fortin, M., & Dale, M. (2005). Spatial Analysis: A Guide for Ecologists. Cambridge: Cambridge University Press.

Chapter 5: Land use and archaeological visibility¶
5.1 Introduction: Exercise motivations¶
In this exercise, we'll be relating learning to access information about the environment or landscape in a raster format. We'll then be relating the locations of visible sites to various environmental factors.
So far in this course you've been working mostly with vector data (points, lines, polygons) but now you're going to work with rasters as well. Environmental data is often held in a raster format because the data is continuous (see Raster versus Features below for more details). Elevation is everywhere, right? So if you have a big image file (a raster) you can have a value for elevation in pretty much every pixel of that file.

Therefore this week practical lab aims to use rasters to explore (and investigate of course):

environmental properties of landscapes using image rasters

physical properties of landscapes using attribute rasters

5.2 Rasters versus Features¶
In #codecell_StreetGridOrientations_geodataframe, we have started to talk about vector feature data and why topology characterisation was important in geographical information systems. Vectors are not the only way to represent the world and what's in it, raster is another format used ... because:

Different representation is more suitable for different types of data

Different representation is more suitable for different types of data processing and analysis

|| --- |--- | |
|
|
|

There are different types of raster data Image rasters and attributes rasters as illustrated below:

|| --- |--- | Image raster (geology.com)|Attributes rasters (met.office)|
|
|

A raster format file is composed of cells where each each cell have a pixel value (=picture element and relates to resolution & to their position on the surface) and a grid cell value (size in units) and each cell has attributes stored in the array. For instance, this attribute can relate to spectral reflectance (ranging from 0-255), colour (RGB is 3 attributes) or numeric values (which can be associated with a full description through a join with a Look-Up_Code -LU_code- such as a reference tables for Land Use map).

|Raster Characteristics|Example of a Raster Array composition| |--- |--- | |

|

Raster Chararcteristics

Space divided into cells of the same size (tessellation)- usually square

Size of cell determines resolution

Feature location defined by position within grid (row, column) - usually implicit

Cells homogenous with respect to attribute

Grids represent single attribute

=

In [ ]:

##codecell_RasterLandscape_ImportUrLibraries !pip install rasterio !pip install elevation !pip install richdem

In #codecell_RasterLandscape_ImportUrLibraries, we are using reference data for this practical which include:

rasterio reads and writes raster formats to manipulate them and creates new ones (rasterio ref);

richdem are sets of digital elevation model or digital representaion of elevation heights in a three-dimensional space/landscape (richdem ref);

elevation provides the height data (pypi ref & elevation intensive tutorials).

In [ ]:

##codecell_RasterLandscape_ImportUrLibraries import rasterio # import the main rasterio function #read, read-write, and write access to raster data files import matplotlib # We have use this library lab4SpatialPatterns # matplotlib is the primary python plotting and viz library %matplotlib inline # this bit of magic allows matplotlib to plot inline in a jupyter notebook # We can check which version we're running by printing the "__version__" variable print("rasterio's version is: " + rasterio.__version__) print(rasterio)

5.3 Raster dataset to inform on the landscape environmental properties: Soil Type¶

|Implications of spatial detail of land use / land cover? Implications of temporal detail of land use / land cover mapping episodes?|Land cover indicates the physical land type such as forest or open water whereas land use documents how people are using the land| |--- |--- | |

| Why do archaeologists care about land cover and land use? - Links past and present human activity to vegetation and soil conditions Supports spatially explicit modelling of past environments - Provides insights into archaeological preservation and visibility|

5.3.1 Let's open a raster image¶

In [ ]:

##codecell_RasterLandscape_OpenImage # filepath to our image img_fp = 'http://ropitz.github.io/digitalantiquity/data/LE70220492002106EDC00_stack.gtif' # Open a geospatial dataset # open() handles your file # this function allows you to open a file and return it as a file object dataset = rasterio.open(img_fp) print(dataset)

5.3.2 Which kind of Raster it is?¶

Let's learn some basic things about the image data.

In [ ]:

##codecell_RasterLandscape_ImageCharacteristics # what is the name of this image img_name = dataset.name print('Image filename: {n}\n'.format(n=img_name)) # How many bands does this image have? num_bands = dataset.count print('Number of bands in image: {n}\n'.format(n=num_bands)) # How many rows and columns? rows, cols = dataset.shape print('Image size is: {r} rows x {c} columns\n'.format(r=rows, c=cols)) # Does the raster have a description or metadata? desc = dataset.descriptions print('Raster description: {desc}\n'.format(desc=desc)) # What driver was used to open the raster? driver = dataset.driver print('Raster driver: {d}\n'.format(d=driver)) # What is the raster's projection? proj = dataset.crs print('Image projection:') print(proj, '\n') # What is the raster's "geo-transform" gt = dataset.transform print('Image geo-transform:\n{gt}\n'.format(gt=gt)) # Does the raster have a metadata? metadata = dataset.meta print('All raster metadata:') print(metadata) print('\n')

Learning a new language – decomposing the code ¶

In #codecell_RasterLandscape_OpenImage, rasterio allows you to open the raster via the function .open() and returns it as Dataset object.

In #codecell_RasterLandscape_Characteristics, you wanted to know a bit more about this raster, in particular its characteristics & metadata by calling information on:

the image name, the number of bands and the image size which is a description of the array and its number of rows and columns:

the description of raster's file format. here, this is a description of the 8 bands (see above from the command dataset.count):

the image's projection:

EPSG:32615 - is world projection used principally for the north Hemisphere - 96oW to 90oW (red box in image below) & in large and medium scale topographic mapping and engineering survey. This [webpage](https://spatialreference.org/ref/epsg/wgs-84-utm-zone-15n/) provides you with other possible formatting/coding option to use this projection in other software.

by calling .dataset.transform, you are asking how the orignal image pixels have been transformed to the projected coordinates:

what does this mean?

You can see in the answer from .dataset.transform that is very much like what you've experienced in lab3_StreetGrid_Orientation with the north-up transformation when we needed to rearrange the street axes to run northward (*##codecell_StreetGridOrientations_RearrangeAxes*) and what you have seen in lab2 especially in *#codecell__Webmaps& Distributions_CheckingZeData*). In our case, the results are showing an affine transformation:

...but, there are other way to transform an image, this could be by using conformal or polynomial transformation for a 2D projection(x&y axes) and you see below what 2D transformation can do to your image (if you want a more detailed explanation on how and why coordinates are transformed [here]( https://kartoweb.itc.nl/geometrics/Coordinate%20transformations/coordtrans.html ) :

Finally, we are calling for the raster metadata, which is the important part of your data that you will need to provide:

5.3.3 Image raster bands: let's visualise¶
That's is the fun part...

Learning a bit more about image bands ¶

But what are image bands?

Raster dataset contains one or more layers called bands. For instance, a colour image can have three bands using RBG, with 1 band Red,1 band Green, and 1 band Blue:

An image can also be a multispectral image with many bands such as remote sensing data which is a recombination of the electromagnetic spectrum to create a multiband raster data. This is where every cell location has more than one value associated with it. You might have heard of Near Infrared (NIR), well these raster images have 8 bands and they come with each band as a separate file (bands 2,3,4 and 8 which are Blue, green, red, and NIR respectively) (see more here). Infrared (and ultraviolet) are that bit of the spectrum we cannot see with our own eyes.

*You can learn more how band combinations works in [this](http://biodiversityinformatics.amnh.org/interactives/bandcombination.php) interactive platform using 3 types of remotely sensed images (MODIS, ASTER and Landsat)*.

And, an image can also have one band can be a digital elevation model (DEM) where the cell holds elevation values.

The rasterio dataset object we've created contains a lot of useful information but it is not ready to be read in the raster image. Instead, we will need to access the raster's bands using the read() ) method:

In [ ]:

##codecell_RasterLandscape_CheckShapOfBands # Let's open fourth band in our image # read() function that allow us to specify a subset of the raster band in the brackets () nir = dataset.read(4) # Using function .shape we want to check out the dimensions of the image nir.shape # #same as in ##codecell_RasterLandscape_ImageMetadata # by using .read() for all bands, the answer should have been (8, 250, 250) # .shape answer provides you with # bands, rows, cols # but images can be very very large files, so it is better to do this bit by bit

In [ ]:

##codecell_RasterLandscape_ImportUrLibraries #we get numpy again # because raster tends to be large(heavy) files, #numpy allows to represent the data very efficiently without using a lot of computer memory import numpy as np

In [ ]:

##codecell_RasterLandscape_BandsDataType # What are the band's datatypes? # remember we have used the dtype function in lab2Webmaps&Distribution datatype = dataset.dtypes print('Band datatypes: {dt}'.format(dt=datatype)) # How about some band statistics? band_mean = np.mean(nir) band_min = np.amin(nir) band_max = np.amax(nir) band_stddev = np.std(nir) print('Band range: {minimum} - {maximum}'.format(maximum=band_max, minimum=band_min)) print('Band mean, stddev: {m}, {s}\n'.format(m=band_mean, s=band_stddev))

Learning a new language – decomposing the code ¶

In #codecell_RasterLandscape_BandsDataType, we are asking numpy (np) to:

tell us which ( datatype ) are the bands. the answer is 'int16' which is an integer of a storage capacity of 16 bit. Each type of integer has a different range of storage capacity:

Type -- Capacity

Int16 -- (-32,768 to +32,767)

Int32 -- (-2,147,483,648 to +2,147,483,647)

Int64 -- (-9,223,372,036,854,775,808 to +9,223,372,036,854,775,807)

provide us with the mean ( np.mean() ), minimum ( np.amin() ), maximum ( np.amax() ) and its standard deviation ( np.std() ). So, this is just like asking how the data (band no.4) is spread really:

and how the data is distributed:

In [ ]:

##codecell_RasterLandscape_ImageDisplay #read in the image full_img = dataset.read() #we are checking the .shape of the image (do I have all my bands? See ##codecell_RasterLandscape_CheckShapOfBands) full_img.shape # we are adding a wee bit of rasterio library here # import the show function which allows us to display the image from rasterio.plot import show #then print the nir band with the whole image shape #we would like to have a grayscale map (cmap see also ##codecell_SpatialPatterns_PlottingUrData) print("Image dimensions: ", full_img.shape) show(nir, transform=dataset.transform, cmap='gray')

In [ ]:

##codecell_RasterLandscape_ImportUrLibraries #more tools import matplotlib # matplotlib is the primary python plotting and viz library import matplotlib.pyplot as plt # this bit of magic allows matplotlib to plot inline in a jupyter notebook #see also lab4_spatial patterns %matplotlib inline import folium # folium is an interactive mapping library

5.4 Raster dataset & physical properties of landscape: Vegetation¶
Some raster datasets will only have one band. These behave a bit like a black and white image. Rasters can contain as many bands as you like, and single band images can be combined into multiple band images (see above in But what are image bands?). Multiple band images are useful because you can calculate ratios between the different bands to enhance the visibility of certain features in your image.

5.4.1 Let's visualise all bands¶

In [ ]:

##codecell_RasterLandscape_RasterCombiningBands # to make a multiple band raster add all the file paths to a list that we call s2_bands s2_bands = [ "http://ropitz.github.io/digitalantiquity/data/sentinel-2/2018-10-13_Sentinel-2BL1C_B02.tiff", "http://ropitz.github.io/digitalantiquity/data/sentinel-2/2018-10-13_Sentinel-2BL1C_B03.tiff", "http://ropitz.github.io/digitalantiquity/data/sentinel-2/2018-10-13_Sentinel-2BL1C_B04.tiff", "http://ropitz.github.io/digitalantiquity/data/sentinel-2/2018-10-13_Sentinel-2BL1C_B08.tiff" ] # open these files and add all bands together # and you are setting an argument 'for' 'in' 'with' 'as' # in new dataframe name for this list => arrs [] # in which we are appending the files arrs = [] for band in s2_bands: with rasterio.open(band) as f: arrs.append(f.read(1)) # convert the list of tiffs(our multiuple bands appended into 1 list = arrs) to a numpy array # if you forgot what was an array you simply revisit decomposing the code of #codecell_Webmaps&Distributions_AllNumbers sentinel_img = np.array(arrs, dtype=arrs[0].dtype) # let's check the shape of this array sentinel_img.shape

In [ ]:

##codecell_RasterLandscape_MultibandRasterClipped&View #clip to get a smaller area and show the multiple band image data #this is done by setting the extents of the y and x axis to the desired (here smaller)extent #using our subset special symbol [] clipped_img = sentinel_img[:, 0:750:, 0:1500] clipped_img.shape show(clipped_img[[2,1,0], :, :]) #we only asked to plot the first 3 bands RGB [3,2,1]

~ Learning a new language – decomposing the code ¶

In #codecell_RasterLandscape_MultibandRasterClipped&View, you have clip the image by using the subset [ ] , you then asked rasterio to show() and plot() only the raster's first three bands so it will be an RGB image.

5.4.2 Let's check what's going on in each band by plotting an histogram¶
You can see the different information held in each band by plotting a histogram of the raster dataset pixel values.

In [ ]:

##codecell_RasterLandscape_MultibandRasterHistogram # plot a histogram of the data in each band #revisit your lab SpatialPatterns for further info on how to plot with matplotlib rasterio.plot.show_hist(clipped_img, bins=50, histtype='stepfilled', lw=0.0, stacked=False, alpha=0.3) #here note bin= corresponds to the interval used between pixel values

Even from simple visualisations, we can see the contrast between the red and the near-infared (NIR) bands. In the histogramm, the peak at 255 is simply from areas with no data (labelled 255 on all bands).

In [ ]:

##codecell_RasterLandscape_MultibandRasterNodata #we can ckeck where on the image are the cell with no value # we use the clipped image and subset to only [:0,0] data clipped_img[:,0,0]

5.4.3 Let's interpret what's going on in each band¶
Different band ratios will highlight different features in a raster dataset. We might be interested in finding places where cropmarks are located, for example.

There are lots of band ratios that will highlight more green and more stressed vegetation. A common ratio that does this is called NDVI- the normalized difference vegetation index.

How does this work?

The density of vegetation (NDVI) at a certain point of the image is equal to the difference in the intensities of reflected light in the red and infrared range divided by the sum of these intensities. NDVI formula is based on how much red and NRI is present in the plant leaves, and can be formulate as follows:

In [ ]:

##codecell_RasterLandscape_NDVI # let's calculate NDVI # but we want to make sure that to ignore errors due to NaN pixels #you remember these... revisit in #codecell__Webmaps&Distributions_ChangingDataType np.seterr(divide='ignore', invalid='ignore') bandNIR = clipped_img[3] # fourth band bandRed = clipped_img[2] # second band #Let's now apply the NDVI formula # note that in python division of integers leads to integers # so we need to specify (.astype) floats in order to get floats for NDVI results # if you forgot what this means, revisit #codecell__Webmaps&Distributions_ChangingDataType ndvi = (bandNIR.astype(float)-bandRed.astype(float))/(bandNIR.astype(float)+bandRed.astype(float)) #we now have a simple matrix

In [ ]:

##codecell_RasterLandscape_showNDVI #show NDVI #here we are using matplotlib instead of the rasterio # because we are now dealing with a matrix plt.imshow(ndvi, cmap="RdYlGn") plt.colorbar() plt.show()

There is a lot of red for the water and green for the vegetation, so the formula looks like to apply correctly.

5.5 Raster dataset & physical properties of landscapes: Topography¶

In this second part of the exercise, we'll be calculating some properties of the landscape like slope and aspect that are commonly used in analyses and models of where sites are likely to be visible.

In [ ]:

##codecell_RasterLandscape_ImportUrLibraries #let's work with geopandas again !pip install geopandas

In [ ]:

##codecell_RasterLandscape_ImportUrLibraries #more tools. all the tools really! from shapely.geometry import mapping from shapely.geometry import Point from rasterio.mask import mask import geopandas as gpd import zipfile import io import requests import elevation import richdem as rd import gdal import os

5.5.1 Attribute raster: elevation dataset¶
And now get some data to start the second part of the exercise- a raster dataset representing elevation referred to as a DEM or 'digital elevation model'.

You may want to revisit Chap 3-3 showing how data can be interpolated. However, there are many ways (>40! with the most common IDW, Spline, Trend Surface, Natural Neighbour, Kriging) to interpolate elevation points to create a surface.

The raw point cloud contains terrain and vegetation and surface objects The filtered point cloud contains only selected classes of points, like terrain points The bare earth terrain model is made from the classified point cloud visualizations are created based on the bare earth terrain model

(Images: Opitz, Nuninger, and Fruchart. Data: MSHE C.N. Ledoux)

In [ ]:

##codecell_RasterLandscape_GetaDEM&clipIt #fetch elevation data from the SRTM server and clip to our area of interest dem_path = os.path.join(os.getcwd(),'areaDEM.tif') elevation.clip(bounds=(5.1, 43.65, 5.5, 43.95), output=dem_path)

In [ ]:

##codecell_RasterLandscape_LetsPlotDEM #check the DEM has loaded nicely by plotting it areaDEM = rd.LoadGDAL(dem_path) plt.imshow(areaDEM, interpolation='none') plt.colorbar() plt.show()

5.5.2 What can we do with a DEM?¶
DEMS are useful for 3D visualisation, mapping purposes and various other applications, and, can also derive additional types of topographical models from a DEM such as slope, aspect, hillshape, 3d surface views, visibility analysis.

DEMs can be then analysed:

DEM analysis 3D Analysis

We will compute and plot in this practical lab slope and aspect.

*Why use a Slope model?*

Predicting landslide risk

Choosing optimal alignment for new roads

Designing drainage networks

Land suitability (e.g. residential construction, commercial forest plantation)

Assessing vulnerability to coastal erosion

Optimising route for hiking, cycling, walking.

*What can we use an Aspect Model for?*

Choosing optimal location for a wind farm

Designing layout of new housing developments

Defining slopes suitable for commercial crops

Inferring soil moisture content and land slip risk (e.g. north and west facing slopes in Scotland generally wetter)

5.5.3 Slope mapping###¶

In [ ]:

##codecell_RasterLandscape_DEMSlope slope = rd.TerrainAttribute(areaDEM, attrib='slope_riserun') rd.rdShow(slope, axes=False, cmap='magma', figsize=(8, 5.5)) plt.show()

Learning a new language – decomposing the code ¶

In #codecell_RasterLandscape_DEMSlope, you have used rd.TerrainAttribute function to compute the slope for each pixel containing the levation value and used rdShow function to visualise the slope map.

There are multiple ways to represent the slope values, you read richdem docs (see lexicode for a link) for more options!

5.5.4 Aspect mapping¶

In [ ]:

##codecell_RasterLandscape_DEMAspect # now do the same thing to calculate and plot the aspect data aspect = rd.TerrainAttribute(areaDEM, attrib='aspect') rd.rdShow(aspect, axes=False, cmap='jet', figsize=(8, 5.5)) plt.show()

Learning a new language – decomposing the code ¶

In #codecell_RasterLandscape_DEMAspect, you have used rd.TerrainAttribute() function to compute the aspect for each pixel containing the levation value and used rdShow() function to visualise the aspect map.

5.5.4 Contours mapping¶
Contours, a common way to visualize elevation data, can also be derived from a raster.

visual representation of Iceland’s elevation levels with contours published by CartoDB

In [ ]:

##codecell_RasterLandscape_DEMrastertoArray #this script is using gdal library (see lexicode) #to avoid having to bring your raster block by block # we need to open the raster file using gdal library #and essentially let gdal read what is in our 1 band DEM gdal_data = gdal.Open(dem_path) gdal_band = gdal_data.GetRasterBand(1) nodataval = gdal_band.GetNoDataValue() # let's convert to a numpy array data_array = gdal_data.ReadAsArray().astype(np.float) data_array # replace missing values if necessary # you knopw this by now ... we are talking about NaN values (the very annoying ones) if np.any(data_array == nodataval): data_array[data_array == nodataval] = np.nan

In [ ]:

##codecell_RasterLandscape_DEMrasterPlotCountoursLines #Plot out data using Matplotlib tool 'contour' #the script layout is very similar to last week's practical on spatialpatterns fig = plt.figure(figsize = (12, 8)) ax = fig.add_subplot(111) plt.contour(data_array, cmap = "viridis", levels = list(range(0, 5000, 100))) plt.title("Elevation Contours") cbar = plt.colorbar() plt.gca().set_aspect('equal', adjustable='box') plt.show() # and here we go!

In [ ]:

##codecell_RasterLandscape_DEMrasterPlotCountoursFilled #Plot our data using Matplotlib tool 'contourf' to get filled contours fig = plt.figure(figsize = (12, 8)) ax = fig.add_subplot(111) plt.contourf(data_array, cmap = "viridis", levels = list(range(0, 5000, 500))) plt.title("Elevation Contours") cbar = plt.colorbar() plt.gca().set_aspect('equal', adjustable='box') plt.show()

That's it for this lesson. Hopefully you've seen how you can access and manipulate raster data to represent the landscape or environmental factors.

5.6 LexiCode¶
To re-use the codes - you will need to first load their respective libraries. So far, you have used ...:

libraries | | | | --- |--- | --- | --- | folium | numpy | rasterio| branca| rtree | richdem| pandas| osmnx| elevation| geopandas| requests | zipfile | seaborn | fiona| io | matplotlib.pyplot | ipywidgets|requests | pysal |seaborn |gdal| os |

plugins

HeatMapWithTime HeatMap MeasureControl PrepareUrBasemaps_CreateLayers from [folium.plugins] cluster (from sklearn) rasterio.plot mapping (from shapely.geometry) Point(from shapely.geometry)
mask (from rasterio.mask)

your lexicode is non-exhaustive, keep expanding, find your own 'best way' to reuse these code/scripts...

Lexicode_MakingaBasicMap | Lexicode_Webmaps&Distributions |Lexicode_StreetGridOrientations | Lexicode_SpatialPatterns | Lexicode_RasterLandscape --- | --- | ---|---|----| == () [] | pd.concat() | { } subselection from list|%matplotlib inline | .open() .head_csv() | .dtype() | ox.gdf_from_places()|requests.get()|.print() .read_csv() | astype() | ox.plot_shape()|request.content()|dataset.name| mean() | fillna()|network_type= ''|.bytes()|dataset.count| folium.Map | def return |ox.add_edge_bearings(ox.get_undirected())|gpd.GeoDataFrame.from_features()|dataset.shape| range() | .apply(lambda x:function,axis=) |count_and_merge()|Set()|dataset.descriptions| len() | pd.merge() |np.arrange()|pd.value_counts() |dataset.meta| iloc[]| how= , left_index= ,left_index= |np.histogram()|.merge()|dataset.driver| .value_counts()| gpd.GeoDataFrame()| ax.set_theta_location()|.sort_values|dataset.read if =:| geometry=gpd.points_from_xy |ax.set_ylim()|cluster.KMeans()|.shape elif =: |print() |ax.set_title()|.fit()|np.amean() else =:| .isin()|ax.set_yticks()|.drop() |np.amin () folium.Marker()| classic.plot()|ax.set_xlabels() & ax.set_yticklabels|.assign()|np.amax() folium.Icon()| generateBaseMap()|plt.subplots()|plt.show()|np.std() folium.Circle| .groupby(['', ''])|.dropna()|.set_title|show() popup= | .reset_index() |polar_plot()|sns.pairplot()|cmap= radius= | max_zoom= |pd.Series()|arrs.append()|np.seterr() .values.tolist() |folium.TileLayer()|np.pi|.show_hist()|plt.imshow .add_to()| plugins.DualMap(location= , tiles= , zoom_start= )| | |clipped_img| | | | |rd.TerrainAttribute() | | | |rdShow() | | | |gdal_data.GetRasterBand() | | | |gdal_band.GetNoDataValue() | | | |gdal_data.ReadAsArray | | | |gdal_data.ReadAsArray().astype(np.float) | | | |plt.contour() | | | |plt.contourf()

Chapter 6: Working with Pysal on Weights, Moran and Lisa: Spatial Statistical Analysis¶
6.1 Introduction: Exercise motivations¶
In this exercise we're going to get into some key spatial statistics. So far in this course we've mostly been visualising spatial distributions and patterns. Here we will run statistical tests to determine whether nor not a pattern or spatial structure exists, and to test what kind of pattern (dispersed vs. random vs. clustered) is present.

This kind of analysis is what we would call spatial statistical analysis. This is different from ** deterministic analysis** -which is when you analyse things like which polygons intersect one another in a dataset, and which polygons entirely contain other polygons. Spatial statistical analysis is a subset of probabilistic analysis - which is simply analysis of how likely it is that something happened. In this exercise we will start by working with vector-based data (points, lines and polygons). Then we will briefly look at working with raster datasets.

=

In [ ]:

##codecell_Spatial_Statistical_Analysis_ImportUrLibraries # start by installing tools as usual !pip install geopandas !pip install descartes !pip install mapclassify !pip install pysal #@title !apt-get install -qq curl g++ make #@title !curl -L http://download.osgeo.org/libspatialindex/spatialindex-src-1.8.5.tar.gz | tar xz #@title import os os.chdir('spatialindex-src-1.8.5') #@title !./configure #@title !make #@title !make install #@title !pip install rtree #@title !ldconfig #Working through the example at http://toblerity.org/rtree/examples.html #@title from rtree import index from rtree.index import Rtree #@title p = index.Property() idx = index.Index(properties=p) idx

In [ ]:

##codecell_Spatial_Statistical_Analysis_ImportUrLibraries #and importing tools... import geopandas as gpd import requests import zipfile import io import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns import pandas as pd import pysal as ps import numpy as np

Learning a new language – decomposing the code ¶

in #codecell_Spatial_Statistical_Analysis_ImportUrLibraries. These are what we call prerequisites. You know by now that they are basic tools so you can get started.

Pandas lets you manipulate your data. ~ déjà vu ~

Geo-pandas lets you manipulate your geographic data. ~ déjà vu ~

requests lets you access easily Hypertext Transfer Protocol or HTTP library (doc.).

zipfile lets you reate, read, write, append, and list a ZIP file. ~ déjà vu ~

io (input/open) lets you to access files and streams (a stream is sequence of data elements made available over time - processed 1 at a time) ~ déjà vu ~

matplotlib lets you plot in 2D which extensive plotting library for quality publication & matplotlib.pyplot provides a MATLAB-like way of plotting ~ déjà vu ~

seaborn lets you do statitistical data visualisation. ~ déjà vu ~

numpy lets you do statistics especially statistics using numpy. ~ déjà vu ~

pysal ~ déjà vu ~ we have use it before in lab4_SpatialPatterns. This is the one I'd like you to spend a bit more time with...Why? because this library is used to conduct exploratory spatial data analysis. You can have a look at its documentation, a video of one of its creator and high level applications for spatial analysis (see image below).

6.2 Preparing your Data¶
You must have realised by now that the most time consuming part of your work is preparing your data ... Data preparation is important, and your ability to use or re-use a dataset depends on how well it has been prepared. When you are selecting and preparing your data, remember to think about your research aims and the questions you are trying to answer, and make sure your data is set up to respond to those aims and questions.

The following scripts repeat what you have done in previous practicals, in particular Lab4_spatial_patterns_in_excavation where you complete the following basic steps:

import

read

select/extract

merge

In [ ]:

##codecell_Spatial_Statistical_Analysis_ImportOurData # And now, as usual, get the data url = 'https://github.com/ropitz/spatialarchaeology/blob/master/gabii_spatial.zip?raw=true' local_path = 'temp/' print('Downloading shapefile...') r = requests.get(url) z = zipfile.ZipFile(io.BytesIO(r.content)) print("Done") #here this is very much like what we are doing on your PC file system z.extractall(path=local_path) # extract to folder #here we want to download the shapefiles # so we sort them using sorted() #so we have a comprehensive list of all variables available in our dataset # print them filenames = [y for y in sorted(z.namelist()) for ending in ['dbf', 'prj', 'shp', 'shx'] if y.endswith(ending)] print(filenames) #here you know it ... #we use function read() to access the data from this notebook and work with it dbf, shp, shx = [filename for filename in filenames] gabii = gpd.read_file(local_path + 'gabii_SU_poly.shp')

In [ ]:

##codecell_Spatial_Statistical_Analysis_Peek@OurData # As you've done before, print out some information on the data # To check the number of records in the file, so it has loaded in ok # And preview the data print("Shape of the dataframe: {}".format(gabii.shape)) print("Projection of dataframe: {}".format(gabii.crs)) gabii.tail() #last 5 records in dataframe

In [ ]:

##codecell_Spatial_Statistical_Analysis_ObjectType&SU # As we've done before (returning to the Gabii finds data - see ###codecell_SpatialPatterns_WhichTypeOfSpecialFinds&fromWhere?) # Get the non-spatial special finds data # And they are archived per SU /Stratigraphical Unit sf_su = pd.read_csv("https://raw.githubusercontent.com/ropitz/gabii_experiments/master/spf_SU.csv") sf_su.head()

In [ ]:

##codecell_Spatial_Statistical_Analysis_FindsBecomesSpatial #Then let's combine our polygons representing context shape and location # with the special finds data # We can use known command 'merge()' for this (see especially explanation for this in #codecell_Webmaps&Distributions_MergingZeData and also ###codecell_SpatialPatterns_TextileToolsBecomesSpatial) gabii_textools = gabii.merge(sf_su, on='SU') gabii_textools.head()

In [ ]:

##codecell_WorkingWithPysal_DataPreparation #Let's pull all those find types out of the big list. #These commands should look familiar because you've done them before (###codecell_SpatialPatterns_SpecialFindsSelection) types = ['Loom Weight','Spool','Spindle Whorl'] textile_tools = gabii_textools.loc[gabii_textools['SF_OBJECT_TYPE'].isin(types)] # Now let's count up how many of these tools appear in each context (SU). # This command will print out a list of the number of textile tools in each SU next to that SU number. textile_tool_counts = textile_tools.groupby('SU')['SF_OBJECT_TYPE'].value_counts().unstack().fillna(0) gts = gabii_textools.merge(textile_tool_counts, on='SU') gts_new = gts.drop_duplicates(subset="SU") gts_new.head()

6.3 Visualising your Data¶
6.3.1 Let's plot our data¶
Now plot your data to visualise it with a focus on the Spool Distribution of Gabii excavations.

In [ ]:

##codecell_Spatial_Statistical_Analysis_PlotZeData # Set up figure and axis f, ax = plt.subplots(1, figsize=(9, 9)) # Plot SUs #gabii.plot(ax=ax, facecolor='0.85', linewidth=0) # Quantile choropleth of deaths at the street level gts_new.plot(column='Spool', scheme='fisher_jenks', ax=ax, \ cmap='YlGn', legend=True, linewidth=3) # Plot pumps #xys = np.array([(pt.x, pt.y) for pt in pumps.geometry]) #ax.scatter(xys[:, 0], xys[:, 1], marker='^', color='k', s=50) # Remove axis frame, also called graticule ax.set_axis_off() # Change background color of the figure f.set_facecolor('0.75') # Keep axes (your x and y) proportionate plt.axis('equal') # Title # f.suptitle()command allows you to add a centered title to the figure f.suptitle('Spool Distribution', size=30) # Draw plt.show()

6.3.2 Exploring your data: Descriptive Statistics Vs Visualisation¶
They are different ways to explore your data.

6.4 Spatial Statistical Analysis¶
So far, you've (rapidly) repeated the steps you've done in a previous exercise to visualise a spatial pattern - this time we are exploring the spools artefacts discovered while excavating at Gabii.

Now we want to test statistically if there is a pattern? Statistical tests are needed when it is not obvious from just looking at the distribution whether or not a pattern exists.

We can start with some of the more basic tests: Moran's and local Moran's, which are tests for spatial autocorrelation. Moran's I statistic (1948, 1950) is one of the classic (as well as one of the most common) ways of measuring the degree of spatial autocorrelation in data.

6.4.1 Moran's I test¶
But how does this work?

You have seen that objects in space are rarely randomly distributed. In fact, they usually have some degree of patchiness (i.e., they are spatially clustered) which can produce a variety of distinct spatial patterns.
These patterns can be quantified according to the degree of similarity between the attributes values of adjacent spatial objects (polygons, lines, points, raster cells).
We will start by testing if nearby objects tend to have similar attributes than expected from a random distribution . The degree to which similar values cluster together spatially is the degree of autocorrelation. This value is reported as Moran's I.

In a nutshell, the Moran’s I statistic provides a correlation coefficient (a measurement of similarity) for the relationship between a measured value and its surrounding measured values.

To understand the meaning of this measurement of similarity, the Moran's I value, you need to understand how "statistical significance" works. Basically, a measurement is considered "statistically significant" if it very different from what you would get by chance. A spatial pattern is "statistically significant' if it is very different from a random spatial pattern.

In summary:

null hypothesis = the data is randomly disbursed.

alternative hypothesis = the data is more spatially clustered than you would expect by chance alone.

Conditions:

All features should have at least one neighbor (8 is good).

No feature should have all other features as a neighbor.

At least 30 points.

Global statistics assess the overall pattern. They are most effective when the spatial pattern is consistent across the study area.

Z-score and p-value are used to evaluate the significance of the Index value. The p-value is the probability that the observed spatial pattern was created by some random process. When the p-value is very small, it means it is very unlikely (small probability) that the observed spatial pattern is the result of random processes. There are two possible scenarios:

Very high or very low (negative) z-scores, associated with very small p-values, are found in the tails of the normal distribution.

Small p-values and either a very high or a very low z-score indicates it is unlikely that the observed spatial pattern reflects a random pattern.

You can read about Moran's. If you prefer a video, this one this one is very clear on how to interpret the resulst of a Moran's test.

To start your Moran's I test¶
As you may have suspected, we need to do something first before running the Moran's I. And, as said above, the emphasis of the Moran's I spatial analysis is on the role of the weights. So, one of the vital steps of spatial autocorrelation modelling is to construct a spatial weights matrix. The spatial weights can be based on:

Contiguity Pysal-notes

Distance Pysal-notes

Distance band Pysal-notes

Kernel Pysal-notes

K-nearest neighbours Pysal-notes

which can be illustrated as such:

*where 0= no weights and 1
As we've discussed in class, designing a good weights matrix is part of the interpretive process. It's an important step because if we fail to design the weights matrix well, the resulting of spatial analysis will likely not produce results that are meaningful. It is important, therefore, to understand the kinds of spatial relationships listed above, and to use the one that best represents the kind of spatial relationship presesnt in your data.

Let's define the weights¶
Let's create some weights for a real dataset that you know: finds from the Gabii excavations.

In [ ]:

##codecell_Spatial_Statistical_Analysis_MoranStatistics_defineweights # To start your Moran's statistical test, you need to create weights that define how strongly you think things near to one another influence one another. # see the types of weights available to you by looking in pysals help file help(ps.lib.weights)

In [ ]:

##codecell_Spatial_Statistical_Analysis_MoranStatistics_KNN_weights #create some weights. I've gone with KNN weights. Read about this in the pysal documentation linked above... #we want to see what is happening with the Spools #let's first subselect them and create a gts_spool gts_spool = gts_new[['SU','Spool']] #let's have the nearest neighbours weights #we choose a k-distance of 5 gts_spool_weights = ps.lib.weights.KNN(gts_spool,5) #here we Ignore the warnings because #we know that not all the SU areas connect up physically

Let's add the knn-weights as an attribute to your data¶
Here we create a matrix, or a grid of values, which contains the weights assigned to each spatial entity: .

In [ ]:

##codecell_Spatial_Statistical_Analysis_MoranStatistics_DefineMatrix # Rename IDs to match those in the `segIdStr` column gts_spool_weights.remap_ids(gts_spool.index) # Row standardise the matrix gts_spool_weights.transform = 'R'

Learning a new language – decomposing the code ¶

in *##codecell_Spatial_Statistical_Analysis_MoranStatistics_DefineMatrix*, to keep things aligned, you use .remap_ids() to rename the IDs of the matrix to match those in the table.
Often there is a need to apply a transformation .transform = 'R' to the spatial weights, such as in the case of row standardisation. You can use .transform = 'b' for binary and .transform = 'v' for variance stabilising doc.

~ Behind the scenes ~ the transform property is updating all other characteristics of the spatial weights that are a function of the values and these standardisation operations, freeing the user from having to keep these other attributes updated.

Let's add Spatial lag¶
Now you have the data and the spatial weights matrix ready.

Let's start by computing the spatial lag of the spool distribution. The spatial lag is the product (multiplication) of the spatial weights matrix and a given variable. If gts_spool_weight is row-standardised, the result is the average value of the variable in the neighborhood of each observation.

Multiplication of two matrices looks like this:

We can calculate the spatial lag for the variable 'Spool' and store it directly in the main table with .lag_spatial() :

In [ ]:

##codecell_Spatial_Statistical_Analysis_MoranStatistics_SpatialLag #add the weights you've created to the attribute table gts_spool['gts_spool_weights'] = ps.lib.weights.lag_spatial(gts_spool_weights, gts_spool['Spool']) gts_spool.head()

Standardisation against biaised dataset¶
Whenever there is a risk that the distribution of your features (see decomposing the code of #codecell_RasterLandscape_BandsDataType as a reminder) is potentially biased due to sampling design or an imposed aggregation scheme, it is important to row standardize the data.

Read about standardisation in spatial modelling.

So now we need to remove potentially biased distribution in our dataset by standardising the counts and standardising the means.

In [ ]:

##codecell_Spatial_Statistical_Analysis_MoranStatistics_Standardisation&SpatialLag # standardise the counts of the number of spools in each context and the weights #we are applying some math here gts_spool['spool_std'] = (gts_spool['Spool'] - gts_spool['Spool'].mean()) / gts_spool['Spool'].std() gts_spool['w_spool_std'] = ps.lib.weights.lag_spatial(gts_spool_weights, gts_spool['spool_std']) gts_spool.head()

Learning a new language – decomposing the code ¶

in *##codecell_Spatial_Statistical_Analysis_MoranStatisticstandardisation&SpatialLag*, to make sure our varaibles distribution is not biaised, we apply some basic algebra:

In [ ]:

##codecell_Spatial_Statistical_Analysis_MoransI_ImportUrLibraries #get some more tools for the Moran test from pysal.explore.esda.moran import Moran

In [ ]:

##codecell_Spatial_Statistical_Analysis_RUNMoransI # Run the Moran test mi = Moran(gts_spool['Spool'], gts_spool_weights) mi.I

Interpreting your Moran's I¶
##codecell_Spatial_Statistical_Analysis__RUNMoransI mean?

Read how to interpret the results.

Are your spools actually clustered?

Let's plot your Moran's I¶
Now let's plot the results.

The cluster/outlier type (COType) field distinguishes between a statistically significant cluster of:

high values (HH),

cluster of low values (LL),

outlier in which a high value is surrounded primarily by low values (HL)

outlier in which a low value is surrounded primarily by high values (LH).

Statistical significance is set at the 95 percent confidence level (really quite confident, we could say).

In [ ]:

##codecell_Spatial_Statistical_Analysis_PLOTMoransI #Here we are using similar matplotlib commands than in Lab3_StreetOrientation & Lab5_RasterLandscape # Setup the figure and axis f, ax = plt.subplots(1, figsize=(9, 9)) # Plot values #command sns.regplot()allows you to plot the data/results and a linear regression model fit #such as our results of Moran's I # sns.regplot(x='spool_std', y='w_spool_std', data=gts_spool) # Add vertical and horizontal lines plt.axvline(0, c='k', alpha=0.5) plt.axhline(0, c='k', alpha=0.5) ax.set_xlim(-2, 7) ax.set_ylim(-2.5, 2.5) #add text within each quandrant plt.text(3, 1.5, "HH", fontsize=25) plt.text(3, -1.5, "HL", fontsize=25) plt.text(-1, 1.5, "LH", fontsize=25) plt.text(-1, -1.5, "LL", fontsize=25) # Display plt.show()

Learning a new language – Interpreting the results ¶

What's the figure generated from ##codecell_Spatial_Statistical_Analysis__PLOTMoransI mean?

In order to guide the interpretation of the plot, a linear fit is also included in the plot, together with confidence intervals. This line represents the best linear fit to the scatter plot or, in other words, what is the best way to represent the relationship between the two variables as a straight line. Because the line comes from a regression, we can also include a measure of the uncertainty about the fit in the form of confidence intervals (the shaded blue area around the line).

The plot displays a positive relationship between both variables. This is associated with the presence of positive spatial autocorrelation: similar values tend to be located close to each other. If we had to summarise the main pattern of the data in terms of how clustered similar values are, the best way would be to say they are positively correlated and, hence, clustered over space.

At the core of this is a classification of the observations in a dataset into four groups derived from the Moran Plot: high values surrounded by high values (HH), low values nearby other low values (LL), high values among low values (HL), and viceversa (LH). Each of these groups are typically called "quadrants"

6.4.2 Moran's I to Moran's local: LISA¶

Actually, as you may have realised, Moran's I can tell you about the complete spatial pattern so it tells you about the clustering BUT it tells us nothing about where the clusters might be ! It only tells you that the pattern is more clustered than it would be if it would be under spatial randomness.

So why do we need Moran's I then? well you need to test the significance of your spatial analysis, i.e. that your regression is not violating your assumptions... so Moran's is a global test.

And, now, what do we do to get test the location of spatial clusters? We need a local statistic...To test this, we use the local variant of the Moran's test.

But are there local patterns inside the global one?¶

In [ ]:

##codecell_Spatial_Statistical_Analysis_ImportUrLibraries # get the tools for the local test from pysal.explore.esda.moran import Moran_Local

Using your Moran's local¶
The local test breaks the global pattern down to test for the presence of local clusters. You can check at each SU whether or not it is likely (in a statistical significance sense) for it to participate in a local cluster.

How does this work? Local measures consider each single observation in a dataset and operate on them (as oposed to on the overall data) as global measures do.

LISAs (Local Indicators of Spatial Association) are widely used in many fields to identify clusters of values in space. They are a very useful tool that can quickly return areas in which values are concentrated and provide suggestive evidence about the processes that might be at work. Therefore, they are a key part of the exploratory toolset. In Python, we can calculate LISAs in a very streamlined way thanks to PySAL:

A positive value for I indicates that a feature has neighboring features with similarly high or low attribute values; this feature is part of a cluster.

A negative value for I indicates that a feature has neighboring features with dissimilar values; this feature is an outlier.

In either instance, the p-value for the feature must be small enough for the cluster or outlier to be considered statistically significant.

Note that the local Moran's I index (I) is a relative measure and can only be interpreted within the context of its computed z-score or p-value.

In [ ]:

##codecell_Spatial_Statistical_Analysis_MoransLocal # run the local test lisa = Moran_Local(gts_spool['Spool'].values, gts_spool_weights)

Applying your Moran's local¶
All we need to pass is the .values variable of interest "gts_spool['Spool']" and the spatial weights "gts_spool_weights" that describes the neighborhood relations between the different observation that make up the dataset.

Looking at the numerical result of LISAs is not always the most useful way to exploit all the information they can provide. Remember that we are calculating a statistic for every sigle observation in the data so, if we have many of them, it will be difficult to extract any meaningful pattern. Instead, what is typically done is to create a cluster map that extracts the significant observations (those that are highly unlikely to have come from pure chance) and plots them with a specific color depending on their quadrant category.

All of the information needed is contained in the lisa object you have just created in ##codecell_Spatial_Statistical_Analysis_MoransLocal.

It is convenient to pull them out and insert them in the main data frame that we will call "gts_spool" as demonstrated in ##codecell_Spatial_Statistical_Analysis_MoransLocal_Significance.

In [ ]:

##codecell_Spatial_Statistical_Analysis_MoransLocal_Significance # Break observations into significant or not # using .o_sim < we will set it at 95% confidence interval # so significant values are all the ones which are less than 5% gts_spool['significant'] = lisa.p_sim < 0.05 # By using .q command # you can store the quadrant each observation belongs to: # - the high-high, high-low, low-high, low-low # just as before (in ##codecell_Spatial_Statistical_Analysis_PLOTMoransI) are quads 1-4 gts_spool['quadrant'] = lisa.q

Learning a new language – Interpreting the results ¶

First, the 'significance' column. As with global Moran's I, PySAL is automatically computing a p-value for each LISA. Because not every observation represents a statistically significant one, we want to identify those with a p-value small enough that rules out the possibility of obtaining a similar situation from pure chance. Since we have been working with a 95% confidence interval, we select 5% using p.sim< as the threshold for statistical significance (just like with global Moran's I).

To identify these values, we create in ##codecell_Spatial_Statistical_Analysis_MoransLocal_SignificanceResults below a variable, significant, that contains :

True if the p-value of the observation is satisfies the condition, and

False otherwise.

In [ ]:

##codecell_Spatial_Statistical_Analysis_MoransLocal_SignificancereadingUrResults gts_spool['significant'][:20] # true means it is in a cluster, false means it is not

In [ ]:

##codecell_Spatial_Statistical_Analysis_MoransLocal_SignificancereadingUrResults # You can read out the calculated p values for each lisa.p_sim[:20]

Bringing your Moran's local to your dataset¶

In [ ]:

##codecell_Spatial_Statistical_Analysis_AddingUrMoransLocal2Gabiidataset #add this info back onto the spatial data gabii_spool_lisa = gabii.merge(gts_spool, on='SU') gabii_spool_lisa.head()

Let's plot your Moran's Local¶
Now we have these elements, the signifigance and quadrant , we can make a map showing which quadrant each SU belongs to, essentially a display of where the local clusters are located.

Parameters are for:

High-high = 4

Low-low = 2

In [ ]:

##codecell_Spatial_Statistical_Analysis_MappingGabiiMoransLocal # Setup the figure and axis f, ax = plt.subplots(1, figsize=(9, 9)) # Plot baseline su poly gabii_spool_lisa.plot(column='quadrant', ax=ax, \ cmap='Accent', legend=True, linewidth=3) ax.set_axis_off() plt.axis('equal') plt.show()

Interpreting your Moran's Local¶
How would you interpret the results of this analysis?

Remember: The z-scores and p-values are measures of statistical significance which tell you whether or not the local pattern is random, feature by feature. In effect, they indicate whether the apparent similarity (a spatial clustering of either high or low values) or dissimilarity (a spatial outlier) is more pronounced than one would expect in a random distribution.

A high positive z-score for a feature indicates that the surrounding features have similar values (either high values or low values).

A low negative z-score (for example, less than -3.96) for a feature indicates a statistically significant spatial data outlier.

6.5 From Vector to Raster¶
You can run similar spatial statistics on raster data to understand trends and distributions of data.

It is possible to convert point vector data to raster data via the process of interpolation.

~ déjà vu ~ We have talked about this before:

in Chapter 3.3 when presenting type of data distribution and example of applications; and,

in Chapter 5 when working with DEMs, raster with an elevation attribute which was interpolated to create a surface.

*IDW interpolation calculates the value of an unknown cell center value (a query point) using surrounding points with the assumption that closest points will be the most similar to the value of interest*
There are many ways to interpolate data, which you can read about on this site.

6.5.1 Whitebox is an open source GIS¶
Whitebox is full of hundreds (412 for version 1.0.2!) of useful spatial tools.

You can download it and run it standalone (without installing!) or you can call its functions from the jupyter notebook.

It will be well worth checking out for your own independent projects: Read more about Whitebox Tools here.

=

In [ ]:

##codecell_Spatial_Statistical_Analysis_InstallUrLibraries # we're going to use the whitebox tools to interpolate our data and imageio to view it !pip install whitebox # acts as a spatial toolbox !pip install imageio !pip install imageio tifffile # stores numpy arrays in TIFF format files & read image&metadata from TIFFs

In [ ]:

##codecell_Spatial_Statistical_Analysis_ImportUrLibraries # import your tools and print the version to check it worked import pkg_resources # provides runtime facilities for finding, introspecting, activating and using installed Python distributions import whitebox #advanced geospatial data analysis platform. Well you need its toolbox for that (see 3 codeline down)! import imageio # provides an easy interface to read and write a wide range of image data, including animated images, volumetric data, and scientific formats import IPython # for displaying html outputs import os wbt = whitebox.WhiteboxTools() print(wbt.version()) print(wbt.help())

In [ ]:

##codecell_Spatial_Statistical_Analysis_CheckUrNewSpatialToolbox # when there are lots of tools, it's useful to call their help info to find out what the parameters are print(wbt.tool_help("IdwInterpolation"))

6.5.2 Let's interpolate¶

In [ ]:

##codecell_Spatial_Statistical_Analysis_PrepareUrData # To interpolate, as you may have noticed in your reading, we need points rather than polygons. # To have points, we get the center of each polygon, called centroids. # In these lines we essentially strip out the geometry information when we get the center of each polygon. spool_centroids = gabii_spool_lisa.geometry.centroid spoolgdf = gpd.GeoDataFrame(geometry=gpd.GeoSeries(spool_centroids)) #Then we 'merge' to glue the attributes we are interested in interpolating back on to our data. spool_centroids_attr = spoolgdf.merge(gabii_spool_lisa.spool_std,right_index=True,left_index=True) #Let's have a sneaky peek at the dataframe spool_centroids_attr.head()

6.5.3 Let's get our working directory sorted¶

In [ ]:

##codecell_Spatial_Statistical_Analysis_SetUrWorkingDirectory # to work in whitebox effectively, we put this data out to a temporary shapefile # we also set a data directory where we can put this file data_dir = '/usr/local/lib/python3.6/dist-packages/whitebox/testdata/' wbt.set_working_dir('data_dir') print(data_dir) #wbt.verbose command allows you to decide if you want output messages such as warnings, progress updates and other notifications wbt.verbose = True #Let's place our shapefile in a file gabii_spool_shapes = spool_centroids_attr.to_file("/usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_spool.shp")

In [ ]:

##codecell_Spatial_Statistical_Analysis_InterpolateRaster # now we interpolate our centriods using the standardised spool weights and show the result # note you can play around with the interpolation parameters to have different kernel sizes and cell sizes... wbt.idw_interpolation( i="/usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_spool.shp", field="spool_std", output="/usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_spool.tif", use_z=False, weight=2.0, radius=5, min_points=3, cell_size=1, base=None ) raster = imageio.imread("/usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_spool.tif") plt.imshow(raster) plt.show()

In [ ]:

#List data under the data folder print(os.listdir('testdata'))

6.5.4 Let's get the spatial statistical Analysis running¶
Now you can calculate statistics on this dataset, for example using rasterio like we did last week, or using whitebox's tools.

6.5.5 Let's check distribution¶
For example, you can check to see if the data forms a normal distribution. This lets you know what kinds of statistical tests will be valid. As we discussed last week, often archaeological distributions are not normal.

In [ ]:

##codecell_Spatial_Statistical_Analysis_RasterSpatialStats_normality wbt.ks_test_for_normality( "/usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_spool.tif", "/usr/local/lib/python3.6/dist-packages/whitebox/testdata/normal.html", num_samples=None )

In [ ]:

##codecell_Spatial_Statistical_Analysis_RasterSpatialStats_ViewNormality # view your results IPython.display.HTML(filename="/usr/local/lib/python3.6/dist-packages/whitebox/testdata/normal.html")

6.5.6 Let's check for autocorrelation¶
You can also run a similar autocorrelation test to the LISA one we ran above with the vector data.

In [ ]:

##codecell_Spatial_Statistical_Analysis_RasterSpatialStats_Autocorrelation wbt.image_autocorrelation( " /usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_spool.tif", "/usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_autocorr.html", contiguity="Rook" )

In [ ]:

##codecell_Spatial_Statistical_Analysis_RasterSpatialStats_ViewAutocorrelation # view your results IPython.display.HTML(filename='/usr/local/lib/python3.6/dist-packages/whitebox/testdata/gabii_autocorr.html')

This exercise ends here. Hopefully you've learned that there are statistical tests for spatial patterns and that these let us go beyond 'just visualising' to look for patterns. These tests can be run on either vector or raster data.

Clearly there's a lot to learn and explore in the world of spatial stats...

6.6. LexiCode¶
To re-use the codes - you will need to first load their respective libraries. So far, you have used ...:

libraries | | | --- | --- | --- | folium | numpy | rasterio | branca | rtree | richdem | pandas | osmnx | elevation | geopandas | requests | zipfile | seaborn | fiona | io | matplotlib.pyplot | ipywidgets | seaborn | pysal | os | gdal | pkg_resources | whitebox | imageio | IPython | imageio tifffile | |

plugins

HeatMapWithTime HeatMap MeasureControl PrepareUrBasemaps_CreateLayers from [folium.plugins] cluster (from sklearn) rasterio.plot mapping (from shapely.geometry) Point(from shapely.geometry)
mask (from rasterio.mask) Moran (from pysal.explore.esda.moran) Moran_Local (from pysal.explore.esda.moran)

At this point you have learned so many thigs that the lexicode is non-exhaustive. It might be useful to make your own custum 'lexicode' with the commands you use the most often.

Lexicode_MakingaBasicMap | Lexicode_Webmaps&Distributions | | Lexicode_SpatialPatterns | Lexicode_RasterLandscape | Lexicode_Pysal_weights_Moran_Lisa | --- | --- | --- | --- | ---- | --- | == () [] | pd.concat() | { } subselection from list | %matplotlib inline | .open() | .tail() | .head_csv() | .dtype() | ox.gdf_from_places() | requests.get() | .print() | ax.set_axis_off() | .read_csv() | astype() | ox.plot_shape() | request.content() | dataset.name | f.set_facecolor | mean() | fillna() | network_type= '' | .bytes() | dataset.count | plt.axis | folium.Map | def return | ox.add_edge_bearings(ox.get_undirected()) | gpd.GeoDataFrame.from_features() | dataset.shape | f.suptitle() | range() | .apply(lambda x:function,axis=) | count_and_merge() | Set() | dataset.descriptions | f.set_facecolor | len() | pd.merge() | np.arrange() | pd.value_counts() | dataset.meta | plt.axis | iloc[] | how= , left_index= ,left_index= | np.histogram() | .merge() | dataset.driver | f.suptitle() | .value_counts() | gpd.GeoDataFrame() | ax.set_theta_location() | .sort_values | dataset.read | .remap_ids | if =: | geometry=gpd.points_from_xy | ax.set_ylim() | cluster.KMeans() | .shape | .transform | elif =: | print() | ax.set_title() | .fit() | np.amean() | lag_spatial() | else =: | .isin() | ax.set_yticks() | .drop() | np.amin () | .mean() | folium.Marker() | classic.plot() | ax.set_xlabels() & ax.set_yticklabels | .assign() | np.amax() | .std() | folium.Icon() | generateBaseMap() | plt.subplots() | plt.show() | np.std() | sns.regplot() | folium.Circle | .groupby(['', '']) | .dropna() | .set_title | show() | Moran_local | popup= | .reset_index() | polar_plot() | sns.pairplot() | cmap= | p.sim() | radius= | max_zoom= | pd.Series() | arrs.append() | np.seterr() | Lisa.q | .values.tolist() | folium.TileLayer() | np.pi | .show_hist() | plt.imshow | wbt.set_working_dir() | .add_to() | plugins.DualMap(location= , tiles= , zoom_start= ) | | | clipped_img | wbt.verbose() | | | | | rd.TerrainAttribute() | .to_file(" ") | | | | | rdShow() | wbt.idw_interpolation() | | | | | gdal_data.GetRasterBand() | imageio.imread(os.path.join()) | | | | | gdal_band.GetNoDataValue() | wbt.ks_test_for_normality() | | | | | gdal_data.ReadAsArray | IPython.display.HTML() | | | | | gdal_data.ReadAsArray().astype(np.float) | wbt.image_autocorrelation() | | | | | plt.contour() | | | | | | plt.contourf() | |

This website does not host notebooks, it only renders notebooks available on other websites.

Delivered by Fastly, Rendered by OVHcloud

nbviewer GitHub repository.

nbviewer version: 8b013f7

nbconvert version: 7.2.3

Rendered (Mon, 06 May 2024 03:08:22 UTC)

Step1	Step2	Step3	Step4	Step5
Assign each points to similar centre: we need to identify the number of classes to use by looking at the data and identifying any discrete groupings.	Identify the cluster centroids (3 coloured symbols on graph).	Reassign the points based on the minimum distance or closest from cluster centroids. Here we should emphasise that there many possible definitions that may be used for “closest” (e.g.nearest neighbour, Ward's method).	Identify the new centroids by taking the average of all points in the cluster.	Reassign the groupings -points and assignment- until points stop changing clusters in a loop.



The raw point cloud contains terrain and vegetation and surface objects	The filtered point cloud contains only selected classes of points, like terrain points	The bare earth terrain model is made from the classified point cloud	visualizations are created based on the bare earth terrain model

Spatial Thinking and Skills for Archaeology¶

Table of Contents

Chapter 1: Introduction: notebooks motivations¶

1.1 What do we mean by spatial thinking?¶

1.2 Getting started¶

What's a notebook? And why Jupyter notebooks?¶

How does it work?###¶

Text cells: Markdowns###¶

The code cells: Working with python####¶

What do you need?###¶

1.3 Introductory exercise: learning rudimentaries working steps¶

Let's get the tools we need - We will always start practical exercises by getting some tools like this.###¶

Let's import our first tool¶

Learning a new language ¶

Let's import another tool¶

Learning a new language ¶

Now try to import another tool¶

Learning a new language ¶

Now let's add data.¶

How do you see what's in your data?¶

Well, if it's spatial data, you should be able to plot it¶

1.4 Challenge¶

1.5 References¶

Chapter 2: Make A Basic Map¶

2.1 Introduction: exercise motivations ?¶

2.2 Maps should answer questions¶

2.3 To make a map we need to get some data.¶

2.3.1 Let's get your data¶

Learning a new language – decomposing the code¶

2.3.2 Let's filter this big dataset to get only the data you need¶

Learning a new language – decomposing the code¶

2.3.3 Let's put the data onto a map¶

Yay!¶

Learning a new language – decomposing the code¶

2.3.4 Let's play with this code¶

2.4 Representing information and relative values or quantities with graphics¶

Learning a new language – decomposing the code ¶

Learning a new language – decomposing the code ¶

2.5 Designing your map - Symbology¶

2.5.1 What symbol scheme is best for your map?¶

Learning a new language – decomposing the code ¶

2.5.2 Thinking about spatial density or 'clustering' or concentration¶

Learning a new language – decomposing the code ¶

Success!¶

2.6 LexiCode¶

2.7 References¶

Chapter 3: Interactive maps¶

3.1 Introduction: exercise motivations¶

3.1.1 Why interactive maps?¶

3.1.1 Open Data and Data Legacy¶

3.2 Let's start¶

3.3 Let's get the data: the case of Antikythera¶

3.2.1 It is all about data preparation###¶

Learning a new language – decomposing the code ¶

3.2.2 Re-organising your data¶

Learning a new language – decomposing the code ¶

3.2.3 Now let's ask a research question¶

The most likely period/date?¶

step 1 - prepare the data¶

Learning a new language – decomposing the code ¶

Learning a new language – decomposing the code ¶

step 2 - Reshaping your data¶

Learning a new language – decomposing the code ¶

step 3 - Merging tables¶

Learning a new language – decomposing the code ¶

The curse of abbreviations¶

3.3 Finally, we make maps!¶

3.3.1 Maps for visualisation and interpretation¶

3.3.2 As always, start with a question¶

Geopandas¶

Learning a new language – decomposing the code ¶

3.3.3 Visualising the results¶

3.3.4 Assess the resulting map¶

Learning a new language – decomposing the code ¶

3.4 Thinking about data visualisation and map design¶

3.4.1 Map Visualisations with Folium¶

Review - basic map controls ¶

3.4.2 Visualise with a Heatmap¶

Learning a new language – decomposing the code ¶

3.4.3 Practicing the concept of layers¶