# The Data Farm: Science Week Presentation¶

## Learning from Data¶

### Neil D. Lawrence and the Sheffield Machine Learning Research Group¶

#### 5th March 2014¶

This notebook has been made available as part of our Open Data Science agenda. If you want to read more about this agenda there is a position paper/blog post available on it here.

This session is about 'learning from data'. How do we take the information on the internet and make sense of it. The answer, as you might expect, is using computers and mathematics. Luckily we also have a suite of tools to help. The first tool is a way of programming in python that really facilitates interaction with data. It is known as the "IPython Notebook", or more recently as the "Jupyter Project".

### Welcome to the IPython Notebook¶

The notebook is a great way of interacting with computers. In particular it allows me to integrate text descriptions, maths and code all together in the same place. For me, that's what my research is all about. I try to take concepts that people can describe, then I try to capture the essence of the concept in a mathematical model. Then I try and implement the model on a computer, often combining it with data, to try and do something fun, useful or, ideally, both.

For the Science Week lecture on "The Data Farm" we are looking at recommender systems.

## Is Our Map Enough? Are Our Data Enough?¶

Is two dimensions really enough to capture the complexity of humans and their artforms? Is that how shallow we are? On second thoughts, don't answer that. We would certainly like to think that we need more than two dimensions to capture our complexity.

Let's extend our books and libraries analogy further: consider how we should place books that have a historical timeframe as well as some geographical location. Do we really want books from the 2nd World War to sit alongside books from the Roman Empire? Books on the American invasion of Sicily in 1943 are perhaps less related to books about Carthage than those books that study the Jewish Revolt from 66-70 (in the Roman Province of Judaea---more History!). So books that relate to subjects which are closer in time should probably be stored together. However, a student of 'rebellion against empire' may also be interested in the relationship between the Jewish Revolt of 66-70 and the Indian Rebellion of 1857 (against the British Empire), nearly 1800 years later (they might also like the Star Wars movies ...). Whilst the technologies involved in these revolts would be different, they still involve people (who we argued could all be summarised by sets of numbers) and the psychology of those people is shared: a rebellious nation angainst their imperial masters, triggered by misrule with a religious and cultural background.

To capture such nuances we would need further dimensions in our latent representation. But are further dimensions justified by the amount of data we have? Can we really understand the facets of a film that only has at most three or four ratings? One answer is to collect more data to justify extra dimensions.