Glossary

Argument

A value which is passed to a function or method when 'called'. Arguments are assigned to named local variables in the function body. Arguments can be further classified as either keyword or positional. In the simplest terms the difference between these types is that keyword arguments are named (proceeded by an identifier) and positional arguments are unnamed (in list form). Further information.


Array

A data structure consisting of an ordered collection of items of a single type i.e. an indexed list.


Bag of words

A model where text is represented as a multiset (bag) of its words. This simplification disregards features such as word order and grammar and instead focuses on term frequency.


Cartesian Graph

Also known as a Cartesian Coordinate System which plots numbers on a plane using an x and y axis.


Cell

An input strucutre in a Notebook which runs either Markdown or Python code.


Classifier

A machine-learning algorithm that determines the class of an input element based on a set of features.


Concatenation

The process of combining strings i.e "This string is" + "Concatenating"


Concordance

A list of all words within a text and their frequency of occurrence.


Conditional Block

Where the program has to make a decision based on a series of options using conditional statements such as if, else and elif


Debug

The process of identifying and removing errors from a program.


Delimiter

A character (most typically a comma) used to specify boundaries between words or regions in plain text.


Directory Tree

A tree like structure which represents the organization and hierachy of files within a directory. Terms such as parent and child are used to describe relationships between files and folders within this system.


Dispersion Plot

Also known as a Scatter plot. A graph which uses cartesian coordinates to display values for multiple variables of a set of data. Particularly useful for displaying positional information for words within a text.


Fork

A cloned copy of a project which is set-up on a independent branch seperate to the original. Often used as a development tool in opensource software - where anyone can create a fork of the program and work on it as a distinct piece of software. Github is an example of a tool which facilitates this sharing and development process.


Function

Put simply, functions provide functionality to a program. They are blocks of organized code which begin with the keyword def proceeded by the name of the function you wish to define in parentheses. The code block begins with a colon and must be indented. Further Information.

  • Function Chaining - Also known as method chaining. It is a set of rules which govern the process of calling multiple methods (functions) in a single statement.
  • Recursive Function - A function which calls itself one or many times in an loop until it fufils the condition of its recursion.
  • Calling a Function - Telling the program to execute a function.

Indentation

Empty spaces used as a formatting tool to designate blocks of code in programming. In Python, indentation is used to indicate a block of code, typically four spaces are used - each line of code in the block must be indented by the same amount of spaces otherwise an error may occur.


Iteration

The repetition of a procedure in the form of a loop to obtain successively closer approximations to the solution of a problem.


Kernel

The core computer program of the operating system which can control all system processes. The iPython kernel runs the code in the background for Jupyter notebooks.


Lemmatization

A lemma is the canonical form of a word. Lematization is the process of grouping together inflected forms of a word to be analysed as a single item i.e. determining the orginal lemma for the words.


List Comprehension

A method for defining and constructing lists. Particularly useful for creating a new list from an exsisting list using expressions with a for / in statement within a set of brackets. Further Information.


Nest

Placing objects or elements in a hierarchical arrangement within a set (an ordered collection of immutable objects).


N-gram

A unit (letter, words etc) of variable size (n = number of units) from a given sequence of text in a corpus used in language modelling. Further information


Normalization

A process of transforming text into a single canonical form, thereby faciliating data consistentency for further processing. Examples include removing non-alphanumeric characters or changing to lower case.


Object

Data which has attributes or values AND a defined behaviour.

  • Response Object - An object which returns a response made through a HTTP request when collecting data from a website or URL.

Operator

Symbols which perform arithmetic or logical computation. Some basic types of operators used in Python are arithmetic (addition +, modulus % etc), comparison (greater than >, not equal to !=, etc) or logical (and, or, not). Further Information


Parse

Parsing or Syntactic Analysis is a process whereby sentences or strings of words are analysed by a computer into their constituents, often this is represented in a parse tree which illustrates this syntactic structure.


Plain Texts

Text which includes only data related to the readable material. That is, without data related to grapahical presentation, formatting or other objects such as images. Encoded using Unicode standards, typically in a text editor such as Textedit on Mac or Wordpad on PC. Plain texts are particularly useful for archival storage as they are not confined to proprietary software and can be opened and edited on many systems, thereby ensuring a more universal accessibility and preservation.


Regular Expressions

The sequence of characters which define a search pattern. These patterns are useful for performing string operations such as find or find and replace


Regularize

The replacement of irregular forms in syntax with regular forms.


Repository

A central location where where data is stored and managed. More specifically, in revision control systems a repository stores metadata for sets of files or directory structure.


Sequence

An ordered set of Lists, Tuples or Strings.


Sparse Matrix

Also known as a sparse array. It is a matrix (an array of data arranged in a rectangular structure of columns and rows) in which most of the elements are zero. If most of the elements were populated by values other than zero than the matrix could be considered dense.


Stemming

The process of reducing a word to it's base form or word stem e.g. added/adding would reduce to add.


Stop Words

A list of words which are programmed to be ignored or filtered in analysis and search queries. Lists of stop-words often contain high frequency function words such as the, of, and etc


String

A string is a container for data of letters, numbers or symbols.

  • Zero padded strings - To pad a string (usually an integer) with leading zeros to make up a specified length.

Synset

A set of synonyms.


Training Set

A data set used to train a model in machine learning. Specific examples are chosen to fit the parameters of the model for training and the subsequent results are compared with a testing dataset.


Tuple

A sequence of immutable (fixed) objects. Tuples are created by seperating values using commas within a set of parentheses e.g. (1, 2, 3, 4, 5 );


Variable

A variable stores a piece of data and gives it a specific name. Common data types which are stored in variables in Python include numbers and Boolean values.


Unicode

An industry standard in computing for encoding (representing) text. Letters, numbers and symbols are assigned unique numeric values which facilitate universal application across different programs and platforms. A fun example of the utility of unicode is the emoji keyboard used on smartphones when sending messages. The universal nature of unicode allows the emoji's to be accurately represented on most modern phones regardless of their differing operating systems (such as android, ios, blackberry). Further information