+1 |
A vote in favor of something.
|
|
|
accuracy |
Accuracy measures how close results are to the true or known value. A statistical measure of a classification model which gives the proportion of correct predictions among a total number of cases. It is calculated as Accuracy = (TP+TN)/(TP+TN+FP+FN).
- TP = True Positive - FP = False Positive - FN = False Negative - TN = True Negative
Accuracy and precision are both ways to measure results. Accuracy measures how close results are to the true or known value. Precision, on the other hand, measures how close results are to one another.
|
|
|
actual result (of test) |
The outcome or value of performing a statistical test. If this matches the expected result, the test passes; if the two are different, the test fails.
|
|
|
aggregation |
To combine many values into one, e.g., by summing a set of numbers or linking a set of strings.
|
|
|
aggregation function |
A function that combines many values into one, such as sum or max .
|
|
|
agile development |
A software development methodology that emphasizes lots of small steps and continuous feedback instead of up-front planning and long-term scheduling. Exploratory programming is often agile.
|
|
|
algorithm |
An algorithm is a set of steps, intructions, or rules followed to accomplish a specific task. In computer science, an algorithm is a set of instructions in a computer program that solves a computational problem.
|
|
|
aliasing |
To have two or more references to the same thing, such as a data structure in memory or a file on disk.
|
|
|
Anaconda |
Anaconda is a [software distribution] (#software_distribution) of R and Python. It is also a [repository] (#repository) of open-source Python and R programs for data science, packaged using the conda [package manager] (#package_manager). Anaconda also creates Anaconda Navigator, a suite of desktop tools including an [IDE] (#ide) and the Jupyter Notebook application.
|
|
|
anchor |
In a regular expression, a symbol that fixes a position without matching characters. ^ matches the start of the line, while $ matches the end of the line and \b matches a break between word and non-word characters.
|
|
|
artificial intelligence (AI) |
Intelligence demonstrated by machines, as opposed to humans or other animals. AI can be exhibited through perceiving, synthesizing and inference of information. Example tasks include natural language processing, computer vision, and [machine learning] (#machine_learning).
|
|
|
autocorrelation |
The degree of similarity between observations in the same series but separated by a time interval (known as the “lag”). Autocorrelation analysis can be used to gain insight into time series datasets by detecting repeating patterns that may be partially concealed by random noise, among other uses.
|
|
|
Bayes' Theorem |
An equation for calculating the probability that something is true if something related to it is true. If P(X) is the probability that X is true and P(X| Y) is the probability that X is true given Y is true, then P(X| Y) = P(Y| X) * P(X) / P(Y).
|
|
|
binary large object |
Data that is stored in a database without being interpreted in any way, such as an audio file. The term is also now used to refer to data transferred over a network or stored in a version control repository as uninterpreted bits.
|
|
|
breadcrumbs |
A set of supplementary navigational links included in many websites, usually placed at the top of the page. Breadcrumbs show the users where the current page lies in the website; the term comes from a fairy tale in which children left a trail of breadcrumbs behind themselves so that they could find their way home.
|
|
|
browser cache |
A place where web browsers keep copies of previously retrieved files (web pages, data files) in order to save time when they’re requested again. Sometimes, issues may arise if there is a newer version of the file online, but the browser doesn’t notice it.
|
|
|
byte |
A unit of digital information that typically consists of eight binary digits, or bits.
|
|
|
caching |
To save a copy of some data in a local cache to make future access faster.
|
|
|
camel case |
A style of writing code that involves naming variables and objects with no space, underscore (_ ), dot (. ), or dash (- ) characters, with each word being capitalized. Examples include CalculateSum and findPattern .
|
|
|
class imbalance |
Class imbalance refers to the problem in machine learning where there is an unequal distribution of classes in the dataset.
|
|
|
collection |
An abstract data type that groups an arbitrary, variable number of data items (possibly zero), to allow processing them in a uniform fashion. Common examples of collections are lists, variable-size arrays and sets. Fixed-size arrays are usually not considered collections.
|
|
|
compute shader |
A general purpose shader program for use in parallel processing. Often used for machine learning, simulations, and other fields which benifit from parallel computation.
|
|
|
Concatenate |
In general programming, to join two strings or collections together. In terms of data tables (for example, a python pandas DataFrame or R tibble), append/stack two tables by either columns (axis=1) or rows (axis=0) by end-to-end joining of data.
|
|
|
conda |
A [package manager] (#package_manager) and environment management system, particularly popular for Python programs.
|
|
|
confusion matrix |
A NxN matrix that describes the performance of a classification model, where N is the number of classes or outputs. Each row in the matrix represents the instances of actual classes and each column represents the predicted classes. For a binary classification model the confusion matrix gives the True Positives (TP), False Negatives (FN), False Positives (FP) and True Negatives (TN) in the 1st, 2nd, 3rd and 4th quadrants, respectively. The table can be used to calculate Accuracy, Sensitivity and Specificity amongst other measures of the model.
|
|
|
constant |
A constant in programming is a name associated with a value that never changes during the execution of a program. You can only access the constant’s value but not change it over time, as opposed to a variable
|
|
|
continuous integration |
A software development practice in which changes are automatically merged as soon as they become available.
|
|
|
control flow |
The logical flow through a program’s code. May be linear (i.e. just a series of commands), but may also include loops or conditional execution (i.e. if a condition is met).
|
|
|
convolutional neural network (cnn) |
A class of artificial neural network that is primarily used to analyze images. A CNN has layers that perform convolutions, where a filter is shifted over the data, instead of the general matrix multiplications that we see in fully connected neural network layers.
|
|
|
data collison |
Occurs when when two or more devices or nodes try to transmit signals at the same time on the same network. Similarly a data collision can also occur when hashing if two distinct pieces of data have the same hash value.
|
|
|
data frame |
A two-dimensional data structure for storing tabular data in memory. Rows represent records and columns represent variables.
|
|
|
data structure |
A format for the organisation, management, and efficient access of data. Typically it will characterise a set of data values and their representation (or encoding), the relationships between values, and ways to access or manipulate those data, such as reading, altering, or writing.
|
|
|
debug |
In a computer environment ‘debug’ refers to the process of finding and resolving errors (also known as ‘bugs’) within computer programs or systems.
|
|
|
decrement |
A unary operation that decreases the value of a variable, usually by 1.
|
|
|
degrees of freedom |
In statistics, the degrees of freedom (often “DF”) is a measure of how much independent information, in the form data and calculations, has been combined to produce a given statistical parameter. Put another way, the DF is the number of values that are free to vary in the calculation of a given statistical parameter. For a statistic calculated from data which are indepdendent (i.e., the values are uncorrelated), the DF can be generally estimated as the sample size minus the number of individual parameters calculated to obtain the final statistic.
|
|
|
directory |
An item within a filesystem that can contain files and other directories. Also known as a folder.
|
|
|
Electronic mail |
Electronic mail is a method for delivering messages between people over a computer network. Messages are sent via an SMTP server and retrieved using either an IMAP or POP server.
|
|
|
Emacs (editor) |
A text editor that is popular among Unix programmers.
|
|
|
encoding |
The process of putting a sequence of characters such as letters, numbers, punctuation, and certain symbols, into a specialized format for efficient transmission or storage.
|
|
|
epoch (deep learning) |
In deep learning, an epoch is one cycle in the deep learning process where all the training data has been fed to the algorithm once. Training a deep neural networks usually consists of multiple epochs.
|
|
|
false negative |
Data points which are actually true but incorrectly predicted as false.
|
|
|
false positive |
Data points which are actually false but incorrectly predicted as true.
|
|
|
FASTA |
A file format for storing amino acid or genomic sequence information. Information for each sequence is broken up into a block of 2 lines. Line 1 contains information about the sequence and begins with a greater than symbol, ‘>’. Line 2 contains the actual amino acid or genomic sequence using single-letter codes.
|
|
|
FASTQ |
A file format for storing genomic sequence information and the corresponding quality scores. Information for each sequence is broken up into a block of four lines. Line 1 contains information about the sequence and begins with ‘@’. Line 2 contains the actual genomic sequence using single-letter codes to represent nucleotides. Line 3 is a separator that begins with a + . Line 4 has a string of quality characters for each base in the genomic sequence.
|
|
|
Feature |
An individual characteristic or property of a phenomenon that is measurable (e.g. length, height, number of petals) and used as the input to a model. Finding or selecting features that are highly independent and discriminatory is a fundamental part of classification.
|
|
|
fragment shader |
The shader stage in the rendering pipeline designated towards calculating colours for each fragment on the screen. For each pixel covered by a primitive, a fragment is generated. All fragments for each pixel will have their colours combined based on depth and opacity after the fragment shader stage is complete.
|
|
|
functional programming |
A style of programming in which data is transformed through successive application of functions, rather than by using control structures such as loops. In functional programming, there must be a direct relationship between the input to a function and the output produced by the function, meaning the result should not be affected by the current values of global variables or other parts of the global program state. It also requires that functions do not produce side effects, meaning they do not modify the global program state, or do anything other than computing the return value, such as writing output to a log file, or printing to the console.
|
|
|
Geometric Mean |
Calculated from a set of n numbers by first computing the product of those numbers, and then computing the n-th root of the result. In contrast to the arithmetic mean, which measures central tendancy in an “additive” way, the geometric mean measures central tendancy in a “multiplicative” way, and hence is often appropriate when estimating an average rates of change or some other multiplicative constant.
|
|
|
geometry shader |
The shader stage in the rendering pipeline designated towards processing primitives. Not to be confused with tessellation shaders, geometry shaders are focused on modifying the shape of primitives to create new results. For example, pixels may be converted into particles using a geometry shader.
|
|
|
ggplot2 |
A package in R that implements a layered grammar of graphics for generating plots. It is a popular alternative to plotting with base R and part of the tidyverse.
|
|
|
GNU Operating System |
“GNU” is an operating system that is free software. GNU is a recursive acronym for “GNU is Not Unix!”. The GNU operating system consists of GNU packages as well as free software realeased by third parties.
|
|
|
graphical user interface |
A user interface that relies on windows, menus, pointers, and other graphical elements, as opposed to a command-line interface or voice-driven interface.
|
|
|
Graphics Processing Unit |
Specialized processor designed to run many instances of a single program in parallel. Orginally designed for use in graphics, but is also used for general computation in the form of compute shaders.
|
|
|
Harmonic Mean |
Calculated from a set of n numbers by first computing the sum of the reciprocals of those numbers, and then dividing n by the resulting sum. Alternatively, it can be computed as the reciprocal of the arithmetic mean of the reciprocal values. Similarly to the geometric mean, the harmonic mean is often used as an alternative measure of central tendancy to the usual arithmetic mean when estimating an average rates of change or some other multiplicative constant. For a set of positive numbers that are not all equal, the min < HM < GM < AM < max where min is the minimum value, max is the maximum value, and HM GM and AM are the harmonic, geometric, and arithmetic means respectively.
|
|
|
hidden layer (deep learning) |
A hidden layer in a neural network refers to the layers of neurons that are not directly connected to input or output. The layers are “hidden” because you do not directly observe their input and output values.
|
|
|
high performance computing |
When computing power is drawn from multiple powerful processors that work together in parallel, rather than from a single desktop computer, laptop, or work station. This significantly speeds up analysis and reduces computing time, which allows people to work with big data.
|
|
|
icon |
In computing, an icon is a graphic symbol that is displayed on a computer screen to help a user navigate the computer system.
|
|
|
immutable type |
Immutable is when no change is possible over time. An object of this type can not be changed and its state can not be modified after it is created.
|
|
|
increment |
A unary operation that increases the value of a variable, usually by 1.
|
|
|
independent variable |
The factor that you purposely change or control in order to see what effect it has on the dependent variable.
|
|
|
index |
Each of the elements of an array. Indexes represent the position by numerical representation.
|
|
|
infinite loop |
A loop where the exit condition is never met, so the loop continues to repeat itself. Often a programming error.
|
|
|
interface |
A ubiquitously used phrase in computing that describes a point of contact. This could be a user interface (e.g. graphical user interface or command line), the interface of an object with the rest of the code or how a program can interact with web services through an API.
|
|
|
Internet Message Access Protocol |
A standard internet protocol used by email clients to retrieve messages from an email server. Messages are left on the server so that they can be accessed from multiple email clients.
|
|
|
invariant |
Something that must be true at all times inside of a program or during the lifecycle of an object. Invariants are often expressed using assertions. If an invariant expression is not true, this is indicative of a problem, and may result in failure or early termination of the program.
|
|
|
Java |
Java is a high-level, cross-platform, object-oriented and general-purpose programming language. Programs written in Java will run on any platform that supports the Java software platform without having to be recompiled. This feature gave rise to the slogan “Write Once Run Anywhere”. Java syntax is similar to that of C and C++.
|
|
|
JupyterLab |
A next-generation interface to Jupyter Notebooks. JupyterLab is open-source, web-based and has a multiple-document interface which supports working with multiple notebooks and Markdown files in a single browser tab. JupyterLab also supports opening terminal/console windows in the browser.
|
|
|
learning rate (deep learning) |
In artificial neural networks, the learning rate is a hyper-parameter that determines the pace at which the network adjusts the weights to move down the loss gradient. A large learning rate can speed up training, but the network might overshoot and miss the minimum. A small learning rate will overshoot less, but will be slower. It can also get more easily stuck in local minima.
|
|
|
loop |
A structure that repeatedly executes a section of code until a specific exit condition is met.
|
|
|
lsof |
UNIX command to see the list of open files being used by processes.
|
|
|
Masking |
[TODO] to be defined
|
|
|
minimum spanning tree |
A minimum spanning tree is a data structure that describes the unique set of edges that connect all of the nodes in a graph while minimizing the weights of all included edges. The minimum spanning tree may refer to either the algorithm to calculate the structure or the resulting structure itself.
|
|
|
mode |
The value, or values, that occurs most frequently in a dataset.
|
|
|
model |
A specification of the mathematical relationship between different variables.
|
|
|
module |
A reusable software package, also often called a library.
|
|
|
Monte Carlo method |
Any method or algorithm that relies on artificially-injected randomness.
|
|
|
moving average |
The mean of each set of several consecutive values from time series data.
|
|
|
multi-threaded |
Capable of performing several operations simultaneously. Multi-threaded programs are usually more efficient than single-threaded ones, but also harder to understand and debug.
|
|
|
mutable type |
An object of this type may be changed and its state can be modified after it is created.
|
|
|
mutation |
Changing data in place, such as modifying an element of an array or adding a record to a database.
|
|
|
n-gram |
A sequence of $N$ items, typically words in natural language. For example, a trigram is a sequence of three words. N-grams are often used as input in computational linguistics.
|
|
|
n-th root |
The n-th root of a positive number x is the number that when multiplied by itself n times produces x. This can commonly be calculated by raising x to the power of the reciprocal of n.
|
|
|
NA |
A special value used to represent data that is not available.
|
|
|
naive Bayes classifier |
Any classification algorithm based on Bayes’ Theorem that assumes every feature being classified is independent of every other feature.
|
|
|
name collision |
The ambiguity that arises when two or more things in a program that have the same name are active at the same time. Most languages use namespaces to prevent such collisions.
|
|
|
named argument |
A function parameter that is given a value by explicitly naming it in a function call.
|
|
|
namespace |
A collection of names in a program that exists in isolation from other namespaces. Each function, object, class, or module in a program typically has its own namespace so that references to “X” in one part of a program do not accidentally refer to something called “X” in another part of the program. Scope is a distinct, but related, concept.
|
|
|
Nano (editor) |
A very simple text editor found on most Unix systems.
|
|
|
natural language processing |
See computational linguistics.
|
|
|
negative selection |
To specify the elements of a vector or other data structure that are not desired by negating their indices.
|
|
|
neural network |
One of a large family of algorithms for identifying patterns in data by mimicking the way neurons interact. A neural network consists of one or more layers of nodes, each of which is connected to nodes in the preceding and subsequent layer. If enough of a node’s inputs are active, that node activates as well.
|
|
|
node |
An element of a graph that is connected to other nodes by edges. Nodes typically have data associated with them, such as names or weights.
|
|
|
non-blocking execution |
To allow a program to continue running while an operation is in progress. For example, many systems support non-blocking execution for file I/O so that the program can continue doing work while it waits for data to be read from or written to the filesystem (which is typically much slower than the CPU).
|
|
|
non-parametric (statistics) |
A branch of statistical tests which do not assume a known distribution of the population which the samples were taken from (Kruskal-Wallis and Dunn test are examples of non-parametric tests).
|
|
|
normal distribution |
A continuous random distribution with a symmetric bell-curve shape. As datasets get larger, some of their most important statistical properties can be modeled using a normal distribution.
|
|
|
NoSQL database |
Any database that does not use the relational model. The name comes from the fact that such databases do not use SQL as a query language.
|
|
|
null |
A special value used to represent a missing object. Null is not the same as NA, and neither is it the same as an empty vector.
|
|
|
null hypothesis |
The claim that any patterns seen in data are entirely due to chance. Other claims (e.g., “X causes Y”) must be much more likely than the null hypothesis in order to be substantiated.
|
|
|
nullary expression |
An “expression” with no arguments, such as the value 3.
|
|
|
numpy |
An open source Python package that works with arrays, vectors and matrices of dimension N, in a comparable method and with a syntax similar at Matlab software. You can find functions and sophisticated operations, focused in multidimensional arrays, linear algebra, Fourrier transform and generation of random values.
|
|
|
object |
In object-oriented programming, a structure that contains the data for a specific instance of a class. The operations the object is capable of are defined by the class’s methods.
|
|
|
object-oriented programming |
A style of programming in which functions and data are bound together in objects that only interact with each other through well-defined interfaces.
|
|
|
objective function |
A function of one or more variables used to measure or compare the goodness of different solutions in an optimization problem.
|
|
|
observation |
A value or property of a specific member of a population.
|
|
|
off-by-one error |
A common error in programming in which the program refers to element i of a structure when it should refer to element i-1 or i+1 , or processes N elements when it should process N-1 or N+1 .
|
|
|
open license |
A license that permits general re-use, such as the MIT License or GPL for software and CC-BY or CC-0 for data, prose, or other creative outputs.
|
|
|
open science |
A generic term for making scientific software, data, and publications generally available.
|
|
|
OpenRefine |
A standalone, open source desktop application for data cleanup and transformations, also know as data wrangling.
|
|
|
operating system |
A program that provides a standard interface to whatever hardware it is running on. Theoretically, any program that only interacts with the operating system should run on any computer that operating system runs on.
|
|
|
optional_parameter |
A parameter that does not have to be given a value when a function is called. Most programming languages require programmers to define default values for optional parameters, or assign them a special value automatically. Arguments passed to optional parameters will often be specified using keyword arguments.
|
|
|
ORCID |
An Open Researcher and Contributor ID that uniquely and persistently identifies an author of scholarly works. ORCIDs are for people what DOIs are for documents.
|
|
|
orthogonality |
The ability to use various features of software in any order or combination. Orthogonal systems tend to be easier to understand, since features can be combined without worrying about unexpected interactions.
|
|
|
outlier |
Extreme values that might be measurement or recording errors, or might actually be rare events. Outliers are sometimes ignored when doing statistics, or handled or visualized separately.
|
|
|
overfitting |
Fitting a model so closely to one dataset that it does not generalize to others.
|
|
|
p value |
The probability of obtaining a result at least as strong as the one observed if the null_hypothesis is true (i.e., if variation is purely due to chance). The lower the p-value, the more likely it is that something other than chance is having an effect.
|
|
|
package |
A collection of code, data, and documentation that can be distributed and re-used. Also referred to in some languages as a library or module.
|
|
|
package manager |
A program that does its best to keep track of the different software installed on a computer and their dependencies on one another.
|
|
|
pager |
A program that displays a few lines of text at a time.
|
|
|
pandas |
An open source Python package that offers fast, flexible, and expressive data structures to make working with structured data, and time series easy and intuitive. It is a powerful tool for data analysis and data manipulation.
|
|
|
parameter |
A variable specified in a function definition whose value is passed to the function when the function is called. Parameters and arguments are distinct, but related concepts. Parameters are variables and arguments are the values assigned to those variables.
|
|
|
parametric (statistics) |
A branch of statistical tests which assume a known distribution of the population which the samples were taken from (ANOVA and Student’s t-tests are examples of parametric tests).
|
|
|
parent (in a tree) |
A node in a tree that is above another node (call a child). Every node in a tree except the root node has a single parent.
|
|
|
parent class |
In object-oriented programming, the class from which a sub class (called the child class) is derived.
|
|
|
parent directory |
The directory that contains another directory of interest. Going from a directory to its parent, then its parent, and so on eventually leads to the root directory of the filesystem.
|
|
|
parse |
To translate the text of a program or web page into a data structure in memory that the program can then manipulate.
|
|
|
pass (a test) |
A test passes if the actual result matches the expected result.
|
|
|
patch |
A single file containing a set of changes to a set of files, separated by markers that indicate where each individual change should be applied.
|
|
|
path (in filesystem) |
A string that specifies a location in a filesystem. In Unix, the directories in a path are joined using / .
|
|
|
pattern rule |
A generic build rule that describes how to update any file whose name matches a pattern. Pattern rules often use automatic variables to represent the actual filenames.
|
|
|
Peanuts |
An American comic strip by Charles M. Schulz which has inspired the names of R versions.
|
|
|
perceptron |
The simplest kind of [neural network])(#neural_network), which approximates a single neuron with N binary inputs by computing a weighted sum of its inputs and firing if that value is zero or greater.
|
|
|
permalink |
Short for “permanent link”, a URL that is intended to last forever.
|
|
|
phony target |
A build target that does not correspond to an actual file. Phony targets are often used to store commonly used commands in a Makefile.
|
|
|
Pip Install Packages |
The standard package manager for Python. pip enables the download and installation of Python packages not included in the standard library.
|
|
|
pipe (in the Unix shell) |
The | used to make the output of one command the input of the next.
|
|
|
pipe operator |
The %>% used to make the output of one function the input of the next.
|
|
|
pivot table |
A technique for summarizing tabular data in which each cell represents the sum, average, or other function of the subset of the original data identified by the cell’s row and column heading.
|
|
|
pointcloud |
A set of discrete data points in three-dimensional space.
|
|
|
Poisson distribution |
A discrete random distribution that expresses the probability of $N$ events occurring in a fixed time interval if the events occur at a constant rate, independent of the time since the last event.
|
|
|
positional argument |
An argument to a function that gets its value according to its place in the function’s definition, as opposed to a named argument that is explicitly matched by name.
|
|
|
Post Office Protocol |
A standard internet protocol used by email clients to retrieve messages from an email server. Messages are generally downloaded and deleted from the server, making it difficult to access messages from multiple email clients. POP3 (version 3) is the version of POP in common use.
|
|
|
posterior distribution |
Probability distribution summarizing the prior distribution and the likelihood function.
|
|
|
pothole case |
A naming style that separates the parts of a name with underscores, as in first_second_third .
|
|
|
preamble |
A series of commands, either placed in the main document, or kept in a separate document, that are included prior to the \begin{document} command. The preamble defines the type of the document, along with other formatting attributes and parameters. This is also the section of the document where packages are added using the command \usepackage{} to enable additional functionalities, and where custom commands can be defined.
|
|
|
prerequisite |
Something that a build target depends on.
|
|
|
principal component analysis |
An algorithm that find the axis along which data varies most, then the axis that accounts for the largest part of the remaining variation, and so on.
|
|
|
prior distribution |
The probability distribution that is assumed as a starting point when using Bayes’ Theorem and used to construct a more accurate posterior_distribution.
|
|
|
probability distribution |
A mathematical description of all possible outcomes of a random event, and the probability of each occurring.
|
|
|
procedural generation |
A method of generating data algorithmically rather than manually. Typically this is done to reduce file sizes, increase the overall amount of content, and/or incorporate randomness at the expense of processing power.
|
|
|
procedural programming |
A style of programming in which functions operate on data that is passed into them. The term is used in contrast to other programming styles, such as object-oriented programming and functional programming.
|
|
|
process |
An operating system’s representation of a running program. A process typically has some memory, the identity of the user who is running it, and a set of connections to open files.
|
|
|
product manager |
The person responsible for defining what features a product should have.
|
|
|
production code |
Software that is delivered to an end user. The term is used to distinguish such code from test code, deployment infrastructure, and everything else that programmers write along the way.
|
|
|
project manager |
The person responsible for ensuring that a project moves forward.
|
|
|
prompt |
The text printed by an REPL or shell that indicates it is ready to accept another command. The default prompt in the Unix shell is usually $ , while in Python it is >>> , and in R it is > .
|
|
|
protocol |
Any standard specifying how two pieces of software interact. A network protocol such as HTTP defines the messages that clients and servers exchange on the World-Wide Web; object-oriented programs often define protocols for interactions between objects of different classes.
|
|
|
provenance |
A record of where data originally came from and what was done to process it.
|
|
|
pseudo-random number |
A value generated in a repeatable way that resembles the true randomness of the universe well enough to fool observers.
|
|
|
pseudo-random number generator |
A function that can generate pseudo-random numbers.
|
|
|
pull indexing |
Vectorized indexing in which the value at location i in the index vector specifies which element of the source vector is being pulled into that location in the result vector, i.e., result[i] = source[index[i]] .
|
|
|
pull request |
The request to merge a new feature or correction created on a user’s fork of a Git repository into the upstream repository. The developer will be notified of the change, review it, make or suggest changes, and potentially merge it.
|
|
|
push indexing |
Vectorized indexing in which the value at location i in the index vector specifies an element of the result vector that gets the corresponding element of the source vector, i.e., result[index[i]] = source[i] . Push indexing can easily produce gaps and collisions.
|
|
|
Python |
A popular interpreted open-source programming language that relies on indentation to define control structure.
|
|
|
Python Package Index |
The official third-party software repository for Python. Anyone can upload a package to PyPI. PyPI packages may install via executed scripts or pre-compiled, system-specific wheels.
|
|
|
Python Software Foundation |
A non-profit organization that oversees and promotes the development and use of Python.
|
|
|
quantile |
If a set of sorted values are divided into groups of each size, each group is called a quantile. For example, if there are five groups, each is called a quintile; the bottom quintile contains the lowest 20% of the values, while the top quintile contains the highest 20%.
|
|
|
query string |
The portion of a URL after the question mark ? that specifies extra parameters for the HTTP request as name-value pairs.
|
|
|
quosure |
A data structure containing an unevaluated expression and its environment.
|
|
|
quoting function |
A function that is passed expressions rather than the values of those expressions.
|
|
|
R (programming language) |
A popular open-source programming language used primarily for data science.
|
|
|
R Consortium |
A group that supports the worldwide community of users, maintainers and developers of R. Its members include leading institutions and companies dedicated to the use, development, and growth of R.
|
|
|
R Foundation |
A non-profit founded by the R development core team providing support for R. It is a member of the R Consortium.
|
|
|
R Hub |
A free platform available to check an R package on several different platforms in preparation for the CRAN submission process.
|
|
|
R Markdown |
A dialect of Markdown that allows authors to mix prose and code (usually written in R) in a single document.
|
|
|
raise (an exception) |
To signal that something unexpected or unusual has happened in a program by creating an exception and handing it to the error-handling system, which then tries to find a point in the program that will catch it.
|
|
|
random forests |
An algorithm used for regression or classification that uses a collection of decision trees, called a forest. Each tree votes for a classification, and the algorithm chooses the classification having the most votes over all the trees in the forest.
|
|
|
raster image |
An image stored as a matrix of pixels.
|
|
|
reactive programming |
A style of programming in which actions are triggered by external events.
|
|
|
reactive variable |
A variable whose value is automatically updated when some other value or values change. Reactive variables are used extensively in Shiny.
|
|
|
read-eval-print loop |
An interactive program that reads a command typed in by a user, executes it, prints the result, and then waits patiently for the next command. REPLs are often used to explore new ideas, or for debugging.
|
|
|
README |
A plain text file containing important information about a project or software package.
|
|
|
reciprocal |
The reciprocal of a number x is 1 / x, or alternatively x raised to the power of -1.
|
|
|
record |
A group of related values that are stored together. A record may be represented as a tuple or as a row in a table; in the latter case, every record in the table has the same fields.
|
|
|
recurrent neural network |
A class of artificial neural networks where connections between nodes can create a cycle. This allows the network to exhibit behavior that is dynamic over time. This type of network is applicable to tasks like speech and handwriting recognition.
|
|
|
recursion |
Calling a function from within a call to that function, or defining a term using a simpler version of the same term.
|
|
|
recycle |
To re-use values from a shorter vector in order to generate a sequence of the same length as a longer one. In Python NumPy, this is called broadcasting.
|
|
|
redirection |
To send a request for a web page or web service to a different page or service.
|
|
|
refactoring |
Reorganizing software without changing its behavior.
|
|
|
regression testing |
Testing software to ensure that things which used to work have not been broken.
|
|
|
regular expression |
A pattern for matching text, written as text itself. Regular expressions are sometimes called “regexp”, “regex”, or “RE”, and are powerful tools for working with text.
|
|
|
reinforcement learning |
Any machine learning algorithm which is not given specific goals to meet, but instead is given feedback on whether or not it is making progress.
|
|
|
relational database |
A database that organizes information into tables, each of which has a fixed set of named fields (shown as columns) and a variable number of records (shown as rows).
|
|
|
relative error |
The absolute value of the difference between the actual and correct value divided by the correct value. For example, if the actual value is 9 and the correct value is 10, the relative error is 0.1. Relative error is usually more useful than absolute error.
|
|
|
relative path |
A path whose destination is interpreted relative to some other location, such as the current working directory. A relative path is the equivalent of giving directions using terms like “straight” and “left”.
|
|
|
relative row number |
The index of a row in a displayed portion of a table, which may or may not be the same as the absolute row number within the table.
|
|
|
remote login |
Starting an interactive session on one computer from another computer, e.g., by using SSH.
|
|
|
remote repository |
A repository located on another computer. Tools such as Git are designed to synchronize changes between local and remote repositories in order to share work.
|
|
|
repository |
A place where a version control system stores the files that make up a project and the metadata that describes their history.
|
|
|
reprex |
A reproducible example. When asking questions about coding problems online or filing issues on GitHub, you should always include a reprex so others can reproduce your problem and help. The reprex package can help!
|
|
|
reproducible example |
See reprex.
|
|
|
reproducible research |
The practice of describing and documenting research results in such a way that another researcher or person can re-run the analysis code on the same data to obtain the same result.
|
|
|
research software engineer |
Someone whose primary responsibility is to build the specialized software that other researchers depend on.
|
|
|
research software engineering |
The practice of and methods for building the specialized software that other researchers depend on.
|
|
|
reserved word |
A word (character string) with a distinct meaning for a programming or scripting language. Typically, reserved words cannot be used as names for variables or constants, as this would confuse the compiler or interpreter.
|
|
|
reStructured Text |
A plaintext markup format used primarily in Python documentation.
|
|
|
revision |
See commit.
|
|
|
right join |
A join that combines data from two tables, A and B. Where keys in table A match keys in table B, fields are concatenated. Where a key in table B does not match a key in table A, columns from table A are filled with null, NA, or some other missing value signifier. Keys from table A that do not exist in table B are dropped.
|
|
|
ROC Curve |
A ROC curve (Receiver Operating Characteristic curve) is a graph that displays the performance of a binary classifier at different classification thresholds. The curve is obtained by plotting the True Positive Rate (also known as Recall or Sensitivity) along the vertical axis and the False Positive Rate along the horizontal axis.
|
|
|
root (in a tree) |
The node in a tree of which all other nodes are direct or indirect children, or equivalently the only node in the tree that has no parent.
|
|
|
root directory |
The directory that contains everything else, either directly or indirectly. The root directory is written / (a bare forward slash).
|
|
|
root mean squared error |
The square root of the mean squared error. Like the standard deviation, it is in the same units as the original data.
|
|
|
rotating file |
A set of files used to store recent information. For example, there might be one file with results for each day of the week, so that results from last Tuesday are overwritten this Tuesday.
|
|
|
S |
A language originally developed in Bell Labs for data analysis, statistical modeling, and graphics. R is a dialect of S.
|
|
|
S3 |
A framework for object-oriented programming in R.
|
|
|
S4 |
A framework for object-oriented programming in R.
|
|
|
sandbox |
A testing environment that is separate from the production system, or an environment that is only allowed to perform a restricted set of operations for security reasons.
|
|
|
sanity check |
A basic test to see if the outcome of a calculation, script or analysis makes sense or is true. This can be performed by visualisation or by simply inspecting the outcome.
|
|
|
scalar |
A single value of a particular type, such as 1 or “a”. Scalars exist in most languages, but do not really exist in R; in R, values that appear to be scalars are actually vectors of unit length.
|
|
|
schema |
A specification of the format of a dataset, including the name, format, and content of each table.
|
|
|
scope |
The portion of a program within which a definition can be seen and used. See closure, global variable, and local variable.
|
|
|
script |
Originally, a program written in a language too user-friendly for “real” programmers to take seriously; the term is now synonymous with program.
|
|
|
search path |
The list of directories that a program searches to find something. For example, the Unix shell uses the search path stored in the PATH variable when trying to find a program whose name it has been given.
|
|
|
Secure Shell |
A protocol and the program that implements it which allows remote access to a server through a secure channel where all information is encrypted.
|
|
|
seed |
A value used to initialize a pseudo-random number generator.
|
|
|
select |
To choose entire columns or rows from a table by name or location.
|
|
|
selecting on the dependent variable bias |
A study that only includes cases where the dependent variable shows the same value, instead of cases with different values in the dependent variable, is a study affected by selecting on the dependent variable bias.
|
|
|
self join |
A join that combines a table with itself.
|
|
|
semantic versioning |
A standard for identifying software releases. In the version identifier major.minor.patch , major changes when a new version of software is incompatible with old versions, minor changes when new features are added to an existing version, and patch changes when small bugs are fixed.
|
|
|
sense vote |
A preliminary vote used to determine whether further discussion is needed in a meeting.
|
|
|
sensitivity |
Statistical measure of a classification model which gives the True Positive rate. For example, the proportion of people who have a disease that test positive. Calculated as Sensitivity = TP/(TP+FN).
|
|
|
sequential data |
Any list of data items where the order is an inherent property of the list. Often the next item in the list is dependent on the previous item or items.
|
|
|
server |
Typically, a program such as a database manager or web server that provides data to a client upon request.
|
|
|
shader |
A program designed to run on the [GPU][gpu]. Generally used in graphics to calculate lighting or position vertices in a scene, though can be used for more general programming through the use of [compute shaders][#compute_shader].
|
|
|
shebang |
In Unix, a character sequence such as #!/usr/bin/python in the first line of an executable file that tells the shell what program to use to run that file.
|
|
|
shell |
A command-line interface that allows a user to interact with the operating system, such as Bash (for Unix and MacOS) or PowerShell (for Windows).
|
|
|
shell script |
A set of commands for the shell stored in a file so that they can be re-executed. A shell script is effectively a program.
|
|
|
shell variable |
A variable set and used in the Unix shell. Commonly used shell variables include HOME (the user’s home directory) and PATH (their search path).
|
|
|
Shiny |
A R package that makes it simple to build web applications to interactively visualise and manipulate data. Often used to make interactive graphs and tables straight from R without having to know HTML, CSS or JavaScript.
|
|
|
short circuit test |
A logical test that only evaluates as many arguments as it needs to. For example, if A is false, then most languages never evaluate B in the expression A and B .
|
|
|
short identifier (of commit) |
The first few characters of a full identifier. Short identifiers are easy for people to type and say aloud, and are usually unique within a repository’s recent history.
|
|
|
short option |
A single-letter identifier for a command-line argument. Most common flags are a single letter preceded by a dash, such as -v .
|
|
|
side effect |
A change made by a function while it runs that is visible after the function finishes, such as modifying a global variable or writing to a file. Side effects make programs harder for people to understand, since the effects are not necessarily clear at the point in the program where the function is called.
|
|
|
signal (a condition) |
A way of indicating that something has gone wrong in a program, or that some other unexpected event has occurred. R prefers “signalling a condition” to “raising an exception”. Python, on the other hand, encourages raising and catching exceptions, and in some situations, requires it.
|
|
|
Simple Mail Transfer Protocol |
A standard internet communication protocol for transmitting email.
|
|
|
Simple Mail Transfer Protocol Secure |
A method for securing SMTP using TLS.
|
|
|
single square brackets |
One set of square brackets [ ] , used to select a structure from another structure based on an index value, or range of values, inside the square brackets.
|
|
|
single-threaded |
A model of program execution in which only one thing can happen at a time. Single-threaded execution is easier for people to understand, but less efficient than multi-threaded execution.
|
|
|
singleton |
A set with only one element, or a class with only one instance.
|
|
|
Singleton pattern |
A design pattern that creates a singleton object to manage some resource or service, such as a database or cache. In object-oriented programming, the pattern is usually implemented by hiding the constructor of the class in some way so that it can only be called once.
|
|
|
slug |
An abbreviated portion of a page’s URL that uniquely identifies it. In the example https://www.mysite.com/category/post-name , the slug is post-name .
|
|
|
snake case |
See pothole case.
|
|
|
software distribution |
A set of programs that are built, tested, and distributed as a collection so that they can run together.
|
|
|
source code |
Source code or, simply, code, is the origin of executed code (either by means of an interpreter or compiler). It’s the primarily human-produced series of commands that make up a program. (Note: Automatic code generators exist for some applications)
|
|
|
source distribution |
A software distribution that includes the source code, typically so that programs can be recompiled on the target computer when they are installed.
|
|
|
specificity |
Statistical measure of a classification model which gives the True Negative rate. For example, the proportion of people who do not have a disease that test negative. Calculated as Specificity = TN/(TN+FP).
|
|
|
spectral analysis |
From a finite record of a stationary data sequence, estimate how the total power is distributed over frequency. See also “spectrum analysis problem”.
|
|
|
sprint |
A short, intense period of work on a project.
|
|
|
SQL |
The language used for writing queries for a relational database. The term is an acronym for Structured Query Language.
|
|
|
Square root |
A special case of the n-th root for which n = 2, i.e. the 2-nd root has the special name “square root”.
|
|
|
SSH key |
A string of random bits stored in a file that is used to identify a user for SSH. Each SSH key has separate public and private parts; the public part can safely be shared, but if the private part becomes known, the key is compromised.
|
|
|
stack frame |
A section of the call stack that records details of a single call to a specific function.
|
|
|
Stack Overflow |
A question-and-answer site popular among programmers.
|
|
|
standard deviation |
How widely values in a dataset differ from the mean. It is calculated as the square root of the variance.
|
|
|
standard error |
A predefined communication channel for a process, typically used for error messages.
|
|
|
standard input |
A predefined communication channel for a process, typically used to read input from the keyboard or from the previous process in a pipe.
|
|
|
standard normal distribution |
A normal distribution with a mean of 0 and a standard deviation of 1. Values from normal distributions with other parameters can easily be rescaled to be on a standard normal distribution.
|
|
|
standard output |
A predefined communication channel for a process, typically used to send output to the screen or to the next process in a pipe.
|
|
|
stratified sampling |
Selecting values by dividing the overall population into homogeneous groups and then taking a random sample from each group.
|
|
|
stream |
A sequential flow of data, such as the bits arriving across a network connection or the bytes read from a file.
|
|
|
string |
A block of text in a program. The term is short for “character string”.
|
|
|
string interpolation |
The process of inserting text corresponding to specified values into a string, usually to make output human-readable.
|
|
|
student's t-distribution |
See t-distribution.
|
|
|
subcommand |
A command that is part of a larger family of commands. For example, git commit is a subcommand of Git.
|
|
|
subdirectory |
A directory that is below another directory.
|
|
|
supervised learning |
A machine learning algorithm in which a system is taught to classify values given training data containing previously-classified values.
|
|
|
support vector machine |
A supervised learning algorithm that seeks to divide points in a dataset so that the empty space between the resultant sets is as wide as possible.
|
|
|
synchronous |
To happen at the same time. In programming, synchronous operations are ones that have to run simultaneously, or complete at the same time.
|
|
|
systematic error |
See bias.
|
|
|
t-distribution |
A variation on the normal distribution that is adjusted to account for estimating variance from the sample instead of knowing it in advance.
|
|
|
tab completion |
A technique implemented by most REPLs, shells, and programming editors that completes a command, variable name, filename, or other text when the TAB key is pressed.
|
|
|
table |
A set of records in a relational database or observations in a data frame. Tables are usually displayed as rows (each of which represents one record or observation and columns (each of which represents a field or variable.
|
|
|
tag (in version control) |
A readable label attached to a specific commit so that it can easily be referenced later.
|
|
|
Template Method pattern |
A design pattern in which a parent class defines an overall sequence of operations by calling abstract methods that child classes must then implement. Each child class then behaves in the same general way, but implements the steps differently.
|
|
|
ternary expression |
An expression that has three parts. Conditional expressions are the only ternary expressions in most languages.
|
|
|
tessellation shader |
The shader stage in the rendering pipeline designated towards subdividing primitives to increase the resolution of a mesh without impacting memory. Not to be confused with geometry shaders which change the overall shape.
|
|
|
test data |
Test data is a portion of a dataset used to evaluate the correctness of a machine learning algorithm after it has been trained. It should always be separated from the training data to ensure that the model is properly tested with unseen data.
|
|
|
test runner |
A program that finds and runs software tests and reports their results.
|
|
|
test-driven development |
A programming practice in which tests are written before a new feature is added or a bug is fixed in order to clarify the goal.
|
|
|
three Vs |
The volume, velocity, and variety that distinguish big data.
|
|
|
throw (exception) |
Another term for raising an exception.
|
|
|
tibble |
A modern replacement for R’s data frame, which stores tabular data in columns and rows, defined and used in the tidyverse.
|
|
|
ticket |
See issue.
|
|
|
ticketing system |
See issue tracking system.
|
|
|
tidy data |
Tabular data that satisfies three conditions that facilitate initial cleaning, and later exploration and analysis—(1) each variable forms a column, (2) each observation forms a row, and (3) each type of observation unit forms a table.
|
|
|
Tidymodels |
A collection of R packages for modeling and statistical analysis designed with a shared philosophy.
|
|
|
Tidyverse |
A collection of R packages for operating on tabular data in consistent ways.
|
|
|
time series |
A set of measurements taken at different times, which may or may not be regular intervals.
|
|
|
timestamp |
A digital identifier showing the time at which something was created or accessed. Timestamps should use ISO date format for portability.
|
|
|
tolerance |
How closely the actual result of a test must agree with the expected result in order for the test to pass. Tolerances are usually expressed in terms of relative error.
|
|
|
training data |
Training data is a portion of a dataset used to train machine learning algorithm to recognise similar data. It should always be separated from the test data to ensure that the model is properly tested with data it has never seen before.
|
|
|
transitive dependency |
If A depends on B and B depends on C, C is a transitive dependency of A.
|
|
|
Transport Layer Security |
A cryptographic protocol for securing communications over a computer network.
|
|
|
tree |
A graph in which every node except the root has exactly one parent.
|
|
|
triage |
To go through the issues associated with a project and decide which are currently priorities. Triage is one of the key responsibilities of a project manager.
|
|
|
true |
The logical (Boolean) state opposite of “false”. Used in logic and programming to represent a binary state of something.
|
|
|
true negative |
Data points which are actually false and correctly predicted as false.
|
|
|
true positive |
Data points which are actually true and correctly predicted as true.
|
|
|
truthy |
Evaluating to true in a Boolean context.
|
|
|
tuple |
A data type that has a fixed number of parts, such as the three color components of a red-green-blue color specification. In “Python”, tuples are immutable (their values cannot be reset.)
|
|
|
two hard problems in computer science |
Refers to a quote by Phil Karlton—”There are only two hard problems in computer science—cache invalidation and naming things.” Many variations add a third problem (most often “off-by-one errors”).
|
|
|
type coercion |
To convert data from one type to another, e.g., from the integer 4 to the equivalent floating point number 4.0 .
|
|
|
unary expression |
An expression with one argument, such as log 5 .
|
|
|
Unicode |
A standard that defines numeric codes for many thousands of characters and symbols. Unicode does not define how those numbers are stored; that is done by standards like UTF-8.
|
|
|
Uniform Resource Locator |
A unique address on the World-Wide Web. URLs originally identified web pages, but may also represent datasets or database queries, particularly if they include a query string.
|
|
|
unit test |
A test that exercises one function or feature of a piece of software and produces pass, fail, or error.
|
|
|
UNIX |
UNIX is a family of operating systems developed during 1969 at AT&T Bell Labs. Its main features are simple tools, well-defined functionality and being portable by nature.
|
|
|
unsupervised learning |
Algorithms that cluster data without knowing in advance what the groups will be.
|
|
|
up-vote |
A vote in favor of something.
|
|
|
update operator |
See in-place operator.
|
|
|
upstream repository |
The remote repository from which this repository was derived. Programmers typically save changes in their own repository and then submit a pull request to the upstream repository where changes from other programmers are also collected.
|
|
|
user interface |
Platform for interaction between a user and a machine. The interaction may occur via text (a command line interface), graphics and windows (a graphical user interface), or other methods such as voice-driven interfaces.
|
|
|
UTF-8 |
A way to store the numeric codes representing Unicode characters in memory that is backward-compatible with the older ASCII standard.
|
|
|
variable (data) |
Some attribute of a population that can be measured or observed.
|
|
|
variable (program) |
A name in a program that has some data associated with it. A variable’s value can be changed after definition.
|
|
|
variable arguments |
In a function, the ability to take any number of arguments. R uses ... to capture the “extra” arguments. Python uses *args and **kwargs to capture unnamed, and named, “extra” arguments, respectively.
|
|
|
variance |
How widely values in a dataset differ from the mean. It is calculated as the average of the squared differences between the values and the mean. The standard deviation is often used instead, since it has the same units as the data, while the variance is expressed in units squared.
|
|
|
vector |
A sequence of values, usually of homogeneous type. Vectors are the fundamental data structure in R; a scalar is just a vector with exactly one element.
|
|
|
vectorize |
To write code so that operations are performed on entire vectors, rather than element-by-element within loops.
|
|
|
version control system |
A system for managing changes made to software during its development.
|
|
|
vertex shader |
The shader stage in the rendering pipeline designated towards handling operations on individual vertices in a scene. A vertex shader can be used to calculate properties of a single vertex, such as position and per-vertex lighting. Not to be confused with fragment shaders which are used to determine the actual colour being rendered to each pixel of the screen.
|
|
|
vignette |
A long-form guide used to provide details of a package beyond the README.md or function documentation.
|
|
|
Vim (editor) |
The default text editor on Unix. Vim is a very powerful text editor, with a steeper learning curve than nano, but that allows the user to execute shell commands and use regular expressions to alter files programmatically.
|
|
|
virtual environment |
In Python, the virtualenv package allows you to create virtual, disposable, Python software environments containing only the packages and versions of packages you want to use for a particular project or task, and to install new packages into the environment without affecting other virtual environments, or the system-wide default environment.
|
|
|
virtual machine |
A program that pretends to be a computer. This may seem a bit redundant, but VMs are quick to create and start up, and changes made inside the virtual machine are contained within that VM so we can install new packages or run a completely different operating system without affecting the underlying computer.
|
|
|
Visitor pattern |
A design pattern in which the operation to be done is taken to each element of a data structure in turn. It is usually implemented by having a generator “visitor” that knows how to reach the structure’s elements, which is given a function or method to call for each in turn, and that carries out the specific operation.
|
|
|
walk (a tree) |
To visit each node in a tree in some order, typically depth-first or breadth-first.
|
|
|
while loop |
A statement in a program that repeats one or more other statements (the loop body) as long as a condition is true.
|
|
|
whitespace |
The space, newline, carriage return, and horizontal and vertical tab characters that take up space but do not create a visible mark. The name comes from their appearance on a printed page in the era of typewriters.
|
|
|
wildcard |
A character expression that can match text, such as the * in *.csv (which matches any filename whose name ends with .csv ).
|
|
|
workflow |
A way of describing work to be done as a set of tasks, typically with dependencies on external inputs or the outputs of other tasks, which can later be executed by a program. An example is a Makefile which can be executed by the make Unix command.
|
|
|
XML |
A set of rules for defining HTML-like tags and using them to format documents (typically data). XML was popular in the early 2000s, but its complexity led many programmers to adopt JSON, instead.
|
|
|
YAML |
Short for “YAML Ain’t Markup Language”, a way to represent nested data using indentation rather than the parentheses and commas of JSON. YAML is often used in configuration files and to define parameters for various flavours of Markdown documents.
|
|
|