62 terms

Inf1 - Data and Analysis

STUDY
PLAY
Key
A minimal set of attributes whose values uniquely identify an item in an entity set
Composite Key
A key that includes multiple attributes.
Super Key
Any sort of attributes who's values uniquely identify an item in an entity set.
Primary Key
A key chosen among all candidate keys to be used as a unique identifier.
Key Constraint
When each entity instance can appear at most once, in a particular relationship instance.
Total Participation
When every entity instance must appear in at least one relationship instance.
Partial Participation
When some entity instances might not appear in any relationship entities.
Weak Entities
An entity who's instances cannot be uniquely identified by its attributes, thus must rely on related entities. Thus the entity must have a key constraint and total participation.
ER diagrams | Key Constrains
an arrowhead on a line when joining an entity to a relationship

The line can be thick or thin.
ER diagrams | Partial Participation
Single line
例:A---B
ER diagrams | Total Participation
Thicc Double Line
例:A===B
ER diagrams | Weak Entities
ALL OF THIS
> A double or thick border on the rectangle of the weak entity;
> A double or thick border on the diamond of the identifying relationship;
> A double or thick line and arrow linking these;
> A double or dashed underline for the attributes of the weak entity that contribute to
the composite key.

例:A==>R*
ER Diagrams | Inheritance
A triangle where the point connects to the parent entity, and the other relationships are connected to the children entities.
ER Diagrams | Entity Attributes
A line connecting the attribute string to the entity.
ER Diagrams | Primary Key
Double underline beneath the attribute
Relational Algebra
A mathematical language of bulk operations on relational tables. Each operation *takes one
or more tables, and returns another*.
Relational Algebra | Selection
Picks out the rows of a table satisfying a logical predicate

σ[sub: logical predicate](table name)
Relational Algebra | Projection
Picks out the columns of a table by their field name

π[sub: column names](table name)
Relational Algebra | Renaming
ρ[sub: columnName → newColumnName](table name)
Relational Algebra | Union
Combines rows of two tables

Symbol: [Table 1] U [Table 2]
Relational Algebra | Difference
Takes all the rows of one table which do not appear in another

Symbol: [Table 1] ー [Table 2]
Relational Algebra | Intersection
Takes all the rows of one table which do appear in another.

Symbol: S1 ∩ S2 = S1 ー (S1 ー S2)
Relational Algebra | Cross Product
Combines every row of one table with every row of another. Literally gives every combination between the two tables..

Symbol: [Table 1] x [Table 2]
Tuple Relational Calculus (TRC)
A declarative mathematical notation for writing queries: specifying information to be drawn
from the linked tables of a relational model.
TRC | Query format
{ [VARIABLE] | [VARIABLE] ∈ [TABLE] ∧ [CONDITION] }

例 :{ S | S ∈ Student ∧ S.age > 19 }
発音:"The set of tuples S such that S is in the table "Student" and has component "age" greater than 19."
TRC | The "∃T ." Binding
{ [VARIABLE] ∈ [TABLE] | ∃V ∈ Variable . [CONDITION]}

例:{ S ∈ Student | ∃T ∈ Takes . T.code = "Inf1" ∧ S.uun = T.uun }

Read as:
> ∃T ∈ Takes → There exists a variable T from takes such that . . .
Structured Query Language (SQL)
A mostly-declarative programming language for interacting with relational database management systems (RDBMS): defining tables, changing data, writing queries.
SQL | Creating a table
CREATE TABLE Student (
uun VARCHAR(8),
name VARCHAR(20),
age INTEGER,
email VARCHAR(25),
PRIMARY KEY (uun),
FOREIGN KEY (column) REFERENCES Table(column) )
SQL | Adding a record to the database
INSERT
INTO Student (uun, name, age, email)
VALUES ('s1428751', 'Bob', 19, 'bob@sms.ed.ac.uk')
SQL | Updating a record in a database
UPDATE Student
SET name = 'Bobby'
WHERE uun = 's1428571'
SQL | Deleting a record in a database
DELETE
FROM Students
WHERE name = 'Bobby'
SQL | Querying a database
SELECT field-list
FROM table-list
[ WHERE qualification ]

> You can make `field-list = *` if you want to return the whole table
> Use the 'AS' keyword to set a pseudo name to a table
> 例:
SELECT Student.name, Student.email
FROM Student, Takes, Course
WHERE Student.uun = Takes.uun
AND Takes.code = Course.code
AND Course.title = 'Mathematics 1'
SQL | Table Combinations
You can combine queries with set operations using: UNION, INTERSECT, and EXCEPT.

例:Takes instances from either query 1 or query 2
[Query 1]
UNION
[Query 2]
Transaction (in a DB)
A single coherent operation on a database. This might involve substantial
amounts of data, or take considerable computation; but is meant to be an all-or-nothing action
SQL | Nested Queries
In FROM, you can implement another table query. For example:

SELECT Student.name, Student.email
FROM Student, Takes,
(SELECT code FROM Course WHERE title='Mathematics 1') AS C
WHERE Student.uun=Takes.uun AND Takes.code=C.code
ACID Properties
Properties that are supposed to characterise a reliable implementation of a transaction.

A - Atomicity
C - Consistency
I - Isolation
D - Durability
A in ACID
Atomicity

> All-or-nothing: *a transaction either runs to completion, or fails and leaves the
database unchanged.*

> May involve a rollback mechanism to undo a partially-complete transaction.
C in ACID
Consistency

> Applying a transaction in a valid state of the database will always give a valid result state.

> This requires maintaining constraints and cascades; and rolling back a
transaction if it will break any of these
I in ACID
Isolation

> Concurrent transactions have the same effect as sequential ones: the outcome is as if they were done in order.

> Transactions can run at the same time but should not see each other's intermediate state. A.K.A. sequential consistency.
D in ACID
Durability

> Once a transaction is committed, it will not be rolled back.
XPath: Element Nodes
These are labelled with element names, categorising the data below them.

In the XPath data model, internal nodes other than the root are always element nodes.

The root node must have exactly one element node as child, called the root element.
XPath | Text Nodes
Leaves of the tree storing textual information.
XPath | Attribute Nodes
Leaf's assigning a value to some attribute of an element node

> We use "@ATTR_NAME" to distinguish an attribute from a text node.
XPath | Traversing a tree
Extensible Markup Language (XML)
XML is formal language for presenting semistructured data. It is a markup language in that it provides a way to mark up ordinary text with additional information.
XML | Elements (Tags)
> There is a start and end tag
> Text strings can be placed in between start and end tags
例:<Capital>Canberra</Capital>
> Attributes can be declared in the start tag
例:<Feature type="Mountain" color="Pink"> ... </Feature>
Schema Language
Any language for specifying similar kinds of structure in XML documents.

A formal schema language should:
> Be precise and unambiguous;
> Be be able to be validated by a machine and check whether or not a document satisfies a certain schema.
XML | well-formed XML
Must follow these conditions:
> Proper use of syntax;
> Correctly nested tags around elements;
> The tree structure can always be extracted from textual nesting;
> Elements are always given with their complete name;
> Attributes are all named;
> Everything else is unstructured text.
XML | Valid XML
A document that satisfies a specific schema is valid with respect to that schema.
Document Type Definition (DTD)
A list of individual element declarations and attribute declarations, in any order.
DTD | Element Syntax
<!ELEMENT elementName contentType >
> Content Type can be:
1. EMPTY indicating that the element has no content.
2. ANY meaning that any content is allowed
(Elements nested within this still need their own declarations).
3. Mixed content where the element contains text, and possibly also child elements.
4. A child declaration using a regular expression.

<!ELEMENT elementName ( nestedElement1, nestedElement2+ ) >
> Means that an element with a given name will have a nestedElement1 followed by 1 or more 'nestedElement2'

<!ELEMENT Country ( Name, Population, State∗ ) >
> Means that a country element will have a name element nested, a population element, and 0 or more State elements.

<!ELEMENT book ( #PCDATA ) >
> Means that the element will have plain text inside
DTD | Attribute Syntax
<!ATTLIST elementName attName attType attDefault ... >*

<!ATTLIST Country code CDATA #IMPLIED >
> Means that the "Country" element may have an attribute code and that it will be a text string

<!ATTLIST Feature type CDATA #REQUIRED >
> Means that the "Feature" element must have an attribute type and that it will be a text string
Corpus
A widely available fixed-sized body of machine-readable text, appropriately sampled to properly represent a certain language variety
Corpora | Two key tasks for building a corpus
1. Collect Data - Involves balancing and sampling data.
> Balancing - ensures that the linguistic content represents the full variety of the language sources for which the corpus is intended to provide
a reference
> Sampling - ensures that the material is representative of the types of sources

2. Add Information - perform textual annotation
Coropora | Applications
1. Identifying collocations for linguistic purposes.
2. Engineering NLP Systems. As well as speech processing.
3. Machine Translation
Stats | Mode, Median, and Mean
Mode : most frequent number in the dataset. There can be multiple.

Median : the middle value of the sorted dataset.

Mean : Sum of all values divided by number of values
Stats | Standard Deviation and Variance
Variance : (sum(data(i) - mean)^2)/N;

StandardDev : sqrt(variance);
Stats | Correlation Coefficient
Note: If estimating a larger population use n-1, otherwise just use n

Variables
> s(x) and s(y) are the standard deviations of x and y respectively;
> x-bar and y-bar are the respective means;

Outcomes
> If r is close to 0 then this suggests there is no correlation.
>If r is nearer +1 then this suggests x and y are positively correlated.
If r is closer to −1 then this suggests x and y are negatively correlatedd*.
Stats | Null Hypothesis
The hypothesis that there is nothing out of the ordinary in the data: no correlation, no effect, nothing to see.
Stats | p-test
The probability that, assuming the null hypothesis is true, the data found is a coincidence.

Thus a low p-value means that we reject the null hypothesis. This is described as statistically significant.
Stats | The χ2 test (chi-square test)
This is a statistical tool for assessing correlation in qualitative data.

It gives the p-value in the p-test.
Stats | Performing the χ2 test
1. Create a table of observed frequencies (table of probabilities)
> O(1,1) = # of A AND B
> O(1,2) = # of A AND (NOT B)
> O(2,1) = # of (NOT A) AND B
> O(2,2) = # of (NOT A) AND (NOT B)
2. From this deduce the expected frequencies
> Assume that there is no relationship between data, thus same probs for both cases
3. Get the sum of the differences between the two
4. Check the frequency density based on the critical points