73 terms

How do we represent knowledge?

• Procedurally (HOW):

- Write methods that encode how to handle specific situations in the world

• Declaratively (WHAT):

- Specify facts about the world

- Write methods that encode how to handle specific situations in the world

• Declaratively (WHAT):

- Specify facts about the world

Logic for Knowledge Representation

Logic is a declarative language to:

• Assert sentences representing facts that hold in a world W (these sentences are given the value true)

• Deduce the true/false values to sentences representing other aspects of W

• Assert sentences representing facts that hold in a world W (these sentences are given the value true)

• Deduce the true/false values to sentences representing other aspects of W

Entailment

one thing follows from another

model

Assignment of a truth value - true or false - to every atomic sentence

A model m is a model of KB iff it is a model of all sentences in KB, that is, all sentences in KB are true in m

A model m is a model of KB iff it is a model of all sentences in KB, that is, all sentences in KB are true in m

Satisfiability of a KB

A KB is satisfiable iff it admits at least one model; otherwise it is unsatisfiable

Propositional logic

the simplest logic - illustrates basic ideas

Inference

the process of generating sentences entailed by the KB

inference rule

consists of 2 sentence patterns ξ and ψ called the conditions and one sentence pattern ϕ called the conclusion

If ξ and ψ match two sentences of KB then the corresponding ϕ can be inferred according to the rule

If ξ and ψ match two sentences of KB then the corresponding ϕ can be inferred according to the rule

Soundness

An inference rule is sound if it generates only entailed sentences

Completeness

A set of inference rules is complete if every entailed sentences can be

obtained by applying some finite succession of these rules

obtained by applying some finite succession of these rules

Proof

The proof of a sentence α from a set of sentences KB is the derivation of α by applying a series of sound (legal) inference rules

Inference by enumeration

We can enumerate all possible models and test whether every model of KB is also a model of α

Application of inference rules

• Legitimate (sound) generation of new sentences from old

• Proof = a sequence of inference rule applications

Can use inference rules as operators in a standard search algorithm

• Typically require transformation of sentences into a normal form

• Proof = a sequence of inference rule applications

Can use inference rules as operators in a standard search algorithm

• Typically require transformation of sentences into a normal form

Model checking

• truth table enumeration (always exponential in n)

• improved backtracking, e.g., Davis--Putnam-Logemann-Loveland (DPLL)

• heuristic search in model space (sound but incomplete) e.g., min-conflicts-like hill-climbing algorithms

• improved backtracking, e.g., Davis--Putnam-Logemann-Loveland (DPLL)

• heuristic search in model space (sound but incomplete) e.g., min-conflicts-like hill-climbing algorithms

Conjunctive Normal Form (CNF)

conjunction of disjunctions of literals clauses

Horn Form (restricted)

KB = conjunction of Horn clauses

Horn clause =

• proposition symbol; or

• (conjunction of symbols) ⇒ symbol

- E.g., C ∧ (B ⇒ A) ∧ (C ∧ D ⇒ B)

Modus Ponens (for Horn Form): complete for Horn KBs

• Can be used with forward chaining or backward chaining.

• These algorithms are very natural and run in linear time

Horn clause =

• proposition symbol; or

• (conjunction of symbols) ⇒ symbol

- E.g., C ∧ (B ⇒ A) ∧ (C ∧ D ⇒ B)

Modus Ponens (for Horn Form): complete for Horn KBs

• Can be used with forward chaining or backward chaining.

• These algorithms are very natural and run in linear time

Forward chaining

• Idea: fire any rule whose premises are satisfied in the KB,

- add its conclusion to the KB, until query is found

data-driven, automatic, unconscious processing,

- e.g., object recognition, routine decisions

May do lots of work that is irrelevant to the goal

- add its conclusion to the KB, until query is found

data-driven, automatic, unconscious processing,

- e.g., object recognition, routine decisions

May do lots of work that is irrelevant to the goal

Backward chaining

Idea: work backwards from the query q:

to prove q by BC,

check if q is known already, or

prove by BC all premises of some rule concluding q

Avoid loops: check if new subgoal is already on the goal stack

Avoid repeated work: check if new subgoal

1. has already been proved true, or

2. has already failed

goal-driven, appropriate for problem-solving,

- e.g., Where are my keys? How do I get into a PhD program?

Complexity of BC can be much less than linear in size of KB

to prove q by BC,

check if q is known already, or

prove by BC all premises of some rule concluding q

Avoid loops: check if new subgoal is already on the goal stack

Avoid repeated work: check if new subgoal

1. has already been proved true, or

2. has already failed

goal-driven, appropriate for problem-solving,

- e.g., Where are my keys? How do I get into a PhD program?

Complexity of BC can be much less than linear in size of KB

Logic for reasoning

Deduce new facts (α) using KB:

- The rules about the world

- Information we gather through perception

- The rules about the world

- Information we gather through perception

Pros and cons of propositional logic

Pros: declarative, allows partial/disjunctive/negated information, compositional (meaning of B1,1 ∧ P1,2 is derived from meaning of B1,1 and of P1,2), context-independent

Con: very limited expressive power

Con: very limited expressive power

experiment

has a set of potential outcomes,

sample space

the set of all possible outcomes

random variable

can take on any value in the sample space

event

a subset of the sample space.

Principle of Indifference

Alternatives are always to be judged equi-probable if

we have no reason to expect or prefer one over the other.

• Each outcome in the sample space is assigned equal probability.

we have no reason to expect or prefer one over the other.

• Each outcome in the sample space is assigned equal probability.

Law of Large Numbers

As the number of experiments increases the relative frequency of an event more closely approximates the theoretical probability of the event.

Prior or unconditional probabilities of propositions

correspond to belief prior to arrival of any (new) evidence

Probability distribution

gives values for all possible assignments

Joint probability distribution

gives the probability of every atomic event on those random variables

Conditional Probability

The probability of an event may change after knowing another event.

Independence

Events are independent if one has nothing whatever to do with others. Therefore, for two independent events, knowing one happening does not change the probability of the other event happening.

Events A and B are independent iff:

P(A, B) = P(A) x P(B) which is equivalent to P(A|B) = P(A) and P(B|A) = P(B) when P(A, B) >0.

Events A and B are independent iff:

P(A, B) = P(A) x P(B) which is equivalent to P(A|B) = P(A) and P(B|A) = P(B) when P(A, B) >0.

Conditional Independence

Dependent events can become independent given certain other events.

disjoint

no members in common

Disjoint({animals, vegetables})

Disjoint({animals, vegetables})

exhaustive decomposition

A set of categories s constitutes an exhaustive decomposition of a category c if all members of the set c are covered by categories in s

ExhaustiveDecomposition({Americans, Canadian, Mexicans},NorthAmericans).

ExhaustiveDecomposition({Americans, Canadian, Mexicans},NorthAmericans).

partition

disjoint exhaustive decomposition

Partition({Males,Females},Persons)

Partition({Males,Females},Persons)

Fluents

functions and predicates that vary from one situation to the next

Atemporal functions and predicates:

true in any situation

Actions

change a situation, described by stating their effects

Frame Problem

Actions don't specify what happens to objects not involved in the action, but the logic framework requires that information

Frame Axioms

Inform the system about preserved relations

Axioms of probability

• For any propositions A, B

- 0 ≤ P(A) ≤ 1

- P(true) = 1 and P(false) = 0

- P(A ∨ B) = P(A) + P(B) - P(A ∧ B)

- 0 ≤ P(A) ≤ 1

- P(true) = 1 and P(false) = 0

- P(A ∨ B) = P(A) + P(B) - P(A ∧ B)

Properties of Probability

1. P(¬E) = 1- P(E)

2. If E1 and E2 are logically equivalent, then P(E1)=P(E2).

3. P(E1, E2)≤P(E1).

2. If E1 and E2 are logically equivalent, then P(E1)=P(E2).

3. P(E1, E2)≤P(E1).

Naïve Bayes Conditional Independence Assumption

Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj)

challenges to Sentiment Analysis

• domain specificity

• "thwarted expectations"

• sarcasm and subtle nature of sentiment

• sufficient, high quality training data

• "thwarted expectations"

• sarcasm and subtle nature of sentiment

• sufficient, high quality training data

Recall

Fraction of docs in class i classified correctly, i.e. Correctly classified positive over all positive

Precision

Fraction of docs assigned class i that are actually about class i, i.e. Correctly classified positive over all classified as positive

F Measure (F1)

2PR/(P + R)

Accuracy

Fraction of docs classified correctly, i.e. correctly classified over all docs

Macroaveraging

Compute performance (Fmeasure, Precision, Recall) for each class, then average.

Microaveraging

Collect decisions for all classes, compute contingency table, evaluate

Conditional Independence Assumption

features are independent of each other given the class

Naïve Bayes Method

Makes independence assumptions that are often not true

Bayesian Network

- Explicitly models the independence relationships in the data.

- Use these independence relationships to make probabilistic inferences.

- Also known as: Belief Net, Bayes Net, Causal Net, ...

- Bayesian networks are directed acyclic graphs (DAGs).

- Nodes in Bayesian networks represent random variables, which are normally assumed to take on discrete values.

- The links of the network represent direct probabilistic influence.

- The structure of the network represents the probabilistic dependence/independence relationships between the random variables represented by the nodes.

- The nodes and links are quantified with probability distributions.

- The root nodes (those with no ancestors) are assigned prior probability distributions.

- The other nodes are assigned with the conditional probability distribution of the node given its parents.

- Use these independence relationships to make probabilistic inferences.

- Also known as: Belief Net, Bayes Net, Causal Net, ...

- Bayesian networks are directed acyclic graphs (DAGs).

- Nodes in Bayesian networks represent random variables, which are normally assumed to take on discrete values.

- The links of the network represent direct probabilistic influence.

- The structure of the network represents the probabilistic dependence/independence relationships between the random variables represented by the nodes.

- The nodes and links are quantified with probability distributions.

- The root nodes (those with no ancestors) are assigned prior probability distributions.

- The other nodes are assigned with the conditional probability distribution of the node given its parents.

Constructing Bayesian networks

1. Choose an ordering of variables X1, ... ,Xn

2. For i = 1 to n

- add Xi to the network

- select parents from X1, ... ,Xi-1 such that

P (Xi | Parents(Xi)) = P (Xi | X1, ... Xi-1)

This choice of parents guarantees:

P (X1, ... ,Xn) = πi =1 P (Xi | X1, ... , Xi-1) (chain rule)

= πi =1P (Xi | Parents(Xi n )) (by construction)

2. For i = 1 to n

- add Xi to the network

- select parents from X1, ... ,Xi-1 such that

P (Xi | Parents(Xi)) = P (Xi | X1, ... Xi-1)

This choice of parents guarantees:

P (X1, ... ,Xn) = πi =1 P (Xi | X1, ... , Xi-1) (chain rule)

= πi =1P (Xi | Parents(Xi n )) (by construction)

Linear connection

The two end variables are dependent on each other. The middle variable renders them independent.

Converging connection

The two end variables are independent of each other. The middle variable renders them dependent.

Divergent connection

The two end variables are dependent on each other. The middle variable renders them independent.

inputs to a Bayesian Network evaluation algorithm

a set of evidences: e.g.,

E = { hear-bark=true, lights-on=true }

E = { hear-bark=true, lights-on=true }

outputs of Bayesian Network evaluation algorithm

- Simple queries

P(Xi | E)

where Xi is a variable in the network.

- conjunctive queries:

P(Xi,Xj | E)

P(Xi | E)

where Xi is a variable in the network.

- conjunctive queries:

P(Xi,Xj | E)

Enumeration

simple query

Variable Elimination

- Eliminate repeated calculations

- Throw away irrelevant calculations/terms

- Throw away irrelevant calculations/terms

is VE any better than Enumeration?

- For singly-connected networks (poly-trees), YES

- In general, NO

- In general, NO

Stochastic Simulation

- Draw N samples from a sampling distribution S

- Compute an approximate posterior (conditional) probability P

- Show this converges to the true probability P

• Four techniques

- Direct Sampling

- Rejection Sampling

- Likelihood weighting

- Markov chain Monte Carlo (MCMC)

- Compute an approximate posterior (conditional) probability P

- Show this converges to the true probability P

• Four techniques

- Direct Sampling

- Rejection Sampling

- Likelihood weighting

- Markov chain Monte Carlo (MCMC)

Markov chain Monte Carlo

Rather than generate individual samples, transition between "states" of the network

Markov Chains

In a Markov chain a variable Xt depends on a bounded subset of X0:t-1

• First order Markov Process: P(Xt|X0:t-1) = P(Xt|Xt-1)

• Second order Markov Process: P(Xt|X0:t-1) = P(Xt|Xt-1,Xt-2)

Markov Models involve two things:

- transition model: P(Xt|Xt-1)

- evidence model: P(Et|Xt)

• Sensor Markov assumption: P(Et|X0:t,E0:t-1)=P(Et|Xt)

• First order Markov Process: P(Xt|X0:t-1) = P(Xt|Xt-1)

• Second order Markov Process: P(Xt|X0:t-1) = P(Xt|Xt-1,Xt-2)

Markov Models involve two things:

- transition model: P(Xt|Xt-1)

- evidence model: P(Et|Xt)

• Sensor Markov assumption: P(Et|X0:t,E0:t-1)=P(Et|Xt)

Filtering

P(Xt|e0:t)

Decision making in the here and now

Decision making in the here and now

Prediction

P(Xt+k|e0:t)

Trying to plan the future

Trying to plan the future

Smoothing

P(Xk|e0:t) for 0<=k<t

"Revisionist history" (essential for learning)

"Revisionist history" (essential for learning)

Most Likely Explanation (MLE)

argmaxx1:tP(x1:t|e1:t)

- e.g., speech recognition, sketch recognition

- e.g., speech recognition, sketch recognition

Variable Elimination and Belief Propagation save time by

- Exploiting independence

- Storing intermediate results

- Eliminating irrelevant variables

- Storing intermediate results

- Eliminating irrelevant variables

Hidden Markov Models

• simplest dynamic Bayesian Network

• state is represented by a single "megavariable"

• evidence is represented by a single evidence variable

• applications in speech recognition, handwriting recognition, gesture recognition, musical score following, and bioinformatics

• state is represented by a single "megavariable"

• evidence is represented by a single evidence variable

• applications in speech recognition, handwriting recognition, gesture recognition, musical score following, and bioinformatics

Speech recognition as probabilistic inference

Goal: find the most likely word sequence, given a sound signal

P(words|signal) =αP(signal |words)P(words)

P(words|signal) =αP(signal |words)P(words)

Chain Rule

P(A, B | C) = P(A | B,C) P(B | C)