EECS 348 Midterm Knowledge

How do we represent knowledge?
• Procedurally (HOW):
- Write methods that encode how to handle specific situations in the world
• Declaratively (WHAT):
- Specify facts about the world
Logic for Knowledge Representation
Logic is a declarative language to:
• Assert sentences representing facts that hold in a world W (these sentences are given the value true)
• Deduce the true/false values to sentences representing other aspects of W
one thing follows from another
Assignment of a truth value - true or false - to every atomic sentence
A model m is a model of KB iff it is a model of all sentences in KB, that is, all sentences in KB are true in m
Satisfiability of a KB
A KB is satisfiable iff it admits at least one model; otherwise it is unsatisfiable
Propositional logic
the simplest logic - illustrates basic ideas
the process of generating sentences entailed by the KB
inference rule
consists of 2 sentence patterns ξ and ψ called the conditions and one sentence pattern ϕ called the conclusion
If ξ and ψ match two sentences of KB then the corresponding ϕ can be inferred according to the rule
An inference rule is sound if it generates only entailed sentences
A set of inference rules is complete if every entailed sentences can be
obtained by applying some finite succession of these rules
The proof of a sentence α from a set of sentences KB is the derivation of α by applying a series of sound (legal) inference rules
Inference by enumeration
We can enumerate all possible models and test whether every model of KB is also a model of α
Application of inference rules
• Legitimate (sound) generation of new sentences from old
• Proof = a sequence of inference rule applications
Can use inference rules as operators in a standard search algorithm
• Typically require transformation of sentences into a normal form
Model checking
• truth table enumeration (always exponential in n)
• improved backtracking, e.g., Davis--Putnam-Logemann-Loveland (DPLL)
• heuristic search in model space (sound but incomplete) e.g., min-conflicts-like hill-climbing algorithms
Conjunctive Normal Form (CNF)
conjunction of disjunctions of literals clauses
Horn Form (restricted)
KB = conjunction of Horn clauses
Horn clause =
• proposition symbol; or
• (conjunction of symbols) ⇒ symbol
- E.g., C ∧ (B ⇒ A) ∧ (C ∧ D ⇒ B)
Modus Ponens (for Horn Form): complete for Horn KBs
• Can be used with forward chaining or backward chaining.
• These algorithms are very natural and run in linear time
Forward chaining
• Idea: fire any rule whose premises are satisfied in the KB,
- add its conclusion to the KB, until query is found
data-driven, automatic, unconscious processing,
- e.g., object recognition, routine decisions
May do lots of work that is irrelevant to the goal
Backward chaining
Idea: work backwards from the query q:
to prove q by BC,
check if q is known already, or
prove by BC all premises of some rule concluding q
Avoid loops: check if new subgoal is already on the goal stack
Avoid repeated work: check if new subgoal
1. has already been proved true, or
2. has already failed
goal-driven, appropriate for problem-solving,
- e.g., Where are my keys? How do I get into a PhD program?
Complexity of BC can be much less than linear in size of KB
Logic for reasoning
Deduce new facts (α) using KB:
- The rules about the world
- Information we gather through perception
Pros and cons of propositional logic
Pros: declarative, allows partial/disjunctive/negated information, compositional (meaning of B1,1 ∧ P1,2 is derived from meaning of B1,1 and of P1,2), context-independent
Con: very limited expressive power
has a set of potential outcomes,
sample space
the set of all possible outcomes
random variable
can take on any value in the sample space
a subset of the sample space.
Principle of Indifference
Alternatives are always to be judged equi-probable if
we have no reason to expect or prefer one over the other.
• Each outcome in the sample space is assigned equal probability.
Law of Large Numbers
As the number of experiments increases the relative frequency of an event more closely approximates the theoretical probability of the event.
Prior or unconditional probabilities of propositions
correspond to belief prior to arrival of any (new) evidence
Probability distribution
gives values for all possible assignments
Joint probability distribution
gives the probability of every atomic event on those random variables
Conditional Probability
The probability of an event may change after knowing another event.
Events are independent if one has nothing whatever to do with others. Therefore, for two independent events, knowing one happening does not change the probability of the other event happening.
Events A and B are independent iff:
P(A, B) = P(A) x P(B) which is equivalent to P(A|B) = P(A) and P(B|A) = P(B) when P(A, B) >0.
Conditional Independence
Dependent events can become independent given certain other events.
no members in common
Disjoint({animals, vegetables})
exhaustive decomposition
A set of categories s constitutes an exhaustive decomposition of a category c if all members of the set c are covered by categories in s
ExhaustiveDecomposition({Americans, Canadian, Mexicans},NorthAmericans).
disjoint exhaustive decomposition
functions and predicates that vary from one situation to the next
Atemporal functions and predicates:
true in any situation
change a situation, described by stating their effects
Frame Problem
Actions don't specify what happens to objects not involved in the action, but the logic framework requires that information
Frame Axioms
Inform the system about preserved relations
Axioms of probability
• For any propositions A, B
- 0 ≤ P(A) ≤ 1
- P(true) = 1 and P(false) = 0
- P(A ∨ B) = P(A) + P(B) - P(A ∧ B)
Properties of Probability
1. P(¬E) = 1- P(E)
2. If E1 and E2 are logically equivalent, then P(E1)=P(E2).
3. P(E1, E2)≤P(E1).
Naïve Bayes Conditional Independence Assumption
Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj)
challenges to Sentiment Analysis
• domain specificity
• "thwarted expectations"
• sarcasm and subtle nature of sentiment
• sufficient, high quality training data
Fraction of docs in class i classified correctly, i.e. Correctly classified positive over all positive
Fraction of docs assigned class i that are actually about class i, i.e. Correctly classified positive over all classified as positive
F Measure (F1)
2PR/(P + R)
Fraction of docs classified correctly, i.e. correctly classified over all docs
Compute performance (Fmeasure, Precision, Recall) for each class, then average.
Collect decisions for all classes, compute contingency table, evaluate
Conditional Independence Assumption
features are independent of each other given the class
Naïve Bayes Method
Makes independence assumptions that are often not true
Bayesian Network
- Explicitly models the independence relationships in the data.
- Use these independence relationships to make probabilistic inferences.
- Also known as: Belief Net, Bayes Net, Causal Net, ...
- Bayesian networks are directed acyclic graphs (DAGs).
- Nodes in Bayesian networks represent random variables, which are normally assumed to take on discrete values.
- The links of the network represent direct probabilistic influence.
- The structure of the network represents the probabilistic dependence/independence relationships between the random variables represented by the nodes.
- The nodes and links are quantified with probability distributions.
- The root nodes (those with no ancestors) are assigned prior probability distributions.
- The other nodes are assigned with the conditional probability distribution of the node given its parents.
Constructing Bayesian networks
1. Choose an ordering of variables X1, ... ,Xn
2. For i = 1 to n
- add Xi to the network
- select parents from X1, ... ,Xi-1 such that
P (Xi | Parents(Xi)) = P (Xi | X1, ... Xi-1)
This choice of parents guarantees:
P (X1, ... ,Xn) = πi =1 P (Xi | X1, ... , Xi-1) (chain rule)
= πi =1P (Xi | Parents(Xi n )) (by construction)
Linear connection
The two end variables are dependent on each other. The middle variable renders them independent.
Converging connection
The two end variables are independent of each other. The middle variable renders them dependent.
Divergent connection
The two end variables are dependent on each other. The middle variable renders them independent.
inputs to a Bayesian Network evaluation algorithm
a set of evidences: e.g.,
E = { hear-bark=true, lights-on=true }
outputs of Bayesian Network evaluation algorithm
- Simple queries
P(Xi | E)
where Xi is a variable in the network.
- conjunctive queries:
P(Xi,Xj | E)
simple query
Variable Elimination
- Eliminate repeated calculations
- Throw away irrelevant calculations/terms
is VE any better than Enumeration?
- For singly-connected networks (poly-trees), YES
- In general, NO
Stochastic Simulation
- Draw N samples from a sampling distribution S
- Compute an approximate posterior (conditional) probability P
- Show this converges to the true probability P
• Four techniques
- Direct Sampling
- Rejection Sampling
- Likelihood weighting
- Markov chain Monte Carlo (MCMC)
Markov chain Monte Carlo
Rather than generate individual samples, transition between "states" of the network
Markov Chains
In a Markov chain a variable Xt depends on a bounded subset of X0:t-1
• First order Markov Process: P(Xt|X0:t-1) = P(Xt|Xt-1)
• Second order Markov Process: P(Xt|X0:t-1) = P(Xt|Xt-1,Xt-2)
Markov Models involve two things:
- transition model: P(Xt|Xt-1)
- evidence model: P(Et|Xt)
• Sensor Markov assumption: P(Et|X0:t,E0:t-1)=P(Et|Xt)
Decision making in the here and now
Trying to plan the future
P(Xk|e0:t) for 0<=k<t
"Revisionist history" (essential for learning)
Most Likely Explanation (MLE)
- e.g., speech recognition, sketch recognition
Variable Elimination and Belief Propagation save time by
- Exploiting independence
- Storing intermediate results
- Eliminating irrelevant variables
Hidden Markov Models
• simplest dynamic Bayesian Network
• state is represented by a single "megavariable"
• evidence is represented by a single evidence variable
• applications in speech recognition, handwriting recognition, gesture recognition, musical score following, and bioinformatics
Speech recognition as probabilistic inference
Goal: find the most likely word sequence, given a sound signal
P(words|signal) =αP(signal |words)P(words)
Chain Rule
P(A, B | C) = P(A | B,C) P(B | C)