27 terms

BIOL 202 Lecture 31 Bioinformatics


Terms in this set (...)

Identification of proteins that promote or prevent the spread of heterochromatin
We understand wuite a bit about the mechanism underlying pev because of GENETIC ANALYSIS OF PEV
"exploited the phenomenon to identify the necessary factors
HMT makes sense - already meantinoed that histone Me associated with heterochromatin
Mutant flies were used
Can find mutations that enhance spreading of heterochromatin and suppress heterochromatin spreading
Two proteins that promote heterochromatin spreading are HP1 and histone methyltransferase
HAT important for suppressing heterochromatin spreading
Loss of gene HAT leads to heterochromatin spreading
Mechanism for chromatin spreading
Histone methylation promotes conversion to heterochromatin
HP-1 binds methylated histones
Recruits HMTase
HMTase methylates neighbouring nucleosomes
etc. etc.
HP-1 recruits HMTase, which methylates nearby histones - spread

Barrrier insulator binds HAT, provides anchor point for a boundary beyond which HP-1?HMTase/hetrochromatin cannot spread
Heterochromatin can spread
It does this using HP1 protein
HP1 binds methylated histones and recruits HMTases
HMTase methylates neighboring nucleosomes and it passes on
What is stopping these HMTases?
Barrier insulators
Barrier insulators stop the spread of chromatin
Barrrier insulator binds HAT, provides anchor point for a boundary beyond which HP-1/HMTase/hetrochromatin cannot spread
Barrier insulators acts as a wall and prevents heterochromatin from spreading
Chromatin spreading is counteracted by HATs bound to barrier insulators
Whole genome "shot gun sequencing"
Nowadays, using what's called next generation sequencing technology, we can do massively parallel sequencing - 1,000,000 sequencing reads at once.
New technique now use today called shotgun sequencing
Genome/DNA was taken and cut into random fragments
Can sequence over a million DNA molecules using massively parallel sequencing!
You then sequence these fragments and overlap sequence readings
You look for overlap within your contigs for complete sequence
Sequences are assembled altogether to form a consensus sequence
Consensus sequence is adequate representation of sequence for each DNA molecule in that genome
However, not all reads are successful and need to combine multiple techniques in order to get the right sequence
Bioinformatics and genome annotation
Extraction of information content of genome, identifi-cation of the full complement of encoded genes
Comparative genomics:
Are genes conserved between species? What information beyond coding sequences is encoded in the genome?
Comparative genomics is the study of genome structure, including, numbers, types and location of genes between chromosomes, or, more usually, between organisms
Attempt to extract information about genome function and evolutionary processes
Within an organism
Gene families & gene duplication
Between organisms
Types of genes
Ultraconserved elements
Functional genomics
Functional genomics:
Studies of gene and protein function at a genome-wide
rather than gene-by-gene level
Gene identification or Gene annotation
The problem - genes are surprisingly rare in the human genome
Less than 2% of the human genome sequence encodes proteins
even including intronic and regulatory sequences, only around 25% of the genome sequence makes up protein coding genes
What's the rest?
At least 50% of the genome is made up of repetitive sequences:
transposable elements
pseudogenes (non-functional duplicated copies of human genes - almost as numerous as functional genes)
simple sequence repeats (as short as 2 bp long)
blocks of tandemly repeated sequences - centromeres, telomeres, ribosomal gene clusters
Genome annotation is the idea that we can have the entire genome and be able to mark every gene in that sequence
All the genes will be mapped onto a scaffold
Take advantage of previous knowledge and can:
-Compare against known cDNAs (mRNAs)
-Look for similar sequences to known genes or proteins
-Look for Open Reading Frames (ORFs)
-Look for potential protein binding sites
Gene annotation - Identifying ORFs
Open Reading Frame (ORF) - contiguous stretch of DNA that codes for amino acids without being interrupted by stop codons
A protein coding gene will have large ORFs, while non-protein coding sequence will have only short stretches with random interruptions
Computer programs scan all 6 possible reading frames & mark those with large ORFs (say, greater than 50 amino acids)
Problem - introns in eukaryotes can also interrupt reading frames, so looking for ORFs alone is not enough
Average gene 10 exons,170 nt/exon; average intron size - ~5.4 kb
You can do a computational analysis of the DNA to identify open reading frame
It has 3 different reading frames on one strand and 3 other reading frames on the other strand
6 possible ORFs in total
Computer program can scan all 6 possible reading frames and mark those with large ORFs
A protein coding gene with have large ORFs
Genes however are organized into exons and introns
Introns' avg size is often larger than exon
Often scientists find many psosbiel reading frames
Gene annotation - Comparing against known cDNAs
cDNAs are DNA copies of polyadenylated mRNAs from a particular tissue
Data from cDNA sequence databases can be used in comparison with genome sequence to
Identify what are likely genes
Identify what are exons & what are introns
Identify patterns of splicing (alternative splicing)
Compare nucleotide sequence & predicted protein sequence with database of known sequences
Look for "homology" - similar genes/proteins in other organisms that have already been characterized
Basic Local Alignment Search Tool (BLAST) is most commonly used
BLAST or Basic local alignment search tool is program used which looks for homology in DNA
Can be from any organism
BLAST tells you the maximum matchup and also gives a total score
100% means 100% identical
Consensus sequences identification
Investigations have identified consensus sequences for:
RNA polymerase binding (e.g. TATA box)
Transcription factor binding
Ribosome binding (Shine-Dalgarno sequence)
5' & 3' borders of introns where spliceosome binds
In absence of cDNA data, can used to predict gene structure
Problem - consensus sequences mostly short, often variable... poor predictors on their own
Shine-Dalgarno - ribosome entry seqeuence in bacteria and archaea
Sequences short - occur frequently by chance alone!!!

DNA sequence should also have splice sites, translational initiation sites, etc.
RNA Pol binds to TATA box and we can figure out where this is
In bacteria, we can find the Shine Dalgarno sequence
We can simply annotate the genome to find these things
However, consensus sequences are often very short and variable
They often occur in the genome tons of times by chance
They are poor predictors by themselves
The main point is that these analyses on their own aren't very accurate
However, when these analyses are done together, you can make reasonably good predictions
Gene families
Gene families - related genes within an organism that encode proteins of similar amino acid sequence ("sequence conservation")
Can contain two to >100 members (called PARALOGS)
Members can be functionally redundant or have independent functions
A large proportion of genomic sequences are present in at least two copies - "segmental duplication"
Gene families arise from gene duplication
Arabidopsis has only 5 chromosomes...
There is a lot of duplication in our genome
A large portion of our genome are segments of chromosomes that have been duplicated
Arabidopsis genome has 5 chromosomes
Can see its duplicated regions
Segments of the chromosome exist in at least 2 copies
Syteny is the conservation of gene order.
Are the gene order conserved between related species?
Depending on evolutionary distance, can be quite similar across all or parts of the genome - SYNTENY
Over time, large scale chromosomal rearrangements have "shuffled" chromosomes
However, at the local level, genes often can have the same neighbours in related organisms (microsynteny)
Can be used to infer common ancestry of species
Synteny is the idea that because chunks of chromosomes have been duplicated over time, that gene orders will also be conserved over time
If humans and mice came from same organism with same DNA sequence, they evolved from each other but the gene order will be conserved b/w human and mice
Human chromosome is color coded compare to mice
Chromosome 19 is found in same order on chromosome 10 in humans as shown by yellow
This is because they came from ancestral organism
In the X chromosome, the it is the same order almost completely as the mouse!
This is useful for determining ancestry
Tranposable elements have played a prominent role in genome evolution
Can induce chromosome-scale rearrangements
Homologous recombination between multiple copies of transposons duplications, deletions, inversions
Contributed to gene duplication and shuffling of chromosome sequences
Homologous recombination occurs b/w transposons
Recombination b/w Tes in chromosome can result in loss of fragments
Don't need to worry about details
This is one way in which genomic fragments have been shuffled over time
Whole genome shotgun WGS sequencing
Whole genome shotgun WGS sequencing: this sequencing determines sequences of many segments of the genomic DNA that have been generated by breaking the long chromosome of DNA into many short segments
Sanger sequencing is traditional WGS sequencing
WGS sequencing first starts with collecting short segments of DNA in the form of genomic libraries
These segments can be inserted into DNA to form vectors
To make genomic library, researcher must use restriction enzymes, which cleave DNA at specific sequences
Resulting fragments have short single strands of DNA at both ends
Each fragment is then joined to DNA molecule of accessory chromosome, which has also been cut with restriction enzyme
Multiple copies of genome are added into these vectors
These recombinant DNA molecules are then propagated into bacteria cells, which take up one vector each
They are amplified, forming cells that are clones
The resulting library of clones are called a shotgun library b/c sequence reads are obtained from clones randomly selected from the whole genome library
These fragments in clones are then sequences
Afterward they are assembled into overlap and callled sequence contigs
Next Generation or 454 Pyrosequencing
This next type of sequencing is similar to tradiitonal WGS in that it obtains contigs
However, done in different systems
Can be done in cell free reactions
Ccan also be done where millions of fragments are isolated and sequenced in parallel
454 Pyrosequencing:
Step 1: Single DNA strands are immobilized in single beads. A DNA template library is generated
Step 2: These molecules are amplified by PCR so there are many DNA molecules attached to beads.. Each bead contains identica DNA fragments
Step 3: Each bead is deposited into a tiny well. In pyrosequencing, DNA on each bead is sequenced
DNA polymerase and primers are added to wells along with deoxyribonucleotides, DATP, dGTP, dCTP and dTTP
The nucleotides flow through in a specific order: dATP, dGTP, dTTP ,and dCTP
Each time nuleotide incorporated, pyrophosphate molecule is released
Two enzymes, sulfurylase and luciferase act to convert this pyrophosphate into light signal that is visible
Whole genome sequence assembly
Bacterial DNA easy to sequence b/c it is single copy DNA with no repeating sequences
Eukaryotic genomes have repeating sequences and difficult to align as sequence reads
Paired end reads: sequences found at opposite ends of genomic inserts in the same clone. Used to find correct order of genome in eukaryotes
Joining together of contigs called scaffolds
Paired end reads can be used to join 2 sequence contigs into single ordered scaffold
Paired end reads can be produced by circularization where genomic DNA is sheared and circulizzation adapters containing liner sequences are added to the end of each segment
Next DNA is circulized and the circulized DNA and linker sequences are isolated
Additional adapters A and B are added to facilitate amplification and sequencing
Resulting library consists of paired end reads with two end tags
This is the study of info content of genome
Annotation is the process of finding all the functional elements of the genome
Proteome refers to he inventory of all the polypeptides encoded by an organisms' genome
Difficult to detect this in eukaryotes but there are some ways
One way is ORF or open reading frame detection
This uses computational analysis of the genome sequence to predict mRNA and polypeptide sequences
Looks for sequences that have characteristics of genes
Sequences with characteristics of genes are called ORFs
Computer scans DNA sequence on both strands in each reading frame and there are 6 reading frames overall
mRNA can also be converted to cDNA and can aso be used to identify exons and ORFs
Expressed sequence tags or ESTs
There are also large sets of cDNA where only the 5' and 3' ends or both have been sequenced
Thes short cDNA sequence reads are called expressed sequence tags ESTs
ESTs can be aligned with genomic DNA and used to identify 5' and 3' ends of a transcript
Different forms of gene product evidence ex. cDNAs, ESTs, BLAST similarity hits, codon bias and mtoif hits are used to make gene predictions
Processed pseudogenes
processed pseudogenes are DNA sequences that have been reverse transcribed from RNA and randomly insreted into genome
Transcriptome: sequence and expression patterns of all transcripts
Proteome: sequence and expression patterns of all proteins
Interactome: complete set of physical interactions b/w proteins and DNA segments, b/w proteins and RNA, and b/w proteins
DNA microarrays
DNA microarrays are used to study transcriptome
We can see what genes are active at a particular cell at a certain time by looking at microarrays
mRNA is extracted, cDNAs are then reverse transcribed and fluorescently dyed
They are then hybridized to microarray, laser detection is used and computer calculates relative levels of hybridized probe
Can see where gene expression increased, decreased, etc.
In Microarray, you use control and experiment cDNA in order to analyze relative levels of hybridized probes
Two hybrid test
Test used to detect physical interactions b/w 2 proteins
Used for interactome analysis to understand interactions b/w 2 proteins
Basis for this test is the transcriptional activator encoded by yeast Gal4 gene
This protein has 2 domains, a DNA binding and activation domain
These 2 domains of GAL4 are separated making in 2 hybrid system, making activation of reporter gene impossible
Each domain is connected to different protein
If the 2 proteins interact, then the domains join together to start transcription of the reporter gene
GAL4 gene is divided into 2 plasmids each containing these domains
On one plasmid, a gene for one protein underinvestigation is spliced next to the DNA binding domain and tis fusion protein acts as "bait"
On the other plasmid, the other protein is spliced and this fusion protein is said to be "target"
Chromatin Immunoprecipitation assay ChIP
This is used to study protein-DNA interactome
Important as proteins often bind and regulate DNA
Say you have a gene and suspect it encodes protein that binds DNA
To test this, you can treat cells with chemical that will cross like proteins in the DNA
This way proteins bound to DNA at the time of chromatin isolation will remain bound through subsequent treatments
Next step is to break chromatin into smaller pieces
To separate fragments containing your protein-DNA complex from others, you use an antibody that reacts specifically with the encoded protein
Antibody is added to mixture so that it forms immune ccomplex that can be purified
DNA bound by protein may be sequenced directly or be amplified into many copies by PCR