male speaker:please join me in welcoming todays speaker doctor julie segre. [applause] julie segre:hi good morning and thank you to all of you who have joined here in the auditorium, andalso online, now and in the future. this is very much a developing field, but i thinkwe're also at a moment in which many of the methods that we use for interrogating themicrobiome are finally hardening so that i can give you a talk that i think if you wereto use these approaches, your paper would still be current when you try to submit thedata in a year or two. because this really
has been in the last five to ten years somethingthat has evolved with the sequencing technology. when i first started looking at microbialcommunities, i used sanger sequencing to look at the 16-s. then we bought the 454 rocheinstrument to use that. that instrument has been discontinued, so we're all now reallyvery much in sync using the aluminum miseq and hiseq, which means that there's more abilityto do cross-study comparisons. but i wanted to sort of talk through today;really what are the standards that our community is using, and how do we set up these studies?so i have no disclosures, and i'll start really with the introduction. and i'm going to givea very human-microbiome-focused study, because that's really my field of expertise, but iwould say that the types of analysis that
we use, especially from the sequencing realm,would be generally applicable if you were looking at ocean communities or human communities.we are all looking to understand the myriad microbes, the fungi, bacteria, funguses archaea,and from the human perspective, while we're interested in this, there is a lot of -- thereis some variation in the human genome. but if you think about the orders of magnitudehere, your body also is covered in these bacteria, so we are estimating trillions of microbes,and their genetic potential is quite diverse at the strain level, at the species level,at the genus level. and so this is another way in which there is a second genome thatis associated with all ecosystems. these microbes are diverse and dynamic and that is reallysome of what you will see now as our society
grapples with that the microbial communitiesmay go through times where they bottleneck and then what is going to come out the otherside; either with humans in terms of antibiotics or in oceans in terms of oil spills. so just some general terms that i’ll use:the humans of course are hosts to many of these microbes; the microbiome is the microbialcommunity, the totality of all that dna. you'll see many of these things in [unintelligible]the microbial cells outnumber the human cells. some people that there is an equal numberif you're only talking about bacteria and fungi. i choose to include viruses in that,so i would say that the microbial cells will then outnumber them -- as much work as we'vedone to understand the functions of human
genes, the microbial dna is really understudied.so if we sequence a human genome, we may know the function of 75 percent of the genes, andwe may understand that they can play multiple roles. when we annotate bacterial genomes,even things like e-coli, we haven’t had that same rigorous testing of sort of whatare the multiple functions of this protein and we have a hard time even trying to predictprotein structure -- i mean, protein function based on structure. so we're really enteringa new period of discovery in which we need to understand some, you know, have some assetswhere we can see ‘how does this genetic potential read out in terms of function?’so while we're really focused for the last hundred years on the pathogenic potentialof these microbes, and thinking about ebola,
thinking about tuberculosis, thinking aboutstaph aureus, there are also many beneficial microbes. and that's part of the reason whywe are so interested, is the role of these commensal beneficial microbes who aid in vitaminsynthesis, digestion -- really important is the education and activation of the immunesystem. this is a way in which perhaps the microbial communities are being read out systemically,and over the lifespan of a human. one of the major goals also of these beneficial microbesis -- you know; nature abhors a vacuum. so if a pathogen comes out, enters a system,and the microbial community isn’t stable or isn’t present, maybe you've taken antibiotics,then you are more able to be colonized by a pathogen.
so now i'll just sort of launch into how wehave been studying microbial communities. and the original way of course which is stilldone for many reasons also, is culturing microbes on these; this is blood augur. and it wasrecognized early on, of course, that the majority of these bacterial species don’t grow inculture, and this has been called the great plate count anomaly, where there are microbesthat are really hardy and easy to grow on culture. so i could grow staph epi -- a staphylococcusepidermidis, i could grow it from every body site, but it might not be that common. thereis a bottleneck that, or a distortion of the microbial community, that you read out byculturing it. and also, our systems have been set up to sort of grow these microbes in isolationwhen we know that many microbes really rely
upon the community in order to flourish, sothe sequencing came about with the idea that maybe we could get a different perspectiveon what the microbial communities are, and really just to bring it full circle, we doa lot of culturing now too, but we usually do it informed by sequencing, we know whatwe are trying to culture, and we pick culturing conditions that will then allow those bacteriaor fungi to flourish. so it’s quite different, i mean, if i wantto capture the fungi that live on the skin, i put olive oil on my plates to capture themalassezia, or i could inhibit the growth of the bacteria with certain antimicrobial.so this was really the first experiment that we did, where we were saying you know, doessequencing give us a certain answer, because
-- i’m sorry i've kind of gone out of cycleand said of course it does -- but i think from first principles you have to understandthat because we're about to make a big investment in sequencing all these microbes, does ittell us anything different. so this was the first experiment that i setup with patrick murray, who was the head of clinical micro. he put the skin swabs on manydifferent culture plates -- the blood augur, the chocolate augur -- you can see the differentmorphologies that we recovered, and then we sequenced every isolate we got. we took thatsame parallel swab, we put it straight into lysis buffer and we sequenced what we got,and the results did not astonish patrick. what you can see from here is that the orangeis the staphylococcus, so this is the comparing
the alar creases, the side of the nose, theumbilicus, the belly button. and what you can see here is that if i just do a survey,which is the dna sequencing, i get a community that has mostly proprium bacterium that darkblue, it has some cornflower blue the coryne bacterium, it has some staph also, the orangeand firmicutes in red. when i put it into culture, i lose the diversity that you'reseeing up at the top, including that little green proteo bacteria, and what i get is basicallyproprionum bacterium and staph, which we know how to culture. the umbellicoes or the bellybuttonis even more extreme, where you're seeing the sequencing would say that there's a lotof korina bacterium there. we can culture those korina bacterium but they're mostlybeing outgrown by the staph and the fermicutes.
so we're using this now to say there's a similaritybetween the two communities of what you get from dna sequencing and what you get fromculturing, but that there is a reproducibility and an accuracy of the representing of whatis the community based on the sequencing. sorry i should have put this into the slides:i would say that for all of our experiments moving forward, we standardize to the humanmicrobiome project mock community, which can be ordered from bei, which is sort of likeatcc, it's a not-for-profit repository. and we sequence the mock community, we actuallydo it on every single plate of sequencing that we do, to standardize our experiments.and i have shared it with people, but i also would recommend that people just order itif you're starting to do sequencing experiments
because it has a known answer. that's somethingthat you're seeing here where i'm giving you two different results and you're saying butwhat is the truth. and that's where the mock community, which is a mixture of 20 bacteriathat have been put in all at the same concentration, is very beneficial because it allows you tostandardize across sequencing lanes if you change protocols. anything that you changewe always standardize back to that mock community, and we run it with every plate that we doof sequencing so that we know if a plate has failed. it also helps us if we do a studyand then we collect more samples maybe two years later, and we think ‘well have wechanged things in the laboratory in those two years without even perhaps realizing it,’we always go back and compare that exact same
sample again. so topics for today's -- well, first of allthere will be the random things where i go off and realize that i should have put somethinglike bei in, the mock community in-- but i'm first going to talk about bacterial diversitystudies, fungal diversity studies, bacterial genomes, metagenomics, and then finish withwhere's the technology going. bacterial diversity studies are typicallybased on the 16-s gene, which is part of the-- the 16-s is a ribosomal rna, so i'm sure you'reall aware that the rhizome is where proteins are synthesized. the ribosome is a mixtureof ribosomal rna's and also of proteins, these ribosomal rna's are in high copy in the genome,and they also have a structure where there
are regions of them that are more conservedbecause they are necessary for structure and some are variable. and this 16-s gene hasreally been used as the signature phylogenetic marker for decades now that allow you to identifybacteria and archaea. and you see it here where on the left is theribosomal rna gene and you can see these stems and the loops. and the stems are of coursemore highly conserved because they have a structure where you're going to have to havea double-stranded rna there. but we used these regions where you can see on the right-handside is the variability, so the variability across the gene, where each of the variableregions where you might get more information is marked. and then you sync primers in thehighly conserved region, and you sequence
across the more variable regions, which helpsyou to identify then what are the genus, sometimes to the species, level. so this is sort ofthe basic workflow, where from a microbiome sample you can have multiple members of thecommunity, you do one dna extraction directly from the sample, we don’t do culturing beforehand,you amplify the 16-s gene and you can use that for taxonomic classification. you alsocan use that for doing population-based analyses, where you talk about alpha-diversity and beta-diversity;that basically means how many different species are there in this community, how does thiscommunity compare with another community, and you can compare to different communities. okay, so i put this in i thought you know,really even just in the handouts, you know,
just to kind of put this there. would -- youknow -- okay so pretty much i've said people are using 16-s, but before that, what arethe things that you need to consider when you're setting up a study? so first of alli think it’s really useful to define the question as precisely as possible. here'sone question: i want to compare wild types with knockout mice. it turns out if you cometalk to me, i'll have a lot of questions about that; are these mice littermates? becausewhat we've seen is that there can be variation even due to just cages and how they've beenbreeding and you'll see one example of that. but i'll also ask you what controls do youneed? so i think it is important to try to really be as clear as possible about the studydesign, and that's not the focus of today's
talk. i'm going to really talk about moreof these other questions here: what sequencing platform will you use, what region of the16-s gene will you amplify, how many reeds per sample do you need, what are the hiddentechnical issues, i'll focus here on chimeras. what analysis tool will you use, how willyou display your data, how will you compare your results with other published studies,and how much information do you really need from these studies to yield a testable hypothesis.so i want to just take you through my sort of, my cookbook, how you would follow thisrecipe. from the very beginning, one of the things that we do struggle with, is calculatingthe bacterial load, and so here i would say that typically people are using a qpcr approach,to say how many copies of the bacterial gene.
most people are using still 16-s rna. i would say that also there has been someeffort from [unintelligible] and others to identify genes that are single-copy, to geteven a more accurate assessment, i did that that the 16-s gene can have multiple copiesin a genome that you may have to control for, but i think you all understand that a qpcrcould tell you, you know, how many copies of bacterial genome do i have in this sample.i think the hard part is really, what are you going to normalize your sample to? maybeif most of you do gut studies, maybe you can normalize that to the grams of stool. andi guess it's just something that you have to consider that maybe there's more undigestedfood material, so maybe the grams of stool
isn’t always the right measure. you know,jeff gordon has done this and he's sort of measured it versus how many calories are beingexcreted. but i think that’s, for us with the skin,we sometimes think about it where we're trying to say per square centimeter how many bacteriado i get and we're comparing when i swab the skin with when i scrape the skin with wheni do a full thickness punch biopsy. and there the difficulty is where we can normalize tosquare centimeters, but we do wonder is there variability in the user who is identified,who has collected the sample. so probably what most of you did come hereto talk about is the dna sequencing. there-- the method that we'll give you at this point,the most information of sequencing the 16-s
is to use an aluminum ic with an amplicon.and so what you're doing down here, you're putting in these primers, this is amplifyingthe v1, v3 region. there are other primers that are very standard that will amplify v4,v6. and you're amplifying the 16-s gene, putting it on the aluminum i-seq [spelled phonetically]and that's the sequencing platform that we have sort of standardized to. i would sayfor a small study what i've seen is that the sequencing is limited, because really thereis an investment here for these primers. the way that we do it in a production, we're aboutique production lab, is we have these primers, but we then have a stutter linkedto an aluminum barcode. and the stutter helps us so that -- and you can read more aboutit in all three of these papers -- the stutter
means that when you load it onto the aluminum,if you have an amplicon sequencing, what is hard when you go onto an aluminum instrumentis that they all will read an a and a c and a g if you're amplifying a pcr product, becausethey're all going to read the primers that you used to amplify that pcr product. so the first 20 base pairs is all going tolook the same, and it’s hard for the aluminum instrument to get the register if every cellis lighting the same base pair. that's why the stutter means that, we put it in wherethere's sort of between 0 and 4 base pairs on the different primers, and that gives uswhere now everything is off-register from each other, and then we can actually go inand detect much better the amplicons. if you
don’t do that then people often load phy-ex,just so that there is not everything reading a c c a g, something like that, that wouldbe the primer sequence. but to get to this point, i would say that scale is the issue,so in a small study the sequencing is limited, and i still think that that's a lot of whyat the nih we are still really trying to create a microbiome initiative, where there wouldbe someplace where you could load a hundred samples and he would have 50 samples and shewould have 25 and we could really sort of do this together rather than all having toset up the different reagents, set up the same platform, and have a few samples everyonce in a while. because right now we do multiplex 400 samples all together in one lane, buteven we who are microbiome lab have trouble
finding 400 samples. there are other means of sequenced data acquisition:some people talk about oligotyping or phyllo [spelled phonetically] chips where you sendit and they put it on a micro ray. i think that the analysis of that data can be morestraightforward because it’s more like looking at micro-ray data. it probably is more expensive,but these things are always hard to cost out, and i guess the limitation is if your goalis to find a unique or novel species you can’t find that on something that has defined materal.the other really good method is the aluminum hiseq, and that is what you know, sort ofthese big studies like the earth microbiome are doing and they're pretty much analyzingthere the v4 region. so it’s a shorter read,
it will give you less phylogenetic information,but that certainly what a lot of the larger studies you will see are doing. okay, so, you get these 16-s reads back, howdo you figure out what they are? and if you think the answer is that you would go andblast it, you will unfortunately blast your sequence, it will match tons of things, andprobably the majority of what it matches are things that say 'uncultured' from a 16-s rrna sequencing study. and that won’t help you very much, because unfortunately peoplelike me have just littered genbank because we had to deposit all of our data from allof our studies, and we just annotated as an uncultured 16-s, so that really doesn’thelp you very much.
i'm going to talk about the tools that weuse. mother, chime, and clover; i'm going to really focus on mother and chime becausethey're really the workhorses. and built into these are a lot of tools that i'll try tounpack some of them, but i have to say, in the olden days, underlying mother is [unintelligible]you won’t see that anymore. they were all kind of built as separate tools, but theyhave all kind of been brought together, where it’s sort of one-stop shipping now, eitherat mother or chime. and it’s also been a place now where the community adds additionalresources, so at one point my lab did a fungal study and we built this fungal database. wellwhat we did was then we loaded it into mother and into chime, so it's kind of gotten tobe a place where we really bring together
tools. so the 16-s sequences. we used pretty mucha reference-dependent database. so if you want to classify a sequence, within mother,within chime, you can go in and use, we've all kind of standardized to the ribosomaldatabase project, which is very similar to silva, very similar to green genes, and itwill give you an assignment for a bacteria, where it’s a curated reference dataset,so it actually has sort of brought into play what are the high confidence differences betweentwo different genera, and between two different phylum -- well, not phylum, but within twodifferent families, orders. so it’ll give you that kind of resolution, and you can feedinto these databases, any of the different
regions of the 16-s gene. there are some differences, like if you wantto get beyond the genus level, then there are some regions that are better to gettingto the species level, so for staphylococcus, you would want to use the v 1 3 region, butfor lactobacillus you'd want to use the v 4 5. so it is important to think about whatis really the genus that you care most about, or what is the tissue that you care most about,and you may want to tailor your sequencing to that. i also should say that each of theseprimer sets has their own bias, and that has been documented, that's where, again, themock community comes in really useful, because you definitely want to test out your primers,your sequencing, on the mock community, because
if there are signature taxa for your bodysite, you want to make sure that you’re actually recovering them. so if from thisit will return something that you can make into a bar chart that basically says whatare all the sequences and the genus. if you get a sequence that has no reference, youmay think that you have identified a novel bacterium, but there are other explanations,and i’ll get to that in a second. you know, not even to say more about it butjust, i wanted to give you at least the basic facts, this is the rdp database. as i saidit's based on lined, curated, annotated 16-s genes, where a lot of work has gone into classifying,and i’m just sort of giving the other, you know, the real specifics to it, because thereare-- there were choices that had to be made,
i mean for example, i don’t-- for examplewe use-- for rdp they used bergey’s taxonomy. there are some types of bacteria that changefrom one name to another, and you know, this can be frustrating to people but you knowwe continue to discover more microbes and more distinctions and there is a communitythat determines when something gets reclassified. from the rdp classifier you can also generatethings, like probe-match and seqmatch. this is the silva database. they're really,you know, quite similar, i don’t know that you’d get a different answer from usingsilva than using the rdp but i wanted to at least make you all aware of this, and thereare some things you can do in these tools. pretty much they'll all do more or less thesame things, but it may be that one visually
appeals more to you than the other; that'sa constant challenge i’m sure everyone has talked about in this lecture series. genomicinformation is so rich that a lot of times it’s the display of the information thatreally is important in terms of understanding the depth of it. so it may be that becauseof what you're trying to pull out and identify, the visualization tools that are built intothese programs will appeal to you more than the other. okay, if you get a novel sequence, you maythink you know you think you have something truly novel. i would say that probably thefirst thing you should think about is ‘do i have a chimeric sequence.’ and what happenshere, you think, how could i have a chimera,
these things are just a few hundred base pairslong. well, that’s another thing that the hmp, the human microbiome project, reallytook a close look at, and i have to say, i think that everyone who’s served on thatcommittee was shocked at how many chimeras we had. so let me tell you about the testwe did. how do chimeras occur? well, it's incomplete extension of a pcr, so basicallywhat happens is you start amplifying on one strand, and then that cycle of pcr ends. andthe next round of pcr when it starts, in fact you've ended in the middle of a very conservedregion and now you can amplify anything, so your query sequence would end up being somethingthat started as a green and ended up as a blue. and when you go into the database youcan’t assign that, so it’ll say that this
is a novel species. well how often does that happen? this is theuse of the mock community where what we we’re doing here were two things: so first you willsee here with the mock community we are trying all these different primer sequences, andyou can see that these are the twenty different bacteria that we had in the community, andyou can see that some of them ended up being overrepresented by certain primers, some ofthem end up being underrepresented by primer sequences, and each set of primers has theirown bias. not great, at least it’s been documented. but then along with that everyset also has this percent of observed chimeras. so remember we put 20 bacteria in, and howmany species do we get out again? well it
turns out that depending on like how you clusterand what are your criteria for pulling out chimeras, you could end up in this, you know-- you end up at least having forty species in here, but you could end up thinking thatyou had 350. so now built into things like mother and chime are these things chimeraslayer, that will identify these kinds of sequences and remove them from your run. andyou know, you may say but what about if this really is a novel bacteria and this is whati want, you can certainly go back and look through those sequences, they're not removedfrom your dataset, but you know you’d need to use that data with caution. so with this kind of sequencing data, i justwanted to sort of show you some of the results
and how we can use this data. this is thedata from the nih common fund human microbiome project, where 250 healthy subjects were surveyedat 5 major body sites and in some of those sites, like in the oral cavity there weremultiple samples taken, and we then asked what are the bacterial communities, using16-s amplification. and you can see that the major determination here is what is the bodysite; so you’ll see in the gut or in the stool, there’s a lot of these bacteroidetesand a lot of these fermicutes, the yellows and the browns, this is actually the averageof the data, i’ll show it again in a minute. where as in the nairs [spelled phonetically]you’re seeing a lot more of these blues, the actinobacteria, and the vagina is goingto have the lactobacillus, that red. so the
major finding here was that the body siteis more determinate than the individual. and in fact it goes even to the body site, sothat the bend of my right elbow is most similar to the bend of my left elbow, but after thatthe bend of my elbow is more similar to andy's more than it would even be to inside my nose,because this is moist epithelium and this is sort of a dryer crease. so again this isshowing more of the individuality, so you're seeing the same features as i was saying.the lactobacillus is really dominating in the vagina, the gut is again these bacteriditesand fermicutes, the mouth is going to have this high representation of streptococcus,and you can see that this is again showing just that the determination of the body site,so you can use this as a way of sort of guiding
what are the bacterial communities that youcan expect to find. and when you set up a study, if you can recruit a small number ofhealthy volunteers, then you could sequence those and assess whether you got data thatwas similar to the larger human microbiome project and that would allow you to sort ofleverage the larger data set. just to show one example from our own workof sort of how you think about these changes in bacterial communities, this is a studythat we did where we looked at the skin microbial communities as children transition throughpuberty. and i think you can see here this actually for me i’m just putting up becausei think it’s a fairly obvious study. the kids here on the left are all prepubescent,the kids here on the right are all post pubescent.
and what you can see is that these kids beforethey go through puberty have a lot more of the reds, which end up being all of theseproteobacteria, they also have a lot more of these streptococcus, which also makes senseif you think kids get these impetigo, a strep infection which adults don’t get. and wealways thought it’s because they were icky kids or something, but maybe it’s becausethere is more strep that naturally colonizes their skin. the changes here that we are seeingis that post pubescent there’s more of these crynobacterium and these propreonum bacteria,the greens, and that also would make sense in that these are bacteria that require lipidsfrom their growth, and when you transition through puberty your skin becomes oilier soit would make sense that these bacteria could
become more prominent. so that's an examplewhere even in a healthy state you could see very clearly a transition. and we can sortof -- we can lay it out down here where you're seeing which bacterial genera go up and down.so obviously i was saying in the later kids its these korina bacterium and the propreonumbacterium. i guess this also does makes the point thatyou have to think about some of these things. for us we did the study because we were wonderingwhen we have kids, do we have to age match them. and from this the answer is clearlyyes. so okay, so you’ve got the 16-s data and you can plot it as rdp, as what is thebacterial genus and species, but you’ve probably also seen other types of analysestypically when people are looking at them
at the community level. and there some peoplewill just use them at the genus level based on the 16-s, but within both mother and chime,there is this other way in which a lot of the studies end up being done on what we calloperational taxonomic units. so let me take a minute to explain that toyou. so you could say that these bacteria all belong to staphylococcus or streptococcus,or you may say that these are all fermicutes, but really the sequenced data is related tothe phylogenetics, and we sort of have these definitions that typically species would haveto mean that you would have to be at least 97 percent identical at the 16-s level. sowe have the sequence data, so what we do is then we try to take the sequence data, andcluster them based on sequences that have
97 percent identity, because that kind ofgives us – it’s a computational, mathematical way of talking about sequences that have thesimilarity, without having to go through the loop of identifying what every bacterial genusis, and some bacteria don’t have that proper specification down to the species level. forexample, the korina bacterium, we just haven’t sequenced that many of them, so i can’tjust assign things and say that this is a corynebacterium accolens this is a corynebacteriumsimulans; i don’t have enough sequenced referenced genomes. but i can see in the sequencesthat these are all korina bacterium, and these sequences are much more similar to each otherand these are much more similar, and so i want to be able to retain that level of resolution,but i don’t have the reference genomes always
to make an assessment and say to the specieslevel what is this sequence. so this allows us to really capitalize on the sequencingdata, and say that i have these operational taxonomic units and i can assign them basedon 97 percent identity or 99 percent identity. there also are differences here in terms ofwhether you are a lumper or a splitter, and you can use the furthest neighbor or you canuse the nearest neighbor as your joining methods. and by that i mean you can have a centroidsequence and you can say that anything that is 97 percent identical to it i will put togetherinto the same otu. that could mean that two sequences are really only 95 percent identicalto each other. we actually require that every sequence within an otu is at least 97 percentidentical to every other sequence. so i don’t
know if that’s a little bit too much ofa nuance, but you can see how these otus -- think about it in the general way, you can eitherbe a lumper or a splitter, and you have to make those kinds of decisions. okay, so thenyou have these otus, what are you going to do with them? and i think the two most commonthings that people do is they look at community membership and they look at community structure. so let me just distinguish for you in a toyway wat i mean by that: let’s say i have two groups and i’m making two kinds of fruitsalad, and my group a i’m going to use mostly apples and oranges, but i’m going to putin some bananas, some pears, and some grapes. the second group i only have apples and oranges.so if i think about community membership,
where i say how many categories of fruit areshared between them, then its only two of the five. if i think about community structureand i say if i pick a piece of fruit out of a and i pick a piece of fruit out of b howcommon is it that i would find the same piece of fruit in a and b. then the communitieslook much more similar to each other, because 94 of the pieces of fruit in group a are theapple and orange and that’s 100 percent of group b. and both of these are accuratecomparisons of what is the community. and that’s where we have both of these measures,and so we really do asses it because you could imagine that in terms of -- when i’m thinkingabout how a bacterial community transitions, that concept of community membership is goingto be important, if rare species end up blooming
and causing disease. whereas if you have cdifin your gut community and you take antibiotics, you could end up having a much greater colonizationof cdif whereas if you don’t have cdif in your original community and you’re not exposedto it, it can’t bloom. so in that case the rare species are important, but if you wantto talk about a community that maybe provides colonization resistance, then it may be moreimportant what are the dominant community members. so i would say most of the time wecalculate both of these and we look for if there’s discrepancies. community membership, i’m giving you hereone example and i’m sorry i seem to have forgotten the reference for this it was froma pnas paper that was done very early from
jeff gordon’s lab where they are lookingat mice that are from a cross where they’re typing obese mice, obob mice that have themutation in the leptin gene, and they’re looking to see how they cluster. and whatthere finding here, this is community membership. so they’re looking to see do they sharemicrobes. and it’s not about the relative abundance, it’s just do they share microbes,and here what you’ll see is that the code here, these are the pups, m33 means it’sthe third pup, the first and the second, and what you see here is that the pups will endup looking most like their mother, and here again you’re seeing they cluster based onwho was the mother of this litter. and in this case here’s mother 2. so at the levelof where are you inheriting your microbes
from, in this case where it’s an experimentwith mice where the father might have even been taken out of the cage i don’t knowwhy they didn’t analyze it. you inherit microbes from the mother, and then there’ssharing amongst the siblings. another study that we were looking at though, these arelitter mates, and what were looking at here is what is the community structure. so herewe’re looking more at the enrichments in certain bacteria that are shared by the genotypeof the mice as compared to the wild type because they were born to the same mother. becausein this example where the mice have a defect, there are certain bacteria that are more commonlycolonizing because the skin is impaired in those mice. so that would -- that’s kindof the two different measures i why they might
give you different readouts. one of the questions of course is how manyreads do you need, and i would give you a ballpark estimate of like 1000 sequences fora first pass analysis. you typically will over generate, so in a miseq run it’s hardnot to generate over 10000 reads. but it is still probably important to think about howdiverse is the community, especially if you’re doing ecologic studies. so i would say alsoit depends on how you’re clustering them, that why i talked about the otus first. butfor some sites we see very low diversity, so if you look even at the y axis here thisis like a very low diversity sample where we really think there’s just four species,whereas this person’s bellybutton, we really
are still accumulating new sequences. so it’sjust worth checking to see with a rare faction curve, how diverse is your community. andthen as i was saying there are these different ecologic measures, richness, diversity, theyall are telling you something different about the community, and they all are easily calculatedwithin mother. and they’re also-- or in chime, and there are very good tutorials withinboth mother and chime, they were both written by ecologists who really are trying to translatethis for people who may not have the full background. in addition, i wanted to justhighlight these two papers that i think really try to talk you through what are the factorsyou should consider when setting up a microbiome study.
okay i’m going to -- i mean from those ofyou who are looking at the time and wondering how i’m going to get through all of this,topic one was the major topic because that’s what most people are interested in, i’mgoing to kid of give a flavor of more of the rest of the work. i’m going to talk fora minute about fungal diversity because it’s very similar to bacteria, but it does requirea different sequencing method and a different database, so i’ve talked a lot about the16-s amplification. in fungi there are ribosomal rna genes, the 5.8 and the 28 and the 18-sfor those of you who ever run eukaryotic rnhls you know that those are the bands you’relooking for when you’re running a northern. and some people do sequence the 18-s, it getsharder, especially if you are looking in human
samples to deplete and find primers that arespecific for fungi rather than humans. and the primers that have worked best for us arethe primer that are actually amplifying he its1 region, the intervening transcribed sequencethat’s between the 18-s and the 5.8-s, this is also the region that is used by most clinicalmicro labs to identify a fungi, and so the databases for these are also just the mostwell developed. this has some difficulties in that i was talking to you about how the16-s sequence has structure to it and has those more conserved and variable regionsbecause it is a functional rna. here you are working with a non-coding rna, and even anon -- just like a spacer sequence, so you can-- you don’t have that fixed width alignmentto do your classifications, you can have 20
base pairs coming in and out and obviouslyit’s not affecting the structure so really the way that we then align, we don’t penalizefor these kinds of large insertion/deletions. we do have custom its databases that havebeen resolved at the different phylogenetic levels, so it is similar to how we do therdp classification for this. and we get different results so in our skinbacterial communities were going to see what is the skin and we’re going to say thatits mostly korina bacterium and we talked about how the left elbow is different thanthe chest and the forehead; there's totally different communities when you look at thefungi. and so in this paper with our skin, we looked at what are the different communitiesof fungi, and we really, we are here trying
to develop data sets where you can then say,maybe are there fungal bacterial interactions. for the skin we really found that it was mostlymalacesium, but we could find tremendous fungal diversity on the feet, which probably wouldn’tsurprise you if you think about the fact that this is where you see many of the fungal infectionsamongst healthy volunteers. so that would be toenail infections and the athlete’sfoot. but just to say that the fungal community is not as robust with their tools, but theycertainly exist if you want to do those kinds of studies. and you can then see -- for uswe looked at what is the fungal diversity versus the bacterial diversity and saw thediscrepancies. okay, now i’m going to move on and talkabout bacterial genome sequencing. so again
i’ve come up with what are the things youshould ask yourself before you embark on this study, because i think for many of the times,you might be thinking about sequencing a microbial isolate and then wanting to annotate it oruse it in your studies. so first just defining what is the study objective, really for us a lot of our next question iswhat reference genomes exist, because if there is a very good high quality reference thenyou can often take your readings and scaffold them onto an existing reference. but for themost part we’re going to talk today about what sequencing platform you'll use what depthsof sequencing do you need, what assembly tool do you use, how are you wanting to displayyour data, what are you going to compare to other published studies, and how will thisinformation yield a testable hypothesis. but
i do put forward those first two questionbecause i think they can often drive the decisions you’ll make later on. so how to assemblea bacterial genome. a staphylococcal genome is 2.5 megabases,i’m going to talk here about gram-negatives, which are more like 6 million base pairs.and our typical way of sequencing these is still on the aluminum miseq where were getting30 to 50-fold redundancy. you can also do these on the hiseq, i’m trying to kind ofgive the examples of a miseq because i think that’s probably still more accessible topeople as an instrument. so what happens is that you take your bacterial dna you lyseit. probably most people right now are going to make a nextera library, which is whereyou insert the transposon aluminum barcode
right into the dna. probably previously peoplehad sheared the dna and made these libraries, but this is really one-hour easy dna prepto get these kinds of reads feed straight onto an aluminum instrument. so you end upwith these reads that are a hundred, or three hundred base pairs and sometimes they arepaired end reads, and what you get is that one read then leads into another and you canassemble these into contig's. so i say it like 'and then you just assemble them intocontigs,' well it turns out that this really actually is something that we spent an enormousamount of time trying to quality control. how are you going to really assemble thesesequences? and because there are choices that the assemblers are making about when to breaka contig versus when to bring it together,
underlying most of the assembly programs arestill frap, velvet, a lot of people are still using the guts of the solera assembler. probablyright now most people for bacterial genomes are using spades, mira, mizurka, and i cantell you we just recently did a reanalysis of this in our lab; they do give you differentresults. and i don’t really know what to tell youon that. so we have ways in which we then benchmark these assemblers to each other.we often in our lab have gold standard genomes in our case, i’ll get to this, we generatea fully assembled genome, and we're benchmarking to that. but it is difficult because someof these genomes will give you longer contigs, but maybe some of them have less support andi don’t really have an answer of like, this
is the path forward. ncbi is working veryhard on this too, and richa, aggerwall, and bill klemky are also looking at this issue,and i think it’s going to be something where it depends what kind of data you want andwhat kind of genome you have. right now we are, we've defaulted to spades, which we errorcorrect with pylon, but i’ll try to highlight where those difference might come into play.i just wanted to sort of even explain to you quantitatively how these assemblers even workand why you might have differences, and it’s really in these decisions, how are they simplifying.you know they go into these hashing methods and try to build a [unintelligible] graphsand it really is in these kind of simplification of the linear stretches and in the error removalthat different programs make different decisions.
and we still so far don’t have like a truemethod. so as i was saying, evaluating these assemblies is something that is still, withingenomics, something that people are really working on. i mean meetings that i go to willhave the assemblathon, where everyone takes these genomes and people compare 'what didyours say what did mine say' and often times we come back to that's why it’s really importantto deposit your reads into ncbi and into the sra because the assemblies that you depositcan have biases in them, and i often, if i want to compare my study with someone else'sstudy, i will just grab their reads. so we get these contigs back, many of them are quitelarge, we do look at coverage, that’s one of the things were looking at, so like plasmidscan be at higher coverage, this is genome
coverage, and some of the plasmids can beat higher coverage, the ribosomal rna operons will be at higher coverage, other plasmidswill be at higher coverage. and these are also the kinds of things that you have toknow, these ribosomal operons because they are, as i was saying, there's 5 copies ofthem in a genome, that’s what breaks assemblies. you’re going to break every time you entera sequence in a short-read library that enters a ribosomal rna those operons are large andthere’s no way for a short read technology to know here to come out on the other side. that is where we have turned to pack bio genomesfor creating references because the pack biome which is this single molecule wave sequencingtechnology can read these very long reads
that are 10kb, 17 kb, so that’s long enoughthat it actually can read through all of these ribosomal operons, and from a pack bio genome,we can actually generate a fully assembled reference genome that will give us the chromosomeand all of the plasmids, and then we can scaffold short reads onto that. those have ended upbeing very valuable for us, and those are the genomes that when i’m looking for areference, if there is a pack bio genome, it tends to be more complete and i will usethat as a reference. genome aligners, if you want to then find what are the changes, thisis often what people are trying to do, and looking for single nucleotide variations orinsertions/deletions, again options that you can use: for genome annotation, we do ncbioffers an assembler pgap and also the joint
genome has an assembler. for some organisms now, i should have includedit i’m sorry, platypus from university of maryland is another one that were using frogenome annotation. so these are typically you submit your genome sequence and they willreturn to you and annotation quite rapidly. not real time, but rapidly, like days. andthe reason you’d want to do that is that a lot of this within a bacterial special andcertainly within a bacterial genus, there’s going to be a variable region. so in a staphepidermidis, every staphylococcus epidermidis has 80 percent core genes, but there 20 percentgenes that are in this variable region, which is also called the pan genome. so that wouldbe that as you sequence more bacterial genomes,
you will continue to get more genes that arein that species, and so you’d want to annotate what are the particular genes in this strainthat you’ve sequenced. i can say based on experience that it often is true that thedifferences that are in this pan genome are the least annotated, they often do come backas open reading function unknown, but you still have to sort of know what is the basicannotation of this genome. so now i wanted to just talk about some exampleswhere you’re saying i want to compare two genomes, find snips, find mutations, finddeletions/insertions, and i distinguished here between snips and mutation because thoseare two different things, often were using single nucleotide variants when we’re talkingabout a phylogent wanting to build a phylogenetic
tree, and often ties those are markers orsignatures of the evolutionary tree, but they don’t necessarily change an amino acid orcause any change in the function. and so if you -- you can identify a single nucleotidevariant. if you wanted to say that it is a mutation you would of course need some functionalstudies to support that. so just as an example this is a study that we did looking at threedifferent multidrug resistant acidinobacter balmonis [spelled phonetically] and our questionwas whether these three strains that were all seen at the nih clinical center, whetherthese evolved from a single origin or whether they had all come into the clinical centerwith an independent origin. so were looking here every time there is asnip relative to the reference genome, we're
going to code it and we can use circos hereto make these very nice colorful plots. what you’re seeing here are snips relative tothe reference, and we had these three strains, a, b, and c, and were looking to see is therea relationship between a, b, and c, you can find that there are these regions, obviously,that are unique to each of our three strains, and there in these clusters and that’s actuallywhy i wanted to say if we had just looked at this without having sort of stitched togetherthe context, we might have called that there were thousands of snips different. but actuallywhat you see is a clustering of the snips in the red regions, the blue regions, thisgreen region here, and again here blue red green, that’s a recombination, that can’tbe counted as 100 different snips. that’s
one even that caused that and ill show thatto you here, where what we’ve had is the analgen biosynthetic locus has really comein and recombined, that’s right here at this-- right near the origin. so when we’retalking about how to build a phylogenetic tree, we want to look at these snips thatare each clustered independently of each other. and when we have hundreds of snips clusteredtogether we have to be able to distinguish that's a recombination, you know it couldbe a single genetic event rather than hundreds of independent events. i’ll just talk fora minute about how we did use it, snips. when we had a clonal outbreak. and here we hada cluster of patients who all had the same carbopenen resistant clubseala pneumonia.when we sequenced all of these isolates, they
were much more similar to each other, andwe did find these clusters of snips, these are all across the genome, and these are allindependent snips, and this clustering of snips did help us to identify that there wasa closer relationship between patients 1, 2, 3 and 5, than from this other cluster,and you can even see that we could narrow it down once we have these snips up here,12, 13, 18, that there would be a closer relationship because they share these common snips thatmust have evolved during the spread of the outbreak. so that we can use to reconstruct transmission.i have to say this is-- this was an example of how the genetic information really is veryclear. i would say many of the question that
we get though, if things are a hundred snipsapart, is it clonal is it not clonal. it’s very hard for us still to make that judgementwithout having more references. and that’s often where i just want to be very clear genomicsis powerful but it doesn’t point the direction of the arrow. and there are times simply alli can say simply is it’s this many snips apart. i can’t make a judgement of whetherthat means, you know, i can tilt one way or the other whether this is more or less likely,but we really can’t say if it is this combination of the epidemiologic information and the genomicinformation, because we certainly have had examples where there are conal strains circulatingin the us we have received two patients that clearly have no epidemiologic link, and theirisolates will be 10 snips apart. maybe they
three hospitals ago were someplace similar,or maybe these are just dominant strains that really have locked their genomes. so i justwant to be clear that there is some minimal information that if two isolates are within10 snips then there clonal and if they're more there not; this is really something thatthe global healthcare system is struggling to incorporate genomics into it. so i’mgoing to talk now about metagenomics, which i have to say is probably the topic in whichthere is the greatest change coming on now. this is basically we've talked about sortof using these markers and using these-- this is basically like you take a sample from someone’sstool, someone’s skin, you just feed it straight onto a sequencer, and then wow doyou have a biofermatics challenge so it’s
a very complex mixture and its very complexcomputationally. so what do we do? i think shotgun metagenomics analysis, you do thiswhen you want to know really who’s there and their abundance and you want to know theirfunction and you want to know what genes are present and you want to identify pathwaysand you want to identify strains and you want to recover genomes and you want to find allthe pathogenic organisms, but you get just so much data that you probably do need ananalysis plan before you start getting these sequences, because they’re overwhelming.so i’ve been talking about on the left where you’re doing these sort of marker genomestudies, and now as i said were just going to get fragments of dna back, so what do youdo with them. well the reason that you would
do this is if you were trying to think aboutdifferences, and you may even have, as i sort of talked about, where there are these pangenome open genomes different strains, different species can have different genes, so you couldget something here which is a phil hugenholtz example where you have sort of a similar bacterialcommunity but within this the open part of the genome or the flexible part of the genomemight encode different genes, so you end up having what look like two similar bacterialfungal communities, but actually have every different genes that there encoding, and that’swhen you need to get to metagenomics. this is the other reason why you want to getto metagenomics, which i have to say i find these studies totally cool, so they're tryingto find out how does a termite digest wood.
so that’s actually a function we might wantto know because we might want to use that to find new metabolic enzymes. how does therumen of a cow degrade this biomass, how do you create energy from biomass? and so youthese are metagenomic studies here there then looking to see what are-- how are we goingto find it, we don’t know what we’re looking for—they’re looking for metabolic enzymes. so there are two ways that you can-- you getthis large data set, and the first thing that you probably say is wow this is a lot morethan i was expecting, so i’m really trying to break it down her that you can either doread based methods or you can do assembly based methods, or you can try to assembleyour reads and then use these larger contigs
to identify genomes and clusters and do gencalling, or you can just do read based mapping. so i’m going to talk about those two strategies.if you’re looking for function, you can use these keg cog, these kinds of tools thatleverage functional databases, and this is really as good as it gets and the only issueshere is that they tend to be more focused on metabolic core functions and they’renot going to return as much of the unannotated dark matter of the microbes. there are somepathways that you can use, this is curtis huttenhower humann, where he's trying to giveyou -- you feed in your reads and you’ll get out pathway coverage, pathway abundance.this is certainly a good place to start, and curtis keeps all these tools available andthey’re all available through the bio bakery
and he does continue to improve them. it’s a fairly solid generic look at yourdata. this would be an example of the kind of output you would get, on the top i’mshowing you the great differences that were seeing and we talked about at the beginningof it, where the stool has all of these bacteriadetes, and these fermicutes, whereas the oral communityis going to have more of the streptococcus and so on. when you look at them in termsof their functional output, they all look much more generic, right? and that’s-- weknow that there are differences in what these communities do, but as i was saying theseare the functions that are most often annotated, we can-- every bacteriadete is going to haveto go through cell cycle division, so those
functions are going to be better known. soit kind of gives you this sort of blurred view where everything looks sort of much moresimilar then maybe if we incorporated what were the unannotated functions would tellus, but these are certainly as good as it gets. some people are trying to call genomes outof metagenomics, i think this-- if you have hard to culture organisms, this is one ofthe things that you can do, is you can just shotgun metagenomic sequence and then tryto bin them, this actually was pretty cool where they're binning them here both on -- sorrythat should say tetronucle-- it sort of does-- tetronucleotide frequencies-- and from a metagenomicsample, there like pulling apart the reads
into different genomes. if you can culturethe organisms, it’s much easier to culture them and match the isolates to the metagenomicreads, and i realize i’m not giving you a path forward here, but this is kind of thestate of- if you set up a metagenomic sequence, you should prepare to spend at least a yearanalyzing your data, at least we do. the sort of idea about how to form these linkage groups,because that could make it more powerful and sort of an intermediate between single readsand having full genomes, is to sort of try to bin your reads into clusters and the waythis has been leveraged originally in this paper from carlson, is that if you have multiplesamples, you think well if these two reads are from the same genome, then i would expectthat they would be found at the same frequency
within the different samples, so you formas many contigs as you can but they are often quite small, and then you cluster your contigsbased on their frequency in multiple samples and that can get you to reconstruct largermetagenomic structures, and that’s kind of where the stare of the art is moving. i wanted to still talk about something thatis another way that we leverage metagenomic data, and for us that’s called strain tracking,where i’ve been talking about how there are this pan genome. so my favorites we talkedabout staph epidermidis how its 80 percent core and then the 20 percent are often thesemore diverse mechanisms, so evan johnson at bu wrote this program called klinpathoscope[spelled phonetically], where if reads come
from the core then there going to map to everygenome. you have to have a set of reference strains; you have to already have sequencedgenomes for phylogenetically diverse strains, there can be snips that distinguish these,in the pan genome you’re going to have reads that map to some strains but not to others,and then with evan’s program pathoscope, it takes both the information from the snipsand also from the pan genome and will reassign so that now you would assign all these readsto strain a. that’s obviously been done with a lot of simulated data, but also we’vedone that without human data where we then are looking at-- from a single individualif they could have all of these strains, we look to see on their body sites what strainsof p. acne's do they have or – i don’t
know why, i have to redo-- sorry, these slidesare cut off but they should be full in your handouts, i don’t know what i’ve donewrong to them-- but this happens to me every time i present on a pc and i don’t knowwhat it is. so we’ve used this data to sort of lookand see, this is one healthy volunteer and they have different strains-- different individualshave different strains. well you can see some of it here, so here for the p. acne's youcan see that individual c will have those brown strains but individual a is only havingthese blue and green, and the purples are between the two. so you can start to use thisto then say what strains are carries by the different individuals and you may from thissee strains that are particularly enriched
in a disease state; that's what we're lookingfor, that's why we’re going all the way to the strain level, because it might be thatsome strains of p. acnes are more associated with the development of acne than the commensalbeneficial ones. strain tracking is also able to be done with read assignment to find thecore in the accessory genomes, this is if you don’t have reference genomes, but youhave many more reads, so this is a very similar-- its two different ways of leveraging the kindsof data you’re going to get out of metagenomics. so why are strains important i think it’sreally just to find the accessory genes to determine whether prebiotics or probioticscan have a lasting effect, could you get a new strain in, how stable are these strains,and to underlie what is happening with diseases.
really as my last topic here i’m going totalk about-- within the context of metagenomics, some people are now trying to use this whereyou have a patient that presents with fever of unknown origin, and you want to know ina clinical setting could you identify what is the pathogen. so one way of doing that in klinpathoscopethat i’ve just discussed from evan jonson and this is also cirpi from charles chu [spelledphonetically], which is going through the same kind of analysis where you’re takingraw sequences and you’re saying what does this match to. i think both of these are verypowerful and the question is, i will just caution you that you will get an answer, andit is often related to just how many times
has a sequence been deposited in the database,if it will make a match for you. this has been used very successfully by charles wherehe was trying to identify a patient who had exactly that; recurrent illness and they couldn’tidentify what it was, they used this sequencing to identify this leptospiraceae that theycould then validate in a clinical test and define the best treatment for this patient. for all of these studies you will often have,i’ve talked all about the microbial dna, i would just caution you that with the genomicdata sharing policy, you will also get human dna and you really have to think about ifyour studies involve the microbial dna, to be very careful what you’re doing with thehuman dna. especially if your goal is to sequence
a microbial community, you will most likelyrecover human dna, and you shouldn’t just deposit it in the database in an open waywithout filtering it out and also that you will recover human dna so i consent all ofmy patients or all of our subjects for whole genome, whole x-some sequencing because ido want them to be aware that even if i am trying to sequence their microbial dna i willrecover human dna and i just think that's something that patients should be aware of. in the last two seconds i’ll just closebecause this is actually a smaller part of my talk than it’s ever been before, whereis the sequencing technology now and where is it going? a lot of the stuff is, rightnow, just going on the aluminum miseq and
the hiseq. i think the pack bio has a rolefor us right now for looking at long reads to get these good reference genomes, and isthere any new technology on the horizon before i give this talk again in two years? the onlyone that i’m aware of is this [unintelligible] which you can see is a small handheld device.it’s a portable small cell, could be used for fast diagnosis, like think ebola and sothat's probably the only new thing on the horizon. and i’ll just finish by sayingi sort of talk at the beginning and talk at the end about sequencing is the start, reallyyou’re trying to generate a testable hypothesis with the sequencing data, maybe you’re tryingto identify a novel pathogen, but then you still have to think about how would i testthis and what do i do with that.
so with the sequencing, what i really triedto talk about her is coming back to cox's postulates, where you’re trying to assesthat there is a microbe that causes a disease, but are more nuanced few now where there'sa microbe causing a disease, but it may be causing a disease only in the context of acertain microbial community, so you need to understand what is that microbe and probablydown to the sequence level, because different strains may or may not be able to do thatfunction, and it may or may not be able to do that in the context of what is the microbialcommunity. so with that i’ll close, we're really trying to understand what is the roleof possible pathogens within the context of a microbial community. so thank you all verymuch.
[end of transcript]
0 comments:
Post a Comment