Statistics – A Full University Course on Data Science Basics

Good day, everyone. This is your lecturer, Monica wahi.And we’re going to start now with section 1.1. What is statistics? So here’s our learningobjectives for this lecture. At the end of this lecture, the students should be ableto state at least one definition of statistics. Yes, there’s more than one, give one exampleof a population parameter. And one example of a sample statistic. Also, the student shouldbe able to classify a variable into quantitative or qualitative and as nominal ordinal, interval,or ratio. So what we’re going to cover in this lecture is, first I’m going to go oversome definitions of statistics. Like I said, there’s more than one. But they all sort ofrelate to the basic concept of why you’re doing statistics, and especially not math.So what’s the difference, right, then we’re gonna go over a population parameter and samplestatistic.And you’ll know what those mean, at the end of the lecture. And finally, we’regoing to go over classifying levels of measurement. So let’s start with the definition of statistics.And so we’re going to go over these concepts like what it is. And also I’m going to definefor you the concept of individuals versus variables. You may know definitions for thosewords already, but I’m going to give you them in statistics ease. And then I’m going togive you examples of statistics, individuals and variables in healthcare. So here are thedefinitions. What is statistics? statistics is the study, how to collect, organize, analyze,and interpret numerical information and data. Well, that sounds pretty esoteric, right?But if you actually think about it, even if he did a simple survey, like you just did a wiki, you just look on Yelp, right? You look on Yelp, andyou see, you know, the restaurant, you want to go to some people say five stars or fourstars, but there’s a few two stars one star will do you go? I mean, there’s a whole bunchof different answers.So how do you do that, you kind of have to analyze it, you kind ofhave to interpret it. So it’s not that easy. So statistics is both the science of uncertainty,and the technology of extracting information from data. So in other words, if you’ve gota bunch of data about like a restaurant, um, you don’t know how it’s gonna be if you actuallygo there, right? You don’t know for sure. But, uh, so it’s the science of uncertainty.If you look on Yelp, and you’re seeing almost everybody’s giving it a four or five star,maybe it’s gonna be good for you, right? But you don’t know, maybe there’s new management.That’s the uncertainty.So statistics is used to help us make decisions, not just whetherto go to the restaurant or not, but important statistics, such as in health care and publichealth. Well, I guess if it’s an expensive restaurant, maybe it’s important. But anyway,and health care and public health, you really need these statistics, because they reallyguide you. Like, for example, let’s think of the Center for Disease Control and Preventionin the United States. So what do they do? They spend the whole year studying the differentflu viruses that go round, because there’s more than one. They spend the whole year doing that they organize, analyze,and interpret numerical information and data about these different viruses, the differentinfluenza viruses that are going around.They extract that information. And you know, whatdecisions I make the make the decisions about what viruses to include, in the next yearsexy? Are they always right? Sure enough, they’re not. I mean, have you ever had a yearwhere you’re like, Oh, my gosh, everybody I know, got vaccinated, and they’re stillgetting sick? Well, you know, give him a break. It’s this sign some uncertainty, they it justdidn’t work out that time. However, this is probably better than just randomly guessing.Right. So that’s statistics for you. Know, I promised you I’d tell you the statisticsease version of individuals and variables. Now, if you’re outside statistics, you knowthat individuals are people, right.And you know that a variable is a factor, like a factorthat can vary, you know, like, the only variable is I don’t know what time something’s going to happen. But when you enter the land of statistics, there are specificmeanings to these two words. Individuals are people or objects included in a study. Soif you’re gonna do an animal study with some mice in it, those would be the individuals.If you do a randomized clinical trial, and you include people who have Alzheimer’s init, then patients are your individuals.But we do a lot of different things in healthcare.We sometimes study hospitals, like the rate of nosocomial infections, in which case ifyou’re looking old bunch of stuff in hospitals, those would be the individuals. Sometimeswe look at states rates of infant mortality, for example, in different states, in thatcase, states would be individuals. So as you can see at the bottom of the slide, a variablethen is a characteristic of the individual to be measured, or observed.I give some exampleson the slide. But like I was saying, you know, if you wanted to study a hospital, for example,I gave you the example of a variable of a rate of nosocomial infections, you could alsohave other variables about that individual or hospital, like the rate of in hospital mortality. Andso, as you can see, one of the things we do in statistics is we sit down and we decide,well, who are going to be our individuals that we’re going to measure? And what variablesare we going to measure.So I just threw up here a few examples of different kinds ofindividuals we have, that we use a lot in health care and public health, and an exampleof just one variable, about those example individuals. But there would theoreticallybe many variables about them. And I just want you to notice, a lot of times, the individualsare geographic locations. Other times they might be institutions, like I said, like hospitals,or clinics, or programs. There’s other things that they are, but these are just kind ofthe big ones. So, um, as I was describing, and just to review, what I went over, statisticsis used in healthcare and other disciplines to, to aid in decision making, like I gavethe example the CDC and their vaccine for influenza. And so therefore, it’s really importantto understand statistics, because you need to understand these processes in healthcare,like how do we figure out what to do? Like not only what do we do, but how do wefigure out what to do.And that’s really important because we use statistics a lot in healthcare.Now, we’re going to move on to talk about what a population parameter is, and what asample statistic is. So we’re going to go over first definition of a population andthe definition of a sample. So you’re sure about what those mean. And we’re going totalk about the data about a population and the data about a sample and how those aredifferent. And then we’re going to get into what I was just describing parameters andstatistics. And I’ll give you a few examples. So let’s start with what is the population,again, another case where you just have a normal word, but it has a special meaningand statistics? Well, it’s a group of people or objects with a common theme.And when everymember of that group is considered this population, right. So here, here’s just one example. Sothe theme would be like nurses who work at Massachusetts, Massachusetts General Hospital,so the population then if that was your theme, will be the list from human resources of everynurse out currently employed at mgh. Now, it really does depend on how you define thatthing. Like I could have said, nurses who belong to the American nursing Association,right? And then we’d be looking at a different list. I could say nurses who live in New Orleans,in the city limits of New Orleans who live there, right, then we’ll be looking at a differentpopulation. So really has to do with the details of how you describe the theme around thatpopulation. But the point is, once you describe that theme, the population is every singleindividual in there. So then, what is the sample? Well, it’s a small portion of thatpopulation. It can be a representative sample, but it can also be a biased sample, and we’regoing to get into that. So let’s just go back to mgh.And think let’s say we were goingto survey a sample of the population of nurses at mgh, let’s say we only surveyed nursesin the intensive care unit. That would be a sample, but not a representative sample.So it would be a small portion of that population, but not a representative one. Probably morerepresentative would be if we asked at least one nurse from each department. And so I justwant to get in your head that the whole concept of sample is, is that it’s just a small portionof the population. And it’s not a portion of some other population. It’s just that one.But the problem is you can get a biased one or representative one. So you have to thinkabout So when you think about it, if you’ve got a whole population, then you would getvariables about each individual in that population. And those variables would be your data. Butif you chose samples, that you know, just a portion will be a lot less work, right?You’d still have to get variables about those individuals, but there’s way fewer individuals,so it probably be easier.So in population data, data from every single individual inthe population is available. And that’s called a census. So I’m, I knew a person who decidedto do a survey of every single professor at a college. She didn’t take just some professorsfrom each department, she sent the survey to every single professor. So she did notuse a sample, she used a census. But in sample data, the data are only available from someof the individuals in the population. So if we go back to the researcher I described,if she had only taken some of the list, the email list of the professors at that college,then she would have been serving a sample. And that’s actually very commonly used inresearch studies, especially if patients, why would you need to go get every, for example,kidney dialysis patient and study every single one, you only need a sample.And why is thatbecause we have statistics. So I’m going to just give you a few examples of real populationdata in healthcare. You’re probably familiar with Medicare, Medicare is the public insuranceprogram in the United States, for elders. So even my grandma was on Medicare when shewas alive, and she was not a US citizen, she was fromIndia. So we really do a good job of covering our elders in the US with Medicare. In fact,I even read a statistics that said, almost 100% of people aged 65 and over are in Medicare.And so therefore, if you download data from Medicare, they make it confidential, you onlyjust replace all the personal identifiers. But there’s this thing called the Medicareclaims data set for every single transaction that happens, like if you’re in Medicare,and you go get some treatment that’s in there. So it has all the insurance claims filed bythe Medicare population, because it has everybody, everything than that is population data.Also,in the United States, every 10 years, the government hires a bunch of people to go outand survey a bunch of people. And also, they send out a bunch of surveys. And the ideais to try to get every single person in the United States to fill out that survey. Andthat’s called the United States Census. So now, I’m going to give you sort of a mirrorimage of the sample data. Okay. Remember how I was just talking to about Medicare? Peoplewho are enrolled in Medicare are called Medicare beneficiaries, and Medicare cares what theythink. So they do a survey of a sample of individuals on Medicare. And they do thiskind of often. I think they do it once a year. I’m not sure it’s a phone survey. They onlydo a sample because they’re going to use statistics to try and extrapolate that knowledge backto the population of Medicare beneficiaries.Also, in case you notice, the United StatesCensus only takes place every 10 years. Do you think changes happen in between? Yep,lots of changes. Like you just think about Hurricane Katrina. That’s very sad. It changedthe population distribution in Louisiana, vary vary dramatically, and also other statesaround there. So how did they keep up? Well, they used the American Community Survey, thegovernment does this the United States Census Bureau, and that, again, is done by phone.And that’s conducted yearly.And it’s a sample and so the US doesn’t know exactly how manypeople would be in Louisiana or anywhere else. But they can use statistics to extrapolatethat from the sample of the American Community Survey. I want to just do a shout out to statisticalnotation. So from now on, when we see a capital N, like let’s say you sack capital N equals25, then you can assume that 25 means a population that’s just kind of a secret code we use instatistics. However, if you saw a lowercase n, n equals 25, and it was lowercase, thenyou could assume that this was a sample of the population. And again, it’s just kindof like a secret code, you have to pay attention. When I’m talking and I say n, and you cansee uppercase and lowercase. You don’t know if I’m talking about a population, or a sample.Now I’m going to get into the concept of parameter versus statistic, I want you to notice thatthe word parameter starts with P PA.So parameter is a measure that describes the entire population.So for instance, anything that would come out of that whole Medicare claims data set,or that whole United States Census would be a parameter. On the other hand, a statisticstatistic starts with S, and statistic is a measure that describes only a sample ofa population. Here we have an, again, a situation where the word statistic is used, like dailyon the news. In fact, sometimes I hear on the news, something like Oh, look at the rateof HIV in Africa, it’s going up. That’s a terrible statistic. I agree. It’s terrible.But they mean parameter, because they’re talking about all of Africa, every single person inAfrica, if the rate of HIV is going up in Africa, they mean a parameter, they don’tneed a statistic. So here’s an example of parameters and statisticsthat are based on the same population. So for example, the mean age of every Americanon Medicare is a parameter that’s every single person. However, remember, the Medicare beneficiarysurvey, that’s just a sample.So if we took the mean age of those people, we would justhave a statistic. And again, you just have to pay attention, because if you listen tothe news, you’ll hear them use the word statistic to mean both parameter and statistic. Butin this situation with, when you’re practicing in the field of statistics, it’s very importantto point out when the number you’re talking about comes from a population versus comesfrom a sample.So you should really use the term. This is a parameter if it’s from a population,or this is a statistic, if it’s from a sample. And so again, don’t get confused. If you’relistening to someone talk in a lecture or in a video, you might want to look for cluesthat a number is a population parameter, or as a sample statistic, if you hear that thedata set that they use encompasses an entire population.And usually that’s the kind ofstuff done by governments, like remember when I was talking about the rate of HIV in Africa,lead probably be done by governments of the United Nations, or the World Health Organization.So when you’re talking about numbers that might have come out of an entire population, usually done by the government,that’s probably a population parameter. clues that someone’s talking about a sample statisticis if you hear them talking about a study that recruited volunteers, well, then, if it’s volunteers, they didn’t geteverybody in the population.So it’s going to be a sample. Also, like surveys, for instance,surveys about who people are going to vote for you public opinion surveys, they’re nevergoing to ask some every single person in the state, who are you going to vote for buildus ask a sample. So if you hear about a survey, you might even have them tell you say, n equalsmaybe a few 1000 people because that’s all they surveyed. And so that’s a clue that we’retalking about a sample statistic rather than a population parameter. Now, I’m going totalk about the difference between descriptive statistics and inferential statistics. Butfirst I’m going to remind you what the word infer means.So infer means to kind of geta hint from something indirectly. It’s kind of the complement to imply. So if I said myfriend implied that I should not call after 9pm and I figured that out. I would say Iinferred that I should not call my friend after 9pm. Okay. So in inferential is whatI’m going to talk about next. But first I’m going to talk about descriptive descriptivesis pretty easy, because you can do it to samples and you can do it to populations will variablesfrom samples and populations, right. And so, descriptive statistics involve methods oforganizing picturing in some Rising information from samples and populations. It’s basicallyjust making pictures of it right? Like look at that bar chart. And that’s just a simplepicture. And that can be made with just about any data. You get data from surveying peopleat work, you get data from surveying your friends, what they’re going to bring to thepotluck. If any of that can be used, you can go download the census data, you can makedescriptive statistics out of that.But there’s something very special about inferential statistics.And that involves methods of using information from a sample to draw conclusions regardingthe population. Therefore, inferential statistics can only be done on a sample. And thereforeand that’s why that’s called inferential. Right? Because infer, because the sample isgoing to give a hint about what the population is right? It’s not going to say it directly,which is annoying, right? But that’s that uncertainty thing I was telling you about.So the sample is going to imply something? Well, we’re gonna infer something from thesample about the population, right? So that’s what inferential statistics is, is where youtake a sample, and you infer something about the population. Whereas descriptive statisticsis more loosey goosey. You can just do that to samples and populations, kind of like makepictures out of it, right.So in statistics, it’s really important to properly identifymeasures as either population parameters, or sample statistics. Because as you can see,you can only do inferential statistics on samples. And so you have to really know whatyou’re doing when you’re doing statistics, what you’re talking about, because differenttypes of data are used for parameters versus statistics. Alrighty, now we’re going to getinto classifying variables into different levels of measurement. So remember our variables,right, like we have individuals, and then we have variables about them. And those variablesactually can only fall into two groups, quantitative versus qualitative. And then depending onwhich group they fall into, you can further classify them as interval versus ratio, ornominal versus ordinal. And I’m going to give you some examples of how to classify a fewhealthcare data, types of variables already, so I like to draw this picture.It’s a fourlevel data classification, I’ll draw it solely here for you. So we start with human researchdata, that’s what I like to start with. Alright, so we’re going to split that into two. Remember,I said that, we’re going to start by talking about quantitative. Another word that’s oftenused for that is continuous, but we’re going to use the word quantitative. So what doesthat mean? That is a numerical measurement of something. So like, this gives an exampleof temperature. So something with a number in it, I always think if I canmake a mean out of it, it must be a quantitative variable, right? And so here’s an exampleof quantitative variables.So time of admin, right? So imagine that you work a shift inthe ER, right? And from maybe 8pm to 12. like midnight, right? So you have this for hours.And you could say, what the average time of admin would be for those who got admittedto the hospital, you know, somebody got admitted at like, eight o’clock, and then somebodyat 815, and whatever, you could put that together, and you’d say what the average time was, also,like, if you were doing a study, and you as you were saying, patients with a particularcondition like Alzheimer’s disease, you could ask them their year of diagnosis, and thenyou could make an average of that. And so you know, that that is quantitative.Systolicblood pressure is also numerical, and platelet count. And these are variables we run intoall the time in healthcare. So we’re, you said that this is quantitative. Now, we’llget back to our picture. So that’s one side. So what if it’s not quantitative? What elsecould it be? Well, the only other category, it could be is categorical or qualitative.I use the term qualitative, but some people use the term categorical, but that’s kindof what it is, is that it’s a quality of something or a characteristic of something like sexor race.So here are some qualitative variables in healthcare, like you can have type of healthinsurance, like whether you’re on Medicare or Medicaid or different types of privateinsurance. Those are all just categorical, right? You can’t make a mean out of that.Also country of origin. If you’re in our group of students and their international studentsin there. Well, what countries are they from? Right? Well, you can’t make a mean out ofthat. Also you have situations where you do have numbers involved, like the stage of cancer,right? That’s depressing. Stage One, cancer, stage two, cancer, stage three, well, younever can make a mean, out of the stage of cancer, you wouldn’t say, well, the mean stagesis 1.4, or something like that. It’s just a category. And of course, stage four is alot worse than stage one. You know, they’re not just equal categories, but their categories.Same with trauma center level level four Trauma Center, where you wouldn’t make a mean outof the number of after the term Trauma Center, right, like what level it is.But you couldsay, well, in the state, maybe. So many percent of our trauma centers are level four traumacenter. So it’s really just a categorical variable, even though there’s a number involved.Alright, so let’s get back to our diagram, we figured out how to take any variable, andfirst split it into one of two categories is either quantitative, if it’s numerical,or qualitative, if it’s a characteristic. Now, we’re going to just concentrate on quantitativebecause we’re going to separate those variables into two categories. And the first one we’regoing to look at is interval. And the second one we’re going to look at is ratio. So ifa if you happen to decide a variable as quantitative, then it could be interval or ratio, but notif it’s qualitative. Okay, if it’s qualitative, it doesn’t get to do that. So let’s look atinterval versus ratio. So on the left side of the side, we have interval, which is whereit’s quantitative, and the differences between data values are meaningful.And ratio has the same thing, the differencesbetween the data values are meaningful. What does that mean by that? Well, remember howI was talking before how level one trauma center and level two trauma center that thatthose are really categories, and not quantitative variables, because the difference actuallybetween them is not equal. Especially if you think of job classifications that might goin 1234, like nurse, one, nurse to nurse three, nurse four, or I worked at a job where wehad office specialist one, office specialist to Office specialist three. And you know what the deal for going from office specialists to to Officespecialist three was really hard, you really had to do a lot there. But to go from oneto two wasn’t that hard? So that was a categorical variable, right? Because the differences betweenthe values were meaningless.Okay. Like the difference between s one and s two versusOh, s two, and s three, they weren’t equal. Whereas when you’re dealing with a quantitativevariable, regardless of whether it’s interval or ratio, you’re talking like years, or systolicblood pressure, one year for you is one year for me. So that’s fine, right? But here’swhere the difference comes in between interval and ratio. So all quantitative variables havemeaningful differences between their data values, but this hairsplitting thing hereis that an interval, there is no true zero.And in ratio, there is a true zero. And thisis how I try to think about it. an interval means kind of like, a space between two things.Like if you think of the word intermission is kind of like an interval. It’s like aninterval of time during a show where you get to get up and go the bathroom and get somecoffee. So that’s interval. And so if you have something that’s a space in between,that’s not going to have a zero, it doesn’t really start anywhere, or end anywhere.It’sin between. Whereas ratio, how are you number that is, I don’t know if you remember fromlike high school, but you can’t have a zero on the bottom of a ratio or a fraction. Sothat’s the way I use a pneumonic. That ratio means that you cannot have a true zero. Buthow does this work out literally? Well, I’ll show you. So let’s go back to those examplesI showed you of quantitative variables, right? Because those are the only ones we have tomake this decision about whether they are interval ratio. So these are these examples.Now I’m going to remind you that ratio has a true zero. Remember that little pneumonicI said, like don’t divide by zero. And so you know, like in a ratio, so they have atrue zero.Well, let’s think about it. It’s not very pleasant to have a zero systolicblood pressure because you’d be dead. Same with the platelet count, but it is possible,right? But now when we go on to interval, we can’t have Like zero time, like time ofadmet, you know are your diagnosis, there’s no like, year zero. So as you probably justguessed, ratio is where it’s at.In healthcare. There’s not a whole lot of times when we haveinterval data, but we do, you know, anytime you have a time, so you got to keep that inmind that if you want to split your quantitative variables into either interval or ratio, yougot to keep this in mind the difference between the true zero and the no true zero. Okay,here’s our handy dandy diagram. We’ve just gone through the tree classifying quantitativedata into interval versus ratio. Now let’s go pay attention to the other side of thetree qualitative.So how do we split those? Um, well, we can split those into nominalversus ordinal. All right. So nominal applies to categories, labels, or names that cannotbe ordered from smallest to largest. Okay, like I kind of think of when they have anadvertisement, they say, for a nominal fee, you can do this, it means it’s small, they’relike, there’s almost no difference. And so that’s why I say, there’s no difference, it’snot smallest to largest is means they must be equal.That’s how I remember it in my mind.But then ordinal applies to data that can be arranged in order in categories. But rememberthat thing I was saying about quantitative, it’s not quantitative, right? Because thedifference between the data values either cannot be determined or is meaningless, likeI was talking about with cancer, especially, you know, if you go from stage three to stagefour, that’s materially different than stage one to stage two. So you really can’t determinethose things. So this is where we’re gonna get into that it’s ordinal. It’s arrangedin categories that can be ordered from smallest to largest. So remember, our old friends thatI threw up there before of these examples of qualitative variables and healthcare? Well,let’s just reflect on this nominal cannot be ordered, right. So that would be more liketype of health insurance and country of origin because they could all be equal.Whereas ordinalis going to have a natural order, even though the differences between the levels is meaningless,which is what makes it so different from a quantitative variables. So which is why itstays on the qualitative side of the tree, it just gets labeled ordinal. So what youwant to do is if you think you have a qualitative variable on your hands, look for a naturalorder. If there is one, it’s ordinal. And if not, it’s nominal. So all data can be classifiedas quantitative or qualitative.So if you have a variable, that’s the first split youcan make as the difference between quantitative and qualitative, but once you do that, youcan further classify it as interval ratio, nominal, or ordinal. And it’s really importantto know how to classify data in healthcare, as you’ll find out later. Because dependingon how you classify it, you might be able to do different things with it in statisticsalready, so what we went over was the definition of statistics. And we talked a little aboutwhy you use it and how you use it, especially in healthcare. We went over what it meansto talk about a population parameter and the sample statistic, and we went over some examplesabout them. And then we talked about classifying variables into the different levels of measurement,and even talked about a few examples there. So I hope you enjoyed my lecture.Greetings,this is Monica wahi lecturer at library college, bringing you your lecture on section 1.2 onthe topic of sampling. So here are your learning objectives for this particularlecture. At the end of this lecture, the students should be able to define sampling frame andsampling error, the student should be also able to give one example of how to do simplerandom sampling. And one example of how to do systematic sampling. The students shouldbe able to explain one reason to choose stratified sampling over other approaches, state to differencesbetween cluster sampling and convenience sampling, and give an example of a national survey thatuses multistage sampling. So let’s jump right into it here. So we’re going to go over inthis lecture, sampling definitions, and then those different types of sampling I mentionedin the learning objectives, simple random sampling, stratified sampling, systematicsampling, and then convenience and multi state sing.So let’s start with some sampling definitions.What is a sample Okay, so we’re going to revisit that concept from the previous lecture, we’realso going to talk about sampling frames, and what errors mean and errors of samplingframes. And then we’re also going to just go right back over that and make sure youunderstand before we go on, and talk about the different types of sampling. So we takea sample of a population, because we want to do inferential statistics, remember thatwe want to infer from the sample to the population.And it’s just not necessary to measure thewhole population, it would be impractical. And it’s cost a lot. And actually, what you’llfind is, if you ever do an experiment, when where you actually do measure the whole population,you’ll find that if you get, you know, a pretty good proportion of the population, and youjust take that, you, that’s all you really needed to talk to. So ultimately, we saveresources, especially in health care, when we do a good job of sampling, and use thatto infer to the population rather than having to take a census of the whole population allthe top. So that brings us to the concept of sampling frame. So the sampling frame isthe list of individuals from which a sample is actually selected. And the list may bethis physical concrete list, like you could have a list of students enrolled at a nursingcollege, or in my other lecture, I gave an example of a list of nurses who work at MassachusettsGeneral Hospital, that could be your list, you’d go to human resources and get that.Or it could be a theoretical list.It could be like the list of patients who present tothe emergency department today, obviously, when you go into work, at the beginning ofthe shift, you’re not going to know who’s on that list yet. But it could be a theoreticallist. But whatever that list is, that is your sampling frame. So that those are the peoplewho actually could be selected for your study. So the sampling frame is the part of the populationfrom which you want to draw the sample. And you want to work at such that everybody fromyour sampling frame has a chance of being selected for your sample.In other words,you don’t want to leave anyone that should be in your sampling frame out in the cold.That leads us to the concept of under coverage. So what is it? It’s omitting population membersfrom the sampling frame? They’re supposed to be on the list, but they’re not there.So how can this happen? Well, let’s say you did what I was suggesting in the previousslide, you got a list of nursing students, you know, from a college, let’s say somebodysigned up that day, or somebody was just admitted that day, maybe they didn’t make it into thedatabase in time and you’re missing them. Or even like that HR list I talked about,at mgh, well, you know, I know how nurses are, sometimes they’ll temp in different places,and maybe they’re not on the payroll, maybe they’re through a temp agency. And so thenwe would miss those nurses from the sampling frame. And then, you know, people who presentat the emergency department at night might be different than those in the day.And soif you’re really trying to sample from people who present to the emergency department, youcan’t just look at like some small period of time, you’d have to look at, you know,the whole 24 hour cycle. So if you omit population members from your sampling frame, they don’teven get a chance to be in it. And that’s called under coverage. Now, I’m going to shiftaround, we’re jumping around with a few different definitions.And we’re going to talk about errors. Now,this is something that took me a while to get used to in statistics, there’s actuallytwo kinds of errors in statistics. The first kind is I call it This is my own terminology,a fact of life error. It’s just an error that happens. When you do statistics, it’s notbad or good. It’s just what happens. And in this case, I’m going to describe one of those.It’s called a sampling error. So the sampling error just simply says the population meanwill be different from your sample mean, and the population percentage will be differentfrom your sample percentage.So what does that mean? That means that if I cut corners,like I said, I could write and just take a sample to infer to the population. If I actuallydo one of those experiments I was telling you about where I have the population dataand I just take a sample and compare the means they will be different. Okay, I mean, theremight be this huge coincidence where they’re the same but they’re typically different.Same if you do percentages, and and we just know this is going to happen. The statisticswe account for it, we have ways of dealing with it. But we know that there’s always goingto be sampling error whenever you take a sample from a population To try to make a mean orpercentage in the sample, it’s just not going to be exactly what’s in the populations fine.But then there are other errors and statistics, whichare actually bad. And your it means you made a mistake. It’s like mistakes, literally mistakes.And so as you go through learning about statistics, it’s almost like you have to sit down andask somebody, is this one of those fact of life errors? Or is this one of those errorsyou want to avoid? Well, we just talked about sampling error. That’s just a fact of lifeerror. But errors, you want to avoid non sampling error. That’s basically using a bad list.I had an example in my life where I wanted to study a whole bunch of providers, right.And my friend gave me this list of providers, and and said, this is the entire list of allthese providers in this particular professional society. But when I sent the email to thatlist, I found there were not only duplicates on this list, but a lot of people emailedme back and said, Why are you sending this to me? I’m not a provider.I’m not part ofthis professional society. And also, some people who were in that professional society,who had heard about the survey emailed me and said, Why didn’t I get the survey. Sothis was a bad list. Some people had been left out of the sampling frame. So peoplewho were in the society somehow weren’t on my email list. And that’s a problem, right?So you have to pay careful attention. This was actually a mistake I made, you have topay careful attention that everyone in the population who was supposed to be representedin your sampling frame is actually there. So I should have really done a better jobof calling the professional society and making sure that this list was a good list. So samplingerror was caused by the fact that regardless of what you do, your sample will not perfectlyresent represent the population. Whereas non sampling error, yeah, I was sloppy.It waspoor sample design, sloppy data collection, and accurate measurement instruments, youcan have bias and data collection, other problems introduced by the researcher. So this is yourfault if there’s non sampling error, but sampling error is just a fact of life. Little whiplash here, we’re gonna now moveon to the concept of simulations. So a simulation is defined technically as a numerical facsimile,or representation of a real world phenomenon. So it’s like working through a pretend situation,to see how it would come out in the case that was real. And this, you know, when you studystatistics, you end up doing a lot of simulations. And remember how I’ve been talking about anexperiment you could do if you somehow did a census and had a whole bunch of data ona population, you could do an experiment where you just took a sample from that populationand looked at their mean to see the sampling error.That’s an example of a simulation.So to just conclude this little section, it’s really important to do your best to avoidnon sampling error. And this is achieved by making sure you do not have under coveragewhen sampling from your sampling frame. So this puts together some of our vocabulary.But just remember, sampling error is a fact of life. Okay, now we’re going to specificallytalk about different types of sampling. And we’re going to start with simple random sample.Okay, so first, we’re gonna start with just explaining what is meant by simple randomsampling, then we’re going to talk about two different methods of doing simple random sampling,they work the same way they achieve the same thing.It’s just that depending on how you’redoing your research, one might be more convenient for you than the other. Finally, we will goover the limits of simple random sampling, because all these sampling methods seem perfect.But then you got to take a look at their limitations. So let’s first define simple random sampling.So here’s a definition. A simple random sample of n measurements from a population is a subsetof the population selected in such a manner that every sample of size n from the populationhas an equal chance of being selected. Well, it’s kind of complicated, but what it meansis, is that if you use the proper approach for simple random sampling, whatever sampleyou get, you could have had just as easily a chance of getting another batch, anothergroup of people from that sample.In other words, like, let’s say you have a list ofthe population of students in the class. So I’m going to define a class as a population.And you want to take a sample of five students from this bigger class. If you take a simplerandom sample, it means that all the different groups of five students you could pick fromthe list has an equal chance of being the sample group you actually pick. Now, you canjust imagine that if you race into the class right at the beginning, and you take yoursample of five and not everybody’s in the class, what does that sound like, right, asampling frame problem, maybe an under coverage problem, maybe biases creeping in there, right.And so you just got to be careful, if you’re going to do simple random sampling, that youstart with a list with everybody in your sample frame, because every single sample that youcould possibly take should have equal chance of ending up being your sample.And I’ll kindof explain it by explaining the two different methods that can be used of obtaining that sample. So one of the best things that you can dois just start with a really good list of all the people in your population. So maybe, youknow, if I was going to study, I used to work at the army. So let’s say I was going to studyall the people who are active duty in the US Army, I would like to get a list of allof those people from an accurate place at the army. And I would like to have them havea unique ID. Okay. And that’s true in the army, everybody in the army has a unique numericalID. So what I would do, like in here, if you were looking at students, you’d take maybetake a student ID, so then you take the IDS from everybody on the list, and you cut themup, like you print them out, and you cut them up, and you put them in a hat, right, or abag where you can’t see in it.And they mix them all up where you can’t see it. And youdraw five of them up, or like in the picture, you know, what they did was mix up all thosepapers, and now they’re not looking. And they’re drawing a few out. Okay, so what did you justdo, you just made sure, first of all, that everybody in the population had an ID number.And that when you printed it out and cut it up, all, you didn’t lose any of them, if youdrop them on the floor, or something that’s not simple random sample, you got to makesure you keep all of them, and that you put them all in the hat, and that you didn’t lookand you draw five or whatever, because then any five of those slips of paper could havebeen drawn in there for your meeting with simple random sampling. Okay, that methodwill work, right? Another method that works, that might work better if you can’t do thisID thing where you cut a paper is where you simply just make your own list of unique randomnumbers, right, you just make your own list.And then you assign those to the population.A great example is if you’re, you know, kind of teaching kids and you want to put themin a random order, maybe you’re gonna do a game or something. Well, all you do is youyou get, like, let’s say you have 10, kids, you number one to 10, you put it in the hat,and then you pull out the first number, let’s say it’s five, you give it to the first kid,right? And then you just keep pulling out numbers and giving them to the kids and thentell them to stand in order, right? So you generate a list of random numbers as longas the list of the population. So I said, What if you have 10 kids? Well, if you have,you know, 500 names, then you get 500 numbers, and they don’t have to be one through 500.They just have to be unique.Okay, I like smaller numbers. So I’d say keep them small,but you can do what you want. And then, in any case, you randomly assign these numbers,you can use the hat, I’m big on hats to this population. And then, you know, you ask themto stand in order, or somehow you figure out it’s kind of like a raffle you call out who’sgot number one, you know, and whoever says yes, you’re like, you’re lucky you get tobe in my study, you know, so you can take the first five numbers in the order, right.And that’s, that’ll achieve the same thing as the last method, you’ll get a simple randomsample, it’s just two different ways of doing it. So ultimately, being in a simple randomsample means that the sample has an equal chase chance of being selected out of thehat that this group of people or a group of whatever has an equal chance of being selected.And you’ll see this picture on the left here is bingo, as some of you may play bingo. Youknow, they pull balls out of there and they call off the names of the balls.Well, eachball has a unique actually a letter and a number unique on there. And that’s how theymake them random. That’s they take a simple random sample of these bingo balls each timethat they do a bingo game. So I described to you the first method of doing that usingan old fashioned hat. The second method, you know, where you generate your own numbers,and you just make sure they’re unique. And then you assign them to things and put themin order. Well, that’s my electronic hat. That’s how I handle it. If I have, for example,somebody sends me an Excel sheet with a list of hospitals on it. I’ll just assign eachhospital random number and sort them in order. And I’ll sample the top few hospitals. That’llbe how I get a simple random sample of possibles. That way, I’m not biased, picking out my favoritehospitals where all my friends work, right? If I do it that way, the first method or thesecond method, all members of the population have the equal probability of being selectedin the sample.And more importantly, all possible samples, all possible groups had an equalchance of being selected. Of course, I only did it once. So I only got one of them. Butthe other ones that weren’t selected had an equal chance of being selected. All right, you probably saw the limits, isthis whole list? Even if I’m sampling hospitals, right? I still need a list of hospitals tosample from. So you may not know who’s gonna show up in the emergency department that day,if you do, while you’re psychic, because most people are not. So how would you sample fromthem using simple random sampling? So simple random sampling is okay, when you got a listlike hospitals, but it’s not so good when you don’t know who’s going to show up thatday. And even if you do a simple random sampling, you need a good list. I made a mistake once,where I did a survey with a bunch of professionals using a professional society list. And whenI sent out the survey, I learned that there were people on the list who were no longerpart of the society that it was an old list.And more importantly, there were people whohad joined the society that had not made it onto that list. So I was getting under coverage.So like, if you were doing a study with students, you know, what if they just left off the parttime students, then you’d be missing them. So this is a great example of non samplingerror. And so if you’re going to do simple random sampling, you do need a list and youreally want to research it and make sure it’s the best list possible. So I just went overthe characteristics of simple random sampling, and two different methods you can use fromto sample from a list.And I also mentioned the limits of it. Now we’ll talk about a differentkind of sampling, stratified sampling. So we’re gonna go over what it is. And then I’mjust like, simple random sampling had all these steps to it, there are different stepsin stratified sampling. And I’ll give you some examples. And then of course, just likesimple random sampling, this stratified sampling has limitations, and I’ll talk about those.So I first wanted to just remind you what the word stratified means, or what strataare, the single word is stratum, and more than one a strata.Now you see that rock onthe slide, you see that big, horizontal line across it, that those that’s a stratum, thereare strata, right? Those are strata of rock, if you stay geology, that’ll the geologistswill explain that where those breaks are, it means something happened often in the weatheror the environment or whatever. But the reason why I put this picture up there is I wantyou to sort of imagine those layers. Because that’s what we do in stratified sampling isfirst, we divide our list, of course, you know, a list, we divide our list into layers.Okay, so remember how I was just talking about simple random sampling? Like, what if I samplefrom hospitals? Well, I could take this hospital list and divide it until layers by for example,how close they are to the city, I could say, urban, suburban, and rural, I could firstput them into those strata. Okay. And if I was doing that, I’d be doing stratified sampling.Same with students, like I could put them in, you know, first year nursing students,second year students, you know, and I’d have this them divided into strata first.Um, sothis is what so why would you do that? Why not just do simple random sampling? Well,if you think about it, let’s say that you’ve got a class like statistics, maybe a lot ofyou know, they’re not that many first year students in it. So let’s say the very smallproportion is that way. If you do simple random sampling, you might just by lock miss allof them. Right. And so, if you’re really concerned about what a minority thinks, then you canmake sure to get representative from that stratum. By doing stratified sampling, becausethe first thing you do is you put those that list into groups. And then you take a simplerandom sample from each of the strata. So here’s the steps. So step one, divide theentire population, the whole list you have into distinct subgroups called strata. Andremember, each individual has to fit into one of those categories. So if you have somebodywho’s sort of halfway halfway between first year and second year, or you’ve got a hospitalthat’s kind of on the border, it you got to choose, you got to put it in one of thosegroups.Step two, um, well, it’s not really step two, but you’ve got to think about thestrata like what is it based on, it’s got to be based on one specific characteristics,such as age income, education level, you know, a great example is you could take people ofall different incomes, right, that’s a quantitative variable, but you can put them in strata byyou know, less than a certain amount. And then that to that, that to that you can make, you know, four or five strata. And then, um,you know, you just want to make sure that all members of the stratum, each stratum,share the same characteristic. And then you could do step four, which is draw a simplerandom sample from each stratum. So like, in the case where I was describing, like,maybe you have a class with very few first year students, if you take a random sampleof five from each strata, you know, each stratum, then you might be, you know, you’re kind ofgetting almost like, extra votes from a small minority, right? Like, you’re kind of treatingthem fairly, even though there’s a way bigger group of the other people you’re taking exactlyfive from.And, but you just that, that’s the risk you take, because you want to makesure you hear from that small group. Because if you just do sample random sampling withgroups, so small, you might just accidentally miss it. So here are some examples of stratifiedsampling. And you’ll see this in the youth Behavioral Risk Factor Surveillance surveysthat they do in high schools, that they’ll stratify by grade, right, because if theydid a simple random sample, you know, a lot of students drop out of junior and senioryear, they get probably too many freshmen and sophomores. And so they’re gonna wantto look at getting a certain amount of freshman classes, certain amount of sophomore classes,certain amount of junior classes, student run the senior classes, so they can have enoughof each to make good estimates, right. And in hospitals, they often sample providersfrom each department, right? Like, they don’t just do a simple random sample of providers,if they’re asking about like provider satisfaction, or if you know about a policy, they won’tjust do that, because they might, for example, Miss everybody in the ICU.Or if you’re studying,you know, ICU is you have multiple ICU is there, then you would want to maybe stratify by ICU, justto make sure even if one of them’s smaller, just to make sure you have a good, good solid representation from each ICU. Sothose are the reasons that push you to do stratified sampling. It’s not always necessary.But when you have these situations where you have these distinct groups, especially thelittle one involved, and you want to hear from everybody, you really want to considerthe stratified sampling.So of course, there’s limitations. And I’ve been sort of leadingup to this, what you end up doing is over sampling, one of the groups usually, you know,like the smallest group, if you make the same amount of people you take from that stratum,the same amount as you take from the big stratum. It’s like the smallest group is having allthese powerful votes and the biggest group has is weaker, you know, they’re both equalwhen they’re not technically equal in the population. But that’s the way it goes, right?And I do higher level statistics, there’s ways to adjust back for that, to just sortof say, take a penalty for that and go back and say, Well, what if the real pot you know,we can extrapolate this back to the population proportions? It’s possible, but it’s it takessome post processing is just the issue. And it’s also like simple random sampling notreally possible to do without a list beforehand. And it’s also hard to do, because you actuallyhave to split the list into groups into these strata.So let’s say I had these hospitalsand I didn’t know where they were, I didn’t know exactly if they were urban or rural orsuburban. Well, that adds another level of complexity to this whole stratified sampling.So, in summary, I just went over what stratified means, and it means you know, putting thingsin groups and then taking from that, and I describe the steps involved. And it’s a stratifiedsample. It goes a lot easily. A lot more easily if the strategist happened to be equal tobegin with, you know, I gave the example of high schools, usually there’s maybe slightlyfewer people in junior and senior year, but it’s kind of close.And it’s always nice.Like if you’re comparing ice use, for example, if the ice use are roughly the same size,because then you don’t have to worry about this whole, one of them is smaller, but it’sgetting an equal vote. Already, now we are going to move on to talk about systematicsampling. Okay, well, systematic sampling actually can be done with or without a list.So it’s a little more flexible than the kind of sampling we’ve been talking about. systematicsampling, it’s easier for me to like, define it by describing the steps you go throughto do it. So I’m just gonna explain how to do it. And then you’ll understand, in fact,you’ll understand why it’s called systematic. So whether you have a list or not, what youhave to do for step one is arrange all the individuals of the population in a particularorder. Now, if it’s a list, you just make it in whatever order you want to make it in.But if we’re talking about, for example, patients coming into the ER, well, they come in, inthe order that they want to.So they already are arranged in the list,right? You just don’t know what that list is. Okay, then step two is pick a random individualas a start. So let’s say I had a list of hospitals, and let’s say it was just sorted by state,right? I, let’s say I picked a random individual, maybe I went down, you know, seven on thelist, and I picked that hospital. Or maybe you could be at the ER, you start your shift.And the seventh patient who is admitted to the ER, you pick that person, just I pickedseven, I mean, you could have picked five, you could have picked 20, you know, just youpick a random person. Then the next step, step three is take every case member of thepopulation in the sample. Now, don’t try this in Scrabble case is not a word in Scrabble,okay? It’s just a word and statistics ease, in what case means spelled k th, it meansevery so many.So let’s pick a number and fill it in for K. So let’s pick the numberthree. So let’s say after you pick your first hospital from the list, or the first patientfrom the ER, it doesn’t matter what number you chose for that, then you take every thirdafter that. So every third patient that comes in after that, you ask them if they want tobe in a study, or every third hospital after that original random one, I pick and I say,Okay, this is going to be part of my systematic sample. So as you can see, it’s like prettysimple to do, it’s easy to do, if you have a list, it’s easy to if you don’t have a list,it’s just the deal is you have to pick K, well, first you pick a random place to start,then you pick K, and then you just keep going every so many.So you could do this with classes,you could take out a list of classes available at your college next semester, she pick arandom number like three, you know, and it’s sorted some way. So you go to the third classand you circle that, then you pick another random number like five and then after thatyou pick every fifth class. So after the third one, you go 45678, and then 910 11 1213. Andyou keep picking classes. Okay, this is not career advice. Okay? Do not pick your classesthat way. This was just an example. Alright, so as you probably guessed, I’m going to benegative Nelly, again, there are problems with systematic sampling.If already thingsare set up, boy, girl, boy, girl, for example. If you pick like an even number, you’re goingto get all boys are all girls, right? And I noticed this actually, when I was doinga study in the lab, we wanted to study like whenever they put the assay through the machines,we thought some of the assays weren’t running, right. And so we wanted to take a sample.And I wanted to take a systematic sample. But I wanted to take a systematic sample,like every seven days, and that’s a week. And so I asked my colleague, does the labvary day by day in what assez it runs because of it always runs the sexually transmitteddisease assays, it saves them up and runs them all on Friday. And I’m sampling fromevery Friday, that’s all I’m gonna get, right? That’s actually called periodicity.You don’thave to remember that I don’t think I’ve ever even seen that written. It’s just I remembermy lecture in my class telling us that that’s what you have to worry about with systematicsampling. It’s not real common problem, though. But what’s awesome about it is you can doit in a clinical setting. So you You can sample patients that way, coming into a clinic orcoming to a central lab or like in the emergency room. And that’s why this is a particularpower, particularly powerful way to sample is that if you have an ongoing sort of patientinflux, when you design your research, you could simply say, once you decide how manypeople you need to recruit for your sample, that you would use systematic sampling, andjust have somebody in the clinic inviting every case person who qualifies every casepatient who qualifies into your study.So it’s easy to do systematic sampling, it’seasy to do with or without a list. And you just pick a random starting point, and thenyou pick every case individual. Next, we’re gonna move on to cluster sampling. So whatis up with cluster sampling? Why do we need even other kinds of sampling? I just wentover so many kinds. I mean, you could use stratified systematic or simple random sampling,why would you even need another kind? Well, cluster is very special. It’s special, becauseit’s the kind of sampling you use when you think there’s a problem at a particular geographiclocation. Typically, that’s how cluster sampling is used. And, and I’ll explain it further. Imagine, for example, there’s a particularfactory that’s is believed to admit fumes that cause problems with people’s health.Well, you can’t do simple random sampling all over the nation, right, or you won’t evenget people by that factory, can’t really do easily do stratified or systematic samplingtheir cluster sampling is what’s designed when you want to study something that’s comingfrom a geographic location.So when you do cluster sampling, you start by dividing amap into geographic areas. So I’m from Minnesota, and I know that there was a mine there withvermiculite in it. And it was it was contaminated, a lot of people got sick from it. But theydidn’t know that’s what was going on. So they first I think divided Minnesota into differentgeographic areas, areas. after dividing the area into these different geographic areas,some with the, with the bad thing in it, and some without the bad thing in it, you randomlypick these clusters or areas from the map.So the app, like if you’ll see there on thescreen, there’s a map of the state of Virginia, and it’s all been divided into different groups.And then this, this cluster is is highlighted, you usually probably pick more than one cluster,sometimes it’s only four or five. But the idea is you try to enroll all of the individualsin the cluster, it’s usually people, although you can do it with animals, if there’s a diseasegoing around among animals, you know, you would have these, the divide the area up intoclusters, and then you try to measure all the animals in the cluster.So as you canimagine, not only is this sort of practically difficult, but there’s reasons why peoplelive together, right? People live in communities. I mean, people don’t just randomly scatteredthemselves, you know, cultural communities grow. companies grow around art, you know,affluent communities have different people in them, then communities that have less money.So sometimes the people located in the cluster are all similar in a way that makes the problemhard to study. And this is, especially if you’re studying some geographic thing, likemaybe a factory or a sewage plant, that you think might be causing cancer, if you’re inan area where there’s a lot of pollution anyway, from other things, and a lot of low incomepeople live there.Because if you’re high income you can afford not to, well, they’realready being exposed to higher rates of carcinogens and probably have a higher cancer rate. It’shard to tell what the independent effect might be of that thing in that geographic locationbecause of the other similarities of the people around. And so this is cancer ends up beinga really difficult, tough nut to crack. Because where we see high rates, there are often alot of different geographic issues going on there in cluster sampling doesn’t really helptease that out. So to wrap this up, cluster sampling is usedwhen geography is important. So if there is something geographically located in a certainspot and you can’t move it, then you kind of are stuck doing cluster sampling.So briefly,the map around that areas divided into different sub areas, right. And those are Not all theareas are picked, just a few are randomly picked. And then all of the people in thatparticular area are sampled. And of course, it’s biased towards the people living in thearea. If you you know, in the area you pick with a bunch of affluent people, you’ll getaffluent people pick an area with a bunch of immigrants, he’ll get immigrants. And soa cluster sampling is not perfect, but you’re kind of stuck with it. When there’s a situationwith geography, how long it was, remember it is, when I used to live in Florida, we’dlike to drive up to Georgia because they had the best pecan clusters.That’s like a typeof dessert with pecans and Carmel and stuff. So when I think of cluster sampling, I thinkof those pecan clusters that they’re only really good in Georgia. So that’s my way ofremembering that cluster sampling has to do with geography. Now I’m finally going to talkabout the last two types of sampling that I’m going to cover in this lecture, conveniencesampling and multistage sampling. They’re both a little quick, so I’m going to justcover them quickly.First, we’re going to start by talking about convenient sampling.And we like that name, right? It’s convenient. Convenient sampling can be used under lowrisk circumstances, like if the findings of what you’re doing aren’t really that important.Like, for instance, let’s say that you wanted to know what ice cream is the best from therestaurant next to the hospital, let’s say a new restaurant opens up, and you’re gonnago off your diet, you’re gonna go get some ice cream, but you don’t want to waste itright. So you want to ask people, what’s the best one, you might ask your coworkers, youmight ask, you know, the people at the restaurant, hey, what’s the best ice cream, but the resultsare not so reliable, because you might end up on Yelp and see that other people disagree.So a convenient sampling is basically using results or data that are conveniently or readilyobtained.And my master’s degree, one of the things I did was I surveyed people anonymouslywho were coming to a health fair, I sat at a booth, and I gave them the survey, to viewquestions in it. That was definitely a convenient sample, you know, just people showing up forthe health fair. And this can be useful when there’s not a lot of resources allocated tothe study, like, I was a starving master’s student, right, like, I didn’t have any money.So that that was perfect for me convenience sampling. And also, you know, the questionsI was asking them about were just characteristics of whether or not they had risk for diabetes.Well, I’m not a doctor, and I wasn’t going to do anything about it.But it was interesting.So it wasn’t a very high risk survey to fill up. It and convenience sampling is convenient,because it uses an already assembled group for surveys like I was doing at the healthfair. An example might be to ask patients in the waiting room to fill out a survey orask students in a class, you know, sometimes I do when I’m teaching, I’ll do a convenientsample of whoever sitting there. I’ll say, Hey, is the homework that I signed you thisweek too hard? Well, it’s always too hard. I don’t even know why I do the survey. But anyway,um, sometimes as a teacher, you’ll just want to do a convenient sample just to get thegauge on where the classes but there are problems with it, right? You can’t just use it foreverything, even though it’s nice and convenient.There’s bias in every group, right? So ifI let everybody go on break, and then whoever’s still sitting there, I asked them a thongworks too hard, I might get a totally different answer than if I waited for everybody comeback. Right. And, you know, just about any time you just waltz into a room, like whenI went to the health fair, who do you think, is there a bunch of sick people? No, there’sa bunch of health minded people there. And so I’m gonna get a bunch of bias, right. Andalso, more importantly, when you do convenient sampling, you often miss important subpopulations.So remember, stratified sampling, how sometimes people don’t group evenly into the differentstrata? Maybe they do kind of in high schools, but especially when it comes to job classifications,they usually have fewer bigwigs than they do.Lackeys, right. And if they just havea few bigwigs, if you do a simple random sample, you you might miss all of them. So maybe youtry a stratified sample. On the other hand, if you walk into the break room that is usedby the lackeys and you say, hey, I want to fill out my, you know, work satisfaction survey.All of the ones you’re going to get are going to be from the lackeys, you’re not going toget any representation from the upper job classes because they don’t go in that lounge,so you’d be missing them.So that’s the main problem with convenience sample is the resultscan be so severely biased because you’re only asking the small, biased group of people thatprobably are all alike in some way. It’s not very representative sample. Next, I’m going to talk about multi stage sampling.So, you know, if you have a kid and the kids crying somebody like What’s up, you say, well,the kids going through stage as well. That’s exactly what you’re doing when you’re doingmulti stage sampling, as you’re going through stages. It’s basically like mixing and matching,the different sampling I just talked about, only you do one stage, and then two stages,and then three stages, and then four stages, or maybe even more. And that’s how you getyour sample. So if you’re imagining why I got to start with a lot of people, you’reprobably right, I just gave an example I made up of a way that you could do multistage samplingis you could start one with stage one as a cluster sample, right? Remember, where youtake out a map, and then you divide into areas? Well, let’s divide into states and take twocensus regions of states like about 10 states from those clumps.Okay, now, we limited itto that. Now let’s go to stage two of our multistage sampling. Now, from each of those,we could take a random sample of counties, right. So we go and look at all the countiesand then take that random sample. Then after we get those counties, stage three, we couldtake a stratified sample of schools from each county. So some of the counties will be totallyrural, some will be totally urban, but most will have some mix. So we’ll take a look ata few schools from the urban a few schools from the rural in stage three from the stratifiedwill tell you a stratified sample schools from the simple random sample of countiesfrom this cluster sample of states. Okay, now we got our schools, stage four could bea stratified sample of classrooms.So once we figured out our urban schools or ruralschools, we could go in there and look at all the classrooms, freshman, sophomore, juniorsenior and take a stratified sample of those. So it’s basically mixing and matching. Butyou’re right, you got to start with a lot to begin with, if you’re gonna whittle itdown, and a whole bunch of stages, doesn’t have to be four I just gave you for. Now I’mgoing to give you a real life example. This is the National Health and Nutrition ExaminationSurvey. And Haynes definitely not a Master’s project. This is done by the Centers for DiseaseControl and Prevention at the United States, right.So what I’m kind of hinting towardsis the kinds of places doing multistage sampling our governments, not only do you have to startwith a whole bunch of people and things and individuals, states and schools, and whathave you, right, is that it’s a lot of work to do all the sampling, and it better be forgood reason. And the National Health and Nutrition Examination Survey is a good reason. That’s,that’s a survey that’s done by the CDC to try and measure America’s Health. Of course,it’s doing inferential statistics, right, it’s taking sample and trying to extrapolatethat information back to the population.And so it’s got to be really careful about howit does a sampler you can’t just waltz in and do a bunch of convenient sampling. Sothis is how it does it, just briefly, they start by in stage one, sampling counties.Then from those counties, they sample something called segments, which is defined in the census,it’s their different areas, from those segments, those areas, they sample households.And that’swhat they mean, like, wherever you live as a household. Even if you live in a dorm, that’sa household or you live in assisted living, that’s a household. I’m an apartment buildinghouse. So they sample those and once they knock on your door of your household, theysample individuals from the house. So they use four stages of sampling. And that’s areal life example of multi stage sampling. So in summary, convenience and multi stagesampling, with respect to convenience sampling, you want to avoid it unless it’s really alow risk question you’re asking about. And you also want to avoid it unless it’s reallythe only type of sampling possible under the circumstances.When you have situations whereyou have patients with very rare disease, probably convenience sampling from your RareDisease clinic is reasonable. There, it’s also used when resources are low. And so thoseare a few good reasons to try to use convenient sampling. It’s really something that you wantto use only if it’s the thing you’re stuck with. It’s much better to looktowards these other sampling approaches I described. And then finally, multistage samplingis usually used in large governmental studies. So don’t expect to actually design anythingalone with multistage sampling. When that happens, I showed you those four things forthat survey that the CDC does hundreds of people work on that even just a sampling tonsof people work to try and set that up. It’s very difficult. But I wanted you to know aboutthat kind of sampling, because it’s important in healthcare, and it happens a lot.So inconclusion, we made it through the sampling lecture didn’t wait. I first started by describingsome definitions, you needed to be able to understand all these different types of sampling.Then I went into simple random sampling, and showed you how to do it two different waysand what it achieves and also its limitations. We next talked about stratified sampling,why you do that and how you do that, and the limitations of that one, too. Then we gotinto systematic sampling, which is a little more flexible, and pretty easy to explain.Next, we talked about cluster sampling, and why you might need to pull that tool out ofyour sampling toolbox. And then finally, we covered convenient sampling and multistagesampling. Already. Well, I hope you better understand sampling now and can keep all ofthese different types of sampling straight in your mind.Hello, everybody, it’s Monicawahi labarre. College lecture for statistics are on to Section 1.3. Introduction to experimentaldesign. And here are your learning objectives. So at the end of this lecture, you shouldbe able to first state the steps of conducting a statistical study, and then select one stepof developing a statistical study and state the reason for the step, you should be ableto name one common mistake that can introduce bias into a survey and give an example shouldbe able to explain what a lurking variable is, and give an example of that. And you shouldbe able to define what a completely randomized experiment is. So let’s get started. This lecture is in acover four basic topics. First, we’re going to look at the steps to conducting a statisticalstudy, you may think there’s a lot of steps to conducting a study, this is from the pointof view of the statistician.Okay? Then we’re gonna go over basic terms and definitions.And by now, you’re probably used to the fact that in statistics, certain words are reappropriated.And they mean something specific in statistics. So we’ll talk about that. Then we’ll talkabout bias and what that is and how to avoid it in when designing your studies. Finally,we’ll talk about randomization in particular topics you need to think about when thinkingabout randomization. So let’s get started. We’re going to start with, of course, basicterms and definitions. And so first, we’re going to review these steps that I keep talkingabout to conducting a statistical study.But there’s some vocabulary, vocabulary that comesup. And so we’re going to talk about those vocabulary terms that come up. And then also,I’m going to give you a few examples from healthcare. So here are the steps I keep talkingabout. So these are the basic guidelines for planning a statistical study. So the firstthing you want to do is state your hypothesis. And you know, I’m in a scientist a while now.And I can’t tell you how many times I get in a group of us, and people are all curious,and they start thinking about let’s do a study. And it’s only halfway through our conversationthat I suddenly say, Hey, wait a second, we don’t have a hypothesis, what’s our apotheosis?So it’s easy, even for scientists to forget that that’s really step one, is you have tohave a hypothesis. And so whatever hypothesis you pick, the hypothesis is about some individuals,if I have a hypothesis about hospitals, those are the individuals I have a hypothesis aboutpatients.Those are the individuals. But it’s important actually, to nail that down. Becauseam I talking about patients in the hospitals? Or am I talking about the hospitals, so makesure that you understand after you, you know, percolate and decide on your hypothesis, whothe actual individuals of interest are? And that’s because you’re going to have to marrymeasure variables about these individuals. So step three is to specify all the variablesyou’re going to need to measure about these individuals. You know, and of course, theyrelate to the hypothesis. So it’s good thing is that was step one, right?Step four is to determine whether you want to use the entire population in your studyor a sample. If you already have a bunch of data like you have the census data you Youmight as well use the entire population.But typically, if you don’t have the data, you’regoing to want to sit down and think about using a sample. And if you do that, whileyou’re sitting down, you should probably also choose the sampling method on the basis ofwhat I talked about in the sampling lecture. Now that you’ve figured out your hypothesis,you got your individuals, you figured out your variables, and you figured out whetheryou’re going to do a census or a sample, if you’re going to do a sample what type of sampleStep five is you think about the ethical concerns before data collection.If you’re going tobe asking some sensitive questions, you think about privacy, if you’re going to be doingsome invasive procedures, you think about how painful that would be, and how hard thatwould be on somebody, especially if they’re not even, you know, it’s they’re just healthy.And you’re just doing an experiment of unhealthy people just to better understand biology.So you have to really sit down and think about these ethical concerns.And they may changeslightly your study design. Finally, after you get steps one through five, are takencare of, that’s when you actually jump in and collect the data. And like I was saying,you know, when I meet with my scientist, friends, we get all excited about an idea. We’re oftentalking about Step six, we’re like, oh, we should do a survey, we should this we shouldthat. And I realized I ended up saying, Hey, we actually have to go back to step one andstart talking about a hypothesis, because I suddenly realized, I don’t even know whatdata to collect, right? If you don’t go through the steps in order, you really aren’t doingit right.Step seven, is after you get the data, you finally use either descriptive orinferential statistics to answer your hypothesis. And that’s what statistics is about. It’shere for that. And then finally, after you use the statistics, you have to write up whatyou find, even if you’re at a workplace. And they asked you to do a little survey thathappened once when I was working somewhere. And they wanted us to do a survey. Their hypothesiswas that they didn’t have enough leadership programs, and they weren’t building good leadersthey could promote.And so I was on a team that did the survey, we didn’t, you know,really publish it, like, everywhere. But we made an internal report, right. And in thatinternal report, we had to do step eight, which we had to note any concerns about datacollection or analysis, you know, that happened when we were doing a report. And we also hadto make recommendations for future studies, or if you wanted to study this in future groupsof employees. So in science, what it usually ends up being is a peer reviewed literaturereport, right? is you do a scientific study, maybe you get a grant. And then you do allthese steps. And then step eight is where you actually prepare a journal publication.And in that, you have to note any concerns about your data collection or analysis, anythingthat might have gone wrong, or not gone exactly the way you planned, or something you needto take into account to really properly interpret what the study found. You also want to makerecommendations for future studies, especially if you screwed something up, or especiallyif you answered a really good question. No reason to per separate on that question, whydon’t we move forward and ask the next one.Now, these are a lot of steps to remember.So I’m going to help you try to remember them in sort of clumps. So let’s look at the firstclump, which are steps one through three, which is data hypothesis, identify the individualsof interest, and specify the variables to measure. So let’s give an example of that.So let’s say our hypothesis was air pollution causes asthma, and children who live in urbansettings. You know, that’s how we’d stated or we could say that as a research question,like does air pollution cause asthma in children who live in urban settings. And so in thatcase, the individuals would be children in urban settings, and the variables we’d haveto measure our air pollution at least, and asthma at least. And of course, we’d wantto know more things about these individuals, these children, we probably measure theirincome and where exactly they were living, and how old they were, and if they’re maleor female, and these kinds of things, but that just kind of helps you think about thefirst three steps together.Now let’s think about the second three steps together four,five, and six, which is determine if you’re going to use a population or sample If it’ssample, pick the sampling method, look at the ethical concerns and then actually collectthe data. So, when you do that, you can either quote unquote, collect data, you know, like,by using existing data by downloading data from the census, or like Medicare, they havedata sets available that are, are de identified, so you don’t know who exactly is in there.Or you can collect data yourself, like do a survey or, you know, get a bunch of patientsthat will allow you to measurement.When you use it, a government data set, often you canmake population measures out of it. And so you don’t really have to go through a lotof sampling, or ethics, because they’ve already provided it for you. And it’s confidential.And that’s kind of your data collection. But most of the time, what you’ll see, especiallyfor studying patients, and treatments, and cures, and things like that, those are ona smaller scale. So you end up collecting data from a sample for those estimates. Andagain, you need to choose a sampling approach. And then you need consent, if legally foundto be human research. So I just want to share with you in case you didn’t know, if you wantto go do research on humans, you’re a nursing student, or your medical students or a dentalstudent, any any students or or your dentist, your physician, whatever, a nurse, you can’tjust make up a survey, or study design and go out and do it, you have to get approvalfrom an ethical board.And that ethical board will talk to you if what you’re doing is consideredli li human research, that you need to get consent from the patients or the participantsin your study if they’re humans. And if you’re collecting data about children, for example,you have to get the consent of their parents and the assent of the children. And in theUnited States, that way, we have a setup, it’s called an institutional review boardfor the protection of human subjects and research or the short answer is IRB. And so I justwant to make sure that if you ever do design a study that you know about this IRB thing,and you realize you have to go through this ethical board and make sure that they’re coolwith it. Before you can move on to the next step of designing a statistical study.Allright, finally, we’re on to the last clump of steps, which is seven, and eight, right?So that’s using descriptive or inferential statistics to answer your hypothesis you insix, you collected the data. Now we’re going to do the statistics. And then step eightis noting any concerns about your data collection or analysis and making recommendations forfuture studies. So you can kind of imagine this is where we’resitting in our offices, and writing up our research, whether we’re writing an internalreport to our bosses, over writing for the scientific literature to publish for everybody.So at this point, I just want to remind you that it matters whether you picked a censusor a sample, for your study design. Because if you pick the census, you’re going to doa certain kind of analysis. And if you pick the sample, you’re going to do a differentkind of analysis and statistics. So again, that’s all kind of cycles back to your studydesign. And what’s important here is I want to talk to you about the two different maintypes of studies.Now within these two categories, you have different subtypes. But these arethe two main types that you can have. The first is called an experiment. experimentis where a treatment or intervention is deliberately assigned to the individuals. So you can kindof imagine that if you enter a study, and they assign you to take a drug in the studythat you weren’t taking before, that would be an experiment. But another thing couldhappen. I mean, you could do this to individuals, you could do it to animals, but you coulddo it, I keep getting the example of hospitals, we could choose some hospitals and say, Hey,you need to try a new policy as the intervention and and that was assigned by the researcher.So that makes this an experiment.And the reason why we have experiments is sometimesyou need them. The purpose is to study the possible effect of the treatment or the interventionon the variables measured. And so that’s one option you can do is have an experimentalstudy where the researcher assigns the individuals to do certain things in the study. There’sanother kind of study The other kind, which is called observational, and the way you canthink about it is in experiments, the researcher does something, they intervene, they givea treatment, right? But an observational, the researcher doesn’t do that the researchersjust observes. So, if you enroll in the study, and you say, Do I have to take a drug? AmI supposed to eat something? What am I supposed to do? And the researcher just says, No, we’rejust going to measure you, we’re just going to ask you questions, and we’re going to measurethings about you, we’re not going to tell you to do anything different, then you’rein an observational study.So no treatment or intervention is assigned by the researcherin an observational study. Now, let’s say you’re taking a drug, you know, just becausemaybe you have migraines, you’re taking a migraine drug, well, you just keep takingit, or you can stop taking it, you know, they don’t care, they might ask you about takingthe drug, but they’re not going to assign you to take it.It’s an observational study.I wanted to give you a couple of real life examples. So Women’s Health Initiative upon the slide was mainly an experiment, okay. This is was run by the United States government,but of course, had the cooperation of many, many universities and, and health care centers,and most importantly, women. So women in America, women who were postmenopausal, volunteeredto be in the study. And the study actually had two separate sections, the experimentsection, and the observational study section. They really wanted women to qualify for theexperiment, and that the purpose of the experiment was to study whether hormone replacement therapy,which is a therapy for symptoms that women can get if they’re postmenopausal, that areunpleasant. What whether that therapy is good for women, or bad for women, because theythought maybe it helps them the post menopause system symptoms. But they thought maybe itcauses cancer, right? So they know. So what they had to do was assign, get a bunch ofwomen who were agreeing, you know that they would take whatever was assigned to them.And they had to assign the drug to some of these women.So that’s what made an experiment.The problem is not all the women qualified for the study. So they had a separate observationalstudy, if if the woman did not qualify to get the experimental drug assigned to her,then she could be in the observational study. And because this is these big government studies,why not, you know, somebody wants to be in a study, why not study them, just put themin the observational section. A very huge, popular long, ongoing study.That’s an observational study, again, run by Well, this one actually started out ofHarvard. And that’s called the nurses Health Study. Some really smart person figured outa long time ago, that nurses are, are smart people, they understand their own health,they understand other people’s health. And they’re good at filling out surveys abouthealth. So they started studying nurses and regularly sending them surveys, of course,they didn’t tell the nurses what to do. They didn’t assign the nurses any sort of drugto take or any diet or intervention or anything. They just observe the nurses, they send thenurses a survey, and about the nurses health, and then the nurse vault fills out that information.I think it’s every two years that they do that, they’re still doing it.Also, at this point, I do want to point outthe concept of replication. So just the word replication, right, regular speaking meansto copy, right? Like, if you ever, you know, have a new roommate, you might need to replicateyour key. So you have a copy of the key for the new roommate? Well, part of the wholescience thing is that studies must be done rigorously enough to be replicated. So thoseare little keywords in there. A rigorous study means one that’s done really carefully, likethinking about sampling very carefully. You know, like avoiding, for example, non samplingerror not being sloppy, not getting a lot of under coverage, using a good sampling frame.You know, I’m just giving you examples that you might know about. But there’s a lot ofthings that have to be done in research to do it properly.It’s just like driving oranything else. You really have to keep your eye on a lot of different things and you wantto try to do them perfectly. And the main reason why you want to do that is so if somebodytries to do this same experiment you did or roughly the same experiment you did. Becauseyou can’t do exactly the same, right? If I study this hospital over here, and somebodywants to study that hospital over there, well, they’re going to get different people in there,right? But even so if that person decides that they want to study that hospital overthere, if I did my study rigorously, then it won’t be so hard for that person to replicatehow I did the study.And then we can see if that person and my study if we get the samething, or if there’s something slightly off or what’s going on. And so replicating theresults of both observational studies and experiments, is necessary for science to progress.So you’ll know that a lot of experiments are done on drugs, before they can be approvedto be given to everybody, because they can’t just do one study, they have to replicateit, to make sure that the findings are all sort of coming in about the same and thatwe can deduce some information about it, you really just don’t want to rely on one studyfor your findings.So I just went over several steps that we need to follow when we’re doinga statistical study, and we actually have to follow them in order. And you also haveto determine the type of study you’re doing, you know, is an experiment, or observationalstudy. And there’s a ton of study decisions you have to make. So you got to keep thatin mind. Now, we’re going to talk about avoiding bias in specifically survey design. Now, youcan do a lot of different kinds of studies. But let’s just talk about surveys, becausethat happens a lot in nursing. Nurses interact with patients a lot, and with the communitywith each other. And often they gather information about those interactions or attitudes or,or how the healthcare system functions by using a survey. So surveys can provide a lotof information and useful information. But it’s important that all aspects of surveydesign and administration when you’re giving it, you got to think about minimizing biasand try you know, try to get a representative sample trying to get accurate measurements.And so several considerations should be made.When you want to think about non responseand also voluntary response, okay, so I talked a lot about sampling in the previous lecture.But just because you invite someone to participate in your study, like maybe you’re doing systematicsampling, and every third patient, you asked, Would you like to fill out a survey? Thatdoesn’t mean they’re going to, right? And so if that person says no, thank you, eventhough there were a sample, that’s called non response. So if I was helping you witha survey, and you said, Hey, I was getting a lot of non response, I would look at theproportion if you approach 200 people, and 80 said, No, you know, that’s only a 20% responserate and an 80% non response rate.If many people are refusing your survey, the few whoactually completed are likely to have a biased opinion. I’ve noticed this at in in situations wherethings are really bad, okay. Like, I remember going to a subway station and it was flooded,and it was really in a bad situation. And there was a man handing out surveys from theTransportation Authority.And he was like, please take my survey, please take my survey.And everybody was waving past him. They didn’t want to grab a survey. While you know me,I got a bleeding heart for surveys. So I took his survey, and I filled it out. You know,I think the transportation authorities not so bad. Right? I lived in Florida, there’sno transportation there, right? So and here in Massachusetts, we got a great transportationsystem, even if it’s flooded or doesn’t work half the time, right.It’s way better thannot having one. Well, I’m not the only one who grabbed a survey a bunch of nice Pollyannas,like me grabbed a survey. So probably the Trent Transit Authority thinks that everybodyloves the subway when everybody was waving past this poor guy because they were so disgusted,because the station was flooded. Right? So if so many people are refusing yoursurvey, a high proportion, the feebly will actually fill it out are going to be kindof weird, probably like me. You know, you’re gonna get a bunch of happy people when mostof the people who said no might be sad people. And so, the reason they may not be completingyour survey has may have to do with how they feel about your topic. This is not just interms of satisfaction. Let’s say you want to talk about how many drinks per night somebodyhas. Okay? Do you think a lot of people who are struggling with alcoholism are gonna wantto fill out that survey? You know, how about illegal drugs or other illegal activity, peoplewho are into that they don’t always feel so good about talking about it.And so, you know,you might get a few people to fill out your survey, but those are not necessarily thepeople who are engaging in the behaviors. So the fact that we have the freedom to choosewhether or not we want to be in a survey is great. But from a researcher standpoint, isyou have to be careful. If you get low response rates, you need to ask yourself who was notresponding? And, you know, am I missing a good share of opinion there? And then, whenyou get people who do respond, you got to be careful with that two, respondents maylie on purpose. If you’ve got a pretty cool survey, but you suddenly ask a question, that’stoo personal. People might just lie. If you ask, maybe a students you’re doing a sin,you know, maybe satisfaction survey with how the front desk runs at a dorm or something.If you, you know, ask a question, have you ever cheated on a test? You know, my, everybody’sprobably gonna say no.Also, if you ask a question where people don’t really know theanswer, offhand, they’re not gonna put it. Like if you ask somebody, you know, when you’re,you know, you asked a kid who’s been living in the house forever, when your parents boughtthe house? How much did it cost? I mean, they’re not gonna know. Maybe they’ll know, but probablynot. And so you want to be careful when you design your questions that you’re not askinganything that’s so personal, everybody’s in lie about it? Or that you’re not asking aquestion, then you would have Trump people try to be accurate, they’re probably not evengive you the right answer, because it’s just too hard to think about. Um, respondents alsoto, you know, to surveys may lie without meaning to, like, inadvertently.Again, if you aska question about something that happened really a long time ago, they’re not probably goingto get it right. This is called recall bias, like you can have you can you know how, like,you can look back at a time in your life, like, especially if you went through somethingreally harsh, like if you were a part of a sports team, and you went to state and itwas really tough that you don’t remember the tough part, right? You sit around singing,you know, your sports songs, and you say, Hey, that was awesome.Well, that’s recallbias, right? Because after winning state, everything looks rosy. But, you know, on thebus, there really wasn’t that easy. So people tend to have recall bias, it’s influencedby events that have happened since the original event. So if you’re giving people a survey,and you’re saying, Well, before you applied for nursing school, you know, what did youthink this? Or did you think that, you know, they might just tell you and think they’retelling you the truth, but they’re actually lying. If you actually managed to go backin time and ask them, then they tell you something different.So again, you can kind of screwup your own data by screwing up your own questions. So you want to think about how you word yourquestions. You can also screw up your questions by introducing a hidden bias. Something happenedto me recently, where a company sent me a free app. And they said, try our free app,and I downloaded it, and it was awful. Okay. And then about a month later, they sent mea survey. And these were the questions I said. When do you use the app? You know, what timeof day? Do you use it? Right? Like how how, how do you use it? Do you read scientificliterature? Do you read news? And the problem was, I couldn’t really answer any of this.Because from the day I downloaded it, I never used it.It was so bad. Right? So questionwording may induce a certain response. They were asking me how do you use this, but theydidn’t give me a choice of I don’t. So I had to say something. I don’t even know what Isaid. I mean, there was nothing I could say To be honest, because of that bias. So youhave to be careful that you aren’t too rosy about whatever your topic is, and and assumeeverybody loves everything. I mean, you’ve got to put out questions like are you evenusing the software? Did you have any problems with the software? Right? I’m just assumingthey’re using it and liking it and using it. You know, like it’s supposed to be used isa big assumption. Order of questions and other wording may induce a certain response andyou’ll see this a lot if you take a public opinion poll.I used to do a lot of pollingWe’d ask questions like, how likely are you to vote for candidate x? You know, very likelysomeone likely? Somewhat unlikely and not at all likely? And people say, I don’t know,no, no likely. And then you’d say, Well, what if you knew that candidate x supported thisnew proposition? proposition? 69. Right, then would you be more likely to vote for candidatex? And so that’s why order of questions other wording and stuff.They’re trying to see ifI add this fact that that fact is that going to make the person like the candidate better.And so you do have to think about the order you put the questions. And if you want toask about two different subjects, kind of think about which subject should come first,because it might color the respondents answering of the subsequent subject. And also on theslide, I wanted to point out that the scales of questions may not accurately measure responses.Do your feelings always fit on a scale from one to five? Well, you know, yelps kind offigured it out.If people’s feelings about restaurants tend to fit on a scale of oneto five, I’d have a lot of trouble filling that out if they gave me a scale of one to17. Right. But sometimes people have more granular feelings about things, maybe theyneed a longer scale one to seven. Um, you’ll see a lot of pain scales, where they offermore than just five choices, because probably pain can maybe go from one to seven or oneto 10. So think about your scales when you’re creating these questions, because that’s yourchoice if you’re designing the study. Another point to be made is the influence of the interviewer.Now, we don’t have as much interviewing going on these days, because we have the internetwhere we can do anonymous surveys, and people just fill them out self report, we have Robophones that you can call robo call.And using an automated voice, that’s obviously not aperson, you can get survey data. But there’s always situations where you actually haveto interview people, especially if somebody is really sick in bed, and you have to showup there, you have to talk to them. And so even on the phone, you have to interview people,and they can hear your voice, right. So you got to think about when you’re pairing upwhoever’s being interviewed with whoever’s interviewing, um, I’ve found that it’s bestto have the interviewer come from the same population as the research participant, ingeneral, the only time that can be a problem is a thirst from the same community, and there’sa privacy issue.But it can be very helpful, for the most part, not always, to have yourinterviewers be actually from the population that you would be studying, you know, fromthe individuals that you would be studying. So for instance, if you need to interviewa bunch of young African American, you know, like some African American teenage men, likeI recently saw a study on how health care in the United States really isn’t suited forthem. And it needs to improve and needs to better cater to this population. Well, let’ssay you wanted to better understand that, the best thing would be is to hire a youngAfrican American male and train him on how to be good interviewer and do be good datacollector, because you probably get the best data that way. On the other hand, let’s think of differentways that that could go, you could take a person who was older, who is maybe of a differentrace, and maybe that would change how this young African American male would respondto this interviewer.I mean, the interviewer could be like, in many ways, like the respondent,but the respondents perception might change, then how they answer all verbal and nonverbalinfluences matter, you know, clothing, the setting that the person’s being interviewedin. And so I’m not saying there’s really a solution to all this. I’m just saying, makesome good decisions. Like I remember working on a data set where there were some questionsthat had been asked about some older men about their sexual function. And I, it looks thedata look funny to me in the statistician who was there during data collection toldme that they had chosen young, female nursing students to interview these elders. Men abouttheir sexual habits. And I just said, you know, that might be subject to interviewerinfluence. And then you of course have to worry about vague wording.Just because itlooks clear to you doesn’t mean it looks clear to everyone. There are simple ways of avoidingvague terms in the survey, when you can just put a number on it. So instead of asking aperson, if they’ve waited a long time in the waiting room, you can say, more than 10 minutes.You can say exactly like within the last month, have you done certain a certain activity orwithin the next year? Do you expect to change schools or whatever. And so try to whereveryou can use numbers or something very specific, you know, instead of go to the clinic, goto the public health clinic at this particular corner, or whatever.And then you’re goingto get some pretty accurate information. But sometimes you’re stuck using vague terms,because you’re studying vague terms, right? I was doing a study of controllable lifestyleattitudes towards controllable lifestyle in medical students. So we asked this question,how important is having a controllable lifestyle to you in your future career? Well, what doesthat mean? That’s pretty vague. So what we did is we use this grounding this anchoringlanguage, we added the sentence, a controllable lifestyleis defined as one that allows the physician to control the number of hours devoted topracticing his or her specialty.So even though we’re talking about something kind of wofully,and watery, loosey goosey like control of a lifestyle, who knows what that means? Andthat’s not to say that that sentence could be interpreted differently by people it certainlyis. But if you’re stuck with vague wording, try to put some grounding language in it.So everybody’s at least sort of led in the same direction with their thought before theyanswer the question. Now, I want to also point out, you probably have noticed, there’s allthese issues, you have to think about when doing surveys, there’s this other issue calledthe lurking variable, well, you know, lurk means to sneak around behind the scenes, right?Behind the scenes, a lurking variable is a variable that’s associated with a condition,but it may not actually cause it.I remember when I was studying epidemiology, they talkedabout how a lot of people with motorcycle accidents, you unfortunately got in motorcycleaccidents that they had tattoos. So therefore, they said, Everybody shouldn’t get a tattoo,you might get it in a motorcycle accident? Well, that’s a great example of a lurkingvariable. Yeah, a lot of people who do get into motorcycle accidents, have tattoos, butthat the tattoos don’t cause that.Um, we also know that having more education increasesincome, but people have the same education level do not all make the same income, there’sthis thing, you know, called, it’s sexism. And it’s called racism. So it matters whetheryou’re a woman or a man, it matters, the color of your skin. If the you know, if you’ve gota darker skin, doesn’t matter, that you have the same education as somebody with lighterskin, you’re still gonna make less money. And so you have these lurking variables behindthe scenes. So when people are looking at Well, why are people you know, making lessincome, because they’re less educated, whatever? Well, you got to look for also the lurkingvariables. So current studies show that why women and African Americans make less moneyon the whole, it’s not explained by fewer of them working or fewer of them getting degrees.It’s really these lurking variables.And so you got to think critically. And I guess whatI would say is, whenever you do a survey, if you’re studying something that has a lotof lurking variables associated with it, make sure you measure those variables. Like earlystudies where they were looking to see if drinking a lot of alcohol causes lung cancer.Some of them forgot to really study how much these people would smoke. Because we knowsmoking causes lung cancer. And we know if you’re hanging out in a place with a lot ofdrinking and they allow smoking, you’ll see a lot of people smoking too. They seem togo hand in hand. So you don’t want to miss measuring variables that you think might belurking variables. It’s no problem to measure them and not use them later, but just makesure they’re included.So, as a final note on bias, I just want to point out that surveyresults are so important. for healthcare, and for the progression of science, that youreally owe it to even a simplest survey, to think about all of these things, these possiblethings that could go wrong, just with the wording of questions or with how you’re approachingthings, and just really consider how you can improve it. It’s really important to pay attentionto avoiding bias when you’re designing and conducting your survey. So think about allthese things at the design phase. Finally, I’ll get into the last section of this lecture,which is about randomization, which I think a lot of us have heard about. So I’m goingto explain the steps to a completely randomized experiment. And after I go through all that,I’m going to also talk about the concept of a placebo and the placebo effect.Then we’regoing to briefly touch on blocked randomization, and also define for you what is meant by blinding.So why ever randomize, right? So what randomizing is, is when you take a bunch of respondentsor participants in your study, and you randomly choose what group they go in. And if you remember,like I was talking about experiment versus observational study, we can’t do that in observationalstudy. This is definitely an experiment because you’re telling them what group to go, right. So randomization is used to assignindividuals to treatment groups. And when you do that, when you randomly assign them,not only you’re assigning them, but you’re randomly assigning them, you’re not picking,you know, you’re using like dice or some sort of random method, and helps prevent bias andselecting members for each group. It distributes the lurking variables evenly, even if youdon’t know about the lurking variables, even if you aren’t measuring them.By using thisrandomization method, they get equally allocated in each group. So just to remind you, howyou actually do that is, first I remember the steps to that statistical study, you haveto follow those. And after you get to the point where you have ethical approval, that’swhen you start doing the data collection step. And that’s where you start recruiting sampleor, you know, hanging up signs and saying, Be in my study, and people come in, and yousee if they qualify, and if they qualify, you’ve got this group of sample, right.Andwhat you do with those people is you say thank you for being in my study. And you measurethe confounders, which is another word for lurking variables. You also measure the outcome,whatever you’re trying to study, if you’re doing a randomized experiment, I know I’vebeen involved in a lot of these where they’re studying drugs for lowering blood pressure.So they’ll often have maybe two groups or three groups, where they’re randomizing peopleinto, but they don’t do that first, the first thing to do is get everybody in there andmeasure their blood pressure, right? The outcome, you know, because they want to know that before,they are going to take a picture of that before. And they also measure confounders, like smoking,remember, smoking is not good for your blood pressure, you know, other things are not goodfor your blood pressure, like not exercising, well measure all of those things.Okay, now,here’s where we get into things. That’s when the whole randomization happens. So I showedthis picture of a dye, but we usually use a computer for it. So we got all these peopletogether. And now you know, randomly, we put them in different groups. And in this example,on the slide, we’re just going to pretend that there’s two groups. And in fact, we can’treally study blood pressure on the slide. Because we’re going to give one group treatmentand the other group placebo, which is an inactive treatment, it’s fake, it doesn’t work. Ofcourse, the treatment and the placebo are going to look the same to the people takingit or, you know, we’re going to fool them.They don’t, they won’t know. But the reasonwhy in real life, you can’t do that with a blood pressure study today is we know that high blood pressure is reallybad for you. So it’s really unethical to give someone a placebo, you got to give them somesort of drug to lower the blood pressure. So usually when we do studies like this onblood pressure, now, new blood pressure drugs, Group A is treatment in Group B is old treatment,like they usually take a new treatment and give it to group by an old treatment to GroupB, see if they can find just a better treatment.But if we were talking about something likeall timers, especially late stage old timers, there’s no treatment. Okay? And so what gowhat’s on the side here, Group A, that gets treatment and Group B, which gets this Shampill, this placebo, that would be ethical then, but let’s just cross our fingers thatsomeday that’s not ethical anymore and that we do get a treatment right. Okay. So after you put them in the two groups withsort of missing from the slide is time passes, people in Group A take whatever they’re supposedto take their treatment.And in this example, on the slide, people in Group B, take thefake treatment, the placebo, and neither of them, you know, usually knows what’s happening.But it takes a while, right. And in the olden days before we knew high blood pressure wasbad. These were the study designs. And this is what ended up happening is that you wouldsee, at the beginning where they measured the confounders and the outcome, everybodyhad high blood pressure, they all look the same. But after treatment, Group A would godown, whereas group and Group B would go down a little bit from CBOE effect, which I’llexplain in the next slide. But that’s how we learned that you can make blood pressurego down with these different pills. Finally, after that time passed, it could be six weeks,it could be years, however long that took after that passed, when it was over, we’dmeasure again, the confounders because they could have changed. And the outcome, whichin my example, was blood pressure, or, you know how serious some of these Alzheimer’sdisease would be, if we were doing that.So I promised you on the last slide that I talkedto you about more about what a placebo is, and the placebo effect, found this great pictureof old placebos from the National Institutes of Health. So a placebo is this fake drugthat’s given and it’s actually kind of hard to make placebos. Just imagine a drug youmay need to take me even excetera and or something like that. Imagine we had to study etc. Andwe’d have to make a fake excedrin that tasted like it and look like it. Because then Otherwise,the people who are randomized to the placebo group would be able to totally tell that theywere in the placebo group, and that’s not good to do. So, what the reason why you needa placebo is there’s this thing called the placebo effect. And that occurs when thereis no treatment, but the participant assumed she is receiving treatment and responds favorably.Now, sometimes I talk about one of my favorite epidemiologists, comedians, Ben Goldacre,he reported in one of us, I think one of his TED talks about a study where they everybodythey enrolled, um, they didn’t have a disease, right, I guess they had a mild disease.Andthey told everybody, either they were going to give them nothing, or they were going togive them a pill, that’s a placebo, it doesn’t do anything. Or they’re going to give theman injection. That’s a placebo injection, it doesn’t do anything. And what they foundis of the three groups, the people who got the injection did the best. And the people,you know, the fake injection, people got the fake pill, the placebo pill, that is secondbest that people didn’t get anything didn’t, the worst.And that his point is, that’s whatthe placebo effect is, for some reason, when we’re getting injected. Even with just sailing,we think we’re getting some sort of drug and it psychologically, or however, affects ourbodies. The same thing when we’re taking a pill. I don’t know if you’ve ever seen kids,you know, saying, Oh, I need medicine 90 minutes. And then then the parent gives them an m&m,right, they think it’s a pill, they’re happy with it.But actually, the placebo effectcan cause real effects on your health, it can make you feel better just because youthink you’re taking a drug. And so that’s why it’s super important to include a placebogroup, if you don’t have a comparison group, like I described with blood blood pressurein all your studies, because if you just have one group where they’re taking it, they’llall say it’s good. They would say it’s good if it was water, right. So the placebo isgiven to what’s called a control group, and they receive the placebo.Now, if you’re studyinglike acupuncture, you can’t really give up placebo acupuncture. So what they’ll do isthey’ll sort of hang, hang up a little curtain and kind of tap you and you don’t know whetheryou’re getting real or it’s called sham acupuncture. Other things have to happen like that whenyou’re doing these studying these interventions that aren’t pills. Those are called attentioncontrols, right? Where we have like a sham acupuncture. So in any case, you’ve got tothink about this because you need a controller comparison group.That’s fair. Whenever you’retesting in an experiment in a randomized experiment, a new thing promised you I’d talk a little bit about blockedrandomization, I won’t get much into it. But sometimes when you go to randomize, right,you know, you get this whole group of people, they’re all about the same, but you’re gonnasplit them into a group A and Group B, one’s gonna get maybe a drug and the others maybegonna get the placebo. Sometimes you get worried that the groups are going to be unbalancedwith respect to a particular lurking variable. In blood pressure, we’d always care aboutsmoking, we want the equal amount of smokers in each group. You know, a lot of times wewe care about gender, we want equal amounts of men and women in each group. So if you’reworried about that, with randomization, you can’t just do it one at a time, because youmight just randomly put too many men in one group. So what you have to do is block randomization.So see, I drew all these blocks on the on the screen, and you’ll see that there’s nobodyin them, they’re just blank, I just put xxx.So this is before you do your study, you havethese blank blocks. And what you do is as you enroll those people remember you haveto measure them and make sure that they qualify for the study, as you get them in, you canjust write them in the blocks, right. So here, I just put their fake initials, you know,so let’s say that XYZ came in first, that’s a woman, and then maybe NSW came in, and that’sanother woman, you just keep putting the women there. And then when the men come in, youput them in, and you fill up the blocks, then here’s a trick, you actually randomize theentire blocks, right? So block one and block three ended up in Group A, and but magic,you got to equal men and women there. And then Group B equal men and women.And so that’show you do with blocks. So but you know, there’s some limitation to this, like, if you getmultiple races in your study, maybe, you know, four or five racial groups. If you make afive block, you’ve got to fill up the whole block before you randomize it. And, you know,sometimes you’re you’re in an area where certain racial groups are rare. And you might havetrouble filling up your blocks. So there’s some limitations of this too. Now, I had mentioned the situation where you reallydon’t want if you’re going to do an experiment, right, not an observational study, experiment.And you’re going to randomize people either to a drug or some sort of intervention versusplacebo, or a drug versus another drug, an old drug, you really don’t want them to knowwhat group they’re in. I mean, because you have to be ethical. before they enter thestudy, you have to tell them, you’re gonna put them in one or two group, one of two groups,but you got to tell them, you’re not going to know what group you’re in wallets goingon.So blinding is where the, where any person is deliberately not told of the treatmentassignment. So he or she is not biased in reporting study information. And it actuallydoesn’t have to just be the participant in the study, it can be researched, like, themost common one is a participant is blinded to treatment or placebo. But I’ve been instudies or I’ve been worked on studies of like Alzheimers disease, right? Well, they’llthey want to take the patients are the participants in the study might have Alzheimer’s disease,and look at their image, the MRI of their head. And often, they’ll have also a neurologistinterview them, they’ll also see a neuro psychologist.And they often want those three differentgroups, they imaging group, the neuro psychology group and the neurology group, not to knowabout each other’s opinion of this particular patient. So they’ll blind them to each other’sopinion. So blinding AR is much more complicated than just blinding the participant to whetheror not they’re in placebo, or they’re in drug group. But double blind is a really importantconcept. And that means that both the participant and the study staff do not know the treatmentassignment. So everybody who’s operating with the patient doesn’t know it. So you’re probablythinking that’s really pretty serious, right? Like, what if that person gets sick, and goesto the emergency room, and they’re taking an experimental drug or they could be takingplacebo? Who knows what they’re taking? Well, in that case, what happens is there’s an unblindingprocedure, there just has to be as part of ethics.It’s already set up in the study.If somebody goes to the emergency room, there’s a person that can be called to unblind. Thepate, the participant who’s now a patient, and and once they’re unblind, they learn whatthey were taking. Even if they were taking placebo, the whole thing’s over. Right? Eventhe study staff work. It’s just a fact of life. It has to happen sometime. But for themost part, what we tried to do is keep things steady. double blind because it makes thingsthe least biased in the most fair. So 10, the session on randomization, the purposeof randomization, why we go through all this when we’re testing treatments, especially,is that it’s used to reduce bias. And especially if you have a particular variable you’re concernedabout like gender, like we were talking about race, or smoking, smoking status, you canuse a block randomization to even out each group.And then blinding further preventsbias, right? Because people don’t know what they’re taking in the study staff don’t knowwhat they’re giving them. And the reason why you have to really think about blinding isthe placebo effect is necessary to take into account, you’re always going to get the placeboeffect every time you give somebody something. So you’ve got to account for that in yourstudy design. So in conclusion, I went over the steps to conducting a statistical studyin order and kind of give you tips on how to remember that we looked at some basic termsand definitions. And we talked about how to avoid bias in survey design, because there’sa lot of different considerations. And finally, we talked more in depth about specificallyabout randomization in experiments. All right. Now, you know, a lot, maybe too much. I hopeyou enjoyed my lecture. Hi, Whoa, it’s me again, Monica wahi, yourstatistics lecturer from labarre College.Now we’re going to go go back and cover whatI didn’t cover in the last lecture about chapter 2.1, which are frequency histograms and distributions.So here are your learning objectives for this lecture. So at the end of this lecture, youshould be able to state the steps for drawing a frequency histogram, you should also beable to name two types of distributions and explain how they look, you should be ableto define what an outlier is, and say one reason why you would make a frequency histogram.Finally, you should be able to define what a relative frequency is and what a cumulativefrequency is. Okay, so let’s get started. First, we’re going to review frequency histogramsand relative frequency histogram. So you’ll figure out what I’m talking about there. Thenwe’re going to go over five common distributions in statistics, so you know what that’s allabout.And then I’m going to talk about outliers. Now, you’ll notice I have a lot of picturesin this presentation of skylines. And the reason why is they remind me of histograms.So let’s talk about what is a frequency histogram. So a frequency histogram is important in statistics,because, as you’ll see, you need to make one in order to see what the distribution is.So I’m going to go first explain what one is, like, show you what one looks like.Andthen I’ll explain how to make one. And then I’ll explain the relative frequency histogram.And then we’ll move on to looking at why do we need that for distributions. So here’sanother skyline because it looks like a histogram to me. So what is a frequency histogram? Well,it’s actually a specific type of bar chart. And it’s made from data in a frequency table.So you might see a frequency histogram and go, well, that looks like a boring old bargraph.Well, it’s not just any old bar graph, it’s got specific properties that I’m goingto talk to you about in this lecture. Okay. Both frequency histograms and relative frequencyhistograms are bar charts with their special bar charts that have to be done a certainway. And why? Because if they’re done that way, in their histograms, they will revealthe distribution of the data, which I’ll explain later. So here is a frequency table, we hadthis before. This was of those fake patient transport miles, right. So you’ll notice herewere the class limits, and then we put in the frequency and we even threw in this relativefrequency.Okay, so this is the frequency table I’m going to use as a demonstrationfor how you make a frequency histogram, you first need a frequency table. Okay, now, here’sthe histogram version of what’s in that frequency table. So I’m going to annotate this one imageto explain the order in which you draw it basically by hand. So the first thing youdo is draw this vertical line for the y axis, okay, you just draw a line. Next, you write words next to the line, and you always startwith frequency of, and then whatever In our example, it was patience, okay. And I’m tellingyou, you need to do it in this order, or you’ll get confused. So you start with that firstline, and then you write this frequency. Okay. Next, you draw the whole horizontal line forthe x axis, okay. And then after that you write the classesbelow. Remember, like the lowest class is one to eight, that’s a lower class and anupper class limit of the lowest class, like you literally write those labels in. And whydo I, why am I so freaking out about this order is because I totally get confused ifI do not do this y axis first.Because then all there’s all these numbers. And it’s totallyconfusing. So just try to do it in this order. Okay. Now, number six, I had to flip the slidehere. Okay, at step six, use drawn like the basic background, you’ve got the x and y axisand those labels. So now you have to start drawing in the bars. So for your first bar,you look at the first class, and you find the frequency on the table, which I thinkit was 14 or something. And so you look for it on the y axis, and you want to label they axis so that the maximum one is is incorporated in it, like you see our maximum is above 20.So we wouldn’t want to end our Y axis at 20, or 15, or something, you have to make it bigger,so you can put everybody on there. But our first one was what at 14, so we draw thishorizontal line around the 14, right there, that that horizontal line, because we’re gonnamake that first bar, then you draw the two vertical lines down, andyou position it over where you labeled the class.And that makes the bar and then you,you actually color in the bars, like and you repeat this for each class, right? So yougo, that’s why I labeled the classes first on the x axis just to make sure everythingis even. And then I go through and I make all the bars. And again, this is why you needto prepare your frequency table first. So you know how to graph it, you know what toput on this graph? Okay, this is the relative frequency histogram, you already understandwhat relative frequency is, right? It’s that proportion, the proportion of your samplethat’s in each class.And so the change, if you’re going to do a relative frequency histogram,you basically go through the same steps, it’s just you’re changing what’s on the y axis,you change what you label it, okay? But the x axis stays the same. And even though you’re,you’re charting the relative frequencies, like, you’ll be like, Okay, this is a totallydifferent number, what you’ll see is the pattern ends up being the same. So it takes on thesimilar pattern, which is the pattern is actually what we’re going after, that’s the thing I’mgoing to talk about with a disparate distribution. And so I tend to prefer since the patternis going to come out the same, I tend to prefer using a relative frequency histogram, versusa frequency histogram. Because if I have two different groups, like let’s say, there weretwo hospitals, and I gathered two sets of data, and I wanted to compare the models transported,then I could use this relative frequency histogram, and not only with the patterns be evident,but I could compare them fairly, like whatever’s 35, you know, point three, five or 35%.Inthis, even if the other hospital maybe had tons more transports, I could see it as like35%. And I could really compare the percent, right. So that’s why I lean towards relativefrequency histogram. But ultimately, you’re going to get the same pattern on your histogram,whether you use frequency or relative frequency. So again, another picture of a skyline. Soyou can see why I think of skylines because they look like histograms, right? So aftermaking a frequency table, what you do with quantitative data, right? Because you’re tryingto organize it, it’s also important to then make a frequency histogram and or relativefrequency histogram, and why it’s because it reveals a distribution.And now, that’swhat we’re going to talk about. We’re going to talk about distributions. So first, I’mgoing to define what I’m talking about with the distribution. And now you’re gonna seea lot of other kinds of pictures like this on the right, see that that shape? That’sone of our distributions, okay. And so that’s a little prequel to what I’m going to say.So first, we’re going to talk about what these distributions are. Then I’m going to describewhat an outlier is, and, and how you can detect them by using histograms.Finally, I’m goingto wrap it up by explaining what cumulative frequency is and when an old jive is. Okay, so what is this distribution thing Ikeep talking about? Well, it’s actually just a shape. It’s the shape that is made if youdraw a line along the edges of the histograms bars, so On the left, you see I drew the scribblyshape. But you’ll notice you can do it with a stem and leaf too. This is not the samedata graphed on the right in the stem and leaf. I’m just using, you know, recyclingthe old picture that I used before. But you see, you can do the same drawing that squigglyline, you know. And that’s actually the distribution. I mean, they don’t all look exactly like that.But that’s what you do is you draw this line thing. I know, it’s kind of odd that that’swhat a distribution is, is just a shape.But there’s actually five of them that we usea lot. There’s way more than five, actually, in statistics, but you have to get into kindof higher level statistics to care about those, we’re only going to concentrate on these five.Okay. So the first one is called normal distribution. And it’s called that everywhere, except Inoticed the book call that mound shaped symmetrical distribution, but I’m going to call it a normaldistribution.And there’s nothing really normal about it, it’s just named that for some reason.And then there’s a uniform distribution, skewed left distribution, skewed right distribution,and by modal distribution, so those are the five we’re going to cover. So let’s starthere with the normal distribution. So as you can see, on the right, somebody made a histogram.And then they do that squiggly line. Well, actually, it was me who made this histogramand drew the squiggly line. And notice the squiggly line, what it looks like, it kindof looks like what the book called it, it’s mound shaped and symmetrical. But that’s theshape of the normal distribution, it looks like that it’s got kind of hokey things onthe side, and, and a mound in the middle.And if that’s what your histogram ends uplooking like, where it’s kind of like a little mountain like that, then you’ve got a normaldistribution. Okay, let’s look at a different histogram. Okay? In this histogram, you’llnotice that like, each of the bars, each of the frequencies is almost like the same, right?It’s either five or six. And it doesn’t matter what class we’re talking about.When it’slike that, the little line you draw across, it’s not squiggly at all, it’s straight. Idon’t see this very often in healthcare data. But it does happen in other kinds of datamore frequently. And this is called the uniform distribution, which makes sense, it’s almostall of these bars are a uniform height. So that’s what a uniform distribution is. Okay,now, this is one kind of like the one we were looking at before, where it looks kind oflike a slide like at a playground, where, you know, like, you climb up the right side,and then you slide down to the left side. Okay? And that whenever it’s like that, whereit’s low on one side and high on the other, it’s called skewed. The problem is, whichway is it skewed? Right? And how I remember which way to say it’s skewed? Is it skewed,where it’s light or short? So here, I would say it’s light on the left. So it’s skewedleft, right? Because on the left side, it’s really the bars are all short.And then youcan just imagine what’s going to come next here? Well, look at this, this is skewed,right, because it’s light on the right. It’s short on the right. So it’s skewed, right.So technically, I mean, both of them are just skewed distributions. I like I just like toexplain them separately. Because sometimes people don’t know which way to say is leftto right. And this is how I remember light on the left, light on the right. Finally,we have bi modal. Now, the word mode in some areas of statistics, and then engineeringand stuff often means like a high point. And by modal means two high points. So as youcan see, it looks like a camel with two humps. And it’s a little hard sometimes to tell bymodal from normal. Because if you remember normal, like let’s say you have a normal distribution,but you just have one little one little bar kind of in the middle, you’relike, is this bi modal, or is this normal? How I tell coach people to see if it’s bimodal is if there’s a really big space between the two humps that’s not so apparent on thisimage here.But you’ll see class three and class four, they’re both short. If only oneof them was short, I might I might have called it a normal distribution. But I’ve reallyseen by modal distributions when it comes to like lab data, because my best friend isa pathologist, and he’ll show me you know, with situations where people have like reallysuper high platelet counts, and then like no platelets practically and there’s nothingin the middle. And that’s where you’ll see a bi modal distribution. Now we’re gonna talkabout outliers.And outliers are data values that are, quote very different from othermeasurements in the data. What’s very different, right? Like it’s an opinion. But people instatistics come up with different formulas to try and figure out if something is verydifferent from the other measurements. And we’ll talk about that actually, later in laterchapters in the class, not so much for identifying outliers, but just to just to better understandour distributions. But just as a quick and dirty representation of what would be an obviousoutlier lit, like nobody would disagree on is this histogram here.So you’ll notice Ijust threw down nine classes, I made up this data. But you’ll see a class two and classthree, there’s just like nothing, and there’s nothing in class eight. But when you get,and then suddenly, there’s something in class one and something in class nine. And whenyou have these big gaps, this is kind of like that platelets, like I was telling you aboutonly this maybe would be you know, you would say this is tri modal, like there’s threemodes, but there’s not really three modes, right? There’s a wacky low one and a wackyhigh one, and everything else is in the middle.So because that one in class one, and thatone, and class nine, they’re so far away from what’s in the middle, like just about everystatistician would agree, these are both outliers. But you can just imagine how much we argueabout what actually is an outlier. It’s especially hard when you’re getting data on weight ofpeople. Some people really do weigh 400 500, maybe even 600 pounds, you don’t know if they’rereally outliers, or data mistakes, or what to do with them. They’re real people. Andmaybe they have really high weights. And unfortunately, some of them have really low weights too.So the one of the main points of doing the histogram is not only to look for these distributions,but also to see if you’ve got any super obvious outliers that you’re just gonna have to thinkabout before you proceed with your analysis. Now, I’m going to talk to you about what cumulativefrequency means, you know, the word accumulate means to just like keep accumulating thingslike if you have a gutter on your house, it will accumulate leaves, like old leaves willsit there and new leaves will keep coming and the old ones will still be there, untilit like totally clogs your gutter, and you have to clean it.So that’s what cumulativefrequency is, is where it accumulates all the frequencies. So you see on the slide,you know, in the first class, when they ate, we had a frequency of 14. So your cumulativefrequency, those are like the leaves at the first beginning of the season, that’s allyou got is 14. But when you add on the next class 21. Now you add to the cumulative frequency,it accumulates, you add that 21 to the 14, and now you’ve got 35.And if you can extrapolateas you walk up all these classes, eventually you get to the total, right. And so yeah,so that’s what you got. And the first class is always the same as the frequency and eachcumulative frequency is equal to or higher than the last one. I’ll have to say in healthcare, we don’t reallyuse cumulative frequency a whole lot, you’ll see it but we are really into relative frequency,I’ll just tell you that. But some groups are into cumulative frequency and those who are,they like to plot it in a plot called an Ojai.And again, I’ll be honest, and healthcare,I’ve never seen an old giant that was just in the scientific literature, which is whyyou’ll see this is about NFL teams salaries, because I think they use it a lot more ineconomics. But at any rate, what you’ll see is that the classes are along the x axis,you know, you’re used to that, because that’s what we do in a frequency histogram. But alongthe y axis, you see these numbers called cumulative frequency. And you just graph it, right, butone of the things you’ll just notice is that it’s going to go up, like each one is goingto either, unless you have a class with zero in it, it’s going to stay the same for thatone.But otherwise, it’s just going to keep going up. So you’ll always see some sort ofshape like this, where it’s always going up and it hits the top. At the end, it hits thetotal cumulative frequency at the end. So, just to review, there are five main typesof distributions used in statistics. And I emphasize mean, there’s other ones, but theseare the ones we’re going to look at. And so that’s why we were doing our histograms andour seven leaf displays is we were looking for these distributions. And also we werelooking for outliers.And then finally, I just quickly did a shout out for your Oh,jive here and your cumulative frequency. So you know what, what’s up with that. So inconclusion, the purpose of the histogram is to reveal the distribution and also the stemand leaf displays reveal the distribution. And you look then, for outliers. You’ll probablywondering, Well, why do we do all this work to, to reveal the distribution, we’ll you’llfind in later chapters and matters, what kind of distribution you have, what kind of statisticsyou can do insert, in a way, you know, like I went kind of, on and on about the normaldistribution.Well, we all really like that in statistics, we’re all really partial tothat, because it allows you to do a whole bunch of different statistics, you know, prettyeasily if you get a normal distribution. However, what’s often happens is in healthcare, becauseI’ve done it, is you get a skewed distribution left skewed right skewed, and then you haveto make some decisions, that makes it a little harder. Also, I’ve had to buy moral distributionbefore I’m remembering that one day, that was kind of an issue, and then I had to figurethat one out. So that’s roughly why we have to go through this chapter and figure outhow to do these distributions. And then later, I’ll explain to you what you do with thatknowledge. Hello, there, it’s Monica wahi labarre College statistics lecturer. We’regoing to circle back now to chapter 2.2. And talk about these other graphs, I’m doing thingsa little out of order, because it makes sense to me. I hope it makes sense to you too.Well,for this lecture, we’re going to have these learning objectives. So when you’re done withthis lecture, you should be able to describe a case in which a time series graph wouldbe appropriate, you should be able to explain the difference between what would be graphedon a bar graph versus a time series graph, you should be able to describe the type ofdata graphed in a pie chart. And you should also be able to list two considerations tomake when choosing what type of chart to develop. Alright, so let’s get started here. What I’mgoing to be doing it in this lecture is, first I’m going to explain what a time series graphis. Then I’m going to talk about a bar graph.And of course, I’m going to show you roughlyhow to make these, I’m gonna explain a pie chart and how to make that. And then I’m goingto go over a review of all the graphs I’ve talked about for chapter two. And just summarizewhen to use what type of graph. So let’s start with the time series graph. And actually,the word time is the key. The time we’re going to talk about this time seriesgraph and what our time series data, right. As you can see, by this little example, timeis across the x axis. And that’s kind of a hint for where we’re going. Okay, so thenI’ll show you roughly how to plot one. And I’ll explain why we have these time seriesgraphs, like how you interpret them and why you even make them.So, of course, I’m anepidemiologist. So what am i into m&m mortality, morbidity. So here’s a nice time series graph,wonderful graph of the percentage of visits for influenza like illness reported by theUS outpatient influenza like illness surveillance network, by surveillance week, and this isOctober 1 2006, through May 1 2010. And you’re like, oh, time? Yeah, that’s the deal.Timeseries data are made of measurements for the same variable, for the same individual takenin intervals over a period of time. Only. In this case, in the example here, the individualis not a person, right? Because remember, individuals are just what you measure whatyou’re measuring variables about. Here, the individuals are actually weeks, right? Becauseevery week, they’re making a measurement. So like I said, time series data are madeof measurements for the same variable, which is what percentage of visits for influenzalike illness. So every week they went to I don’t know who is in like, what clinics arein this outpatient influenza like illness surveillance network, but let’s just pretendthere’s like 10 clinics in there. So each week, these clinics have to go in and say,Yeah, I had, for example, 100 visits this week, and 10 of them were for influenza, likeillness.So then that would be 10%. That week, for that clinic. Well, they got all the clinicstogether, and they found out what the percents were. And you can see on the y axis, right,there’s the percentage, and then you see on the x axis all the weeks in the year. So um,so you’ve seen these before, right? You especially see it with stock market, right? You go onYahoo, and look at your favorite stock, right? You know, we’re also rich, we own so muchstock, and so you track your favorite stock that way. Personally, I’m spend more timelooking at mortality and morbidity, things like influenza, but hey, there after I getsome money, I’ll be looking at stock market prices.So when we see these time series datagraphed in these time series graphs It’s often about things like influenza rates. Other rates,you’ll see life expectancy, rates of heart attack. And that’s usually what we see, becausewe’re trying to affect those rates. And we’re trying to see if they’re going up or down.So I’m going to just roughly go through how you make one, if you ever wanted to make one,the first thing you need is a table, kind of like the one on the right, I just madeup these data, they don’t mean anything.But roughly what you need is a column that says,in this case, I put year, the influenza people they put a week, but you have to put likeregular time increments in the first column. And then you have to put that variable measuredat that time in the next column. So let’s say it’s today, and you’re like, Oh, I wantto measure how many times I went to the gym each week, you know, weekly over the lastfew months? Well, you’re gonna have to reconstruct that data, right? Like maybe from your memoryor your calendar. So normally, when you’re going to go do time series stuff, you startand you collect the data as you go along. And then it’s nice and accurate. Okay, solet’s say you did that, and you managed to get some time series data together, then howdo you plot? Well, the first thing you do, and I’m using this influential thing, as anexample, is you draw a horizontal line and you make that your x axis, now you gatheredyour data based on years or weeks or something.So you can label those time periods there,because you already know those time periods. And so you just label that x axis. There,then you draw the vertical line for your y axis. And again, you’ve done all your measurements,right? So if you were measuring how many times you went to the gym per week, you know, maybeonce a day, you know, that would be seven would be the maximum, right? So you didn’twant to make sure your y axis is tall enough to get that seven. And if you had a good weekthere. And so that’s really what you’re looking for in the y axis, you don’t want to too tall,like you see the highest point that they have Ooh, in 2009, they had an outbreak there,they needed to make sure that the y axis was tall enough so that they could graph that.But other than that, you don’t want to too much taller.And then make sure you labelit. I’m big on labeling here, because otherwise people get confused. Okay, now we’re going on to the next step,then this is where you get into actually putting in your data. Now, because there were so manyweeks, like if you look at like 2007 is only about like the x axis is only about two incheswide. And all like 52 weeks of 2007 were plotted in there. So it literally looks like a supersmooth line. But honestly, what they did was they went and they put each point in. Andso they put each point in separately, and then they connected the dots. And that’s whyit looks so smooth. If you only have a few points, and you have a wider x axis, it’llbe a more choppier, it will be, it’ll look a little bit more like they’ll stock market.Graphs like that go up and down, up and down and kind of look like a roller coaster andnot so smooth.But if you have a lot of points and you mission together ends up looking reallysmooth. You also I just wanted to point out can have more than one line on the graph.For more than one set of data values. Like here, they’re comparing, I don’t know somesort of book performance, how much it was sold. In US versus Canada, you just have tomake sure that you have a legend if you do that, so people can tell the lines apart.So to summarize, time series graphs are useful for understanding trends over time, like whetherthings go up or down like you saw on that influenza chart, we could see when there apparentlywas kind of an epidemic or an outbreak. So graphing more than one set of time seriesdata, like you saw in the last graph on one graph can help and comparing the differencesbetween the datasets I worked at for the US Army. And there’s a lot of problems with peoplegetting injured in the army. And so I made a lot of time series graphs of rates of injuryover the years because we were trying to do things to make the rates of injury go down.And then that way we could see if the trend was there that we were actually making themgo down.So that’s the main goal of these time series graphs. Now, I’m going to moveon to talk about a bar graph, which can display quantitative or qualitative data. And I’mgoing to first start with the features of the bar graph. here’s just an example on theright here. I’m going to talk about how to make one and then we’re going to talk aboutwhat happens when you change the scale meaning the x axis like how how tall the x axis is,on a bar chart because it really changes things. I call it a bar chart sometimes, or bar graph.They’re really the same thing. I don’t know why they chose graph in the book.But thenfinally, there’s I want to do A little shout out to what purrito charts are, we don’t reallyuse them much in healthcare, but I still wanted you to know about them. Alright, so let’slook at the features of a bar graph. The first thing you want to know is that they the barscan be vertical or horizontal. So don’t, even though I’m showing you this horizontal, orthis vertical example, don’t be thrown off, if you see a horizontal example. Regardlessof whether they’re vertical or horizontal, the bars are supposed to have a uniform width,and uniform spacing, they can’t be wider or skinnier. And they have to be spaced apartat a uniform rate. I’m gonna use, like I said, this big one here, as an example, to talkabout bar graphs, I just want you to notice what is being graphed here.And this is thepercentage of people in the US not covered by health insurance. And it’s split up byrace and ethnicity. And it’s looking at the years 2008 through 2012, which is like bad,right? Like, you want people to have health insurance. Okay, um, so item three here saysthe length of the bars represent either the variables frequency or percentage of occurrence.So if we were looking at instead of percent like it’s I’ve circled percentage, becausethat’s what we’re looking at in this one, we could have looked at, you know, numberof visits at a health care clinic, and that would be frequency, right. But we haven’tbeen looking at percentage here.So I, so I just wanted to call that out. So you’llsee then, on the y axis, we have the measurement scale. And as long as we write it there, andwe use that same measurement scale, for graphing each of the bars, we will be fulfilling theitem for which is the same measurement scale is used for each mark. I don’t know why anybodydo it any other way.But that’s part of the features of the bar graph. Now, this is afeature that really is like my pet peeve, I get so irritated when I find a bar graphor any other graph where things are not labeled, I get totally confused. So you really wantto put on a title, you need to put the bar labels, at least on the app on the x axis,right? Like you have to know see how it says white alone, black alone, like you wouldn’teven know what those bars were unless somebody put something there, right.And some peoplealso add the actual values for each bar, I’ll do that if there’s space, like there was spacehere. If it gets too busy, I don’t do that. But um, because you can kind of see them fromthe graph. Now, you’re probably wondering, um, you’reprobably kind of having a flashback, you’re like, this looks totally like a histogram.What is the difference? Well, I started by talking to you about histograms, they’re actuallya special case of a bar graph, right? So bar graphs are more general.And the histogramis a specific type of bar graph. So histograms are bar graphs that must have classes of aquantitative variable on the x axis. So you can already see that the bar graph I’m showingyou is not a histogram, because it says categorical, qualitative things, it doesn’t have a class,right? Also histograms must have frequency or relative frequency on the y axis, whichas you can see this as percentage of something. So that’s not that. So this isn’t a histogram.But whenever you make a histogram, you’re just making kind of a special bar graph. AndI just wanted to point that out, so you weren’t confused. Now, I said, I was going to warnyou about what goes wrong when you change the scale. And what I mean by changing thescale is when you look at that y axis, notice how it the top of it the way this person made,it, is at 35, or 35%.But notice that the highest racial group without health insurance,which is unfortunately, those of Hispanic origin, that that’s close to 30. But it’snot all the way up to 35. So I’m not exactly sure why they made it so high. So I wantedto see what would happen, what the shape would change these bars, if I actually made thetop 30. So I regenerated this, and then you’ll see what happens.See, it’s the same data.I just made it and I made the top 30. It’s kind of subtle, but suddenly all the barslook bigger, right? So if I were like some advocate and running around saying this isterrible, you know, these people don’t have insurance. I’d like to look at the one onthe left more than the one on the right. But, you know, in a way, that’s a little misleading,right? It’s the same data. So the differences between bars are more dramatic when we changethe scale to be shorter, a little bit more dramatic. But let’s go The other way, andthis is where I see people do things a lot. Let’s see what happens, see how that the thetop of the y axis is 35. Right now, let’s double that. Let’s just make it 70. And thenlet’s see what happens. As you can see, the differences between the bars look small, right?Like, the difference between that big Hispanic origin one and the lower white and Asian aloneones isn’t really that big anymore.So my opponents would rather look at that graph.In fact, everything looks kind of small. on that graph, it’s a Oh, there’s no problemswith insurance. Um, and that’s, you know, when people talk about lying with statistics,so to speak, I mean, these are the kind of tricks people do to try and change how thingsappear. And the best way to do it is to just do kind of what I suggested is look at thenext one up from your tallest one. And do that, use that as your top of your y axis,what I would have to do with the army is I was looking at rate of knee injury, and alsorate of ankle injury.But knee injury was way more common. And so if I wanted to comparethe two, I always use the same scale, because otherwise, people wouldn’t be able to seethat the ankle injury was really, really low. Compared to the knee injury, even though they’reboth important. Um, let’s hall with a taller y axis, the differences between the bars lookdress less dramatic, and also the taller you make your y axis, the less it looks like youhave of the bars, so you got to be really careful.I don’t think you would do that.But you know, other people do that, when they’re trying to make their points. So just be carefulfor that. Also, a term that was mentioned in the book is the term clustered and clusteredbar graph. It’s not that complicated, it just means more than one bar is graph for eachcategory. You’ll see in the in the last one I did, it was just on on one topic. And here,if you look at this one on the right, and of course, I mixed it up a little I did thehorizontal version.But this is life expectancy at birth. And it’s it’s separated by you’ll see thatthere’s three sets of bars, right? There’s both sexes together, in there’s a bunch ofbars for that. And you see the legend Hispanic, non Hispanic, black, non Hispanic, white,and then they mix them all together all races origin. And then they also have separate setof bars for male and female. And so this would be clustered. And if you do that, you reallyneed a legend so people can tell what’s going on. You’ll also notice that you know, lifeexpectancy, that’s good. If it’s high, right, you want to live to be 8090 100. But if youlook at the bottom of the slide where we have the x axis, if we mean if we started at zero,and just made it all long, it would not even fit on the slide. So what they’ll do is they’llmake these little hash marks with this little squiggle, and indicate that they just skippedahead. But like I said in the first part of this, if they skip ahead on the female one,they have to skip ahead on all of them.Right, so everything is skipped ahead there. Thisis a fair comparison. It’s just like we’re sort of, it’s like, we’re fast forwardingthrough the movie up to about 50. And then looking at the differences there because everything’sthe same up to that. So that’s just another thing about scale is notice whether it’s clusteredif you’ve got a legend, and also look for the squiggle. Okay, now I’m going to giveyou a shout out to a purrito chart. And you probably already noticed, we don’t reallyuse these much in healthcare, because this example is about causes of an engine overheating.Well, we don’t do that a lot in healthcare. And you’ll see I kind of slapped on a labelon the y axis, the word frequency, okay. So in a perrito chart, this is you remember howI was saying this histogram is a special bar chart, or bar graph will pre though chartis a different kind of special bar graph.Okay. And then that one, the height of thebar indicates the frequency of an event. Like if you look at these events here, like damageradiator core, that happened 31 times right? And then happened more often than faulty fans,which only happened 20 times. So what they do is they figure out what happened the mostand the second most and least whatever, and they deliberately arranged them in order leftto right, according to decreasing height. It’s a way of sort of zoning in on what isthe most important problem you’re finding.So it’s really meant to graph frequenciesof problems. I actually only saw one purrito chart I’ve ever ever in healthcare, so Sofar, I really looked for one. And what it was about was, it was about things that canhappen that are bad in a nursing home. And I remember the tallest bar was for falls,right? Like people fall in a nursing home. And then there was a smaller bar for medicationerrors that happens. The reason why we don’t I think the reason why we don’t use thesea lot in healthcare is, you know, let’s pretend that’s what this was of it, let’s pretendthis 31 instead of damage radiator course that 31 Falls? Well, the first thing you’dprobably ask is, well, how many people are in that, that nursing home? You know, andhow long did you collect data for right? 31 Falls is pretty bad.But it’s not bad. Ifyou have hundreds of people over 10 years of that all you get a 31 Falls, you’re doingpretty well. So I would say that the reason why we don’t use preto charts a lot in healthcareis that sort of leaves out some important information about these serious events. Andso we like to look at things in different ways. So just to summarize, about bar graphs,bar graphs must be made following a few rules, I talked to you about the you know the difference.with, you know, you have to keep the width the same and, and how you have to label theaxes. So we know what you’re talking about. Because you can visualize both quantitativeand qualitative data using a bar chart. So these labels become really important, as doscales, right? Like, I showed you how you change the scale, and you can make thingslook different.So you want to be careful and be cognizant of that. And also, I dida shout out to purrito charts, and I explained why I think they’re not used that much inhealthcare. Now we’re going to jump into pie charts. Youknow, just even the thought of a pie chart makes me hungry, doesn’t make you hungry.Um, so here’s what a pie chart is. They’re also called circle graphs. They’re used withcounts or frequencies that are mutually exclusive. And that sounds really fancy. But all it meansis when every individual can only fall in one category. So I’m going to give you theexample on the right, which is actually from a real report you should probably read. Itwas a survey that was done by the Massachusetts nursing Association, and they got 339 nursesto fill out the survey, one of the questions was, do you receive annual blood borne pathogentraining? Now the answer is only going to be yes or no.They can’t say yes and no. Thatis what mutually exclusive is, is where you can only answer one answer. So as you cansee, 234 people said yes, which is good. And 105 said no, which is bad, I’m worried aboutthat. But these pie charts are often made in graphing programs, becausethey’re a little difficult to do by hand. And I’ll explain to you why. And unlike peredo,charts, these are super common in healthcare, as you can see right there on the slide. Solet’s look at the features of a pie chart. Um, I actually just made up this fake piechart, I pretended I had a class where I gave a five point quiz, right? And the reason whyI did that is I wanted to show you how to do it with a quantitative variable.Becauseremember, the last one, it was yes or no. And that’s qualitative. Those are the theanswers that the nurses could give to that survey question. Well, this is a differentone. This is where I actually put, you know, fake students in their their points on thisquiz into classes, right? Like you see zero points, one to two points, three to four pointsand five points, right. So regardless of whether you’re doing yes, no, no qualitative, or,you know, different categories like that, or you’re doing classes like this, every individualin your data must be in only one of the categories, only one of the classes kind of like frequencytables and histograms. You everybody gets one vote. And that’s really important in apie chart, even though it can be used with qualitative or quantitative variables. Andyou’ll see later What I mean by that. And so here is just a fake example I made of howyou would then make a pie chart out of a quantitative variable. So I’m just gonna briefly go over how you woulddo this by hand and I’m realizing I’ve never done this by hand.I always use Excel as youprobably recognize that lovely purple color, which comes out of Excel. But if you weregoing to do it by hand, I guess you’d have to go buy one of those things in the lowerleft, which is a protractor because that helps you see the degrees of a circle. Remember,it’s a whole circle has 360 degrees, right? I don’t know if you remember all this fromlike trigonometry. And but then like a half circle would be 182 Freeze. And so that’show you figure out like how much of the piece of the pie you need is using this protractor.So if you’re going to make a pie chart by hand, you first have to make a table, you’llsee we make tables constantly and statistics.And I put class in the first column, becauseI was doing one that required class because it’s quantitative. If you were doing thatone with the nurses saying yes or no, you would put category and you just say yes orno, right, and then total, then of course, next, you put the frequency. And I alwaysput total to add it up to try and make sure you know my fake class apparently, and 37people in it. So I just want to make sure you know, everything adds up, then the nextstep room will remind you of relative frequency, it’s where you figure out the proportion ofthe circle that that’s going to take up, right. So see, the five points out the seven peoplewho got five points? Well, if you divide seven by 37, you’re going to get point one, nine,well, that’s I like percent.So that’s 19%. So that would say what proportions the circlethey get, right. And then finally, in the last column, remember how it’s telling youthe whole circle is 360 degrees, when you take that proportion you get, and you multiplyit by 360, to figure out how many degrees, you’re going to make your circle. And that’swhy you need the protractor.And that’s also why I always use Excel for this because itmakes it so you don’t have to worry about those things. All you would need for Excelis actually just the class or the categories, and the frequency. And then if you use theirautomatic pie graph function, then you can get all this other stuff out very quickly.So I just wanted to make a few notes about pie charts. This is the thing I’m coming backto is this mutually exclusive categories. So I want you to imagine that I do a survey,right. And I asked the question, what is your favorite color? And I give some choices likered, green, blue, whatever, there’s only going to be one answer to everybody’s question,right? Because you can only have one favorite, right? And that then is eligible to be usedin a pie chart, because everybody gets one vote. But a lot of times, I’ll see peoplewho do a different survey question, they’ll say, check off all of the colors you like.So if I get that I’m like, Oh, I love red.I like orange, I like green, I’m checkingoff a bunch. There’s some people I know who don’t really like color, like they just weregray and black. So they probably wouldn’t check off anything. And then there are thepeople who just check off one or two. Well, as you can see, people can have multiple votesor no votes or whatever. And if you have that situation, like I was telling you, where peoplecan say multiple things, you’ve got to go into bargraph land, okay? Because a wholebunch of people can like read a whole bunch of people can like green, a whole bunch ofpeople can like blue. And you won’t get a circle out of that. If everybody answers justone answer. And so therefore, everybody’s in a mutually exclusive category, then youcan use the pie chart. I also wanted to let you know that I find it and I think a lotof people do more informative to put the percentage on the actual chart, then the frequency, somepeople put both the frequency and the percentage, which is good, it’s not so helpful to justput the frequency as you see that the nursing report did on the left.And it’s because youreally don’t know, you know, 234 seems like a lot. But what proportion is that of thecircle, that’s what you would kind of want to know. Whereas if you look on the righton mine, you can see like, for instance, only 5% God zero point, that’s a small amount,right? You know what 5% means? It’s just hard to tell, you know, if you look at that oneon the left, and looks a little like two thirds, which would be 66%. But we don’t know whatthe percent is, right. And so it’s really helpful to have that percent. And always includea title and a legend. Because if you’re, if you’re graphing a pie chart, you’re gonnahave more than one category, and so people are gonna want to know what that color means. This looks so good, doesn’t look good. Um,pie charts are common in healthcare, and they graph mutually exclusive categories. Okay,so so you’ll see this all the time. And like I said, it’s easier to make using software,I use Excel, it can come out of other software, but I just like Excel because you can reallyput fancy labels on and you can do that squiggle thing and but choosing a graph requires someconsideration, like whether or not you actually want to make a pie chart or a bar chart orwhatever, requires some thought.And also, regardless of the chart you make, you shouldfollow these rules. You should always provide a title, okay? Even if it’s just for yourprivate use. Trust me, I’ve done this. I go back and I’m like, I don’t even know whatI grabbed. So take your time sit down, write a little title. So you remember what you alsolabeled the axes. Because, again, you think you’re going to remember or maybe you thinkit’s obvious everybody in the audience is going to tell, don’t leave anything to beassumed, just be absolutely clear about what’s on each axis. Always identify your units ofmeasure. So if you’re talking about a rate per 10,000 people or a percentage, or maybeyou’re talking about an average, or you’re talking about a frequency, it doesn’t matter,just make sure you’re clear about what you’re talking about.In the units of measure, usually,this ends up on the y axis. So the thought is to make the graph as clear as possible,thinking font size, thinking number of items graph, you know, I’ve sometimes seen a bunchof time series graphs where they put so many lines on there, I can’t even see anything.Or they’ll have these really tiny font sizes. Or they’ll just try to put too much on onegraph. And it’s hard to read. So if you find, if you have trouble reading it, probably everybodyelse will. So you want to modify it. So I just throw this on the right. Can you tellwhat’s missing from the above graph? The above graph is really missing a lot of information.I mean, we don’t even know what it’s about we, we can kind of guess it’s a time seriesgraph because of the time at the bottom. But what else right? So the person who made thisreally knew what they were talking about, but we don’t, and you don’t want that to happento your graph.Okay, so here, what I’m going to do is review all the different graphs I’vetalked about in chapter two, and talk about the cases where that graph is useful. So youcan keep the straight in your heads what why we have all these graphs, right. So first,there’s the frequency histogram. Remember that that was only for quantitative data.And that’s what you make when you want to see the distribution, right? Remember, thedistribution was a shape. And, and a frequency histogram is a particular type of bar graphthat is meant for showing these distributions.I also showed you how to make a relative frequencyhistogram, which is almost the same thing, only it graphs the relative frequency insteadof the frequency. And that also will show you the distribution, right, because the patternwill be the same. But this one’s specifically good for comparing to other data. So if youhave two sets of data, maybe from two different locations are two different groups, then youwant to use the relative frequency histogram, because then it’s easier to compare distributions,right. I also showed you how to make a stem and leaf display, I explained what the stemand leaf is, what the leaves are, and what the stem is.And that’s also for quantitativedata. And that’s also if you want to see the distribution, it’s also good for organizingthe data, it’s a little easier to make by hand than a histogram. Because a histogrammakes you make a frequency table first, and stem and leaf display, you can kind of skipthat step. So again, these first three were just about trying to take quantitative dataand visualize it so you can look at distributions and also look for outliers. Next, we wentinto the time series graph. And that is really about time, right? That’s for graphing a variablethat changes over time. And as measured at regular intervals, mainly to see trends likeis it going up? Is it going down? Was there an epidemic, and that’s what a time seriesgraph is for a bar graph.Now this is the generic bar graph, not the specific histogram,like I described, but the generic bar graph can be used for qualitative data or for quantitativedata. And it can be used for displaying frequency or percentage, and we went over some examples.Then I shouted out to the perrito chart, which is a special bar graph, right. And that specialbar graph graphs frequencies of rare events, in descending order, usually bad things, youknow, rare bad things. And again, we don’t really use this much in healthcare.Finally,I went over the pie graph. And that’s four mutually exclusive categories, quantitativeor qualitative. And we use those a lot in healthcare. So in conclusion, in this particular lecture,I first went over the time series graphs, and explained how they show changes over time.And then I went over bar graphs and showed you how they can display quantitative andqualitative data. They can be up and down or horizontal. I showed you some differentexamples. And then we went through pie charts, looking at mutually exclusive categories,which I think are my favorite, like look at this pie. This makes me so hungry. Um, butat the end, it’s important to pick the right chart.Because you want to have a useful visualizationof your data. If you’re trying to look for a distribution. Choose the right kind of visualizations,the right kind of graphs, if you want to instead look for trends over time. You get to choosethe right kind of work. So I gave you some pointers on how to do that. And now my mouthis watering. So I’m gonna go eat some pie. Yoo hoo, it’s Monica wahi. Again, your statisticslecturer from labarre College, I decided to chop up chapter two and reconfigure it.Sothis first lecture is going to be on part of chapter 2.1, frequency tables, and theentire chapter 2.3, which is stem and leaf displays. So here are your learning objectivesfor this lecture. At the end of this lecture, you should be able to state the steps formaking a frequency table defined class, upper class limit and lower class limit, you shouldbe able to explain what relative frequency is and why it’s useful for comparing groups.Also, you should be able to state the steps for making a stem and leaf display. And finally,you should be able to describe the difference between an ordered and ordered leaf.And ifall that sounds foreign to you, don’t worry, you’ll understand it all at the end of thislecture. So just to introduce what I’m going to cover, first, I’m going to define for youwhat a frequency table actually is. And then I’ll explain to you how to make one whichwill help you understand even better what it is. After that I’m jumping right into whata stem and leaf display is, and how to make one of those in the main reason why I cancombine these is because I feel like stem and leaf displays can help you make frequencytables.That connection was not really made in the book. So I’m making it here. So let’sjust start with the frequency table. So what is one of those? Well, you know, when I thinkof frequency, I think of the radio, right? Like I think of REM what’s the frequency?KENNETH? I think that was a last hit. Okay, that’s not what we’re talking about. We’retalking about frequency, like the word frequently, like How frequently do you go to work perweek, right. And you would count how many times you go to work or go to class per week? Well, frequency is, like frequently, it’s like how frequentsomething happens.So first, I’m going to explain to you what a frequency table is,and why you make them, then I’m going to define some more terms, I just defined frequency,I’m going to just define some more that you’re going to need to know. And then I’m goingto explain the steps for making a frequency table and a relative frequency table. So remember,quantitative data, I’ll just remind you qualitative data are categorical.So that’s like genderrace diagnosis, where you put individuals into categories. And quantitative data arenumerical. Remember, like age, heart rate, blood pressure. Now, I just want to calibrateyou to the idea that this whole frequency table thing, this, this whole thing is aboutquantitative data. And so this entire lecture actually is focusing only on quantitativedata and not qualitative data already. So when you have quantitative data, as you probablynoticed, if you’ve ever had it, right, like, let’s say that you, let’s say you go on Yelp,you know, I always give that example. And you tried to decide whether to go to a restaurantor not. You have a bunch of fives, and fours and threes and twos and one stars, how doyou know, you know, you just have a pile of numbers. So how do you organize them, I’mgoing to give you like a totally fake example I made up Okay, so I’m pretending that 60patients were studied for the distance, they needed to be transported in an ambulance.So how far they needed to be transported from where they call the ambulance, and were pickedup and actually got to the hospital.So the shortest transport in my fake data, or theminimum was one mile, which is awesome. That’s kind of what happens to me because I liveright near a hospital, hopefully, I don’t need to be in an ambulance very often. Butthat’s what happens in urban centers, the longest transport the maximum was 47 miles,which would really suck. And I just want to point that out that happens to people in therural areas because of lack of access.So this is kind of realistic, even though it’sfake data. But anyway, it’s hard to just look at a pile of numbers. So how do we understandthese data? Well, now I’m going to start those definitions. The word class means the intervalin the data. So in Remember, we’re talking quantitative data. So let’s say I just madeup well, how many people got transported between 30 and 40 miles, okay. That would be a classof 30 to 40, right. And the class limit is the lowest and highest value that can fitin the class. So carrying on with my example of a class I just randomly picked 30 to 40.If we made that a class we would say 30 would be the lower class. limit, and 40 would bethe upper class limit. Make sense? Alrighty. So then, of course, you have the width ofthe class or the class width. So that’s how wide the classes. So carrying on with theexample, if the upper class limit was 40, and the lower class limit was 30, what youdo is you minus 30, from 40, which you get 10. And then you add one, and n equals 11.That’s a little formula.But if you’re like me, and you count on your fingers, you wouldgo 3031 32 6034, blah, blah, blah, and you’d realize that there are 11 numbers in that.Now we get to frequency, like I sort of quickly explained in that is how many values fromthe data fall in the class. So how many patients were transported 30 to 40 miles. Or anotherway of saying it is, if you look in all the data you have, and you find every single personthat either got 3031 3233, blah, blah, blah, up to 40, count all those people up that thenyou will get the frequency for that class. Okay, but you probably realize you do needto decide on classes before you go counting frequencies, because you need to know thelower and upper class limits. So let’s talk about some rules about classes. First of all,classes have to be the same width, you can have 30 to 40, and then 40 to 42, right, or41 to 42, right? You can’t have skinny class, fat class, they have to have the same width.But, um, there are different ways to pick it, right? So, class width can be determinedempirically isn’t that a fancy word empirically just means you just choose it because youlike it, right.And if you ever look at survey data, about just about anything, when they look at the quantitative variable of age,they often put that in classes. And as you’ll see on the slide, these are the classes weoften see 18 to 2425 to 3435 to 44. And you can go on, right, like, that’s what you normallysee. And that means, empirically, you just picked it out of the hat. And already, you’reprobably noticing Well, 18 to 25, or 18, to 2465.And older, those classes aren’t really equal as the ones in the middle,right? Like, what’s the upper class limit for 65 and older? Okay, well, that’s justnormally what happens in the world, and especially in healthcare, and healthcare, when you pickclasses. Even though the classes are technically supposed to be the same width, you reallyshould be guided by the scientific literature. And you’ll see why later, when I show youthe other videos in this chapter. It’s because you really want to be able to compare whateveryou find to whatever other people have found before you. And therefore you don’t want tocut up your classes in different ways, or it’s hard to compare them. However, in thebook, they teach this class with formula, so I thought I should really show you that,too. So here’s the class with formula that I don’t really see used much in healthcarestatistics, but I’m going to teach you anyway. So this is the formula. First you calculatethis number, you find the maximum in your data, and you’re in the minimum in your data,and you subtract the minimum from the maximum.So the example I was giving from the fakedata about the transport is 47 was a maximum, and the minimum was one. So I did the firststep and got 46. Okay, looking back into the formula, you divide whatever you got thereby the number of classes desired. In other words, like however many, you know, categoriesyou want, right. So if you never want too many, like you don’t want 10 or something,you know, 34567, usually something in that range is a good number of classes. So let’spick six just for fun. So we’ll take that 46 number we got we divided by six and weget 7.7. Then back to the formula side, how you decide then your class width is you increasethis number, you get to the next whole number. Now a lot of people are confused by that,because even if I’ve gotten something like low, like 7.1, I’d still go up to eight, youhave to increase it up to the next whole number.So you have like this, this integer, you know,that’s a number without any decimals after it. So you have this integer for your classwith so our class with in this example then would be eight. So, um, now I described toyou that whole class with, but I’m not going to use it in the example because we don’treally do that much in healthcare and it makes it actually kind of hard to understand becauseyou want something that’s a little intuitive, like if you look on the slide right now, youknow, less than 20 miles 21 to 2930 to 39 and then 40 or more, that may A little moresense in your head.You know, that’s how we think of miles. If I had put like 18 to 24,and 25 to 29, you know, we don’t really think that way. So this is helpful in healthcareto boil it down to something like this. And by the way, if I was writing a real paperin the sort of real data, I’d be looking at the papers before this that talked about transporttimes and looking at those class limits.Okay, so a frequency table displays each class,along with the frequency, the number of data points in each class, as you can see, theclass limits are on the left side of the simple frequency table, you know, the classes, andthen the frequencies on the right side, right. And you’ll notice that they all add up to60, because we measured 60, fake patients, and it’s really good to do that little check.Because you don’t want to double count people put them in two classes, they only get tobe in one, etc. So selecting arbitrary class limits, can make the frequency table unbalanced.So in other words, doing this empirical thing can make it sort of weird because less than20 is big, and 40 or more miles is big. And it’s bigger than the other classes. So it’sdoes it kind of breaks the rules of class with but not following the scientific literaturecan make your results not comparable, and can make the science less useful. And so that’swhy I sort of flail against the book with this class with formula thing.So I’m, I’mgoing to just give you another example for a frequency table. Okay. This one is more,it’s also health carry, you know, glucose is measured in the blood and expressed inmilligrams, 400 milliliters, right? So glucose is a huge molecule, and it should be clearedfrom the blood, especially a fasting. So if you’re not eating anything, you’re not puttingany glucose in your body supposed to be like metabolizing. That problem is some peopledon’t metabolize glucose very well, you know, that’s what diabetes is. So you, you careabout how much glucose is sitting around people’s blood. So blood glucose levels for a random sample of 70,women were recorded after a 12 hour fast. And this is what they got, they got the minimumwas 45, the maximum was 109.And they picked six classes. So this is how they set up theirclass limits. And again, this is using a class with formula. And just to demonstrate, youknow, it sort of comes out a little weird here. But then they they got these frequencies,okay. And this is again, just another example, using this time the class width formula toget our six classes and to make sure that they covered everybody. Now, you’ll noticein this, we start with the minimum like 45 to 55. And we end with the maximum, whichis up to 110. And that’s really the clearest way to do it. It’s just not typically donethat way. If you read, like scientific literature and healthcare, you just don’t see these frequencytables labeled like that. So and just to wrap up this part, make sure all of your data pointsare accounted for only once in one of the classes. So whether you use a class with formula,or you use empirical or arbitrarily picked classes, every single data point only getsone vote, it can only be in one of the classes.And, and also, you don’t want to leave anyof the data points out. So you want to make sure that that happens that you account forall of them. And also you need to make sure your classes cover all the data, right. Andhealthcare when we do that thing up to 20, and 65. And over all that stuff, we causethat to happen. However, if you’re going to use a class with formula, you really haveto pay attention to where your minimum and your maximum are. Because then you want tomake sure all of your classes cover all of your data. And like I mentioned, make surethe total of your classes of the frequencies in your classes adds up to the total numberof data points, it’s just a little check, make sure you didn’t do something wrong.NowI’m going to talk about what is a relative frequency table. And that builds on what youalready just learned about frequency. So we all know what our relatives are. They’re likeour family, right? We have relationships with them. And so what relative means is in relationshipto the rest of the data, okay? So in statistics, they often use this fancy F to stand for frequency.And, as I’ve mentioned before, the sample size, if you have a sample, they use a lowercasen. So what they use as the formula for relative frequency is F divided by n. And if you’reclever with math, you realize what that means is is if you take a frequency of any of theclasses, you know, it’s just a portion of the whole sample, and you divide it by thetotal sample, which is that n you You’ll get the proportion of values that are in thatclass, it’s not really that fancy. So relative frequency is something very useful to putin a frequency table.So you’ll see that I, I kind of crammed it in onto the right side,this is the old frequency table I just showed you with glucose, but I crammed in this relativefrequency next to it. So it’s super easy to calculate, like, for example, for the firstone, see, 45 to 55, the frequency is three, what did I do? Pull out the old calculator?Well, I actually I use Excel. And I did three divided by 70, because I was a total. AndI got Oh point oh four. And those of you don’t really like proportions, you can do that thingwhere you move the decimal two places to the right, and then put us percent sign. So thatwould be like 4% of those 70 people are in that first class. And then the same thinghappened with the next one, I took, you know, the 56 to 66, I took seven divided by 70,which came out 2.10.And those of you into percents, I’m really into percents, I likemoving that decimal over, I think of it as 10%, then, but whatever, as you can see atthe bottom, and all has to equal 1.0. If you like proportion, land, or 100%, if you’relike me, and you like percent land. But in any case, this is all you have to do to dothe relative frequency table, you just make another column and do all those calculations.And it’s super easy to calculate it. And it’s very helpful. So why did we even do this,because we had a pile of quantitative data, and it was really hardto organize right. And the first thing was we had to do was select class width. And Italked about the politics behind that. But ultimately, whatever you do you do in thelower in the upper class limits need to be determined and put in the first column ofyour frequency table. Then in your second column, which are the frequencies, you countup, how many are in that class, and you fill it in. And then if you make that third column,then you can do that dividing thing and get your relative frequencies.And that’s great.That’s how you build your frequency table. And as I go through future lectures, you’llsee even more why you would make that table like how useful that can be. Given that youhave quantitative data, and it kind of gets all over the place, it’s very helpful to organizeit in that table. Now I’m going to move on to talk about the stem and leaf. Andthe reason why I picked talking about it. Now it’s because it’s on the theme of organizingquantitative data. So I’m going to talk to you about what the stem and leaf plot actuallyis. Here’s a just an example on the slide and how you make one.And why why you mightmake one of these you’ll find it feels a lot like making a frequency table. But why doyou make these instead of a frequency table? And it’s just more food for thought. So first,one of the things that I got hung up on when I took biostatistics is I could not get overthe fact that it was called a stem and leaf. So I had to understand that. So this is anexample of a stem and leaf there. So why is it called a seven leaf? Well, there’s alwaysthe stem. And that’s so see these corn stalks, I’m from Minnesota, I’m used to seeing them,you’ll notice that there’s a stem, right, like this big corn stock has the stem, thatthing you see that vertical line and a bunch of numbers on the left, that part of the stemand leaf plot is called the stem. And then leaves are added onto the sim as we tallyup the length of the leaves. And that may not make much sense right now, but I’ll showyou how to make one.But essentially, what you end up doing is adding these leafs likeyou see under two, there’s a little leaf that just has a zero on it. But if you see underfive, there’s this big long leaf with a whole bunch of numbers off of it. So I’m makingone will help you understand this terminology. But I first wanted to just show you this picturebecause it’s actually kind of hard to understand what’s going on with a stem leaf unless youunderstand that that vertical line in the numbers to the left of it is considered astem.And then each one of these things we build off start, you know, off of each ofthose numbers is called a leaf. So people talk about the four leaf in the five leafalready. Okay, so again, I’m just so into making up data, right? So I decided to makeup data from 42 patients who visited a primary care clinic and referred to mental health.Now the reason why I made update on the subject is I’m very upset about this subject. I thinkpeople are waiting too long to get mental health treatment. Especially if you’ve beenfollowing the news about the Veterans Administration. In the US. A lot of people are put on holdeven for primary care.You know, they’re put on waiting lists and I don’t like so I madea fake data by That as a demonstration just to highlight these issues. Okay, so what whatdata Did I make up, I made up the the number of days between the referral and their firstmental health appointment. That was what was collected. So let’s say you go in on January1, and you get a referral. And then 10 days later, you actually show up at the clinic,then that would be 10. Right? That would be your value. So that’s quantitative. So let’stake a look at it. So on the right side of the slide, you see just this pile of numbersfrom all these people that came in and, and then got a referral.So like, you look atthe first person had to wait a month, go see a mental health professional. But ifyou look, you know, the third one, and that person only needed 12 days. So that’s howyou sort of consume this fake data I made. And then you’ll see over on the on the leftside, I already made a step.It’s blank that doesn’t have any numbers on it, but I knewI need that vertical line. So I just made that in preparation. Okay, so let’s buildour simile. So what we do is we start with the first number,and that’s what’s awesome about this is you just start with the first number. And if youwant, you can kind of cross them out as you go along to keep track. So we start with this firstnumber. And you’ll see what I did, I went over to the stem, and I put the three on theleft side of the stem and the zero on the right, this begins the three leaf, okay.Here’s the next number. Now, I put the twoabove the three because it’s like right before it and you can kind of imagine we’re gonnawalk down like 23456. And then I put the seven on the right side to start the the two leaf.Alrighty, here we are with the next number, which is 12. And as you’ll see, I startedthe one leaf, you’re starting to see the pattern, right? And you can probably guess what’s goingto happen next, we start the four leaf and put the two there. Okay, our next leaf, we’vealready started, right for 35. So what do we do there? Well, we just add the five onto the three leaf, the three leaf was already started with that, that 30 at the beginning,so we just pile a five on there.Here’s 47, we just pile a seven on there. Now you’llnotice I tried to line up that seven on the four leaf with the five on the three leaf.When you’re doing this by hand, well, even when you’re not doing it by hand, you reallyhave to keep those things lined up or you you won’t have a good stem and leaf. Okay.Now I’m going to just fast forward a little a little because you can probably imaginehow to do the next row the 3836. You just keep piling it on. But I want to show youwhat happens when you get to the special case here. Okay, well, we’ll go with this 29. Thisis the last thing before the special case. So you’ll notice that 38 got put in there,see that eight, three leaf that 36 got put in there, you know from the second row, see,we put everything in there.And now we put in the 29 look at that we got a three afterthat. That’s our next one. So where are we gonna put that three? And I, you know, youmight think on three leaf but that’s not right, right? Because that’s 30 something. So wheredo you put the three? Well, some of you figured this out, you have to add a zero onto yourstep. So look at that, I put that zero there and then we put the three in. And then youcan already guess how to do the 21. Next, we’ll just tack a one on to the to lead. Butthen when we get to the next zero, we just add a zero on to the zero we. So you can probably figure out how to pileup all of these. But I did want to talk to you about something else that happens withthese stem leafs.As you go on adding to the leaf, you got to be careful because you mightend up with a situation where you got something big now I really feel sorry for this fakeperson. 51 days for a mental health appointment that’s too long, right? But it causes us laterto have to add a five. Now this can cause real estate problems, especiallyon a piece of paper, you know, what have you the four was right at the bottom of the paper,right, it’s kind of hard, maybe you have to tape some paper at the bottom I have thisproblem a lot. Um, you’ll see here this, I even had to move this up on the slide whenwe got later to the 70 I’d add the seven leaf. Now I just want to show you for some reasonthe state of we didn’t have any 60s. But you still have to put that six leaf place or inthat that’s got to be there. So even if you know as we go on, if we’re missing any leavesin between, we just need the place are there because that space has to be there.And here’shere’s an outlier. We’re gonna learn about outliers pretty soon. This is a really longtime. 105 days this is kind of like VA status right? But it And you’ll see that and of course,this is fake data, but unfortunately reflects real data. You’ll see when we get to 105,not only did we skip the eight leaf and the nine leaf, and we need to leave a space forthem, but 10 becomes the part of the stem. So if we went on to 200, or 300, I mean, thatwould be awful. The wait that long, though, the first two digits of it, like if we had365, the 36 of the 365 would be the part of the step. Alright, so I just did a littledemonstration to explain certain nuances of the stem leaf that you might encounter inyour life. So now, I’m going to just reflect back on the two ways that I’ve described inthis lecture for you to organize quantitative data. First, I showed you how to make a frequencytable. But what you need to do with that one is you need to set up classes and class withand and to count the frequencies in there a lot of there’s a lot of pre processing alot of pre calculations, you really want to think when you’re doing this, and you don’twant to be distracted.However, if you’re trying to do a stem and leaf, you really cando that on the fly, you don’t need to set up classes or class with, as you noticed,we just went through the line of those pile of numbers, and just crossed them off as weput them onto the stemmen wave. And there was really no need to count, you can tallythe data as you go through the list, you know, cross it off. And it’s just really quickerto do. Of course, those of you who are pretty clever saying, Well, basically you’re forcingin a stem and leaf everything to be in the class of, you know, the 10s, right, you know,the 20s and the 30s in the 40s. That’s like the two leaf, the three leaf and the poorly.And yeah, it is kind of like a simplified way of making those kinds of classes.Butin any case, I just wanted to alert you to this because you might see some similaritiesbetween the two. And I wanted to highlight those as well as the differences. Now I’mgoing to give you a few tricks here, I want to tell you about the concept of an unorderedleaf. So an unordered leaf is what we were making before when I was demonstrating, it’sjust where the numbers are out of order in the leaf like you’ll see this two leaf it’sa seven, seven to nine. Well, if there were an order would say 2779, right, like the twowould come first before the seventh and the ninth.And the same with the three leaf that’sout of order, because you can see that it’s zero and five is fine, but eight doesn’t comebefore six and five, right? That’s no problem to make an unordered leap. However,after making an unordered version, you can rewrite the stem and leaf in an ordered way.So you see how I did that I rewrote the two leaf and the three leaf. And now they’re allthe leaves are in in order. Okay, you don’t have to be but you can do that. And if youdo that, if you make your stem only first unordered the way I was demonstrating, thenyou rewrite it into ordered, it is way easier to count it up to make a frequency table nomatter what classes you choose. Or you can just make each leaf a class. And then it’ssuper easy to make the frequency table. So that’s why I combined these two pieces ofthe chapter together is because I wanted to show you how you can use a stem and leaf tohelp you make a frequency table.So a stem leaf, it’s just another way to organize quantitativedata. And it’s easier to make kind of on the fly than a frequency table because it requiresless preparation. And they can help you put data in order before like in preparation fora frequency table started to help you as a first step to make sure that you can organizeeverything. And at the end. Remember I keep emphasizing your frequency table has to reflectall your data points. And they can only be in one class, blah, blah. Well this is oneway to make sure that happens is to first do this pre organization using an orderedstem and leaf. So in conclusion, frequency tables and stem and leaf displays organizedata, they organize quantitative data.And the stem and leaf may help you make a frequencytable. So you might want to start with that. And the purpose of both of these things isto reveal a thing called a distribution. And I’m going to explain that in the next lecture.Hello, it’s Monica wahi. Again, your lecturer from library college and we are moving onto chapter 3.1 which is measures of central tendency. And here are your learning objectives.So at the end of this lecture, you should be able to explain how to calculate the mean.You should also be able to describe what a mode is In say how many modes a dataset canhave, you should be able to demonstrate how to find the median in the set of data withodd number of values, as well as in a set of data with an even number of values. Andyou should also be able to define trim mean and weighted average. All right, so what’sthis measures of central tendency, I’m going to explain that why we kind of call it that.And then I’m going to talk about them, which the three biggies are mode, median, and mean.So I’m going to talk about those and explain how to get those.Then, towards the end ofthe lecture, I’m going to go into some special situations. One is called the trimmed mean.And the second is a weighted average. So let’s get started. What is the central tendencything? Well, if you think about quantitative data, which that you can only do this withquantitative data, not qualitative data. But when you think of having a pile of numberslike this, one of the things you want to know is how much they tend towards the center.Now, of course, you don’t know where the center is, until you start looking at the data. Somedata are kind of high up in the hundreds, like systolic blood pressure. I give a fivepoint quiz and one of my classes, so those numbers are low, like 12345. But then thequestion becomes, do the group towards the center of whatever list of data they’re in?Or don’t they? How sort of sensory? Are they? You see these distributions on the slide?You’ll see, on the left, you’d probably say, well, that looks more sensory than what’son the right, you know, this normal distribution on the left, and the skewed right distributionon the right.And so intuitively, you kind of know what I’m talking about. But what thislecture is going to be about is how to actually put numbers on the difference between whatyou see on the left and what you see on the right. So these are the numbers, these arethe measures of central tendency, we’re going to go over mode, median, and mean. And themedian is a little different, depending on whether you have an odd number of values oran even number of values. I mean, it means the same thing, but you calculate it slightlydifferently. So I’ll go over that.And then the mean, a lot of you already know what amean is, but there’s a couple special means we can make. One is called a trim mean, andanother is called weighted average, which is a weighted mean, I don’t know why theychose the word average for that one, because mean an average mean the same thing. But I’mgoing to go over these things.Okay, well, let’s start with the mode. The mode is thenumber in the data set that occurs the most frequently. So I put up this little tiny dataset here of just five numbers. And it’s obvious that then five is the mode, right, becauseit repeats Once there, two fives there. But look,I just changed one of them, I changed it to a six. And now there’s no mode. So I just want you to know that a lot of datasets don’t even have a mode, there’s just no repeat at all in them. And that usuallyhappens when you have a broad range of numbers, they can have like systolic blood pressure,I mean, it would be kind of lucky, you just got two people with the exact same one. Butthat can happen. So don’t think there’s always going to be a mode, there might not be one.It’s also possible to have more than one mode, like look at that.So I’ve got six numbersup there. And the two repeats once and the three repeat ones. So you’ve got two modes,right? But let’s say that the three actually repeated three times, then it would only beone mode, because the three threes would Trump the two twos, right? So you can just imagine how confusingthis gets when you got a ton of numbers. What’s a little less confusing is, um, if you likeI said have a broad range of numbers, it would be kind of a coincidence, if two patientshad the exact same systolic blood pressure or platelet count, you know, like you geta repeat in there. And then that would be the mode.Of course, if you measure a wholebunch of people, then eventually you’re probably going to get one. But I just wanted to sayand also, if you look at the slide all those numbers, you’d really have to go through andorganize them and count them up and see if there is a mode, there probably is one becausewe see a lot of repeats. But then which was the one that wins that’s repeated the most?Or are there two that are repeated the most, and becomes kind of political when you reallydo it. And it’s not worth a lot of work, because what does the mode tell you? It doesn’t reallytell you much.It does tell you the most popular answer. The word mode in French means fashion.So like I put on the slide, you know Allah mode, it’s in fashion. So it’s the one that’smost popular or the most common result, but it’s not used a lot in healthcare. And it’sactually not used very often once in a while. I’ll say, Oh, the mode. In the class for myfive point quiz was five, meaning everybody did pretty well they mostly got a five. Thatwas the most popular result. But you hardly ever have to say that. And so remember, we learn the words resistant, like if a measure isresistant, you can’t whack it out very easily. Well, you can change things pretty easilywith the mode, the modes not resistant, I even just demonstrated that on those slides,by just changing one number, you can erase the mode or add a mode or whatever.And soit’s not stable, it’s not resistant. And those are the kinds of things we don’t really likeand healthcare, so we don’t really use them. So I’ll move on to some cooler measures ofcentral tendency. And here’s a really cool one, which is calledthe median. And it’s the middle of the data.And I’ll explain that a little bit more whatwe mean by the center of the data. Okay, so remember, we’re talking about quantitativedata. So you’ve got some pile of numbers, it doesn’t matter, you can always sort themin order of lowest to highest. And I keep talking about this five point quiz, I givehim my class. It’s an easy quiz. And most people get fives. But even so somebody getsa four usually, or somebody doesn’t show up for the quiz, and they get a zero. And soit doesn’t matter, I can have 100 people in the class, I still could put all of thosenumbers in order of lowest to highest, even if most of them were fives.Because you’llget repeats in your data sometimes, right. And also, sometimes you’ll get outliers. Likeif I said one person maybe didn’t take the quiz and they get a zero. But everybody getselse gets four and five is an easy quiz, well, then that zero would be an outlier. So youdon’t have to worry about that. And like I said, you know, the data values sometimesare almost the same, like almost everybody gets a five on my quiz, because it’s so easy.So it doesn’t matter. Even if you have these weirdnesses in your data, you can still justarrange them in order. And that’s what we mean by the median is the number that is halfwayup, or halfway down, right. So if I’ve got 100 people in my class, and I’ve got the zeroover here on the left, and I put all the, you know, fours, and then the fives, you know,I have to count up what 50, right to see where the middle is.And it’s probably going tobe in the five range, right. But that’s all we mean, we say, you’ll take however manyvalues you have, put them in order, even if there’s repeats and outliers or whatever,just put them in order, and then count up halfway. And that’s where the median is goingto be. So I’ll demonstrate this here. So how to find the median, the first step is to orderthe data from the smallest to largest. So I’m giving you two demonstrations.And I don’teven know what these data mean, I just totally made them up. The one at the top, the dataset the top that starts with 42, that only has five numbers in it. So I’m going to demonstratethe odd version with them. The one set at the bottom has actually six numbersin it. So I’m going to demonstrate the even version, because remember, it goes a littledifferently, whether you have an odd number of numbers or an even number of numbers. Okay,so those are the numbers. And we still have to do the first step, which is order the datafrom smallest to largest, because you can see they’re not in order.So I’m going todo that here. Okay, there it is. So those are the same numbers, they’re just in orderfrom smallest to largest, okay. So we’re going to get rid of those numbers on the top, andinstead put the position they’re in. So let’s look at the top data set, which is the oddone. So I’m going to say this is how you find the median is you number the positions, you know, it’s 12345. And it’s the middleposition. So you can imagine, if we had had seven data points, we’d go out 1234. And we’dcircle that one, and that would be the median. So that’s what you have to do is you takethese, if you have odd values, you just put them in order, and see I numbered them foryou. And then you take the middle number, and that’s the median. That’s what it is.It’s 42 in this one. Okay, we’ll do the downstairs data set there that has six, as you can see,the positions are numbered. And then what do you do, you go to the third and fourthposition, which is the kind of the middle right, and you literally make an average ofthem, you add the two, and they happen to be seven and eight right next to each other.But if they had been like eight and 10, then the average would have been nine, and thatwould have been the median.But because this is seven and eight, you do seven plus eight,divided by two, and it’s 7.5. So when you do the median with an odd number of values,you’re going to be taking one of the values in there. If you’re doing the median, on aneven number of values, you might get something with like a decimal, because you’re lookingfor the two values that straddle the middle, and you’re going to be making an average ofthem. And so you might get kind of a wacky number like 7.5 that’s not in the underlyingdata set. So um, this is fine for like, if you have five or six numbers or seven. WhatWhat if you have like 150 numbers, I mean, you do still have to put them all in orderto begin with, you know, like I use Excel, I probably just soared.But you have to knowhow many numbers to go up. It’s not obvious. So this is how you find the middle number.They have a little formula for it. So let’s say we have an oddnumber of values. And I’m giving you the example like 21 love Let’s say at 21 students in myclass, and that’s how many values I have. And I wanted to make a median of their grade,what I would do is put them all in order. And I’d say, Well, I have to go up so many,and that’s the median. But I don’t know how many to go up. So I would use this calculation.So I take the end, which in our case is 21. And I’d add it to one to it. And then we get22. And then I divide by two. So that’s just how it works. So if you had 41, you woulddo 41 plus one, it would be 42, divided by two. Or if you had, like, I don’t know whyI’m picking on ones like 27, you do 27 plus one, and that would be 28.And 28 dividedby two is 14. And so you see, it would just force it to be an even number that you comeout with. And then that’s the position you got go often. So if I had 21 students in myclass, and I took the grades and raised them in order from lowest, lowest to highest, likeif they were that quiz grades, you know, most of them would probably be four and five, butit wouldn’t matter, what I would do is just start with the lowest and count up to the11th 1/11 position, and then that would be my meaning.Now, you also have to do that,you have to find the middle number, even if you have an even number of values. So I tookan example 14, now you’ll notice we use the same formula. But if you do use this formula,you get 7.5. And that doesn’t, that’s not the median. That’s just how many positionsyou have to go up. Right. And so remember, on the earlier slide, we had, we had to gobetween the third and fourth position, we had to average those two numbers.Well, thisis basically saying, if you get 7.5, you have to go to the seventh and the eighth, the onethat straddles it, and those are the two that you average. So if my n like 100 is a nice,even number. So if you have 100 plus one and you get 101, then you’ve got, you know, 50.5,right, and that just is a secret message that when you line up all your data, you take the50th, one in the row and the 51st, one in the row, add them together, divide by two,and that’s going to be your median. So I just wanted to share with you this little formula,just in case, you get like a large number of numbers thrown at you and putting themin order is a big pain.And then you have to figure out how many to count up, you canuse this formula to get the middle number. So what does a median tell you, we have alot more to talk about here. First of all, it’s called the 50th percentile of the data,what it means is 50%, or half of the data points are below the median, and the otherhalf are above. And that intuitively makes sense because you just created we createdthis median together. And we could see that half of the points are in the bottom halfon the top. And so it’s also known as a middle rank of the data. And what’s nice about themedian is it doesn’t really care much about the ends of the data. Like if I gave extracredit to a few people in my five point quiz, and they got a few sixes, probably the medianwon’t even change because it’s in the middle where all the action is where we find themedian.And outliers don’t really bother it because like if one or two people get a zeroon the quiz, it’s really, you know, if there’s 21 people in there, or 100 people in there,it really isn’t gonna affect, you know, these things happening at the end. So we like themedian because it’s very resistant, and it’s very stable, you can’t really whack it outwith some outliers, throwing them on the ends. Now I’m moving on to the third measure ofcentral tendency, which is a mean, but I also threw in here, trimmed mean and weighted averagebecause there are other kinds of means. And we’re going to talk a little bit also aboutresistant measures, because like I just mentioned that.But I’m gonna step back and talk a little bit about the Greek lettersigma here, that’s actually capital sigma, I do not speak Greek. And I actually havetrouble speaking statistics, because a lot of it’s in Greek. So I try to avoid that andmy lectures, but sometimes you can’t get away from it. So I have to really introduce youto this capital sigma. So in English, we say or statistics ease, I guess, is whenever yousee this, you say some of Wah, like you expect something to be right after it. Okay. So ifyou see, like the sigma and then x, you would say sum of X. That’s how you say. So whatis x? Well, remember how we were just making medians. And we were looking at modes, well,each value there is considered an X, okay, so each of the values in those days sets anX. So sum of X would mean add these all up or add up all the axes. And then I just threwon another example, let’s say somebody came to you and said sum of X, Y, it would meanyou must have some x y’s lying around and you have to add them together.Or somebodycame up to you and said, you know, some of the prices on your, of the food in your basket and the grocery store, right? Somebodysaid some of that, you’d be like, Okay, I have to go through all these prices and addthem up. Right. So that’s what some of them Okay, and it’s used a lot in statistics, andwe’re going to use some of all the time.So I just want you to get in your head that wheneveryou see some of, there’s probably going to be this thing next to it. And it’s gonna bea batch of numbers that you have to add up. And if it’s numbers from our data set, itwill be called x, if it’s other numbers from something else that will be called whateverthey’re called. But just know that this means some of and I see on the slide, the upperone is Times New Roman, and the lower ones Arial, they look kind of different.But Ijust wanted you to get ready to deal with this some of a lot. Okay, so here we are,I’m hitting you with a sum up. This is the formula for the mean. And a lot of you alreadyknow how to calculate the mean. And you just kind of do it. And you didn’t know this ishow you say it in statistics. But basically, it’s this ratio. So this is like a fraction.And on the top of the fraction is a sum of X, you add up all your actions. And on thebottom of the fraction is an, which is however many you have. So you add them all up anddivide by however many you have. And you’ve probably been doing this your whole life.But this is actually the formula. So I just thought I’d demonstrated, um, see, I put thatsum of remember those six data points I was using for the median, I just kind of copiedthem over here, I add them all up.And so I got some of axes 40, right. And then I countedthem, and that was six, while I made them be six. And so 40 divided by six is 6.7. Sothat would be the mean for these data. And you probably already knew how to do that.But I wanted to sort of crosshatch it with the actual formula. Okay, now I’m again, goingto take a little break here to just talk about means, because remember, we talked about samplestatistics and population parameters. If somebody just talks about a mean to you, and they say,look, the mean such and such as six or something, unless you really get into it with them, you’renot going to tell it’s not going to be obvious if they did a sample mean, or did a populationmean? So but when we write this down, it becomes obvious.If I say, x bar, see that x withoutline above it, that’s pronounced x bar, and you’ll see I write it on the sides x bar,because it’s so hard to put that little line up there. But that means the same thing, thisx bar, whenever here x bar, or you see that x with a line over it, it means that it’sthe sample statistics. So if you ever saw like x bar equals six, not only do you knowthe mean is six, but the secret code says this mean comes from a sample, because x baris being stated. But if you look on the right side, you’ll see that it says there’s thism, and it’s pronounced mu, it’s a Greek letter again, and I you’ll show, you’ll see on theleft, I put it in Arial. And on the right, it’s n times new roman looks a little different.But it’s pronounced mu.And so if you saw mu equal sex, you’d be like, Whoa, that wasa population they measured. And the you probably say that too, because you don’t see mu a lotlike people usually don’t measure the population, it’s a lot of work, you often see x bar, buteven so I want you to be cognizant of whether it says mute or whether it says x bar, becauseit’s still going to be a mean. But if it’s mu, they’re talking about the population.And if it’s x bar, they’re talking about a sample. And that might be more important later.But just keep this in mind. Also, when we talk about samples, we use a lowercase n tomean the number of numbers we have. Whereas if we use, we’re talking about populations,we use an uppercase n a capital N. So you’ll see that the sample mean formula on the leftside, this x bar equals sum of x divided by n, it changes if you’re talking about thepopulation mean, and you’re like, come on, you add it up the same way.Like mu is basicallythe population mean, and capital and it’s just the number in the population, that meansalmost the same formula. But the issue is you really are supposed to label things whatthey are. So if you’re doing a population mean, mean, you’re supposed to call it mu,and you’re supposed to use, you know, write it like that on the right side of the slide.And if you’re doing a sample mean, you’re supposed to call it x bar, and you’re supposedto do it like on the left side of the slide. So I just wanted to make that clear to youas you go through the rest of these lectures. Because when I say mu, I’m gonna mean a mean,but it’s gonna be from a population.And when I say x bar, the mean the mean, but it’s gonnamean it’s from a sample. Alright, so now we’ve talked about several measures of central tendency,but I wanted to put a means and medians together in kind of a cage match because I wanted youto look at them and see what their differences are. Now, I’ve been sort of giving accoladesto the median, right, because it is very resistant to outliers, and it’s very stable. Rememberhow I pointed out if you throw some outliers on either side, it doesn’t really affect itmuch. Unfortunately, means are not resistant to outliers. You could just throw like ifI took my five point quiz, and I just felt like failure. barring a student and then givingthem 10 points, it would totally screw up the mean for that class. And it’s so it’snot very stable. So one of the things we can do if we’ve got outliers in our data is tojust use the median.But sometimes we want to use the mean. So we got to do differentthings with it. So one of the things we can do to try and make a more stable mean, orhonest mean is to trim it. So I’m going to talk about how you do that. So as you cansee, on the left side of the slide, a very high value, a very low low value, like anoutlier, or more than one outlier can really throw off the mean.And it’s not a problemwith median. So if you want to make the meal a little resistant, what you can do is trimdata off of each end. So the outliers get cut off, okay? The problem is, you can’t look at thedata, when you’re doing that, really, you would just have to make a rule when you’renot looking and say, Okay, I’m going to trim X amount off the top and X amount at the bottomand as to be equal, and you just have to look away when you’re doing.Okay, so what I’msome people do is a 5%, trim mean, which means you take 5% of the data at the top and cutit off, and 5% at the bottom and cut it off. So you basically lose 10% of your data. Andin health care, a lot of people get mad about that they don’t want to lose any data. Sothey don’t like to use this way of fixing the problem of outliers, they use other ways.But I wanted to show you this as a simple way to fix it. So I’m going to imagine wehave 100 data points, because it just makes it easier for you to see what’s going on.Um, so if you had 100 data points, 5% of them would be five.So basically, you’d be trimmingfive off of the top, and five off the bottom. So the first step would be is probably youalready made the mean out of this 100. And you didn’t like it because you saw outliersat the top and bottom. So what you have to do is put the data in order just like youdo for the median, you put them all in order, you sort order from, you know, the lowestto the highest, take all of your 100 and do that, then what you would do is you wouldlike circle the five most bottom ones, and they’re going to get cut off, and you’d circlethe five top most one of them, they’re going to get cut off, they get thrown out.And thenyou’re you’ve got the 90 values left in the middle. Now you make a mean out of those.And then that’s a 5% trim mean, and you got to tell people, if you do that, you can sayhere’s the original mean, and here’s the 5% trimmed mean, because then people get an ideathat there must have been some outliers and some of your data got hacked off. But thenthis might give you sort of a more stable estimate of the mean. Now I’m going to moveto something else entirely. It’s not about trying to make the mean stable, it’s justabout trying to make the mean a little different. Sometimes certain values in your mean shouldcount more than others towards the mean. And that sounds really esoteric, but the way wesee it all the time is in school. So you might get a great grade on your homework, you mightget A’s on your homework, right? But if homeworks only worth 10% of your final grade, it doesn’thelp you much. And so what that 10% is it when you have a class like that is it’s calleda weight.When you move into statistics, you say well, I’m going to, you know, I as theteacher, I’m going to wait your homework grade at 10% of your final grade. So it doesn’tmatter how awesome your homework grade is, or how bad it is, it’s really only going tocount for 10% of your final grade. And that’s why we do weighted averages, you know, I don’tthink your homework should be worth like 50% of your grade, right? That doesn’t make anysense. And so even though, so you might want to have different things contribute a differentamounts of weight to that final mean. So this is a way of messing around with the mean,and making certain things going into it count for more, or have kind of a bigger vote thanthe other ones.And so I again, I’m just gonna stick with school to give examples becausethis is where we normally see it. So I mean, if this example where homework is worth 10%of your final grade and quizzes would be worth 20%. And the final worth 70%. And I just wantto point out, I’ve actually seen people do this, like cuz I tutor, and like this is horriblemaking your final worth, like, over 50% of your grade.So this is just a shout out toany like professors watching this. Don’t do this. Okay. But anyway, let’s say I was meanand I did it. And let’s say you were pretty good student and you got an A on the homework,right? And so we’re gonna say that’s a 4.0 because a lot of schools would say A’s 4.0.Then let’s say you got B plus on the quizzes, maybe because the lectures weren’t very good,right? Haha. So you got B plus on the quizzes that would translate to the number 3.5 onthat four point scale. And let’s say you got to be on the final. That’s too bad, but that’s3.0. So what do I say that’s too bad? Well, you probably want an eight because the finalcounts for greater weight right accounts for 70% and you’d want that to be really high.Great.Now I first wanted to show you the non weighted average, like the normal mean,you would make the normal mean you would make as you just add the four to the 3.4 to thethree and then divide by three, because you have a three in there, and you’d get 3.5,you get a B plus in the class, right? But let’s just look down, or let’s look up atthat formula. So this is the weighted average formula. It’s the sum of x times the weights, divided by the weights. And remember whatI said sum of x y, like as an example. So we have to, instead of just summing x, likewe did in the non weighted average, we have to do X times W, on all of them in summit,and you’re like, what’s w? Well, remember, I told you what the homework worth 10% that’sthe weight for it, right? And so, so using percent, when we do the weighted average,you use the decimal version.So you’ll see under the weighted average, I’m doing thatsum of X w thing by taking the four and timesing it by point one for that 10% first, and thensee that B plus that 3.5. That gets multiplied by point two, because that’s where 20% andthen there’s that B, you got on the final, right, that gets multiplied by point seven.So that’s the sum of X w thing going. And what do you get, you get 3.2. Now I don’teven bother to, to divide this by some of W, because some of W is one in this case,like if you add up point seven plus point two plus point one, you get one. And thatoften happens, you just make the weights add up to one.But I just wanted to let you knowif for some reason you had goofy weights that didn’t add up to one, the last thing you haveto do is divide by them. So as you can see, in the lower part of the slide, the sum ofX W is 3.2. And if we divided it by one, we get 3.2. And now you don’t get b plus in theclass, now you get like a B. And that’s the difference between the non weight and theweighted average is the weighted average weighted this final be extra, and then that causedthe grade, the final grade to be lower. And that’s what waiting is. Now, I just want tosay a few things. I’ve gone through all our measures of central tendencies, but I wantedto talk about how they relate to the distributions we learned recently. So I just put up an exampleof a normal distribution. And then I color coded these lines.So see on the way, right,there’s a color coded mean. And then there’s a green median. And then there’s a purplemode. Technically, they should all be right on top of each other. But you can see themif I did that, so I just wished him up next to each other. what the point is, is if youhave data with a normal distribution, all these three things are on top of each other.And what the magic of this is, is you don’t even need a histogram to know. So like I usestatistical software, and I’ll feed in the data, like a quantitative variable. And they’llsay, Tell me the mean, median, and mode.And then it will, it’ll tell me the mean, medianand mode. And even if I don’t look at the histogram, if it says almost the same numberfor Mean, Median mode, I automatically know it’s a normal distribution. Well, that’s notthe case with skewed distributions. So with skewed distributions, the measures of centraltendency are not right on top of each other. In fact, they’re in a different order, dependingon whether we have right skewed or left skewed. So at the top of the slide, I’ve got an exampleof a right skewed distribution, right? Because it’s light on the right. Alright, so what’shappening here? Well, the mean, is getting dragged around by that tail, that big tail.So you can see that the blue mean, is on the right side of the median.So the median ismore resistance. So it’s sort of hanging out closer to the bottom of the data. But thethe tail, that right tail is pulling the mean up. And then the mode is the lowest one. Soif I get this print out, and I see that the mode is the lowest the medians in the middle,and the means the highest, I can say without even looking at the histogram, this is probablyright skewed. Now let’s look at the bottom of the slide where we have the left skeweddistribution, you know, because it’s light on the left, and you see the same phenomenon,but it’s going the other direction, that that tail, that’s towards the low end of the data.It’s dragging the mean down now.And notice the median is more resistant doesn’t get draggeddown as much. And of course, the mode stays at the high part of the data where there’smore data, right? So if I get the printout, and I see that the mean is the lowest andthe medians in the middle and the most the highest I’m like Okay, all right. have tolook at the histogram. And I know this is left skewed. So this is basically what I wantedto tell you about the, the distributions, and these actual numbers and how they sortof relate.So in conclusion, what this lecture was mainlyabout was the measures of central tendency, right? mode, median and mean, and how to calculatethose. And, you know, I’ve been kind of bagging on the mean, I’m sorry, but the mean is justnot resistance is totally not stable. And the median is, so you want to remember thesethings? Yeah, you can kind of fix things by doing the trimmed mean, we don’t really liketo do that in healthcare. Because we lose some of our data, we find other ways of fixingthe fact that our mean, maybe kind of goofy.But they’re outside of this lecture, how wedo that. I also showed you about weighted average, you know, just in case you have tohand calculate your grade. I’m actually I had a student in my class once. And this isback when we had Blackboard. And there was something wrong with Blackboard. So she wasreally upset because she thought she was getting a really bad grade. But she was getting abad grade because she didn’t do a good job of learning weighted average, because whenI showed her how to actually calculate her grade, it turned out to be a B, I remembershe was crying.Because she did an unweighted average, she was crying in my office. Andthen I just showed her how to do the weighted average. And she stopped crying, she was gettinga B. So just don’t cry. Try the weighted average first, okay. And then finally, I went overdistributions and measures of central tendency, and just related to you how the distributions,how the numbers we get from the measures of central tendency, how we can put them on distributionsand see some information about the distribution. All right, well, you made it through the measuresof central tendency, get ready for 3.2 measures of variation. Hello, and welcome to chapter3.2. It’s Monica wahi.Library college lecture. And I’m here to go over with you measuresof variation. Alright, right, here are your learning objectives. So at the end of thislecture, the student should be able to state three different measures of variation usingstatistics, you should also be able to explain how to calculate variance and standard deviation,which I’ll give you a hint, those are two of the measures. All right, you should alsobe able to calculate the coefficient of variation and explain its interpretation. And finally,you should be able to state chebi shows theorem. So now we’re going to be concentrating onmeasures of variation. And the first one, I’m going to talk about his range. And thenI’m going to talk about variance and standard deviation, which are two different ones, butI’m going to talk about them together.And you’ll see why. Then we’re going to go overthe coefficient of variation, which is abbreviated CV, then we’re going to talk about Chevy,Chevy Chevy came up with a theorem, we’re gonna talk about his theorem. And then histheorem leads us to calculate these intervals. Remember, intervals are like, have a lowerlimit and an upper limit. I’ll remind you that and when will calculate Championshipat intervals together? Alright, let’s get started. So let’s think about variation. Okay.What is variation even mean? Well, it means how much does the data vary? So imagine Itaught two classes, which isn’t too hard, because I do teach two classes, I teach twoof the same classes, two different sections. So imagine that I gave a quiz.And the samemean grade was in each class. Okay. And I said that, could we tell how internally consistentthose grades were? So for instance, let’s say that I gave a five point quiz. And themean, in each class was three? Do we really know how many people got something far fromthree, like, maybe in one class, people got a lot of fives, and ones. And that’s how wegot the average of three. And maybe in the other class, everybody just got three, like,we really can’t tell from a measure of central tendency like median, or mean, or even mode,we can’t tell how internally consistent the data are, especially, we can’t even tell thatfrom a mean, two different classes can have the same mean, and a totally different kindof variation behind the scenes.So when you’re talking about quantitative data, and you havea whole data set, and you do the measures of central tendency, like Mean, Median mode,it doesn’t tell the whole story, you have to also add on the information about variation.And these calculations that we’re going to learn here in this lecture are about waysto express how much the data vary in the data set. And it’s just separate from central tendency.So central tendency is just about central tendency. And then this variation is aboutvariation. And you need to know both before you can really evaluate your data set. Sowe’ll get started on talking about ways to calculate these measures of variation. So um, As I said, I’m going to go throughrange.First, I’m going to talk about variance and standard deviation. And I just want toremind you, you know how I’m always going on about sample statistics versus populationparameters. Well, this starts playing in in that the formulas are slightly different thanfor sample variance, the standard deviation and population standard deviation. So we’llgo over those separate different formulas. Finally, we’re going to talk about in themeasures of variation, we’re going to talk about the coefficient of variation or CV,but we’ll do that after these other ones. Okay, so we’re going to start with the range,because it’s the simplest to calculate. So here’s how you do it. So you’ll notice onthe right, I just made up five numbers, I just totally made them up. I don’t know whatthey are.Okay, I just did that for a demonstration, because the range is the difference betweenthe maximum and minimum value. So literally, it’s pretty easy to calculate, you have tofirst search around for the highest or the maximum, which in this little data set, it’sso cute. It’s only got five numbers. So it was obvious that somebody ate was the highest,right? And it’s sort of obvious at 21 is the lowest.So how you calculate the range isyou take the highest minus the lowest, and then you get a number. And that’s the range.And sometimes my students actually take the highest, and then they put minus and thenthe lowest. And then they tell me, that’s the range. And I’m like, No, yeah, I actuallyhave to subtract it out. So you’ll see here, it says 78 minus 21 equals 57. So it’s 57.That’s the range. Okay. So all it’s telling you is the distance between the top and thebottom.And I’ll just say that, that’s not very useful. In fact, I had a problem withthat when I was working, I worked at the army on this army database. And I looked at therange of ages of soldiers when they started. And the range was h Four, three 107. Alright,obviously, there was a problem with the data, right? Just for some reason, there was a screwedup record that said, somebody got him when they were four. And there was another screwedup record that said, somebody got in when they were over 100, they were just screwedup data, okay.And that caused me to have this ridiculous range. And so the range isnot very stable or resistant, right? If we just fixed that, you know, record that saidsomebody was four when they got in the army, then we might have a normal range, you know,like little more like a minimum, we might see 18, or 17, or 19, or something. But, asyou can see, on the right side of the slide, I just picked out that the minimum and themaximum, we could just change arbitrarily change those numbers. And suddenly, we’d havesomething totally different from 57. So as you can see, even though this range is a measureof variation, it’s not stable and resistant.And it actually kind of doesn’t tell you much.If I say we’ve got a range of 57, you don’t know if the minimum is like zero, or likenegative, or like 105, you know, you really don’t know where that ranges in. So it’s notvery useful. But it’s a place to start, because that’s our first measure of variation. Nowwe’re going to get into what we really use in statistics a lot, you’ll sometimes seein articles where they state with the ranges, they usually don’t state the actual numberI tell you to calculate, they actually state the minimum and the maximum.And sometimesthat’s interesting. But variance and standard deviation. That’s what we really live on instatistics for measures of variation. And you’re probably wondering why I’m talkingabout them together when they’re totally different calculations. Well, it’s because they’re friends.Okay? And how are they friends? Well, the variance calculations, kind of a big formula.And so you get through that, and then you have the variance. And then all you have todo to get the standard deviation is take the square root of the variance. So that’s whythey’re friends is like you go through all this trouble to get the variance. And thenthe next step is just take the square root of that, and you get the standard deviation.So before I actually talk about those formulas, I wanted to just set in your head, what thesewords mean.Because, like, I remember, I worked in a mental health place. And I don’t know,we didn’t have enough licensed people there. And so our leader said, Oh, I’m applying tothe state for a variance, right? Meaning that the state would give us allow us to vary fromthe rules. Well, that’s what variances is how the data vary. So you think of the spreadof the data and how well does the mean every represent that spread? It doesn’t, right.So variance is a way of representing how the data vary really around the meet. Now, you’reprobably wondering, well, then why do you even have standard deviation? It’s the squareroot of variance. But let’s just think about what the word means. You know, standard meanssort of following a standard are the same. So it’s just the amount of variation, thatstandard in the data set. And you know what the word deviation means? Like, you can say,oh, that person is a social deviant because they go to crimes or something. Or like thisguy with a healthy nose, he does not have a deviated septum.But you know, some peopledo have a deviated septum where it’s like crooked, right and they have trouble like sneezingand blowing their nose and sometimes even breathing. Well, a standard deviation wouldsimply mean that everybody’s deviation is about the same. So, variance is a calculationthat says how much things vary. And so the standard deviation, because it’s just thesquare root of variance, but I just want you to imagine in your head, oh, standard deviation,that means how much the data deviates around the mean, because a lot of times studentsget confused about the measures of central tendency, they try to apply them to variation,but variation is totally different thing. So just remember what variance literally means,and what standard deviation literally means. And that might help you get through theseformulas and understand the interpretation. So as I mentioned earlier, the formulas forvariance and standard deviation are different, whether you’re talking about a sample, ora population. And, admittedly, we don’t use the population variance or population standarddeviation calculation very often, because we don’t measure the population that often.So we tend to use the sample variance and sample standard deviation all the time.SoI’m going to demonstrate those. But you’ll notice conceptually, they’re really similar.Like, um, you know, if you have population parameters like Meuse, and like populationstandard deviations, they tend to behave similarly in formulas, as sample versions, it’s justthat in statistics, we always want to be really clear about what we’re talking about. So wealways want to use the right symbol, so we’re hinting towards, we’re analyzing a sampleversus we’re analyzing a population even though conceptually like means or a mean, right?But you want to represent which mean you’re talking about one, that’s a parameter, orone, that’s a statistic, whenever you write out the formula, so I’m just being picky aboutthat.And then there’s also two other things you want to know. Um, there’s two differentways of actually doing each of these formulas. You know how like an algebra, you can havea big equation, and you can express it more than one way. So that’s all they do is theyput a formula in one way called the defining formula. And then they put the formula, sameformula, but rearranged by algebra into the computational formula. Now, I always thinkthat’s kind of funny that they call the computation, right? I mean, both the formulas give youthe same results, it’s just plugging in numbers and getting out the answer. And the answeris gonna be same, whether you use the defining formula, or the computational formula.Butwhat I think is so funny is they call it the computational formula, but I cannot computeit. Like I always get confused when I use it. So I pretty much ignore the computationalformula in my entire life. And I just teach the defining formula. And I find my studentsalways remember the defining formula, they always can get through it. Although peoplewho are into the computational formula, they tell me that I’m doing things the hard way,I’m going the long way around.But you know, what just goes a long way around, it helpsyou not get confused, and helps you convince yourself you actually got the right answer.So let’s just do the defining formula. All right. So let’s look at the defining formula,you can look it up, you can look up the computational formula, but this is the defining formula.So let’s just get get our minds wrapped around that. Remember, I told you that variance isgreat, because you calculate that, and then you just take the square root of that, andyou get the standard deviation.So as you can see on the left side of the slide, weabbreviate the sample variance by just saying s, which is the standard deviation to thesecond. I know that sounds ridiculous, right? Like why don’t we have a special thing justfor the variance? Why do we just say it’s so the second and then say sample standarddeviation is just as as well actually, to be honest with you people use different notation.I’m just using this because it matches the textbook we’re using.But people will oftensay var for variance. And so in other textbooks, they’ll do that, and then statistical software,but they’ll also say s to the second like this, and it’s maybe a good way of you rememberingthat the standard deviation is just the square root of the variance, right? So if you eversee s to the second, remember, S is the sample standard deviation, and s The second is thesample variance. And I’ll show you the population one in a minute. But if you see those, that’swhat they’re talking. Okay. Now, let’s look upstairs at the top formula. See this thingon the top? It’s really kind of scary, but we’re going to work through this and you’renot going to be scared of it. Okay. I know you know that there’s a little somesign there that capital sigma so you know, something’s gonna They get summed up. Butthat looks kind of scary that x minus x bar to the second thing will handle that, okay.But n minus one on the bottom, that’s not so scary, okay.And we’ll handle that onetoo. And then you’ll just notice, all I did for the bottom part is I just put this hugesquare root sign over that whole thing. So that’s the only difference between the upstairsand the downstairs. And then I also wanted to show you a picture of a calculator, becausea lot of times, if you haven’t really done math or statistics for a while, you forgetthe whole concept of square root. And I’ll just remind you, whenever there’s a squareroot of something, it just means that if you times it by itself, you’ll get that number.So remember, like 25, the square root of 25, if you put 25 in your calculator, and youhit that square root thing, you’ll get five, right, because five times five is 25. However,if you put in 24, you’re gonna get something with decimals, right? But whatever it is,you get, if you times it by itself, you’ll get 24. So I just want to remind you of that,because sometimes people forget that if they haven’t been doing statistics or math fora while, or they haven’t used the calculator for a while.All right, I told you, I talkedto you about this numerator, right that the top is the numerator in a fraction, and thebottom is the denominator. So I’m going to talk to you about this numerator. So the sumof X minus X bar squared, you know, that’s how I would say it, this is actually calledthis little piece of the formula is called the sum of squares. And so when From now on,when I say sum of squares, I literally mean the top half of this equation. So what youdo when you do the defining formula, is you just kind of relax and say, the first thingI’m going to do is figure out the sum of squares, I’m going to figure out the top part. Andthen I’m going to just write that down, and then later, I’m gonna come back to this formulaand enter it.So this next part is, how do we figure out that top part of the equation?How do we get the sum of squares, and I’ll show you. Okay, so let’s just look at theslide, I’m on the left, there’s this blank table. And that’s usually what I do firstis I make this blank table. And you don’t want to say column one, column two, columnthree, I just put that there. So I could talk about the columns.And then you know, I wastalking about, but usually, what I put is I put x in the first column, and they putx minus x bar I wrote out minus, but you can just use a dash. And then I put in parenthesesin the third column, x minus x bar to the second, like that. Remember, when you haveparentheses, you have to do what’s inside the parentheses first. So this means you literallyhave to do X minus X bar before you to the second it or square it. And I’m just walkingyou through this to get you ready for what we’re going to do with this tape. On the right,so this slide, I’m just reminding you that the sum of x minus x squared to the second,in other words, the sum of whatever is going to be in the column three.That’s anotherway of saying the sum of squares. Okay. So an easy way to explain this, what the squaresare, is to just show you how to calculate it. So I just pulled out some data set, imagine a sampleof six patients presented to Central lab. So this happens to me when I go to my doctor,sometimes she’ll say, you know, it’s time to do a lab panel for you. So she gives methis slip of paper, and I go downstairs to the central lab, and I give them the slipof paper, and they say, Okay, sit down, and then we’ll call you up, and we’ll draw yourblood or whatever. So we’re imagining six people did that. And then they got up to havetheir blood drawn. We asked them, How long did you wait? Okay. And I’m in the centrallab where I literally do wait two minutes, that’s a really good lap. But sometimes it’s really busy if I go like duringlunch, and I’ll wait something like 10 minutes.So here are six patients. One of them waitedtwo minutes, a couple of them waited three minutes, probably the other three came induring lunch because they waited eight minutes, 10 minutes and 10 minutes. Okay, so that’sour data, that it’s a little tiny data set, but I just wanted to use something small toshow you how to calculate the variance, and then the standard deviation with just thislittle data set. Okay. So what’s the first step? After making the table you have to makethe blank table for us is you fill in the first column, which is called x. So what isx? Actually, each of these patients waiting time is an X. Remember sum of x, if we saidsum of x, we would mean add all these x’s together, right? So So that’s all I did, Ijust put each x in the column, you’ll see 2338 1010. It’s just like identical to thesex’s. And then I put at the bottom, I put that little fancy sum of X and said 36. Okay, andso that’s just the first thing you do.Just put them all in and do the sum of X. All right.Now the next step is don’t look at the left side of the slide yet, look at the right side.Before you go and fill in column two, you have to do X bar. In other words, You haveto figure out the mean. Now, you can kind of cheat because you just figure it out someof x. And if you remember the formula, the mean, or the x bar of the sample is the sumof x divided by n.And remember, I told you at six patients, so you just take 36 dividedby six, and you get six. Now you just hold that number, you hold that. So between column one and column two, yougot to calculate x bar, and you hold, right. And then while you’re holding that, you keepit off to the side, you realize that this is how we’re going to fill in column two iswhat x minus x bar means is the x bar is just six. But we have to go through each x andminus x bar from, it’s helpful to order the x’s before you do this, like notice, I putthem in order 2338 1010, it’s a good idea to just do that, because it helps your brainthink whether or not you’re doing the right thing.So let’s start with the two. So wedo two minus six, which is the x bar. Now you can look at column two, two minus sixequals negative four, I hate negative numbers, but you just have to deal with them sometimes.Okay, so it’s negative four, so you just deal with that, then you go to the next slide,and it’s three minus six, which is negative three. So we’re still on the water here withthe negatives, but you’ll notice that the next 1x is three, so you can kind of copywhat you just did.So you’re getting negative three. So what you’re actually technicallyfilling in this column, I showed you the equation, but you’re putting negative four in the firstone, negative three in the second one, negative three in the third one. And then now finally,the fourth x is eight. So eight minus six, we got above water, now we’re in two, right.And then we have 10 minus six was 410 minus six, which is war. And when you order themlike that, that’s often what happens. In fact, that’s always what happens is you end up witha bunch of negative ones at the beginning and a bunch of positive one later, that’sjust totally normal.Don’t worry about that. But you got to be careful, you got to makesure you make the right meet. I’ve had people on tests actually screw up this mean. So youcan just imagine when a train wreck happens after that is you do not get anything rightafter that. So make sure your means right. And then make sure you subtract it from everysingle x and put the right answer in column two. That’s the next step. All right. Okay,so we’re done with that step, what do we do next? Now, we just take whatever we got incolumn two in square. So we have the first one was negative four. So we take remember,square is just the the number time itself. So if you don’t like to use x to the secondbutton on your calculator, you can just do negative four times negative four, same thing.And so you’ll notice we do negative four times negative four, we get 16. Now, it’s prettyeasy. negative three times negative three is not, you know, two times two is four.ButI what I want you to really look at is the 10s. Notice that they get a 16 two, just likethe two did. And that’s the trick here. Remember, I said I hate negativenumbers? Well, a lot of statisticians feel the same way I do. And so they often fix it by squaring the numberbecause it’s a racist, the negative. Just remember, negative times negative is positive,and positive times positive is also positive. That’s a little trick, you know, when it comesto multiplying. And so when we do that, we are squaring each one of column two.And they’recalled squares, right? So we’ve got 16 994 1616. These each are squares. So what do youthink we do? We add up that entire column, and we get the sum of squares. So look atthat, we add up that entire column, and we get that super complicated looking thing atthe bottom, which is the numerator for our variance equation, right? Like this wasn’treally that hard. Was it? Okay, so we sum that up. And as it turns out, we get the number70. So 70 is our sum of squares. All right. All right. Now we’re back at the sample varianceformula. And I’m so excited because look at the top of the formula. We answered. It’sit’s 70. Okay, so we got that 70. But we still have to deal with the bottom of the formula.Remember, n was six, right? We had six patients, and the bottom of the formula is n minus one.So the bottom of the formula is going to be five, right? So let’s fill this in.I waskind of running out of room, so I just filled it in upstairs. So you see that 70 dividedby five suddenly this looks super easy, right? So 70 divided by five is 14. Okay? That’sthe variance. totally easy, right? Once you make that, I mean, it’s not it’s tedious,right? You have to make that whole table and add things up and stuff. But here, it’s notreally that hard. Now, Guess how we’re gonna make the standard deviation you’ve probablyguessed it, we’re just going to take a square root of 14. So remember that button on yourcalculator, you could put in 14, hit that button, and you get 3.74 and a bunch of otherstuff, but I just chopped it off at 3.74. So that is your sample standard deviation.Now I promised you I would talk about the population formulas for standard deviationand variance, as well as the sample ones. And I told you, they wouldn’t really be conceptuallymuch different. As you can see on the left side of the slide, sample variances expressed,I made things red, so you can see what the differences were sample variances s to thesecond, but population variances as other Greek letter.Remember, I told you that thatother sum was capital sigma, like, you know, Greek is like English, in the sense they havecapital and lowercase letters? Well, that thing that I always think it looks like ajelly roll, but the Jelly Roll looking thing is actually lowercase sigma. So that I’m nevergoing to say lowercase sigma, except for now, I’m going to say population variance and populationstandard deviation. So you’ll see at the bottom of the slide, the lowercase sigma alone isthe population standard deviation.And then the lowercase sigma to the second is the variance.So just remember, if you see that Jelly Roll thing, we’re talking about a population versionof the standard deviation or variance in that the sample. Also, you already know about muversus x bar, right, so we have x bar on the left. And that’s the sample mean, and mu onthe right, which is population mean. And you also already know about n, which is the numberin your sample. And this is where there’s a big difference actually, in the sample,you have to do n minus one on the bottom, and in the population, you just do, and capitalN that whole population. And if you think about it, it makes kind of sense, becausepopulations are huge, so won’t even matter if you like subtracted one. Whereas, you know,samples are small. So you sometimes have to, you know, adjust or something, so you haveto minus one, but you wouldn’t even matter like people make a mistake and accidentallyminus one from the population one, they don’t get much of a different answer.And so that’swhy I’m concentrating on the sample once, that’s what we normally do. But I wanted togive a shout out Just so you know, if you ever see the arm formulas on the right sideof the slide, you know their population level formulas. Alright, now we’re gonna move on,we made it through range, variance and standard deviation. So now we’re gonna move on to talkabout the coefficient of variation. And this is used a lot for comparisons for comparingbetween two different labs often. I say that because my friends are pathologist,in the first time I actually use this in medicine, as we were comparing lab values on the sameassay from two different labs, I just wanted to explain to you this might be the firsttime you’ve heard the word coefficient. And that gets a little confusing for people instatistics who are new, because the word coefficient is actually just kind of a generic term forcertain kinds of numbers.So you’ll hear somebody say, coefficient of variation. And you’llsay, you’ll hear somebody say coefficient of something else, or coefficient of somethingelse. And just a word coefficient. Most people haven’t even heard it. It just means a certainkind of number. It’s just somebody says, oh, the coefficient is not good, or it’s high,or whatever, you need to ask them, What coefficient are you talking about, right. So in otherwords, coefficient doesn’t mean a specific thing. It just means a number that comes outof statistics. And so you have to know which coefficient they’re talking about. So thisis the first time maybe you’ve heard the word coefficient. And I’m going to talk for thefirst time then, to you if you’ve never heard coefficient before, about a specific coefficientcalled the coefficient of variation. Now, you’ll, as we go through this textbook, there’sother coefficients on it. So please remember this one is coefficient of variation, right?And a way to remember it is a CV for short. And so other coefficients have different abbreviations,but the coefficient of variation is CV.So I put on the right side of the slide the theformulas, and nobody seems to have any trouble doing the formula, right, because once youcalculate the standard deviation, the sample standard deviation of the population one,as you can see in the formulas, and once you calculate x bar, which is a mean for the sample,it’s pretty easy to do the division, and then they like it when you do it in percent. Andyou’ll notice that about statistics is certain things they prefer as proportions. And certainthings they prefer as percents. It’s just like, I don’t know, it’s just like our culturein a way and so coefficient a very is always expressed as a percent. So you have to timesthat by 100. And then put a percent sign after it. But really, that’s pretty easy to do youtake the standard deviation, you’ll see I did it for our patients 3.74. It took us allthat work to get there, right? Remember square root of 14.And then remember, our x bar wassix. So we needed that remember earlier for that column, too. So I just dumpster divedumpster dove, those numbers, and then did this calculation out and I got 62%. And sostudents generally don’t have trouble getting that number. But what the problem is, is like,what is the number even mean? Right? Like, what does it mean, if you divide the standarddeviation by the x bar and times by 100%? And like, how do you interpret that percent?So the easiest way to talk about it is to actually compare it with something. Becauseone thing you’ll also notice in statistics is if you make ratios of things, they don’thave any units. So if I take your blood pressure, like your systolic blood pressure, and I sayit’s whatever, 130 mmHg. If I divide that by your diastolic blood pressure, or evenby some lab value, or your temperature, or whatever, your IQ, suddenly I get a ratio,and that doesn’t have units, right, it doesn’t have mmHg, or anything like that. And if Ido that to a bunch of people, all of those ratios don’t have any units.And so they technically could be comparedto each other. So you’ll see that that’s a strategy in statistics is they’ll make ratiosof things and say all those don’t have any units. So it’s, you know, sort of lackingin that way. But the power is you can compare these ratios. So, I decided to just pull outother patients, I just made up other patients, right. I pretended we went back to the lab,the next day, and we gathered some data. And we gather some data, and we came up with Ijust made this up an x bar of eight, and a standard deviation of four. It’s a littleclose to what we had before, right? Like x bar six insanity, Visa 3.74. But anyway, inthis next sample patients, the S four divided by the x bar of eight times 100 equal to 50%,and not 62%, like the other one did. So how do you interpret that? Well, the CV is a measureof the spread of the data relative to the average of the data. So in the first sample,the standard deviation is only 50% of the mean.But in the second sample, the standarddeviation is 62%. of the mean. So what I would say is that the second sample,the red one with the 62%, has more standard deviation, compared to the mean. And so thatmeans it’s less stable, right? It’s got more variance compared to its mean, and it’s morestandard standard deviation compared to its mean. So it’s less stable. So it moves arounda lot. So if you said to me, if these were actually two different labs, I would say,you know, I prefer the first lab, the purple lab, because it’s more predictable.I know,it’s gonna be like less variation, because it’s 50%. And the 62% means that that’s lesspredictable. It’s a little hard to see in this example. But what happens is, if youhave two different labs, and you’re looking at this, like maybe you split a blood sampleor a bunch of blood samples and send half to one lab and half to the other, what you’resupposed to get the same mean and the same standard deviation, right? They’re the sameblood, you just want it. But sometimes you don’t sometimes you getsomething like this, in which case, if you’re comparing labs, you would go with the purplelab and not the red lab because they produce a more predictable result.So CV is a littlehard to interpret. But it’s easy to calculate. So that’s one awesome thing about now, we’regonna move on to chubby chef and his theorem. So chubby chef figured something out a longtime ago. And this is how he started thinking about it. He first started thinking, well,let’s say you have an x bar and an S, like we just did with the CV. He noticed somethingelse about it, he didn’t notice the CV, he noticed that you can create a lower and upperlimit by subtracting the ass and adding the s to the x bar. So remember back when we weremaking frequency tables, and I said, Well, we need to make class limits, we need to makea lower class limit and an upper class limit. Well, we use those terminology a lot likelower limits and upper limits. Well, Chevy show was like wait a second, I got an idea.Let’s say I take a mean.And I you know, this will force the mean to be in the middle ofthis. I can subtract one standard deviation from it, and I’ll get some sort of lower limitand I’ll add a standard deviation to that mean and get some Sort of upper limit. Andof course, let’s pretend the standard deviation was one, like you’d subtract one to that one.And so this would be like totally symmetrically in the middle, right, the x bar would be inthe middle, and then it’d be surrounded equally by these two standard deviations. And I’mjust saying standard deviation generically, because you could do this with a mu, and thepopulation standard deviation, two, you can do the population work. So he just sort of,like figured out, that’s a thing that can happen, you can add and subtract a standarddeviation from the mean. And you can get these limits.And so example, let’s say I have amu. So I’m gonna pretend I have a population a mu of 100. I don’t know what I measured,but I got 100 and a population standard deviation of five. So Chevy, I was thinking, you knowwhat I could do, I could take that 100 and subtract that five from it, and I get 95,I could take that 100 and add five to it, I get 105. And so we just started like workingwith this concept, like I could subtract and add like a standard deviation. And then hethought, Wait a second, I could even do this with two standard deviations, right? So Icould take like, if it was five, I could take that times two, that’s 10. And so I coulddo 100, subtract 10, and I get 90 for the lower limit, and 100 and add 10. And I get110 for the upper limit.And so I can make this this range or this interval, right? fromthe lower limit to the upper limit, we call it an interval, right. And so he just sortof conceptually realized that if he used some rules along with this, there might be someuseful interpretation of these limits, right, there might be some way that uses limits tomean something. So we’re going to look at how he figured out to be able to use, youknow, one standard deviation on either side of the mean, or two, or three, or four multiplesof these standard deviations on either side of the mean, to actually come up with somelower and upper limits, that meant something. So he realized that what these low lower andupper limits would mean is that at least some percent of the data would be between theselimits. So in other words, some percent of the of the axes would be between the lowerand the upper limit.But that percent would depend on how many standard deviations you’regoing out, right? Like is it one is a two is a three, the, the more you go out, obviously,the more percent of your data are covered by the limits, because they’re just huge,like, get it. So the interval so big, and almost covers the whole thing. So you wouldexpect that percentage go up, as the number of standard deviations you use goes up. Soso he was working on this out, and he came up with this formula, right. And he also,he was figuring out, he wanted this to work for all distributions, like normal, but alsoskewed. And also like uniform and by modal. So this was the formula he came up with.Now,in this formula, see at the bottom, k stands for the number of standard deviations, orthe number of population standard deviations that he’s going to use, right? So let’s pretendthat he made KB to like two standard deviations, right? Then you’d see this, it says one minusone divided by k to the second, which would be to the second, so that would be to thesecond is what four. So one divided by four is point two, five.And so one minus pointtwo, five is like point seven, five, well, you make that a percent at 75%. So he’s like, okay, that’s what I’m going tosay. If you go out two standard deviations up or down, and you make those upper and lowerlimits, at least 75% of the data of the axes are going to be there, at least, there mightbe more, but it’ll be at least that. So he did this he used to, and they use three, andhe used four. So two standard deviations, either way, threestandard deviations either way, or four standard deviations either way. Now, students in myclass often think that they have to memorize this one minus one over K to the second, youdon’t memorize. This was just a story of how Chevy chef did this proof.So you can memorizeit for fun, but nobody memorizes it. I mean, you know, Chevy chef did the work. I’m justshowing you the proof, right? So he figured this all out. So as you can see how he likeyou can do this with two, three and four, you’ll get the same answers Chevy chef does.So it’s kind of a waste of time, but you can do it just for fun. So he did the two one,I showed you that on the top. I even talked you through it. So you’ve plugged two intothe equation, you’ll get 75%. So in that thing I was just talking about like imagine I had100, right? And that was my x bar and my standard deviation was five, right? And then two timesthat is 10. So I go well my lower limit then would be 90 in my upper limit.That wouldbe 110. And I would be able to confidently say at least 75% of my x’s are between 90and 110. So if I’d measured maybe 100 people, right, I’d say at least 75 of them are goingto be between these limits. In fact, it could be 80, could be more, but at least 75. Sothen Remember, I told you to predict that as we made this number bigger, you know, wego out more standard deviations, we’re going to cover more of the data, right? So we neededthree, it didn’t come out as even, it came out in 88.9% of the data.So almost 89% willbe covered if you go out three, and at least almost 88.9%. And if you go out four standarddeviations, it’s at least 93.8%. Right? And just to remind you, you know, when you haveupper and lower limits, you have an interval, right? That’s just we just call it that. Butthis particular interval, if you get it this way, it’s Chevy service interval, becauseeverybody’s so happy did all this work, right? Because I wouldn’t have figured it out. SoI just wanted to demonstrate an example of championships interval, because then you canknow how to interpret them or why anybody does them.Okay, so remember our patient sample,they’re in the waiting room at the lab, right? So they waited on average, six minutes, andthen the standard deviation of them waiting was 3.74. Right? Now, when I gave you thisdemonstration of how to calculate the standard deviation, I use this patient sample, I didthat I only had a few patients in the sample on purpose, because otherwise your table thatwe made with the defining formula would be huge, and I never finished this video. Sowhat I’m gonna ask you to do is pretend that instead, we had 100 patients in there, right?Instead, I measured 100, and I got my x bar, my 3.75 standard deviations, okay, so if wemeasured 100 patients, and we got that, I just want to, I put this chubby shove rulesin that table.So if we go out two standard deviations from the mean, from the x bar,either side, whatever limits we get whatever interval we get, we know at least becauseI made it, so we say you know, studied 100 patients. So by law, we’re at least 75 ofthose patients will be between those lower and upper limits, if we follow championshipsyrup. And if I do go out three standard deviations, at least 88.9 patients will be in there. Okay,I know that doesn’t make any sense, like 88.9 patients Saudia point nine of a patient. Butwhat they’re saying is, I guess it would be 89. All right, yeah, 89% of the patients orin other words, 89 patients, at least would be in that interval. And of course, if I wentout for at least, I wouldn’t have to say 94.8 of a patient, but at least 94 patients wouldfit in that interval.And if you’re thinking about if we only start with 100 patients,that’s almost all of them. So the for one isn’t so useful, right? So you’ll see me onthe left side of the slide calculating the intervals, right? So let’s start with thefirst one. The first one is two standard deviations on either side of the mean. So the chubbychef interval we get is negative 1.48 to 13 4.48. And you probably notice you can’t waitnegative time. So already, this is kind of weird, right? But what this is saying is ofour 100 patients, at least 75 of them because this is 75% championship interval, weightedbetween negative 1.48 minutes, so that might as well rounded to zero between zero minutes,and 13 4.48 limp minutes, right. And so at least 75% of them are, I fell in that range.Now 13.48 minutes is kind of long. So we would be happy, I guess is 75% of them fell in thatrange, because then that means that they were probably not waiting that long.But if you go out, then you widen this interval like 88.9. If you do that, then you say atleast well rounded to 89 89% of the patients waited between negative five point to twominutes, which is you might as well make zero and 17.22 minutes.So as you see, if we widenthe interval, we’re going to get some later waiters in there. And so then we’ll say, Well,at least 89% were between there, but at least 90 89% were between there and that means itwasn’t bigger, right. And then again, we go out one more, we get 93.8%. So let’s justround it to 94. So at least 94% of the patients or if we have 100 patients, at least 94 ofthem waited between negative 8.96 minutes, which again is nonsensical, up to 20.96. Butthen we’re starting to get where we’ll have almost all the patients with Somewhere betweenzero and 20 minutes, we really don’t know how long they waited. So this is just kindof to show you what happens when you line that interval, you you maybe have less certaintyabout what individuals happen, be sort of a better idea of what the range is. So again,I just put this at the bottom. If we had 100 patients, this is how you would interpretit, at least somebody five would have waited between the lower and upper limit for the75% championship interval. And then at least 80.9 patients I know nonsensical.And thenthe 93.8. So you see that interpretation lower part of the slide. So this is a really difficultconcept for a lot of students. And so I’ll just give you this take home message. Firstof all, Chevy shove interval works for any distribution, normal skewed whatever. Reasonwhy that’s part of the take home messages later, we’re going to learn about intervalsthat only work with normal distributions. Okay? So this one is loosey goosey. It workswith all distributions. So that’s one of the take home messages for chubby sets interval.Also, Chevy says interval tell you that at least a certain percent of the data are inthe interval.Later, we’re going to learn about intervals where exactly a certain amountof data are in that interval. And so Chevy shop again, a little loosey goosey, right,he says at least. Next, championship intervals are sometimes nonsensical, as we just talkedabout. Negative time doesn’t work, right. Sometimes you’ll have very high limits, especiallywith a four. And so ultimately, they’re not very useful. And they’re not used in healthcare. I literally had never heard of Chevy shows interval until I started teaching thisclass. So what is the purpose of teaching you Chevy says interval. The purpose of teachingthis is to point out in statistics, we often use the s or the population standard deviation,you know, just standard deviation. And we add or subtract, we’ll add and subtract itfrom the mean, is a good way of making lower and upper limits that have special significance.That’s really the main take home message is that you’ll see this pattern as we go throughthis class, where we get a mean either populations or sample, and we have x bar, you know, xbar or population mean. And then we have a standard deviation, right either from samplea population.And then we take either one standard deviation, we added subtracted ortwo, or multiples. And those intervals then have certain significance. I only taught youin this one about Chevy chef, what you learn about other intervals later that are madesimilarly. So in conclusion, what did we learn, we learned how to calculate the range, welearned how to calculate the variance and standard deviation. We learned about how tocalculate the coefficient of variation, how to interpret it. And we talked about the differencein the formulas from sample versus population. And we learned about Chevy Chevy and his theorem,how he figured it out, and how we calculate this intervals and how you interpret them.Now I just thought I’d show you this picture of Chevy chef here. He’s a Russian guy. Well,the stamp was from the USSR, for the Iron Curtain fell.But I just thought I’d showit to you. So you knew who figured all this out? Good job, you’ve made it through themeasures of variation. And now you’re ready to do what the quiz, the homework, whatever,right? You’re totally knowledgeable. Good job. Well, I’m back. And so are you. Welcome toChapter 3.3 percentiles and box and whisker plots. It’s Monica wahi. Library college lecturer.And this is what we’re going to talk about. And this is what you’re going to learn. Atthe end of this lecture, the students should be able to explain what a percentile means,describe what the interquartile range is, and how to calculate it. Explain the stepsto making a box and whisker plot, and also state how a box and whisker plot helps a personevaluate the distribution of the data.So let’s get started. You know, whenever we talkabout a box and whisker plot, I think of some cute little animal with all those whiskers. I’ll explain what the whiskers really are,I mean, not on the animal, but on the box and whisker plot later. So what are we goingto go over, we’re going to go over percentiles, and we’re going to explain what those are.Then we’re going to talk about core tiles sounds a little slimmer, it’s got the tilesand it will you’ll you’ll understand why they’re similar. Then we’re going to compute coretiles. And then finally, we’re going to do the box and whisker plot. All right. So let’sgo. So percentiles, we’re going to have a flashback, okay. You’re not going to likethis little part because it’s going to remind you of standardized tests. So maybe not allof you have been subjected to this, but most of us have if you gone to high school. Inthe US, you probably got to deal with these standardized tests. So just remember, we’reonly talking about quantitative data.All right. So if you take a standardized testor a non standardized test, you usually get points. And points are numerical. So that’squantitative data. So I remember I used to take the standardizedtests, and I’d be, you know, showing my friends what I got, right, because they’d send youthat thing in the mail. Now, I learned pretty early on, that it mattered who all was inthe pool of people maybe taking a test with you, right. So if you’re taking the test witha lot of stupid people, it’s easier to get a higher percentile, because what percentilemeans is it for example, if you test at the 77th percentile, it means you did better than77% of people taking the test. And a lot of those standardized tests, they didn’t carehow many points you got, what they cared about is what percentile you were at.So differentbatches of people would have different scores. And if you got a lot of lucky, got a lot ofstupid people, then your score would be higher than there. So it didn’t really matter whatyour absolute score was, it just mattered what your percentile was. So just to sortof remind you, if somebody had come up to me in high school and said, I got 77 percentile,what I’d say is okay, if only 100 people had taken the test, you’d have done better thanSunday, seven of them. Of course, we were all Brady, Brady, you know, I was always inlike the 95th, or the 97th, or the 98th.And it happened so often, I wondered if it wasreally true. But what I realized is, is that there were so many people in the pool, becauseyou know, I was in public high school in Minnesota, well, they were pulling together all the publichigh schools in Minnesota, ninth grade, you know, as pulled with them in 10th, grade orwhatever. And when you’re taking like nursing examinations, sometimes they’ll do that they’llput you on a percentile. So I try to tell people, you know, strategize, try to takein when only stupid people are taking, which of course, makes no sense. How can you tellwhen stupid people are taking it, right? You don’t even know who’s taking it. But really,that’s that’s what a percentile is, it’s the percentage of people that you did better thanif you’re at the 77th percentile, then you did better than 77%.Okay, so here’s justsome rules about percentiles. First of all, you know, I gave the example of the 77th percentile,well, the rule is you have to have one between one and 99. Like, you can’t have the negativesecond percentile, or the 100, and fifth percentile. So that’s the first, then whatever numberyou pick, like I was saying, that percent of the values would fall below that number.And 100 minus that number, have the values fall above that number. So like, in my, well,here, we’ll give an example. 20, people take a test, just 20, right, let’s say there’sa maximum score of five on the test. The 25th percentile means that 25% of the scores willfall below whatever score that is, and 75% will fall above that score. So let’s say it’san easy test.And let’s say out of my 20, people, 12, get a four, which is almost thetotal, right, and the remaining eight, get a five, so everybody gets either a four orfive, well, then, you know, the 25th percentile, or the score that cuts off the bottom fivetests, right, will be a four, just because this was an easy test. And every you know,the first 12, people got a four and then the rest eight out of five.So even the 50th percentile,then would technically be at a four, right? Now, this would all come out differently ifit were a hard test, and most people got a score below three, right? And so the percentileswould be shifted down, I just tell you that so you can keep in mind the difference betweenthe actual score and the percentile. So the percentile just happens to mean that thispercent of people got the score lower than whatever your score is, it doesn’t actuallysay what your score was, right? So that’s what you just want to remember as we’re goingto percent. Okay, now we’re going to talk about core tiles, and also the interquartilerange. Remember the tile think so this relates to percentiles. So I put a little quarterup there. So core tiles is a specific set of percentiles. And you’ll see why I put thelittle quarter up there. It’s because there’s technically four core tiles, it’s just thatthe top quartile doesn’t count because it’s like the 100% one. And remember, it can onlygo up to 99, like I was just showing you.So we calculate the first second and thirdquartile. So we have the 25th percentile is the first quartile, the 50th percentile, whichis also known as the median, which you’re already good at, right? That’s known as asecond quartile. And then the third quartile is the 75th percentile. So those are yourcourthouse 25th 50th and 75th. And technically a 100th. But we never say that, right? Becauseit only goes up to 99.So you have the first quartile at the 25th percentile, the secondquartile at the 50th percentile. The third quartile at the 75th percentile. And theseare actually not that hard to calculate by hand. So here’s, like how you do it sort of an overview.So first you order the data from smallest to largest, because remember, we have quantitativedata, so you can sort them, so you sort them smallest to largest. And this is feeling veryimmediately, right? Well guess what, that’s step two is you find the median, because themedian is also the second quartile, which is also the 50th percentile.So already, youhave know how to do this, right? Because you could already do step one, and two. Now, thisis the harder part, this is the new part. Step three is where you find the median ofthe lower half of the data. Right. And so wherever you put your median, you pretendthat’s the end, and you look at the smaller values, and you find the median of those.And that would be the first quartile or the 75th percentile. Then finally, step four,which you probably guessed, is you find where your median was. And then you look at theupper half of the data between the median and the maximum, and you make a median outof that part of the data, and then that’s your 75th percentile. Okay, and I’ll showyou an example of us doing that. But this is an overview of the steps. Now, remember,range before what the range was, yeah, you remember it, that’s where we had the maximumminus the minimum, right? And I told you, you have to actually do out the equation andtell me what number you get.And that’s the range. Well, we have something new and improved.In this lecture, here, we have the inter quartile range. Okay, so you already know about quartiles,we were just talking about them. But inter quartile sort of means like, within, right.So once you have the third quartile, and you have the first quartile, you can calculatethe inter quartile range, or RQR for short. So if you see IQ are on here, just remember,that’s interquartile range.So that’s the third quartile minus the first quarter. Andagain, I’ll show you an example. It’s this is just an overview. Okay, here’s the exampleI promised. On the right side of the slide, you will see a sample of data I collected,I went to HD comm that’s American Hospital directory calm, and that provides publiclyavailable information about American hospitals. So I went in, and I took a random sample of11, Massachusetts hospitals, there’s a lot more, so I took a random sample. And whatI did was I wrote down how many beds each of those hospitals had. Because if a hospitalhas several 100 beds, they’re considered kind of a big hospital. And if they have less than100 beds, they’re considered a smaller hospital.So I wrote all those numbers down. And thenI already did step one of making our courthouse which is to order the data from smallest tolargest. So you’ll see on the right side of the slide, my smallest hospital had only 41beds, and my largest hospital had 364 beds and see I put all of them in order, they’reon the right. And so we already did step one. So let’s go on to step two. So the Step twois to find the median, and that’s quartile two, or the 50th percentile. Now, you’re alreadygood at that, right. And so we have 11 hospitals. So we know that the sixth one in the row isgoing to be the median, you know, because it’s an odd number of hospitals that I drew.And so the sixth one will circle it, that’s the 50th percentile or the median, so we alreadygot quartile two, it’s, it’s funny that you have to start with quartile two, but that’s what you have to do.Now, I just re color coded these. So you couldkind of remember what’s going on as we do the other steps. 126 is the median. That’skind of not on anybody’s side, it’s not on the lowest side, and it’s not on the highestside. The orange ones then are considered below the median. And the blue ones are consideredabove the median. And so I just color coded them so you can keep track of what’s goingon in the next slides.Okay, now we’re going to do the 25th percentile for step three.So the goal is to find the median of the lower half of the data. So now you see why I colorcoded it is because now we’re pretending just the orange ones exist. And we are just findingthe median of that. And we’re not counting that 126, because that’s already been used.And so now we find that 90 is the 25th percentile, how you remember that it’s not the 75th, it’snot the third one is because it’s the low one, like 25 is a low number. And 75 is ahigher number. So you go to the lower part of the data, you find the median of that,and that’s going to be your 25th percentile.And so in our case, that’s 90 then you probablyguessed it, you go to the blue ones, right the upper half and you go get the median outof that. And so of course ours is 254. So that’s our 75th percentile. So what we justdid is we calculated our courthouse. We have Our 50th percentile, our 25th percentile andour 75th percentile. So that’s what I meant by that overview slide. This is an exampleof how you would do that. And of course, I have to give a shout out to the IQ R, whichis the interquartile range. Remember, you just learn that. So that’s the 75th percentileminus the 25th percentile. So in our case, that’s going to be 254 minus 90, which equals164. So that is your IQ R. So if I gave you a test, and I asked you what is the IQ orfor these data, you can’t just put 254 minus 90, you actually have to work it out and put164. So there you go. So that’s our quarterly example.So I just wanted to step back andgive you some philosophical points on what happens with q1 and q3, depending on how manydata points you have. Okay, so remember, the first step of this is always to put them inorder from smallest to largest. So let’s pretend I had only drawn the first six values of myhospitals. See how I put on the slide, I put the position of the number, which is 123456.And I put above the example numbers. So let’s say I was going to do the median on that,you know, what I’d have to do is I’d have to take 90 plus 97, divided by two. But thenthe next question is, what do we do for q1 and q3? Well, given that in the example ofhaving six values, the 90 and 97 are mushed, together for the median, they don’t get, theycan get reused, or they do get reused when looking at the bottom and the top half ofthe data.So when we went to go to do q one in this, we would actually count that 90 inthere. In fact, q one would be 74, because that’s the median of the three numbers belowthe median right below that line. And then the Q three would actually be 121, becausewe actually count the 97 in there. So in other words, when you have like six values, andthe median is made out of mushing together two values, like taking the average of thosetwo values, those two values, they get to double dip, they get to be in the bottom,and the bottom line gets to be in the bottom, and the top one gets to be in the top whencalculating q1 and q3. Now, well, what if we had seven values instead of six? Okay,so I just expanded and pretended we had seven hospitals. And you’ll see that I have sevenpositions there. Well, this was a little like the one we did, together with the 11 values,where the median was clearly this 97. Here, in this case, it’s 97. So that 97 does notget reused in the bottom in the top. So you’ll notice that q one is the middle number ofthe three bottom ones, and Q three is the middle number, the top three ones.And sothat’s what happens when you have seven values. And it’s also happens when you have 11 values,like I demonstrated with those hospitals. But it’s not super predictable. Because whatif you had eight values, we suddenly see it gets a little complicated. So how would wedo this? Well see the first four are between 41 and 97, top four between 121 155. Well,to make our median, we’d have to take the mean of 97, and 121. But remember, they don’tget used up the 97 then gets to double dip and be part of the calculation for q1, and121 gets a double dip and Part B part of the calculation for q3. But even even with thisdouble dipping, if you go down, you’ll see that there are four then numbers to contendwith, for q1. So of course, to get q1, you actually have to mush together or take anaverage of 74 and 90.And if you go up the upper part of the data, in order to get qthree, you’re going to have to make an average of 126 and 142 are the ones in position sixin position seven. So if you’re unlucky enough to get like eight values, then you realizeyou’re going to have to make your median by making an average of two numbers, your q1of making an average of two numbers and your q3 like that. So it’s not super predictablewhat’s going to happen. You just have to pay a lot of attention. Just remember if yourmedian is made out of two numbers average, those numbers get to double dip in the downstairsand the upstairs of calculating q1 and q3. If instead your median is just one number,like because you have an odd number of values, then that guy has to just stay there and doesnot double dip in q1 and q3 calculations.So we can just see another example of this.So this is nine values right? Now remember, when I had 11 values, it was like having sevenvalues. I had this median and it was really clear like we have here but even Um, the mediansof the top of the top of the data and the bottom of the day, they were just, you know,it was an odd number. And so it was easy to figure that out. Well, you see here, in thiscase, our median is the fifth value, and that’s 121.So 121, does not double dip anywhere,right? So we go to calculate q one, we only have four values, because we’re not countingthe 121. And then we’re stuck with taking an average of the second and third value toget q one. And then same thing upstairs here, between, you know, 142, and 155. You know,those are the two middle numbers of our four numbers at the top. And then we have to takean average of those to get q3. So I guess this is just my long way of saying you gotto be really careful what you’re doing. First, make sure you’ve gotten the median, then figureout if that median is this kind of a median where it’s just you’re circling, or it’s amedium that came out of an average, because if it’s a medium that came out of an average,just know that those numbers are going to double dip in q1 and q3.And if it’s a mediumthat was because you had an odd number of data, it was just like in the middle, thatone doesn’t get to double dip. Okay, enough double dipping, I’m getting hungry. When Igo to that roller coaster, I’m going to get a double dip ice cream cone. Okay, we’re gonnamove on to box and whisker plot, which is kind of like your percentiles getting graphed,right.So let’s go back to our ingredients, we already created our box plot ingredients.In fact, that’s why I trickily went through those portals first, because now we’ve createdour ingredients to make a box plot. So I just sort of summarize what we have on the leftslot, side of the slide, say that 50 times, hospital beds was what we were counting, thesmallest Regional Hospital had only 41 beds. q1 was 96. a little easier. I put it in anorder cure, one was 90, median q2 was 126. You know what I mean? I mean, cuartel, right,like by these cues, then q3 is 254. And then the maximum was 364. Okay, so let’s make aboxplot. And then you remember what the data looks like on the right side of the slide.Okay, well, now I’m going to walk you through how you would make this box plot.So first,you draw this thing? Well, how do you know what to draw? Well, I usually just draw aline and a vertical line, and then put a zero at the bottom, and then I cheat, I go lookat the maximum go, Oh, I wonder where that is. And see our maximum was like 364. So Ijust made 400. At the top, if our maximum had been something like, you know, I thinkMassachusetts General Hospital has something like 600 or 800 beds. If we had gotten thatone in there, and that was our maximum, I would maybe go up to 900, you know, whateveris a little bit above the maximum, that’s what I put at the top.So this was 364. SoI put 400, then what I did was I divided it in half, like I see where the 200 is, I justkind of threw that in there. And then I divided between the 200 and the 400, a half and putthe 300. And so you can just kind of eyeball this and draw it out that way if you want.Okay, so I got this thing set up. And then here we go, we’re going to do the first thing. Okay, here’s the first thing we’re going todraw in q1 or quarter one. So on the left side of the slide, you’ll see a circle that’s90. On the right side of the slide, I made this horizontal line. Now how Why do you makethat line? Well, look at how its proportion to that that upward and down graph thing Imade, you know, with the numbers, you probably don’t want to too wide, but you don’t wantto too skinny.This is just about right, like Goldilocks just right. Okay, so you just makethis horizontal line at q1. So that’s the first. Now you make a copy of that same lineparallel, and you make it at q3. So if you look at that, if you’re I hope you’re notlost, if you look at that, you know, 100 200 300 400, you know, q1 is 90, so it’s about10, under 100. So that’s how I knew where to position that lower one. And then 254,that’s about, you know, a little bit higher than halfway between 203 100.So that’s whereI roughly knew how to position this one. It’s not perfect. If you do it in statistical software,they put it out and it’s perfect. But for demonstration purposes, that’s what I’m doing. Okay, so now what we’ve done is we put inq1 and q3 and we put these horizontal lines that are parallel. Alright, here’s the nextstep.We connect them, hence, the box so the box gets made, right that you just call itconnect them. Alright, now I put a little circle on the right side of the slide becauseI wanted you to make sure you saw what’s going on there. Okay. That’s when we put in q2 orthe median, right? So the median is 126. See where 100 is. It’s up a little bit, and wemake that parallel. But you see how I made q one q three connected the box and then didthe median. I think this is the easiest order to do it and when you’re drawing it by handand you’re not the statistical software Because then that way, you know, this box is all nice.And then your median fits and everything looks nice, but we’re not done yet.We got the whiskers.So you’re probably wondering this whole time, what is this whisker thing? Well, you justfigured out what the boxes the whiskers are the markers for the minimum and the maximum.So you’ll see the minimums at 41. And then we have a whisker at 41. So why is it calleda whisker? Well, it’s smaller. I don’t know why it’s called the whisker, but it’s differentfrom the other ones.Because it’s smaller. I guess that’s a reason maybe. But noticehow it’s like half the size, almost half the size. Sometimes they’re really, really small,but it’s tiny. And you want to position it, like vertically in the middle, like you don’twant it off to the side or anything. But and you also want these parallel. You’ll noticethe maximums up there way high at 364. So I just did both of these on the same slide.So you draw on the whiskers. And then you probably can guess the last step. Yeah, connectthe whiskers to the box. So good job. There you went and did it You made a box plot. Andthen now let’s look at the inter quartile range.Remember how you calculated this, youtook q three minus q one? Well, that means this boxy thing is 164. Beds long, right?So that’s where your IQ are. This is a visual pictorial of your IQ. So very good. We didour boxplot, we did our inter quartile range. And you’re probably wondering, why don’t wejust do this? I’ll explain. So why do we do this? Well, one of the mainthings that we do is we look at the distribution in the data. I know, I know, you guys learnhow to do a histogram already, and you’re good at a stem and leaf.Those are other waysof looking at the distribution. And if you make a histogram of these data, you’ll findthat Well, I mean, these are only 11. But you know, if you get a pile of data, and youmake a histogram and the stem and leaf, you’ll find that those images agree with the boxplot.And you’re probably thinking, Well, how do how do they agree? Well, if you look on theright side of the slide, I’m just giving you an example. So skewed, right? If you had skewedright data, and you knew it, because you made a histogram and you saw a skewed right distribution,if you took the same data, and you made a boxplot, it would be kind of like that skewedright one that we just did, where the top, whisker would be really high in that thingconnecting the whisker to the box. That would be like really long, whereas the one on thebottom is short.As you can see, the skewed left is the opposite, right? The bottom oneis long, and the top one short. If you have a normal distribution, remember that that’ssymmetrical. That’s that mound shaped distribution, and you have a larger spread. In other words,you have a bigger standard deviation, you have a bigger variance, right? Then you’regoing to see a box that’s really big like that. But if you have a smaller spread, andit’s a normal distribution, you’re going to see a box that looks like this. And you’reprobably wondering, where are you getting these shapes? Well, I’ll show you a kind ofon the last slide here as we wrap up the conclusion.It’s because if you fly over a roller coaster,like see this roller coaster, this roller coaster is skewed right? That would make sense,right? Because you want to go up steeply, and then go down really fast. And see howthe boxplot for the roller coaster looks. You’ve got sort of the part where you startgoing up really fast. That’s kind of near the median and kind of near the the 25th percentile.And the part where you start where you’re just getting on and it’s slowly going there.That’s like the bottom whisker. And then you go up and you come down. And it’s a long tail,which is good, I guess if you design roller coasters, and then that long tail, then isthat right skew? So that’s why I mean, if in your mind, you’re going how she gettingthis this histogram in this box, but this is kind of how I’m doing it, as I’m saying,Well, if you flew over the histogram, or the roller coaster, you might see like a shapeof a box plot.So in conclusion, we talked about percentiles, in general, like the 77thpercentile, what that all means. And then we focus in on quartiles, which are a specificset of percentiles. And then we’re going to go or we already did calculate the quartiles.And the reason why we did that is because we first needed to do that in order to makethe interquartile range. And then finally, we need those quartiles in order to make andinterpret a box and whisker plot. Okay, this isn’t the roller coaster I’m going to, butI’m going to one and I guarantee you it is skewed right. Greetings and salutations. Hi, this is Monicawahi, your library college lecturer bringing to you chapter 4.1, scatter diagrams and linearcorrelation. So here’s what you’re gonna learn at the end of this lecture, you should beable to explain what a scattergram is and how to make one state what strength and directionmean with respect to correlations and compute correlation coefficient are using the computationalformula.And finally, you should be able to describe why correlation is not necessarilycausation. So let’s jump right into it. First, we’re going to talk about making a scatterdiagram. And the thing on the right side of the screen is not a scatter diagram, but it’skind of scattered. So I put it there, it’s kind of pretty. And then next, we’re goingto talk about correlation coefficient, R, and how to make it. And then finally, we’regonna do a shout out to causation and lurking variables, which remember we talked aboutbefore, but we’re going to talk about them again, in relationship to our.So let’s startwith the scattergram. And I also call it a scatter plot, because it’s like everythingin statistics, there’s got to be about eight names for everything. So scatter gram, andscatterplot mean the same thing. So let’s just get with the setup here. So scatter grams,or scatter plots are graphs of x, y pairs. So what’s an XY pair, xy pairs are measurements,two measurements made of the same individual or the same unit. So if you measure my heightand my weight, that’s an XY pair, if you measure my height in the my friend’s weight, that’snot an XY pair, because that’s two different people, right? So these xy pairs, the x partis called the explanatory or independent variable.And it’s always graphed on the x axis. Soremember, in algebra, you would do these graphs, where you have this vertical line, and thatwas the y axis, and you have this horizontal line, which was the x axis. And I always hadtrouble remembering, which is which, but that’s how it is. And so whichever x whichever ofthe pairs is x, expect that to be graphed along the x axis. And it’s also called theexplanatory and or independent. Remember, there’s got to be a million names for everythingexplanatory or independent variable. So if I talk to you and said, here’s an XY pair,and this one is the independent variable, or this one is the explanatory variable, youneed to like just secretly know I’m talking about the X of the two. And then surprise,here’s the y of the two and the Y is also called response variable. It’s also calledthe dependent variable. And that is graphed on the y axis. So again, like I said, I usedto have trouble remembering is the vertical one, the y axis or the horizontal one.Butwhat I did was I remembered, if you take a capital Y, and you go grab onto its tail,and you go pull it straight down, you’ll see that it’s vertical. And that’s how I rememberthat’s the y axis, it doesn’t hurt the Y. It’s used to that. So if you can stretch they’s tail down, and you get vertical, remember, that’s the y axis. And then the other oneis the x axis. Okay? And then also, you have to find a way to remember which one meanswhat like, does x mean explanatory and independent? Or what or does it mean response independent.So how I do it is, you know how we sing the ABCs abcdefg. Well, if you fast for the Nis w x, y, z, right, so the x comes before the Y, you know, in the alphabet, so I dox and then an arrow to y.And then I imagined in my head that saying X causes Y, even thoughit doesn’t necessarily cause y’s, you’ll see at the end of this lecture, but I think aboutit that way. Because if that happens, then y is dependent on x and x is independent,it can do whatever it wants, but y is dependent. So that’s my way of remembering x is the independentvariable, and y is the dependent variable. So anyway, that’s a long way of saying thescattergram is a graph of these xy pairs. And that’s what we’re going to do is makethat graph. So we needed some xy pairs, right? So I asked the question, do the number ofdiagnoses a patient has, does that correlate with the number of medications she or he takes?So if you don’t have that many diagnoses, you probably aren’t on that many meds, right.But if you have a lot of diagnoses, you should be on a lot of meds.But we all know people in real life can sort of violate thatjust depending, I mean, you could have one really bad diagnosis with a lot of meds. Oryou can have a bunch of diagnoses that are all taken care of with one mad so it’s notperfect, but this is kind of a reasonable thing to think. So what I did was I put uphere just for x y, Paris, as you can see, so I’m got four pretend patients. And youcan see here’s the first patient, that person has an x sub one because they only one diagnosis,but like I was saying must be a bad diagnosis because that person has a y of three or ison three meds for it.Right? So that’s how you read this table. So let’s start makingour scattergram out of these data. Okay, so here we go. So I labeled the x axis numberof diagnoses, right just to keep things straight, and the y axis number of medications, andthen you’ll see where I put the dot, right? because x is one, I went over to one numberof diagnosis, right? The one diagnosis, and then, because why was three, I went up threeto this three, right, and there goes the dot, that’s where that first person gets a dot,okay, you put it there. And that’s what you’re going to do with these other ones, too, isfour dots. Okay, I just threw all the dots down, so you can kind of see what was goingon. But here’s the second person, right? So that person had an X of three. So I went overthree. And I just put those green arrows in just so you can see what was going on, they’rereally not part of the scatterplot is just more like, like cheating, you know, to showyou because we’re just practicing right? And then that person, so had an X of three andthen a y of five, and you see where the dot goes right.And then here, you can see wherethe fourth got.or I’m sorry, the third that goes because there’s a four and a four. Andthen here we have the fourth that. So this is the scattergram of these four patients.Of course, a lot of times you have like hundreds of patients in there. But I just showed youthe simple example. Okay, now, because we did that, I can talk about linear correlation,you’ll kind of get it right. linear correlation, that term means that when you make a scatterplotof xy pairs, it kind of looks like a line. Now over here on the right is not like biology.That’s not like statistics. That’s like algebra, right? Because back in algebra, you’d havethese perfect lines where the dot was right on the line and see the x and y.Notice there’sno diagnosis, nothing. That’s algebra, right. So perfect linear correlation. Looks likegraphing points in algebra. And if you actually make a scatterplot, of like people, xy pairs,and you see that, you should suspect there’s something wrong, it actually happened to meonce, one of our statisticians came to me and said, Monica, look at this, you won’t believe this. And I said,Well, I don’t believe this. What are you graphing? And he said, on the x axis, he had put theweight of every of the person’s liver. And on the y axis, he put the weight of the wholeperson. And I’m like, I, how do you weigh people’s livers? Like, that sounds painful.And he goes, Oh, let me go see.And what he learned was that you don’t waste people’slivers, you use an equation to estimate the weight of their liver and guess what’s inthe equation is their actual weight. So I’m like, that’s why I came out, like on a lineis because you were using the Y to calculate the x. And he was like, Oh, you’re so smartfor a secretary. So then I became an epidemiologist. But anyway, if you ever see this in biology,just suspect Something’s fishy, because really, things just don’t end up right on line.Butif they get really close, you can say it’s close to perfect linear correlation. I justwanted to let you know, that’s what we’re what’s going on here with this linear correlation.Okay, so let’s talk about facts about linear correlation. So things can be linearly correlated,without being perfectly on the line, obviously, our little thing was, so if, if when you makethose dots, your scattergram, if you imagine a line going through it, if you imagine thatthe line is going up, like it kind of looks like it’s going up, this is called a positivecorrelation.But you don’t always have a line going up. So I want you to look at this. AndI made up these data too. But on the x axis is the number of patient complaints. So aswe go on, the patients are madder and madder. They’re grouchy and gross, you’re making morecomplaints. on the y axis, we have number of nurses staffed on the shift, right? Andso as you go up, there’s more nurses. Well, sure enough, when you got a lot of nurses,you don’t have as many patient complaints, right? Because they’re being attended to.So this is what you would say is some people say inverse correlation.But in this presentation,I’m calling it a negative correlation. Because as one goes up, the other goes down. And asone goes down, the other goes up, because and that’s depicted visually with this linegoing down so you see, you can imagine line going down That’s a negative correlation.Neither is better, you know, positive versus negative, it just explains how these thingsare behaving together how X and Y behave together. But then you can have situations where there’s reallyno correlation, like x and y really don’t have anything to do with each other. So asyou’ve seen, you know, when you’re, when you have patients in the hospital, some of themhave really big families, and those families come a lot.And some of them don’t reallyhave that many loved ones. So as you can see along x, here are totally unique visitors,meaning you just count each person wants. So you could have, there’s a patient who onlyhas one Unique Visitor. But if you look at why they spent in the hospital, that personthat’s been there seven days, and that that visitor keeps coming, right. And then youhave maybe a patient here, the second one is to unique visitors. And that person’s onlybeen in one day, but both those people have been there, then you have people like a personwith three unique visitors. And they’ve been in the hospital for days, right. And thoseare probably the same three people coming back. So it really doesn’t matter how longa person’s in the hospital, if they’ve got a lot of loved ones who keep coming, they’llkeep coming or not. Right? Right, according to this correlation. So you end up imagininga straight line. And that’s no correlation, that’s fine, too. Nothing is better or worse,it’s just that you make the scattergram to try and understand how x and y are related.This is always fun.Like in books, they always make some sort of goofy picture. I don’t knowwhy they do this, I would never get a goofy picture, like they show in books about, youknow, this, I made up the correlation. This is in the lobby, the number of the games inthe lobby, and the number of the books in the lobby, they should really have nothingto do with each other. But if you see something just way goofy like this, just say it’s nocorrelation.I don’t even know how I get this. Hi, there. Alright, so we’ve been talkingabout correlation. And it actually has two attributes. So far, we’ve only talked aboutone and that is direction, we talked about positive, negative and no correlation. Sowhenever you’re talking about a correlation, you have to say what direction it is. Butyou also have to say the other thing, which is what strength it is. So now we’re goingto talk about how you figure out what the strength is. So strength refers to how closeto the line, all of the dots, they fall really close to the line, it is considered strong.If they fall kind of close to the line, it’s called moderate.And if they are very closeto the line is weak. Now remember, that’s totally different from what direction is itcould be positive, strong, or negative, strong, right, could be positive, moderate, or negative,moderate. So this is just a statement, the strength is a statement of how close the dotsyou make in your scattergram file close to the line that you end up dropping. So I thoughtI’d just give you a few examples. So look at this, I just made this up. This is whata strong negative one would look like. Notice how those pink dots are almost on the line.And this is a strong positive.Again, even one of the dots is on all right, not all ofthem, you know, or it’d be perfect, but it’s never perfect. So this is really close. Butit’s strong, positive. So strong just refers to the fact that the dots are almost on theline. Now, this is almost the same correlation, but the dots are not really almost on theline has to be fair and kind of going between them, but they’re kind of far away. And sojust eyeballing it, you would say this is moderate. And here, it gets weak. And mainlyit’s because the dots are more all over the place. But you’ll notice there’s one that’slike right on the x axis. And then hey, look up there, like in the title, there’s one upthere, like way up there.And that’s like an outlier. And sometimes, when you get outliers,they can really whack things out. So even though this is a weak correlation, that linelooks like so powerful, because it’s almost basically connecting these two outliers. Soyou just got to be careful, and that’s part of why you make a scattergram first is outlarge can have a really powerful effect on the correlation. Especially it’s an any of the four cornersof the plot. Like if you get a weird outlier kinda in the middle, it’s not going to doas much as if it’s in the upper right, upper left, lower right or lower left. It can reallyaffect the direction like like, you know, it’s Like a seesaw, or a teeter totter, youknow, an outlier can get on and really change the direction of it. And it can also messwith how strong or weak the correlation is. So that’s why you really want to start witha scatterplot.And that’s why the way this chapter is organized starts with the scatterplot.This, you just want to look for outliers. And also just see how X and Y look when youplot them. Now we’re going to get on to correlation coefficient, R, we’re going to get on to computationand actually making a number. So you can not just use watery terms like direction, youknow, positive, negative, or moderate, strong weak to explain it, but you can actually puta number on how correlated x and y are. So remember, the word coefficient, we did itwith coefficient of variation, which is different.So the CV, you know, is one kind of coefficient.But what we’re going to talk about is a different kind. This time, our coefficient, this timeis called R. And just coefficient means the number we just like to use it in statistics.Now, it seems kind of weird, because like, I’m talking about correlation, and peopleare like, Well, why is it our Why isn’t it like see for correlation, then like, I don’tknow, I didn’t invent it. But this is how you can remember you can go correlation, correlation.So correlation coefficient, R. So just remember, r means correlation. And technically our meansample correlation, population correlation coefficient, right? Like his, you know, imagineyou’re correlating like height and weight and the population like, oh, everybody inparticular state, you actually need a Greek letter for that. And I showed it on the screen,I don’t know it’s this fancy p, I don’t know the right name of it. But we don’t actuallycover it in this class.So I just want to just show it to you, we’re only going to focuson R, which is the sample correlation coefficient. So what is r? Well, it’s like I said, it’sthe numerical quantification of how correlated a set of x y pairs are. And it’s actuallycalculated by plugging all of the XY pairs into the equation, I’ll show you how to doit. And you can see that if you do it by hand, if you have a lot of xy pairs that will takeforever. So I tried to limit that. And like, remember, standard deviation and variance,there was like a defining formula and a computational formula. This time, I’m only going to showyou the computational formula, it’s, in my opinion, ways your to do, but it gets youthe same number. Alright. So that’s what we’re going to do is we’re going to take a set ofxy pairs, and we’re going to calculate our M.But then how do you interpret our Well,let me just prepare you mentally for what we’re going to get out of this calculation.The our calculation produces a number and the lowest number possible is negative 1.0.So that’s perfect negative correlation. So if we were like in algebra, and we had anA line going down, and all the dots were on it, then the R would be negative 1.0. Butthat never happens. Right? So if you want to think about it is like if you have a negativecorrelation, and you get an R, that’s like negative point nine, five, or something reallyclose to negative 1.0, that it’s close to negative 1.0.So it’s close to perfect negativecorrelation. That’s how you want to think about it. And then the opposite is the highestpossible number you can get for our is 1.0. But most people never do that. except forthat one mistake I was telling you about. And that would be perfect positive correlation.So if you see that you calculate an R, and it gets really close, like point nine, five,like I said, or nine, eight or whatever, then you’re thinking, whoa, this is really closeto perfect positive correlation, right? And then everything else is in between. So like,you know, point five or negative point three or point 02, or negative point, oh nine, likeall of those are between negative 1.0 and 1.0. And that’s where r should be. So let’ssay you calculate R and you get eight. Okay, you did it wrong, right? Or you calculateR and you get negative 2.3. Like that’s not right, it’s got to be between negative 1.0and 1.0. And if you make a scattergram, you should know whether it should be on the negativeside of the positive side or it should give you a hint.So this is just more to calibratewhat to expect from our because it’s kind of a big calculation. So I’m just going togive you some pictorial example. Because remember, every single time we make our right, um, wealso have a scatterplot behind it. And I just thought, you know, it would be helpful tosee some real life examples of our, these are real life examples, okay, real life, youdon’t get this from just anything, right? I’m just teasing. But anyway, so I startedwith some negative hours because I’m feeling negative today.I went into the literatureand I found this article about, oh, it’s not MIT and Harvard. It’s about theevolutionary principles of modular gene regulation, a nice and all I know, it’s, I’m supposedto cut down on eating bread. So that’s all I know about this. But they had these reallynice scatter plots. So and they calculated are for them, so and they had a little lineon them. So I thought I’d show them to you. So if you look, the one that’s labeled D,see where the dots are, right, and see where the line is.And this looks kind of like amoderate to strong, negative correlation, right? Because the dots are kind of closeto the line. And then when the group calculated are they got negative point seven. And sothat kind of makes sense, because, and then I put my opinion in the lower right, thesearen’t official cut points or anything, but I usually use these as a guide, see how Isaid negative point four to negative point seven is moderate. So I would call that theone monitor. Now let’s look at E. So see how the dots don’t cluster so close to the line,as they do with the D one, that’s going to make it a weaker correlation, it’s still it’sstill negative, right? So it’s negative point four, four.And when you look at my littleopinion, I still call that moderate, but it’s on the low end, see that. And then if youlook at AF, see how many of them are like way far away from that line, and they’re draggingit down. So now it’s in the even weaker correlation, negative point two, five, right. And so thenthat’s weak. And so this is just some examples to give you a pictorial. And now I’ll be Ipromise to be more positive, here’s some positive Rs, they didn’t draw a line on this one, thisis a different article, right? Says obesity is associated with macrophage accumulation,and adipose tissue.So again, try to cut down on bread. But anyway, um, if you look on theleft side, you’ll see all of these x y pairs plotted on the scattergram. And even thoughwe don’t have a line there, we can imagine it’s going up. So we would expect this tobe positive. But we also would imagine they’re not really clustering around the line verytightly. So when we see that the R is point six, we’re not surprised. I mean, it’s onthe high side, a moderate in my world, which makes sense. But go look on the right one,you know, under the B one, look at how those, you could almost connect the dots and geta line out of that. So that’s really tightly hugging the line. And then we’re not surprisedto see that the R is point nine, two. So that’s pretty strong. So I just wanted to give youthese tutorials before we actually went forth, and calculated r because that’s one thingyou can do is do the scatterplot have an expectation, what r should look like.And then if you calculateR and it’s totally wacky, you know that you did something wrong. Okay, let’s calculateour and let’s use the computational formula. Okay, I threw the formula up in the upperleft, and don’t feel overwhelmed by it, we’re going to take that apart very carefully, right.But before we even do that, I just want you to have a flashback to chapter 3.2. c, allthose sums of are those capital sigma was in the equation. So we’re going to handlecalculating are a lot like we handled calculation, calculating variance and standard deviation.We’re going to make like a table with columns. And then we’re going to fill in those columnswith calculations.And then we’re going to add up the columns to get all those numbers.So already you were good at that, and 3.2, you’ll be good at this too. And then I madeup a story because it’s a lot easier to check your work if there’s some story behind thatand statistics. So pretend we have seven patients that have been going to your clinic for ayear. They’re good patients, they keep coming. So they came to the clinic over the year.And at the last visit of the year. You measured the diastolic blood pressure, and what you predicted was or whatyou thought would make sense as those with a higher diastolic blood pressure would havehad more appointments over the year because probably they’re trying to stabilize and runpower. Sure, maybe they have other problems that are driving it up. This makes perfectsense, right? So what you wanted to do is see if you are right, so you’re going to takethe diastolic blood pressure at the last appointment as your x, you know, because you think thatthat’s maybe the explanatory variable, or, you know, that would be the independent variablethat would make it so have something to do with whether or not they had a lot of appointments.And then you take why as the number of appointments over the last year, because you’d say, Okay,hi, DBP probably means they have more appointments.That’s just your idea, maybe you’re wrong,but we’re gonna do that. Okay. So, um, I put in the title, just a reminder, access DVP.And why is number of appointments so you don’t forget. And then we made up this tape. Solook at the first column, it’s just the patient number, it’s nothing, you know, exciting,we just want to keep track of which patient is one, right. And then notice under x, wejust have all of their dbps. So this patient, one at the last appointment had a 70 mmHg,and patient two at 115, mmHg. That’s kind of alarming. But these are fake data. So don’t get worriedabout these patients. But anyway, we just fill in x. And then also, when you have theirchart out, you can look up how many appointments they had over the last year and patient wentonly at three, whereas patient two had like 45, which you can believe because sometimesthey’re coming in all the time to get stuff, adjusted.It but then you know, patient three,only a 21 and patient four at seven. So you can see these are the XY pairs for each ofthese patients, right. And it’s pretty simple to go to the bottom and sum up each of thecolumns, we have some of xs 678 and some of y’s 166. And also, I’m reminding you of theour calculation, I put that in the upper right, just so we can see what we’re doing. I justwant to call your attention to one of the terms in there, which is sum of X, which Iput in the parentheses here. And that we already know, just from making the first part of thistable and adding it up. So we already have that thing. And now I just wanted to pointout, if you saw the sum of x over here, it’s not exactly the sum, it’s a sum of x y.Sothe Y is mushed. Right next to it, that’s not some of x, that’s some of x y. And that’slater in the game, we’re gonna put the sum of x y at the bottom of the last column. SoSo that first term there, that’s not some of x, that’s some of x y. Okay, now downstairs,we see the sum of x to the second, right? And that looks an awful lot like the one nextto it on the left, which says sum of x to the second, right? And so how do you tellthe difference between the kind without the parentheses and the kind with the parentheses.So this is how I do. The rule is always regardless what’s going on, do what’s in the parenthesesfirst. So that’s easy to do. If you have parentheses, if you got the parentheses version, you knowthat the sum of x to the second with the parentheses in it, is you just do the sum of X, and youdo the sum of X and E times by each other.Right? But what if you don’t have any? Well,what I do is I say, Well, if I did have some, I do it this way. But if I don’t have any,then I know I have to do the sum of the x squared calm, right. So that’s where you takex times x x times x, x times x on each line, put it there and sum that. So that’s how Igo through it no matter where I am in statistics or algebra.If I see that some symbol andthen the x squared, I first look for the parentheses. If they’re there, I know what to do. If they’renot there, then I know you don’t do the thing where you just take the sum of x squared,you have to go and look at the bottom of the column of the x to the second column and takethe sum of that. I hope this is helpful. All right, so as you can see, there’s, I’ve shownyou on the top of the equation is where you just take the sum of X and the sum of Y. Andon the bottom, I’m showing you where you take those and you take the square of them. Andthen in the other term is the one where you just take the sum of the call. All right.And so there you go. So what happened here? Well, we filled an x to the second so if yougo to a patient, 170 times 70 is 4900.That’s where we’re getting that number. So you gothrough and then patient to 115 times 115 is 13,225. So you go from All those and thenyou sum those up. And that’s what goes in that first term. And then I’ll bet you canguess what the next slide is. Surprise. Now we do the y one, so don’t getconfused because you kinda have to skip a column there.So three times three is nine.And so that’s why in the Y squared, I’m 45 times 45 is 2025. That’s how we’re doing those.You sum all that up, and then go look up at the equation, that’s where you put that sumof Y squared. Now we have x, y. And this reminds me of a student I had before. She was reallyconfused. She’s like, Monica, I don’t know what to do with x, y, the x, y quantity. AndI go, What do you mean? I mean, it’s pretty obvious. You just take x times y, like here,70 times three is 210. She goes, x times y, where’s the times? Like, how do you know it’ssupposed to be times like, I don’t see any times. Right? I don’t see any dimes either.Like there’s no like, like, how do you know to do that? Well, anyway, I’ll just tell you,I guess, imagine, like a little multiplication symbol between x and y. That’s what’s supposedto be there. That’s what you’re supposed to imagine, I guess I was so used to lookingat it was like, you’re right, I guess you’re just supposed to assume that.So take x timesy. So for patient two, we just took 115 times 45. And that’s how we got 5175. So you gothrough each of those, it’s a lot of processing. And then you sum it up at the bottom, whoo,that’s a big number. And then you see, I circled it in the our equation. So I think we figuredout where to put everything, obviously, n is seven, right, because we have seven patients,you see a bunch of ends in there. So I think we have all our ingredients. So let’s moveforward. So all I did here was rewrite the exact same equation with all the ingredientsin it, right. So like I said, the N is seven. And so wherever you see n, you’ll see a seven.See that sum of X, Y on the top, you see where that goes, see some of x and some of y andthen downstairs, you’ll see I filled in all those numbers too. Now, let me just talk toyou a little bit about both levels, the numerator and the denominator in the numerator, becausewe have order of operation, you need to do out the end times the sum of x y, that’s seventimes 18,458, you need to do that out first.And then you need to do the other one, youknow the 678 times 166 first, and then after you’re done with those two things, you haveto subtract the second one from the first one, that’s the order you have to do thatin to get the numerator right. Now for the denominator, it’s a little bit the same, buta little more complicated. You see on the left side, you have that seven times 67,892,you have to do that out. And then you have 678 squared, you have to do that out, thenyou have to take that, subtract it from the first one.And after that, after you havethat, you take a square root of all of that, and that’s your first term. And then you stillhave to go over to the other one, you have to take seven times 6768. Keep that then take166 times 166. Keep that, that that term, you subtract from the first one. And afteryou’re done with all that, you take the square root of that, and then those two things, youhave to multiply together. So that’s a lot of work, and you have to do it in the rightorder. So here, I just wanted you to see how you, you probably want to just work out thisterm separately first, and then work out this terms separately.And just like that thingI was telling you about x y, those two terms, once you work them out, you take the squareroot of the left one in the square root of the right one, you have to multiply them togetherto get the denominator. So this slide is to help you see I threw the numerator on thatwas relatively easy. But these are the two different numbers you should get from theleft side of the denominator and the right side of the denominator just to check yourwork. And then of course, once you multiply them by each other, you get this number 17,561.3.So ultimately, what the calculation for our comes down to is you’re trying to calculatethe numerator and you’re trying to calculate the denominator. And at the end, you dividethe numerator by the denominator and you get the answer which is R. So we’re going to dothat now. And here’s what we got is we got this 0.949.And because we see that it’s positive, then we know it’s a positive correlation.And thenremember my opinion. And also probably everyone’s opinion, because if you run that up, you gopoint nine, five, well, that’s getting really close to 1.0. So most people would agree thatthat’s pretty strong. So how you would diagnose this correlation is you would say it’s positive,and it’s strong. Okay, I just want to wrap this up by giving you a few facts about ourthat I may not have covered yet. First, r requires data with a bi variate normal distribution,which is something we didn’t check before doing our r in this class, because I justdon’t cover that. But please know, if you take another statistics class, and they bringup our, they might talk about checking for the by various normal distributions.So justknow about. Next, please know that our also does not have any units. So other things thatdon’t have units, remember, the coefficient of variation didn’t have any units, some thingsjust don’t have units, and r is one of them. Also, we did talk about how perfect linearcorrelation is where r equals negative 1.0. That’s if it’s a negative correlation, orr equals 1.0, which is a positive correlation. But I might not have mentioned that no linearcorrelation is r equals zero. Now, you probably won’t see that in real life. But sometimesI’ll make an R, and the R is either positive or negative. But it’s 0.0000000. Somethingright? Regardless of whether it’s positive or negative, if it’s 0.00000, something, it’sreally close to zero. So that means there’s probably like, no linear correlation. Andthen we learned about positive or negative art, but I just wanted to remind you of thebehavior of X and Y when you get those circumstances, okay. So if you have a positive R, it meansas x goes up, y goes up.But it also means as x goes down, y goes down. So they traveltogether. When you get a negative r, it means as x goes up, y goes down. But also it meansopposite, as x goes down, y goes up, so they travel in the opposite directions. Now, here’sanother fact about our little factoid, if you choose to switch the axes, like let’ssay I designate, you give me xy pairs, and I designate a certain variable as x and thecertainly one is y, and you actually designate them the opposite, it really doesn’t mattereven in the equation, because you’ll end up with the same R value. So it doesn’t matterif you call the x my X, Y, and I call your, you know, y x, like we can switch them, butyou’ll still end up with the same are with the calculation. Then finally, even if youconverted x&y to different units, you get the same error. So let’s say that you werein England, and you were doing the correlation between height and weight. And you were usingthe metric system on the same patients that I was using the US system, even though we’dhave different numbers, cuz obviously you have to convert them, we’d still get the sameare when we’re done.So finally, we get to the last subject of this lecture, which islurking variables, which you’ve heard about before. But the main point I want to makeis correlation is not causation. So you don’t want to be misled by correlations. So beware of lurking variable. So remember,lurking variables are things lurking behind the scenes, I caused things, right. And soyou may have realized that selecting x and y, like if you have xy pairs, designatingwhich one is x and which one is y is kind of political, because you’re implying thatx could cause y. So let’s say that you’re correlating height and weight, taller, peopleare heavier. So you would cause x to be height and y to be weight. You know, people don’tgo, Oh, I’m too short, I should gain weight so I can grow taller.You know, that’s justnot the way things work. So you have to put x as the height, and y is the weight. Butthere are Riya. In reality, other causes of weight besides height. In fact, there arethings that cause both height and weight, like genetics, right? So a genetic profilethat leads to Thomas and also obesity could be a lurking variable in the relationshipbetween height and weight. So there could be some tall people that are always obese,and it’s not really just because they’re tall. It could be because they have the geneticsthat programmed them to be tall and also obese, right? And so here’s an example where yougot to be real careful. Um, with correlation. So there’s been this claim that eating icecream causes murders, because they noticed when in areas where ice cream sales go up,murder rates rise.And I don’t know about you, but when I have some really good icecream, it just makes me so mad. I’m just kidding. I mean, why would this happened? Right? Well,the reality is summer and warm weather are lurking variables, because we sell more icecream in the summer. You know, the ice cream consumption goes up. But also people are outsidemore and more murders occur. And you know, I from Minnesota, where it gets really coldfor periods of the winter, and oh my gosh, there are totally no murders, then, like peoplejust don’t commit murders, when it’s really frigid out, it’s just really inconvenient.So that’s a situation where there’s a lurking variable.And so you don’t want to start,you know, screwing up our ice cream laws and making it so we can have ice cream, just becauseyou misappropriate that ice cream causes murders, right? There’s a lurking variable behind it,that’s having something to do with both. Here’s another one. And this was my professor inmy biostatistics class, they use the C put up a really like a time series chart overa long time, like since the 1900s. And they pointed out as people purchase more onions,the overtime is onion consumption goes up and down.The stock market rises, right? Sowhen the stock market slow, people aren’t eating as many onions. And this is just trueover generations in the US. So um, yeah, we’ve had some problems with our economy in theUS, do you think we should all start eating a bunch of onions, right? So the healthy economyis a lurking variable. And a healthy economy, people buy more food, they including onions,and also a healthy economy boost the stock market. So you got to be careful about thiscorrelation is not causation. You know, and so if you want to make the stock market goup, don’t make everybody onions. And definitely don’t make a stop eating ice cream, that wouldmake me very upset. So at the end of the day, you’re not going to be able to affect themurder rate by bringing down the ice cream consumption rate.And you’re not going tobe able to fix the stock market by making people eat onions. And so that’s the wholeconcept behind lurking variables. And correlation is not necessarily causation. So in conclusion,when you’re doing your correlations, First, make a scattergram because you want to getan idea visual idea of the strength in their direction. And you also want to look for outliers,then go on and calculate are by hand, but be really careful because it’s a big hairycalculation. And you don’t want to make any mistakes.And then finally, when you go tointerpret are Be careful of lurking variables. And remember that correlation is not necessarilycausation. And now, time for some ice cream. Hello, it’s Monica wahi, your library collegelecturer here to ruin your day with chapter 4.2 linear regression and the coefficientof determination. So at the end of this probably painstaking lecture, the student should beable to at least explain what the least squares line is. Identify and describe the componentsof the least squares line equation, explain how to calculate the residuals, and calculateand interpret the coefficient of determination, or CD for short. Alright, so it’s really coolif you have a crystal ball, because then you can make predictions, right, you just lookinto the crystal ball.It’s some nice equipment, I’ve had friends who have them, they’re verynice to put out on your dining room table as the centerpiece. Unfortunately, though,they don’t really play much into statistical prediction. So what I’m going to show youin this lecture is how we use statistics for prediction instead of this beautiful crystalball. So we’re going to start by talking about what the least squares line is. And then we’regoing to talk about the least squares line equation, which is the crystal ball thingwe use only in statistics, okay. And then we’re going to talk about dealing with predictionusing the least squares line. And finally, we’re going to talk about the coefficientof determination. So let’s get started. And let’s get started with the term least squares.criterion, right? So remember, criteria is plural and criterion is singular.And it meanswell criteria as stuff you need to meet right to be eligible like you have to meet the criteriafor registration for college right? Well, least squares Cartier tyrian is just one,which is awesome, because then you only have to meet one thing. So one of the things youprobably wondered when you were watching last lecture is how do you know exactly where todraw this line when you have a scatterplot. Like, how do you know where to make the linethe most fair.So in the last chapter, when we plotted the scatter grams, I just drewa line there for demonstration. But there actually is an official rule as to where theline goes. Okay. And basically, the rule is as has to meet the least squares criteria.Okay? if it meets that criteria, there’s only one line that does, then that is where theline goes. So how do we get to that? Well, this is roughly what it looks like.When you draw the line, there is a vertical distance from each of the dots to the line.Now, as you can see, by the slide, sometimes the dots are below the line. And sometimesthey’re above the line.And so the word square is indicates that whether it’s up or down,you’re going to square it. So it’s not going to be negative anymore. Because whenever yousquare a negative, it becomes positive. So first, you’re going to have to square allof these things. Okay? So imagine you were just going to try it out, like, maybe drawthis line, and then you calculate the squares, and you’d be like, okay, that’s how many andthen maybe you tilt the line a little.And calculate the scores again. And your goalwould be to add when you added up all the squares, to have the least ones. So the linebelongs where what causes smallest sum of squares for the whole data set. So if yoursoftware, which you’re not you’re a person, right, but if you were software, you’d befiguring that out using your software brain as well, how exactly to tilt this line, andwhere exactly to put it to minimize these squares, but we’re people. So I’m going togo on and explain how people do this. So the trick is, if you can figure out with the lineclose, you can draw it on the scatterplot and be right. But there is a challenge ofknowing exactly where it belongs on the graph. And then also, you’re probably realizing youdon’t always have a graph to draw it on. Like maybe you need to talk to somebody about wherethe line goes, and you can’t draw a picture.So how you explain where the line goes asyou use an equation. And some of you may remember this, and some of you may not, so I thoughtI’d do a little quick review of how lines and equations relate. Okay, so we’re goingto get into the least squares line equation. But first, I’m going to give you a littleflashback about algebra, and I’m sorry, if this is painful, um, this is hard for me,because I wasn’t really that good at algebra. But um, I and this isn’t statistics, thisis algebra, but I just wanted you to remember this part.Okay. So back in algebra, therewas a chapter, where you were given these xy pairs, and then was different from statistics,because they all lined up on a line, see, these pink dots are just perfectly out ofline, okay, and these are the XY pairs. And remember, you had to graph this kind of likewe had to do scatter plots. And then you were given this equation, y equals b x plus a,right? And that was the linear equation to describe this line. And you were like, okay,I don’t get how to put this equation together with this line. And so first, the teacherwould say, well, B stands for the slope of the line, right? Because you have to knowthe slope, I mean, the line can be tilted, any which way. And so if you know the slope,you already know something about the line. And in algebra, how you would make the slopeas you calculate the rise over the run, right. And so there, you know, be in algebra wasrise over run, and you’d get the slope.And then you’d be like, great. But you’ll alwaysneeded another thing in order to define the line. Because if you imagine this line isin an elevator, it could still have the same slope, but go up or down, right, so we needto anchor it on the y axis somewhere. So h stands for the Y interceptor where it’s Spearsthrough the y axis. And, as you can see, by the drawing, it looks like a is zero comma,zero, right? But you don’t have to look at it, what you can do in algebra, is you toget a is what you would do is go since you’d filled in B, you just go grab an XY pair,and plug the X and and plug the y and then plug the B, you just got in and back. Calculatethe y intercept, right. And that’s how you would get the whole linear equation.And sothat’s how you would do it in algebra. And I just wanted to remind you that because wedo some similar things in statistics, it’s a little different. But I wanted to remindyou how to connect what a line looks like with how this equation works. All right. Well,welcome to statistics looks, those pink things are not on a line. So we want to make a linebut now you know about the least squares criterion. What you’re trying to do is make a line thatminimizes the least squares, right? So here we go. Um, remember Hello, I was just talkingabout this linear equation back in algebra. Well notice the difference. The main differencehere is the hat, right? The y is wearing a hat. And that’s universally in statistics,whenever you see a letter or a number wearing a hat, it means it’s an estimate. Okay? Soof course, we’re estimating why because if you look on that line, none of these dotsactually falls on that line.And we don’t really expect even an estimateto fall on that line just close, right? You know, because of the least squares, okay.And so we almost have, in a way, the same goal we did back in algebra, we have to getthat be that slope. And then we have to use that to back calculate our a. Okay, so let’sgo on with that. Um, so like I said, in the software approach, you just feed all the XYpairs in, and then the software just actually prints out the B in the A, it just printsout the slope and the y intercept, which is why I love the software. But we don’t getto use that in our class. In our class, we have to do the manual approach just becauseit’s painful. And I had to do too. So now I’m making you do it right, me. Okay, what,what we’ll do is plug all the XY pairs into an equation to get the slope, the speed.AndI promise you, I won’t give you a ton of xy pairs, you know, or you’ll be there forever.But this next step, we have to do, we didn’t have to do an algebra. And that is we’re goingto have to go back to all of our x’s, calculate x bar, and go back to all of our y’s and calculatey bar. Remember, that’s the mean of the x’s in the mean of the y’s. And you’re probablywondering, Well, why do we have to do that? I’ll show you again. But in case you didn’tnotice, though, those dots really didn’t fall on least squares line, they fell around, andyou need least on that line to help back calculate that wider set. And the rule ofthe least squares line, one of the rules of it is that x bar comma y bar is on that leastsquares line. So you can know if you calculate that out that that’s actually on the leastsquares line.Okay. And so finally, after you do x bar and y bar, you plug in B, andyou plug in x bar for the x, and you plug in y bar for the Y hat to back calculate theA. So it’s a similar, but different process as algebra. So the moral of the story is youneed to recycle, right, we got to be good to the environment. So what has happened?Well, you wouldn’t be at this point in your life of making a least squares line, if youhadn’t already started out by making a scatterplot. And then deciding you wanted to do R, andthen making are. And when you make Are you end up with that big table, remember, andyou end up with all these calculations, like some of x, some of y, some of x squared andsome of x y. Now you want to recycle those, you want to save those calculations from ourbecause they fit also into the equation for b. So you want to recycle that. Also, youwant to save the are you made, because you’re going to recycle that into the coefficientof determination, which I’ll explain later.And then this is not about recycling, you’llactually have to make this a new, but you need to calculate x bar and y bar. Now younever needed to do that before now, but now you need this. And so yeah, so get togetheryour old r calculations, and then put your x bar and y bar together and you’ll be readyto do the least squares line equation. Alright, so here’s a flashback. Remember this big table?Remember our story, we had seven patients, right? And x was their diastolic blood pressureat the last visit they had of the year. And then why wasthe number of appointments they had over the year. And we thought, Well, if your diastolicblood pressure, you know goes up, then maybe you need more appointments because it’s markerof being sick. I don’t know. That was my little story. Okay, so over on the right now we’llsee that the formula, we have the formula we’re using for B, the tax gives you two formulas,again, I’ve always got my favorite, it’s the one with the table, right? So here’s the formulafor B.And then after you calculate B, you’ll notice in the formula for a, b is in the formulafor a so you got to do B first, right. So a lot of times students are a little confusedand what the goal is here, the goal is to if you look at the bottom of the slide, thegoal is to come up with what B is and what A is, and then fill it in. And that’s yourleast squares line equation. So your least squares line equation is always going to havean A y hat in it. That’s that’s a variable that just gets to stay there. It’s alwaysgoing to have that equals and then after that, whatever your B is going to be mushed up nextto that x so it’s always gonna have that x there. And then plus and then whatever youget for a and just as a trick, if a Turns out to be negative, then it ends up beingminus a, right. But that’s the generic equation. And our goal is to calculate B and A and fillthem in.And then we will say this is our least squares line equation. Oh, rememberhow I was saying, you actually need to make some new calculations, right. So you needto make y bar and you need to make x bar. And it’s a little easier to show when I’vegot this column, the columns up. If you look at the bottom of the slide, remember how someof X was six, some D eight and remember how our n is seven. And remember how a sum ofx divided by n is your x bar. And the same goes for y, right, we have the sum of Y dividedby seven, I just wanted to quickly remind you of this, that you need to generate thesethings before, you can actually completely finish the least squares line equation.Ijust summarized like that I cut to the chase, basically, I just summarize the the actualnumbers you’re going to need and put them over here. So we don’t have to look at thatwhole big table anymore. Alright, and you’ll notice that I grayed out the sum of Y squaredbecause I realized later we don’t really use that. Okay, so let’s look under on the leftside under the big list of numbers we have. And you’ll see the B equation that I filledin, right, and if you compare that to the formula on the right side, you’ll see what’sgoing on, you know that n is seven, right? So wherever you see that seven, that’s wheren is okay, then the top of equation, remember some of x, y, let’s just look that up.Yeah,that’s that big number 18,458, I wanted to just be clear, you have to doout that left side, the seven times the 18,458, you have to do that one out, and then do outthe right side, which is that sum of x times sum of Y which is 678 times 166, you haveto do that one out. And then after that, you have to subtract the right one from the leftone, because of order of operation. Okay, so that’s how you make the numerator.Nowlet’s just look downstairs, again, we have an n, so we know that’s seven, and then thatsum of x squared. And remember, it doesn’t have the parentheses around the sum of X square,if it had the parentheses around it, you’d be taking like 678 and squaring that, butit doesn’t have the parentheses. So you have to use that big numbers 67,892. Okay. Andagain, like with the upstairs, you got to do out that side of the equation, right, thatterm, you’ve got to multiply that out before even looking at the rest of the equation,right. And then Oh, here we go. On the right side of the denominator, we have some of xsquared, that’s exactly the example I was giving earlier.So you say 678 times 678.And you have to do that one out, right. And then after you do that one out, and you dothe first one out, then you subtract the second one from the first one, remember order ofoperation. And if you do it right, you should get C below the on the left side of the slide,you should get that for the numerator in that for the denominator, and then you divide themout and you get 1.1. And that’s your B, right. So there you go.That’s how you do it. Andso now we got to worry about AES. So what I did was I just wrote B at the top there,so B is 1.1. And so now we can use B to try and figure out a, so remember how I look atmy list. Remember, I did x bar and y bar for you just so we had that ready. So now we’regoing to calculate a by putting in Y bar minus and remember order of operation again, wegot to do the B which is 1.1 times x bar. So we do that one out first, and then subtractit from 23.7. And remember, remember, I was saying sometimes you get a negative a, well,we got negative ad for a. Alright, so we got our B, we got our a, and let’s go. Now, oh,if you want to check your work, this should work out right. Like you should be able totake the B times the x bar, right, which is 1.1 times 96.9 minus 80. You know the a andyou should get 23.7.So if that works out, then you know you did everything right. Butremember what the goal was, the goal was to actually fill in that least squares line equation.So if you look over on the right, that’s what we did. So we still have our Y hat, we stillhave our equals, now we have a 1.1 where the B belongs. We still have that x because thoseare variables that we had in the x, and then we do minus 80.Because we came out with anegative one. If it had been just plain 80 would say plus ad, okay. All right at thebeginning of this presentation, I teased you that we were going to do prediction with theleast squares line equation. We weren’t going to use a crystal ball. We were going to Usethis equation. Well, I finally get to that exciting part of this presentation. But, andthere’s always a big, but I first have to warm you up with some rules, right? Firstof all, I just want you to reflect on what we just did. And realize that we can drawthe least squares line. But unlike algebra, our xy pairs probably aren’t on it, right?Like in this example, none of the XY pairs are on it.So you need to be sure about atleast one xy pair that’s actually going to land on the least squares line. And the onlyone that you can be sure of is going to land on least squares line is x bar, comma y bar.And if you reflect on it, that’s why we had to calculate that right, because we had touse x bar and y bar in the calculation to back calculate a the y intercept. Now, youmay be lucky and get a data set that there is an x y pair that just happens to fall onthe least squares line, or maybe even a couple or maybe more. But you can’t trust that. Soif you need to trust that there’s a point on the least squares line, you know, it’salways going to be x bar comma y bar. All right.And now I want to focus more succinctly, onto the slope or B, right. So remember, we just in our example, calculated B and we got1.1. For me, and that’s a slope. So I want to point it out that the slope B of the leastsquares lines tells us how many units the response variable or Y is expected to changefor each one unit of change and the explanatory variable or x.So that’s a little kind ofa tongue twister. But if you think of our example, it’s a little easier to understand.So the fact that that slope was 1.1, in our example, and that we were having XP DBP. Andwhy be number of appointments over the last year, what we’re essentially saying by thatis, for each increase in one mmHg of DBP, or the X for each increasing one of those,there is a 1.1 increase in the number of appointments the patient had over the past year. So asDBP goes up by one, then the appointments goes up by 1.1. Well, I don’t know what 1/10of an appointment is, but you get what I’m saying because it’s just a Y, okay. And sothe number of units change in the Y for each unit change in X is called the marginal changein the Y. So which if you sort of think about it, that’s 1.1. So 1.1 is the slope. But 1.1is also the marginal change in the Y for each unit change in the x.Now, I also want tojust recall for you this concept of influential points, right, so like with our if a pointis an outlier, and remember, we should have done a scatterplot. And everything beforewe got to this point, because we need our we need all those sums of x’s and sums ofy’s and sums of sums and whatever, right. And so like with AR, if a point is an outlier,and you can see it on the scatterplot, it can really drastically influenced the leastsquares line equation, just like it’s can screw up our right. And so an extremely highx or an extremely low X can do this. And I was just, you know, pointing out a culpritwe have here on the scatterplot. So always check your scattergram first for outliers,because you could end up in a situation where you’re making a least squares line and there’sa bunch of outliers, you know, whacking it out.Okay, now I’m gonna also bring up, you’reprobably like, when do we get to the prediction part? I’m like, you just have to relax, Ihave to get through a few of these issues, right? So one of them is the residual. Andyou know, the word residual, like it kind of sounds like residue, right? Like you said,you know, somebody comes over and sits there their cup on your coffee table without usinga coaster that leaves some residue and you get all mad, okay, well, that’s kind of whata residual is.It’s like kind of like residue, it’s like something left over, right. So oncethe equation is there, once you make the least squares line equation, there’s something Ijust want you to notice. And that is you can take each x, remember how we had seven patients,they each had an X, you can theoretically take each x, plug it into the equation andget the Y hat out, right? So I want to just demonstrate doing that. So we have our equationupper right here. So a patient one, I took patient ones x which was 70. And I pluggedit in 70 times 1.1 minus 80. You know, I put in the equation and I got negative three.Now that’s why had the real why I put it on the screen here is actually three.So as youcan see, you know it’s not the same answer, right? And then patient two I did it withpatient two also I did 1.1 times 115 because that’s the x and then minus 80. You know,because that’s the rest of the equation. And I got 46.5 Now that was a little closer, becauselook at patient twos wise. That was 45 If it’s really close to this 46.5, that’s a littlebit better.But the reason I was doing all that is I just wanted to tell you the residualis y minus y hat. So in the first case, we have y hat was negative three and y was three.So patient when we did three minus negative three, and we got sick, so that’s the residual,it’s kind of like residue, right? It’s like the residue leftover between Y hat and y,right.And then patient who we did it again, we took y which is 45 minus y hat, which wasbigger, it was 46.5. So we got negative 1.5. So that’s the residual. So So this is howyou calculate the residual. And this is what it is, this is how you get it. But the bottomline is, you don’t want big residuals, right? Because that would mean the line didn’t fitvery well. So you’ll find that if you have a really good fitting line, you have verysmall residuals. And so you’re probably like, well, what’s a good fitting line? Well, we’llget to the coefficient of determination, and that’ll help you see what constitutes a goodfitting line. But first, I will get to the prediction part,okay. So you’re done with your least squares line equation, and you want to use it forprediction.So let’s say you knew someone’s DVP, and you wanted to predict how many appointmentsshe or he would have in the next year. Now, what you’re not doing is you’re not using,you’re not reusing your X’s from your data, we just did that to make the residuals, whatyou’re doing is actually imagining a new thing out there. And you’re gonna use this equationfor prediction. So you could plug in the DVP as an X, and get the Y hat out, and say that’syour prediction, right? But you gotta use some caution. If you use an X within the rangeof the original equation, as you can see, I put the x’s up here, the range of the originalequation was like 70 to 125. Right, those were, you know, the areas covered by x, right?If you do that, if you pick an X, somewhere in there, this type of prediction is calledinterpolation. And people feel pretty good about it. But if you use an x from outsidethe range, like one that’s really smaller, like 65, or one that’s bigger, like 130, thenit’s called extrapolation.And then it’s not such a good idea, because you don’t know ifit’s really going to work, right. So here, I’m going to give you an example of interpolation.The patient in your study as a DBP of 80. Okay, so 80s, right in there, it’s in thatrange. So let’s use it right. So we do it. Now, this looks familiar to you, because wejust did this when we did residuals, but we’re using a new person now. So 1.1, times 80,minus 80, equals eight. So this is how we, what we would do is predict that this patientwould come to eight appointments next year. So there, that’s how we use our least squaresline equation, like a crystal ball where we can predict right? So is it really this easy,right? Is this all you have to do to predict the future? Well, it’s not really that easy.You can’t make a linear equation out of any old xy pair.So remember this from our lastlecture, see, the scatterplot. It looks like what a cloud in That’s right. It doesn’t havea linear equation, you know, it doesn’t look like it should make a line. But you know what,you feed that stuff into the software, or you feed that stuff into your B formula, andyou’re a formula, you’ll get, you’ll get a line out of it, even if there’s no linearcorrelation. And so if you get that line out of some scatterplot, that looks like this,then it’s not a very good line, right? And it wouldn’t work very well for prediction,right? Because this looks pretty unpredictable.So for that reason, we can’t just accept anyline that is handed to us. To evaluate if our least squares line equation should beused for interpretation, we need the coefficient of determination. So here we are at the coefficientof determination. And so remember how I said you have to recycle, recycle recycle in this,well get out your our time to recycle. So the coefficient of determination is also calledr squared. And it literally means r times r. And I just have to add this on. Just likeremember the coefficient of variation. Remember that one, we always turn r squared into apercent, right? And so you times it by 101%. So in this example that we did remember, earlyon in the last lecture, we did the R for this, that not the scatterplot I just showed you,but the for the one of DBP, and the appointments, right? And we got an R that was really, reallystrong positive correlation, right, we got point nine, five.Well, if we want to calculater squared, which is the coefficient of determination, we take point nine five times point nine,five If and we get point nine oh, but we got to do that percent thing. So we end up with90%. So this is how you say it, you say that 90% is the variation that’s explained? Andwhy, by the linear equation, right? So that’s, you know, y varies, right? Like how many appointmentsthey had, you know, it was different for each person. Well, 90% of that variation is explainedby the equation. And of course, if you take 100 minus 90%, there’s 10%, unexplained variation.So there’s still some variation that could be explained by other variables, but not alot. And how you actually stated is, you know, when you’re done with this, if you were writinga paper, you’d say, 90% of the variation in the number of appointments is explained byDBP.And I know people are like, explain, like, it doesn’t have a mouth, like, whatdoes it talking about? You just have to say it this way. There’s it’s statistics ease,this is how you say it. And by contrast, or by complimentary, what you wouldsay is 10% of the variation in the number of appointments is not explained by DBP. Right?It could be explained by other things. Well, we happen to get a nice, I see CD for coefficientof determination. You know, we got a nice high one.But what if it’s a low? Well, let’sjust think about it CD should be better than at least 50%? Because that would be random,right? And the higher the better. So if you’re on a test, nobody’s going to give you a CDof like 60% and say, Is this any good because I don’t know, you’d be very conflicted. Inreal life, what I use it for is to compare models, if one is 60%, and the others 55%.Of course, I’m going to go with a 60%. One, but it’s still not very good, right. And ifit’s low, you know, the higher the better, basically. And if it’s low, it means thatyou probably need other variables to help the x you use to explain more of the variationbecause that x is not doing. Okay, in summary, I just wanted to go over chapter four, soyou realize where we’ve been.Okay. So we started out with a set of quantitative x,y pairs. First thing we did was we made a scatterplot, we wanted to look at the linearrelationship between x and y. And we wanted to look at outliers. If we’d seen a lot ofoutliers, or no linear relationship, we would have stopped there. But because this is aclass we had to learn, I forced them to be a scatterplot with a linear variation, andnot too many outliers. So we could move forward and do our so we calculated our to see ifour correlation was positive or negative, and weak, moderate, or strong.So that’s whatyou do if you find a linear relationship. Next, in addition, in this lecture, we calculatedB and A to come up with the least squares line equation. And I just wanted to you tonotice that the sign on B will always match the sign on R. So if you have a positive R,you’ll have a positive slope, if you have a negative or you have a negative slope, butotherwise, the numbers won’t match, just a sign. And then also, I wanted you to noticethat strong correlations will give you high coefficient of determination, even if they’renegative correlations, because remember, it’s r times r. And so negative times negativeare still as positive, right? So if you have strong correlation, like negative point nine,or point nine, it really doesn’t matter what direction if it’s strong, then you’re goingto get a high coefficient of determination. So after we did this B and A thing, we usethat linear equation to calculate residuals, right, like we took the x’s from the originaldata and put them in got the Y hat and calculated the residuals.After that, we use R to calculatethe coefficient of determination or CD, to decide if we wanted to use the literate equationfor prediction. Because if it was bad, we weren’t going to do that. But we decided wasgood for prediction at 90%. And we decided to use it. So that was our journey throughthese xy pairs all the way down to the coefficient of determination. Good job, you made it. Soin conclusion, the least squares criterion, and calculating the least squares line wasthe first thing we went over how to do that and what it all means. And then I reviewedsome issues with prediction using the least squares line, because it looks kind of easy.It looks kind of, you know, better than sliced bread, but there are some things you haveto think about. Finally, we went over the coefficient of determination so that you couldfigure out how good your least squares line equation was. And I just wanted to point outthat CD kind of looks like CDs, you know, like we used to have CDs. They were so prettyand rainbowy like that. But now all CD means is coefficient of determination.Hello, andwelcome back to statistics. It’s Monica wahi are labarre College lecturer and You’ve madeit to chapter seven, I broke up chapter seven into bite sized pieces. And we’re going tostart with chapter 7.1, talking about the normal distribution and the empirical rule.So here are your learning objectives for this lecture. At the end of this lecture, you shouldbe able to state two properties of the normal curve, state two differences between Chebyshevintervals and the empirical rule, and explain how to apply the empirical rule to a normaldistribution.So, remember, distributions, we learned aboutthem a while back, but I’ll remind you a little bit about them. And then we’re going to talkabout properties of the normal distribution, or specifically the normal curve, that shapethat comes out of making a histogram of normally distributed data, then we’re going to rememberChevy Chevy intervals, we’re going to talk about what Chevy Chevy did for us, and whatChevy Chevy really didn’t do for us.And then we’re gonna move on to the empirical rule,which works very well, better than Chevy Chevy intervals, when you have normally distributeddata. And then I’m going to show you an example of how to apply the empirical rule to thatnormally distributed data. So remember, the normal distribution, in fact, remember distributionsat all right? So to get a distribution, and a lot of people sort of forget this, by thetime we get to chapter seven, but I just wanted to remind you, this is from an earlier lecture,we had a quantitative variable, which was how far a patient’s had been transported.And we determined classes, and we made a frequency table.So remember that. And then after that,we made a frequency histogram, and then made a shape. And as you could see that shape,which is the distribution, that shape in this one was skewed, right, see that light on theright, okay, but that’s an example of something we cannot apply the empirical rule to, becausethe empirical rule only applies to normally distributed data. So I had to give you anexample of that. And here’s my example. So when I was in my undergraduate in costumedesign at the University of Minnesota, they made us take a chemistry class and one ofthose big lecture halls.So I was in a very large class that probably had about 100 people.And we were given this really difficult test, it was 100 point test, and I was used to gettinglike A’s. And so when they were done with the test, the T A’s, were handing the testsback to everybody. So they could see their grade, while the professor was writing onthe board, and was reading the frequency of all the different scores. And I remember theTA handed me my test, and it said 73 on it. And I’m used to getting like 90s, up to 100.And I remember stating out loud, saying 73, that is an awful score, I can’t believe Idid so badly. I was talking like that. But at the same time, the professor was writingthe frequencies on the board.And what I realized is the top score was in the 80s. And I hadthe third top score was 73. That’s how hard the test was. And that’s a nice Shut up, becauseI noticed everybody giving me dirty looks because they had scored actually below me.So I wanted you to imagine that class. And I imagined what the normal distribution wouldlook like for that class with the distribution of the scores.And the reason why I thoughtit would be normal is because we all did badly, right. And so nobody got 100. So we were allbelow the 100. So I imagined this curve here for you. And I imagined my class, I had 100people just to make it easy. Of course, the test was difficult. And nobody got 100 points.And the mode, the median. And the mean, were all near see great, because you remember how,when you have a normal distribution, the mode, median, and mean are all on top of each other.So we all did pretty badly. So I’m going to use this example of the fake chemistry testscores to exhibit exemplify these properties of the normal curve. So there’s five I’m goingto talk about. The first is that the curve is bell shaped with the highest point overthe mean.And so you can see I drew a scribbly little curve, put a little arrow there toshow you that that’s where the mean of the scores were. And then I also wanted you tonotice that the curve is symmetrical with a vertical line through the mean. So there’slike a mirror image of the curve on either side. Now, it’s not perfect, obviously. Butit should be roughly like that. And you know, this is not true of skewed or bi modal orthese other things we’ve been talking about. Okay, and the third property is that the curveapproaches the horizontal axis but never touches it. You don’t have to memorize this, but remember,asym totw or asymptomatically close, that’s when a line gets really close to another line,but they never touch.It’s so romantic. But anyway, that’s a veryBollywood thing to say, by the way, but uh, so the curve approaches the horizontal axisand never touches or crosses and then also there’s this inflection or these transitionpoints between cupping upward and downward. And these transition points occur at aboutthe mean, plus one standard deviation and about the mean minus one standard deviation.And this is a little hard to explain. But imagine you’re on a roller coaster and you’regoing up this normal curve. There’s this part where you’re just mainly going on, well, thepart where it seems to kind of level out and you’re at the top of the curve, he startsto relaxing. That’s that inflection point.And so as you’re going over in the rollercoaster, and you’re in that flat part, and then you start kind of going down, that’sthe second inflection. So that’s where what it’s saying about is the property of thiscurve is that you have these inflection points like that. And they roughly occur at plusor minus one standard deviation above and below the mean. Then finally, and I call itthis, and just so you could see it, the area under the entire curve is one, so think 100%.So it would be nice if that were a square or rectangle, or even a triangle, somethingthat we’re used to in geometry, but it’s not, it’s this goofy shape, right? But still, youneed to get it in your head that that shape is worth 1.0 in proportion land, or 100% inpercent land. And what I mean by that is, let’s say we cut that shape and half, the,each side would have 50% or point five on it, then let’s cut it a different way.Sothe part of the curve on the right side of that line is a fourth of the curve, or 25%of the curve, even though it’s goofy shaped, and the part on the left side is 75%. So that’swhat we’re trying to get you to think like is that, yeah, you can just declare that allthe area under the curve equals one or 100%. But the reason why we’re declaring that isbecause we’re gonna cut it up and say different amounts of percent of the curve. Now we getto the empirical rule, since we reviewed this whole curve thing, and I’m going to make youremember Chevy shove, I’m sorry, but you know, let’s talk about Chevy Chevy, Chevy shovehelped us get some intervals, right, in intervals have boundaries, or limits, they have a lowerlimit and an upper limit. That’s how you know what bounds the interval. So when we weredoing Chebyshev intervals, what we would do is we’d figure out a lower limit and upperlimit, and we’d say at least so much percent of the data falls in the interval, right?So when we would choose the lower limit of mu minus two times the standard deviation,and the upper limit was mu plus two times the standard deviation, we would say at least75% of the data were in the interval.So I wanted to just show you a demonstration usingmy fake class. So remember, there were 100 students in the class, I actually came upwith a mu for them. And their mu on the test was 65.5. So my 73 was better than the mean,but not much better, right. So the mu for that class was 65.5. And the standard deviationwas 14.5. So I calculated these chubby shove this championship interval for 75% of thedata. So I took 65.5 minus two times 14.5. And I got 36.5, which is a pretty bad grade.And then the upper limit was pretty good, right? 65.5 plus two times 14.5 equals 94.5.On 100 point test, that’s a pretty good grade, right? So if you had 100 data points, or 100students, at least 75 would have scored between 36.5 and 94.5.So you’re probably alreadyrealizing, okay, that doesn’t really help Monica, who scored 73. And this is a reallywide range, we say at least 75% of people score there, you could probably guess thatwithout even knowing about chubby ship intervals, right? So it didn’t really help me narrowdown, like how well is this class doing? If I had had the mu and the standard deviation,I could have calculated this and said, Okay, I’m no better off.So championships theorem on the left side,and applies to any distribution, you don’t need a normal distribution, you can use thatskewed distribution. Also, you’ll notice it says at least. So like this was at least 75%of the data fell in there. Maybe even 100% fell in there. So it doesn’t really help us.And as you go, let you start with two standard deviations. If you go out three, it’s 88.9%.And four, it’s 93.8%. You know, you might as well start at the beginning and say almost100% of the data falls in this interval. And if you’re saying that it’s not very useful,right. But it kind of gets stuck doing that because championships theorem applies to anydistribution, the empirical rule is much more elite. It only applies to the normal distribution.And you’ll see why if you are lucky enough to get the normal distribution that you wantto use the empirical rule instead of championship.Okay? Because Secondly, the empirical rulesays approximately It doesn’t say at least, so it’s saying basically, not at least it’ssaying about exactly this. So you can trust it. Okay, you don’t have like this unknown,like maybe 100%. There’s, so it says, This is what it says and I’ll show you a diagramof it, but it says that 68% of the data are in the interview interval. mu plus or minusone standard deviation. So mu minus one standard deviation all the way up to mu plus one standarddeviation 68% of the data are in there.And you’ll notice that Chevy chef didn’t evensay anything about one standard deviation. And so already, we’ve got something way moreuseful if we apply the empirical rule, right. So next we go to 95% of the data are in theinterval, mu plus or minus two standard deviations, 95%, approximately 95% are in there. Now,if we had bought chubby chef, we’d be saying about this too, we’d be saying 75%. Okay,we’d be saying at least 75%, which could be 95%. But here, if we’re using the empirical rule,we’re relatively sure that it’s 95% between mu plus or minus two standard deviations youcan like better, right? Finally, if you get out to three standard deviations, you’re kindof running out of data, because 99.7%, almost all of them fall in that interval. So as youcan see, the empirical rule is going to give you a more specific answer. But again, youcan only use it if you have a normal distribution, but which we do.So let’s go look at that.Okay, this is a diagram that I’m going to help I made it myself, actually, because Ithought it was the other diagrams I saw were not pretty. And this one is very pretty inmy mind, but let me unpack this diagram for you, because there’s a lot going on. And first of all, I want you to notice the shapeof it, it’s a normal distribution, okay. And then I want you to notice that I put thisblack line down the middle, and I put a little arrow that says mu. So this is where we wantto imagine mu, it’s no matter what your what your actual numbers are from you. Like inour case, this is 65.5 for our points. Just imagine whatever your mu is, and whateveryour standard deviation is, this is where you would put the meal, right, then you’llnotice that each of these sections that’s colored, has a little standard deviation symbolin it, because that’s representing that, that the width of that is one standard deviation.So if your standard deviation was like five, then mu would be plus plus or minus five,like the green one would be mu plus one standard deviation.So it’d be mean plus five, andthen you draw that parallel line there and see that arrow that says mu plus one zerodeviation, that would be there. And of course I can, I just had to use the symbols, becauseI don’t know how big the standard deviation really would be, or what the mean really wouldbe. But whatever it was mu plus one standard deviation, if you go up there, you would seethat that green area represents 34% of the data.And if you’re lucky enough to have exactly100 people, like I did in my demonstration, that would mean that between mu and mu plusone standard deviation of these test scores would be 34 people’s scores, right, so youcan really figure that out. Same with the yellow section only, that’s mu minus one standarddeviation, and 34% of the scores would be between those two numbers. Now you’ll see as you get up into the blue,that’s between one and two standard deviations above the mu, you’ll see that because theroller coasters a lot lower to the ground there, that section is really small, it’sonly 13.5% of the data. And the same with the orange one that’s on the other side ofthe mu. So that’s below the mean. And that’s only 13.5. And then you’ll notice that atthree standard deviations, between two and three, there’s a little tiny piece right,the purple piece and the red piece, those are only worth 2.35% of this shape. And thenI wanted to point out there is some stuff at the end, in the little black part beyondthree standard deviations on either side, there’s point one 5%.And a lot of times peopleforget that. But one way you can make sure that you’ve got to remember that it’s thereis that if you add up all these percents on the slide, you’ll get 100% because remember,I promised you that the whole the whole curve is worth 100%. And this is how we split itup. I also want you to notice that there’s kind of a cheat, right? If you just add upthe green, blue, purple, and then the little black part at the end, if you just add upthose percents, you’ll get 50%, right, because that’s half the curve.And the same, you’llget the same thing if you do the yellow, orange, red, and the little part and the black atthe bottom. If you add those up, you’ll get 50%. So that’s how you want to just conceptualizethis whole empirical roll diagram. But now we’ll apply. So I put the empirical rule diagramon the left, and then I put our class frequency histogram on the right and look, I put themeal and I put the standard deviation so we could have it there. Now the first part ofthis section, I’m just going to show you how to fill in the numbers under the diagram.Okay, and then after we fill in the numbers, I’m going to talk to you about how to interpret those numbers.So let’s start with easy let’s write the muunderneath the symbol for me, which was 65.5. So we just wrote that was simple, okay. Nowlet’s do the plus or minus one standard deviation. So you’ll see 65.5, which is our mu minus,and I put one times 14.5. I know I just did that for demonstration purpose. So you see,we’re doing one times the standard deviation. So if you subtract that from the meal, youget 51. And so I wrote that 51 underneath the mu minus one standard deviation. And ifyou go the opposite way, and you add on 14.5, you get 80. So I put that up there. So that’sI just labeled those two, you can kind of guess what we’re going to do on the next slide. Surprise, we’re going to do almost the samething. All we’re doing the mu minus two times the standard deviation to get the 36.5. Andthe mu plus two times the standard deviation to get that 94.5.And you probably already,we’re ahead of me with this one. This is where we do 65.5 minus three standard deviations,and we get 22. And then we add three standard deviations, and we get 109. And now we’reall able to So what does this all mean? Well, remember, our n equals 100, just out of convenience.So what does this mean? It means that 34% of the scores are between 51 and 65.5. Sothat’s the yellow bar. Right? So 34 scores were that because I 100 people in the class.So I’m standing there in that class, and I’ve got a 73. But I don’t 34 of those people I’mlooking at have a score between 51 and 65.5. I also know that another 34%, or another 34in this class, because there’s 100 have a score between 65.5 and 80. And my 73 is somewherein there, right? So already, I’m getting an idea that 68 people are 68% of the scoresare going to be between 51 and 80. Right. And so I’m right there with 68% of the class.So I’m going to go through some fake test questions for you to just show you how tocome up with the answer.So let’s say the question was, what percent of the data studentscores are between 36.5 and 80? So think about how you would answer that question. So seewhere 36.5 is, it’s on the lower limit of the orange part, and see where the ad is,it’s on the upper limit of the green part. So what you would do is you would add up thepercents in between right 13.5 plus 34, plus 34? And the answer to what percent of thedata are between 36.5 and 80? The answer would be at 1.5%. Here’s another question. Whatcut point marks the top 16% of the scores. So already, you know you’re up in that area,probably where the purple or the blue are, right? And so what would make the top 16%?Well, if you actually add together that point, one 5%, from the little black part, the 2.35%,from the purple, and the blue 13.5%, you’ll get 16%. So the cut point then for that allthe scores above 80, that would constitute the top 16% of the scores. Here’s another quiz question, what percentof the scores are below 94.5.So we see 94.5 is at the upper limit of the blue section.So you could kind of say, well, let’s just add up everything below. Right, we’ll addup everything below it, and that person, the scores will be below 94.5. And so we do thatwe add everything below it. But remember how I said that there that the yellow, orange,red, and the little black part there that that equals 50%? If you just wanted to sayokay, that’s 50% plus the green part, plus the blue part, you could do that, and thenyou get the same answer. So what are the cut points from the middle 68% of the data? Ijust wanted to show you an example. What if they say middle, right? Well, you’re gonnahave to be centered around me that right? So the middle 68% means 34% above the mean,and 34% below the mean. So the cut points would be 51 to 80. Okay, now I’m going toask a similar question, but I’m going to use different words. Okay. What is the probabilitythat if I select one student from this class, that student will have a score less than 80?Okay, so notice, I’m using totally different terminology.I’m saying what is the probabilityyet? The only the actual answer is what you would probably guess, which is where you addup all the percents below 80. So the point of me giving you this quiz questions is topoint out that percent and probability mean the same thing when you talk. So either I’mgonna say what percent of the data are below at the score of 80? Or what is the probabilitythat if I select one student, that student was scored less than 80? That is actuallythe same question.So the answer is going to be I use that 50% trick here. That answersme 50%, which is the whole bottom half of that curve plus 34% gets up to 84%. Right?So, so the probability that if I select on student, that student will have a score lessthan 80 is 84%. And that’s the same as what percent of the data is below 80 is 84%. Okay.Here’s another probability question, what is the probability I will select a studentwith a score between 36.5 and 51? Well, that’s as if I was asking, it’s the same questionas what percent of the data are between 36.5 and 51? which you would know the answer thatthat would be 13.5. That’s the orange part, right? But even if I say, what is the probability,I will select a student with a score between 36.5 and 51 13.5%? So let’s say that we wereat a casino, and we were betting, right.And I’m like saying, okay, there’s 100 students,I’m going to just grab a score out, and I’m betting a lot of money that I’m going to grabsomebody between 36.5 and 51. And you’d probably be like, you don’t want to bet on that. Becauseyou only have 13.5% probability of selecting one, you probably want to bet if you’re goingto bet on something in the in the yellow section or something in the green section, becausethey have higher probability. So that’s how you would think about probability.And percent,even though they’re kind of the same thing. I just wanted to show you how they word thequestions differently. But it means the same thing. So now I wantyou to just sit back and think for a second. So think about what would happen in a differentclass taking the same hard test, meaning nobody’s getting 100%? What’s the mu was the same,meaning everybody’s doing badly.But the standard deviation was larger than 14.5? What wouldthat do to the intervals? So let’s just stare at this for a second. Let’s say the mu wasstill 65.5. But the standard deviation was like 30. Okay, there was a lot of variationin the class, that would already mean that where the ad is right now, that that wouldactually be 95.5. Right? And where that 51 is there. Now, if we have a standard deviationof 30, that would actually be 35.5. I mean, that’d be a way bigger interval, right. Andso the class I was in in chemistry was an undergraduate class, I was in costume design.This was a whole bunch of different kinds of people in chemistry. And that’s probablywhy we even had kind of a big standard deviation of 14.5. Even though I made that up. I mean,in reality, we probably did have a big standard deviation. I knew in the chemical engineeringdepartment, they had chemistry classes for chemical engineering majors, I’ll tell you,their standard deviation was probably a lot smaller, because they were probably more alikeand got more similar grades as each other.But with this diverse class, we probably hada pretty big standard deviation. So that gets to my last question, what if the standarddeviation was actually smaller than 14.5. So if we were like in the chemical engineeringclass, and they were taking chemistry, and they had a smaller standard deviation, maybethey might have had the same mean 65.5. But let’s say their standard deviation was likefive, then where the ad is now would be a 70.5. And where the 51 is, would be a 60.5.And we’d have way more confidence of where we knew the scores fell, like as I was standingthere with my 73. I would be saying like, Oh, you know, my 73 is pretty high, if everybodyhas a small standard deviation, right? Whereas it’s not that high here, because we have kindof a big standard deviation.That’s in the first though the green part. So the reasonwhy I want you to think about that is, that’s why this shape goes by mu and standard deviation, becauseit really matters how big the standard deviation is, how big each of those areas are with thedifferent colors. So I just wanted to remind you that percent, area and probability areall related.The percents literally refer to the percent of the area of the shape, okay?And imagine the whole thing is 100%. So just to remind you, the orange part is 13.5% ofthe area of the hole shape, but it also is the probability that an X like a student andx falls between mu minus one standard deviations and mu minus two standard deviations. Andthat if I select 1x, from a group, this group that I’m 13.5% is the probability that I willget an X in that range. And so it means both things. So in conclusion, the empirical rulehelps establish intervals that apply to normally distributed data. And it’s more useful thantrebuchet. Because it’s more specific, these intervals have a certain percentage of thedata points in them.And they also refer to the probability of selecting an X in thatinterval. And these intervals depend on the mean and the standard deviation of the datadistribution. So if those change then exactly where the numbers are on those intervals change.Well, I hope you enjoyed my explanation of the empirical rule. And now you can practicedoing it yourself at home. Good morning, good day. And good afternoon. This is Monica wahi,your library college lecturer here moving you through chapter 7.2, and 7.3, z scoresand probabilities, I decided to merge these two chapters together, because I thought theyactually kind of belong together, I didn’t really understand why they were separated.So at the end of this lecture, you should be able to explain how to convert an X toa z score, show how to look up a z score in a Z table.Explain how to find the probabilityof an X falling between two values on a normal distribution, describe how to use the Z tableto look up a z corresponding to a percentage, and describe how to use the formula to calculatex from a z score. Well, that sounds like a lot, but you’ll understand that at the endof this lecture, first, I’m going to go over what a z score is and what the standard normaldistribution is. Then I’m going to talk about Z score probabilities. And what those are,I’m going to show you how to use the Z table to answer some harder questions besides theones I talked about during the z score probabilities section, then I’m going to show you how touse a slightly different formula to calculate x from z.Finally, I’m going to just remindyou some tips and tricks about using z scores and probabilities correctly. So all this talkabout z scores. So what is the z score? And what is the standard normal distribution?Well, let’s take a look at this very, pretty thing I made. You may recognize it from thelast lecture, it was my little Empirical Rule diagram. So remember, the empirical rule,remember how it required a normal distribution? Well, that worked well for the cut pointsavailable, right? Like mu mu plus or minus one standard deviation, mu plus or minus twostandard deviations. If we ask questions that were right on those cut points, we had goodanswers. But what about in between those cut points. So I wanted you to notice, in thisEmpirical Rule diagram, these numbers at the bottom, like I just circled them, like negativethree, negative two, negative one, and then mew doesn’t have a number.So pretend there’sa zero there. And then there’s one, two and three, okay? That is the standard normal distribution.And that is also called z. So these things on the right, those are z scores. So see thegreen area, zero is the z score that’s on the lower limit of that, and one is the zscore at the upper limit of the green area. So you can see that this whole curve, thethe standard normal distribution on the right, the whole, the mean of the whole curve iszero. And the standard deviation of the whole curve is one. And that is what c score is.So I just want you to notice the concept of standard. I’m, I’m in the US. And in the US,we use, you know, the US dollar, but one of the things I’ve noticed is that a lot of countriessee it as a standard. So they’ll map their currency to the US dollar. So maybe the Eurowill map its currency to the US dollar, maybe the Egyptian pound will also map its currency to the US dollar.And once it does that, it’s a lot easier to compare them, right.And so that’s the mainreason for the standard normal distribution is it helps you compare exes from differentdistributions, different normal distributions that have different means in different standarddeviations from each other. It helps you map them to this normal standard normal distributionhere that standard, so you can compare them. So let’s talk about z scores, every valueon a normal distribution. So every x can be converted to a z score, just like I was sayinghow you can convert any currency to dollars, there’s some formula for that.You can convert every x on a normal distributionto a z score. But you have to know how to use the formula right? And what goes intothat formula. Well, first, you need the X that you want to convert to a z score. Soyou need to pick one, then you need to know the mu of your distribution, your normal distribution,and the standard deviation of your distribution. And here are the two formulas that are used.The one I was just talking about is on the left is the formula for calculating the zscore. And we’ll go over the one on the right later in this lecture. So remember in thelast lecture, I was talking about a class that had 100 people in it.And that all tooka really hard test, it was so hard, nobody got 100%. And it was 100 point test. So nobodygot 100. The top score was in the 90s. So um, and remember, in the upper right there wasthere’s the meal, the meal was 65.5, which is pretty bad score, 100 point test, and the standard deviation was14.5. So I’m going to give you an example of calculating a z score on that particulardistribution. So let’s say you got a friend, you have smart friend, and that’s my friendgot a 90 in the face of all this? Well, let’s calculate the z score for 90 on this particulardistribution. Okay, so here’s what we’re going to do is, first we’re going to remind ourselves,you don’t have to do this in real life when you’re doing it. But I’m just doing this fordemonstration purposes, is what our Empirical Rule stuff look like. Remember, at mu plusone standard deviation was 80.And mu plus two standard deviations was 94.5. So already,you know, whatever your answer is going to be for 90 is it’s going to be between oneand two. Right. But we just don’t know exactly what it’s going to be. So I’m just showingyou this for demonstration purposes to relate it to the last lecture. But you don’t haveto do this in real life when you calculate. Okay, so we know that the Z we’re going tocalculate is going to be somewhere between one and two. And as you’ll see, on the slidehere, I labeled over on the z curve, I labeled where z equals zero, which is the mu that’s65.5.So we’re going to anticipate we’re going to get a z score, that’s somewhere betweenone and two. And you’ll see in blue, I listed the ingredients, right, so we have the smartphonescore 90, we have the mu 65.5. And we have standard deviation 14.5. And then we haveour z formula. So let’s do it. Okay, so x minus mu is going to be 90, which is our xminus 65.5. You do that out first, and then you divide it by 14.5. And look, our Z scoreis 1.69. And that’s exactly where we thought it would be, it would be somewhere betweenone and two. And so as you can see, you can take any x and convert it to Z. Here we’lldo another example, only this friend is not so smart.This friend actually got a scorethat was kind of low, it was so low, it was below the meal of 65.5, this poor friend onlygot a 50. So let’s try it again, let’s do a z score for 50. So again, you know thisis just for demonstration purposes. But remember, in Empirical Rule land 51 was that mu minusone standard deviation. So we’re going to expect that between again, negative one andnegative two is z is where our 50x is going to land if we calculate the z score.And sohere we are, we calculate the z score, we have 50 minus 65.5 divided by 14.5, and weget negative 1.07. And the reason why it’s negative is, as you can see, it’s on the leftof the meal, so then the z score is gonna be negative.And so as you can see, it’s exactly where we thought it would be, it would be a littlebit to the left of negative one. So now we’re going to get into something that’sa little bit harder, which is the z score probability. So you’re feeling pretty goodabout the z score. But now let’s talk about the probabilities. Okay, so remember the probabilityfrom the empirical rule, this is just old Empirical Rule stuff. So remember, I gaveyou a question at the end of that lecture, I said, What is the probability I will selecta student with a score between 36.5 and 51? And remember, the answer was like this orangearea, which is 13.5%. But what if you have z scores like 1.69? The Smart friend, andnegative 1.07, which are the not so smart friend, you know, in other words, you haveexcess of 90 and 50, which are not on the empirical rule? How do you figure out thepercent or the probability? That’s the next step with your z scores? Okay, so now let’sask this question, let’s say, what is the probability that students scored above thesmartframe.Now, we could also ask for below, but I’m just choosing to ask for above thistime. So in other words, what is the area under the curve from z equals 1.69? All theway up. So see, like a little ways through that blue edge. We wish we knew the area foreverything up from 1.69 Z, through the purple area through the little black thing at thetop.We wish we knew that area. We only know from the empirical rule what’s on the cutpoints of like one and two, but we don’t know this in in between things. So how do we figurethat out? Well This is another problem here. What is the probability that students scoredbelow the nozzle smart friend, right? And in that case, see the diagram, we’d have tofigure out what is the part of the orange that that friend gets plus the red and plusa little black part of the bottom? What is the percent or the proportion of the curvethat represents that. So that’s what we’re getting into now. And that’s what we do iswe look these up in a Z table. So what the Z table is, is basically, they figured outevery single Z score, you could have between negative 3.49.And I’ll go into why negative3.49, between negative 3.49 and positive 3.49. And they went like every 100. So they figured out for every single one of thosethese scores, what the probability is, and they actually fit that all on a table. Andso now, what I’m going to show you how to do is how to use that table to look up theprobabilities. And by the way, if you look up a probability that happens to be on oneof those Empirical Rule cut points, you’ll get what the empirical rule says. It’s justsaid, the empirical rule is nice, because you don’t have to pull out the table.Butif you have something that’s not on the empirical rule, cut points, get out your Z table. Sohow do you use the Z table? Well, the first thing is you want to figure out what areayou want, right? So we’re going to start and do the not so smart friend, because that’sa little bit easier actually to demonstrate. Okay, so what is the probability that studentsscored below the not so smart friend? So, which is a secret way of saying, what is thearea under the curve that makes up most of that orange part, all the red and the littleblack part at the bottom? What is that proportion. And so for areas left of specified Z value,you’re supposed to use the table directly.So I’m going to show you how to use that tableto look up negative 1.07. And then I’m going to come back and tell you what they mean byuse it directly. Hi, there. So here we are at the Z table. And if you have the book,you can look it up in the appendix in on page eight. But there’s also a lot of z tableson the internet. Sometimes they’re arranged a little differently. So I’m using this onebecause it’s from the book. So remember, the Z that we’re looking up, we’re looking upthe Z of negative 1.07. So remember, I said they had to somehow calculate all the differentprobabilities for every single z between negative 3.49 through positive 3.49. Every 100th, theyhad to come up with that, well, how did they fit it all on their table? Well, this is whatthey did. See, this is the being the Z table. Remember, I said negative 3.49? Well, thisis negative 3.4.And then to find the Z and negative 3.49, you have to imagine that thenine is here, but it’s going to be the last one here. So see this nine here, this is whatit would be. So just for pretend, if we had a z score of negative 2.58, I go 2.5. Andthen I have to go over to the eight, one right here. Okay. Or if I had one that was negative2.10, right, or negative, just plain 2.1. Right? Then I’d go over just one to this zero,line and see these these little tiny things in here. Those are all probabilities. In fact,let’s go look up our probability, which is negative 1.07. So we’re going to go down here,negative, here we are at negative 1.0. And then we have to go over to the seven column,right, so what’s the song? Here’s a song, it’s three from the left, I guess I couldhave guessed that.So we have negative 1.0987. So this is point 1423. Otherwise known as14 point 23%. So that’s actually what you get out of the Z table. That’s the probabilitythat’s the percent you’re looking for. And just in case, you’re wondering, these aren’tall negative, the first page is negative. The second page is positive is all the positiveZ scores all the way up to 3.49. But what I want you to hold in your head is what wejust looked at, which was negative 1.07, which is point 1423. Okay, hold that thought. Okay, here we are back at our slides.Andso look at that green part where it says four areas to the left of a specified Z value,which we’re doing with the not so smart friend, use the table entry directly. So here wasour table entry. It was point 1423. So we’re just going to use that number that we foundand we’re gonna say the probability then, is 14.23%. And that kind of makes logicalsense knowing the empirical rule. Now, I’m going to show you an example of what why Iwas saying, use it directly. In this next example, we’re going to look at the smartfriends probability. In fact, we’re going to ask what is the probability that the studentsscored above the smart friend in the smart friend set z equals 1.69. So I’m going todemonstrate now, for areas to the right of a specified Z value, you either look themup in the table, then subtract result from one, or you use the opposite z, which is inthis case would be negative 1.69.And you’ll get the same answer, whether you do with thefirst way The second way, but I’m going to demonstrate both okay. So first, I’m goingto demonstrate what happens when you look up the probability in the table for that,see, and then you subtract that probability from one. So let’s go look up z equals 1.69.All right, here we are back at our Z table, only this time, we’re looking up a positivez.So we don’t want this first one, we want the second one. So remember, we’re lookingup z equals 1.69. So we’re looking under here for 1.6. And that’s right here. And now wehave to go over to the nine column. So that’s going to be point 9545. So hold that thought,point 9545. Okay, we’re back with our probability that we looked up in the Z table. Now remember,we were supposed to look it up in the table and subtract the result from one. So that’swhat we’re going to do now. So we found point 9545 in the table, we’re going to take oneminus point 9545. And we get 0.0455, or 4.55%, this little tiny piece, which kind of makessense, because it’s right at the top of the distribution, just a little piece of the blue,and the purple, and then the little black at the top. Alright, and so what you wantto imagine is that point 954, or five, which is like 95.4, or 5%, that’s the whole piecebelow z equals 1.69.That’s most of the blue, the green, the yellow, the orange, the red,and the little black at the bottom, that’s all in the point 9545. Okay, so again, wewere looking up in the area to the right of the specified Z value, and I showed you thefirst way of doing it, there’s another way of doing it, and that’s where you just usethe opposite z from the get go. So we’re going to now use the opposite seat, we’re goingto look up negative 1.69. All right, here we are back at the Z table. Only this time,we’re looking at negative 1.69. So negative 1.6 is the first thing we need to find inthis column.So here we are negative 1.6. And then we know nine is the last column.I’m learning that. So we’ll go over here. And so that that looks familiar. Right point.Oh, 455. Okay, hold that thought. All right, well, back. And so as you know, if you lookit up in the table directly, like the 1.69 directly, and you take that probability, andyou subtract it from one, which is what we did last, we got the same answer we got now,right point, oh, 455, or 4.55%. So it is kind of more efficient, to just use the oppositez, if you’re looking for areas to the right of the specified Z value.But I always saywhen you’re done looking it up, compare it to the picture. And I always say draw a pictureto, you know, I don’t mind if you have normal curves drawn, drawn over all of your homework,or all over the wall, I guess, or maybe a whiteboard, that’s probably more efficient.But it’s best to draw it out. label on there, where your z and your x are, and then justlook at it. Because we know that the little piece above z equals 1.69 is not 95% of thatcurve. It’s just not it, that’s over 50%. And we can tell that little tiny pieces under50%. So if you accidentally do the first way and forget to subtract from one, you know,maybe if you check it against your normal curve drawing, you’ll realize oh, I made a mistake. So eventhough there’s two different ways to find the probability, if it’s to the right of thez value, just try to make sure no matter which ways you use that you finally do a realitycheck against the drawing you make, just to make sure you got the right piece becausethere’s only two pieces.There’s a big piece and a little piece of the skirt, and we got4.55% we know that’s a little piece and we know From our drawing that we were lookingfor the little piece. So that’s how you do your reality check. Okay, you thought thatthere weren’t any harder questions? Well, here are some harder questions. So this isa little bit more on probabilities in the Z table. So here’s another question we haven’thandled yet. What if you were looking at a probability between two scores, such as theprobability the students will score between 50 and 90, so it’s somewhere in the middle, okay. Note that in that case, when you have a betweenone, you actually have two axes, and we’ll label them x one and x two, so the not sosmart friend is going to be x one, and the smarter friend is going to be x two, justto keep these x’s straight. Okay. So the next step is you’re going to calculate z one andz two. And I’m kind of cheating. Because we already did these, we already knew the Z onefor the National smartphone was negative 1.07.And we already knew the Z two, for the smarterfriend was 1.69. So I just put them on the diagram. Okay, and then here’s this beginningof the strategy, and I’ll just explain the strategy, and then I’ll do the strategy. Sofor z one, you find the probability to the left of the Z, so you find the little pieceto the left.And remember, you can take the direct probability from the Z table. So that’swhat direct means is you just get to copy it directly out of this table. Then for ztwo, you find the probability to the right or above z. So you find the little piece there.And you use one of those two methods I showed you, which we did together. And then finally,imagine like the whole curve, you’re subtracting the piece at the bottom, the Z, one probability,and you’re subtracting the piece at the top. So you’re trimming with those two pieces toget the between probability. So that’s the strategy is basically you find out the thesize, the probability of each of the little pieces on the sides, you subtract both ofthose from one, and that traps whatever’s left in the middle.So I’ll demonstrate this.So remember, for z one, the probability to the left of Z one was point 1423. We did thattogether. And then we use both of those methods. And they got the same answer to find the probabilityto the right of z two, which was point o 455. Okay, so that’s a little piece at the top,and then we got the little piece at the bottom. And now we’ll take one minus the piece atthe bottom minus the piece of the top and the total is point 8122, or 81. Point 22%.which kind of makes sense, that’s a big piece in the middle. So it wouldn’t be surprisingif it was about 80% of the curve. So this is how you do a between like. Here’s anotherquestion I haven’t really handled, what have you looking at a probability more than 50%?So such as the probability that students will score greater than 50? Right? Like, like thebig side? Okay? Well, actually, you just do what you normally would do, you say four areasto the right of the specified Z value, either look up in the table and subtract the resultfrom one, or use the opposite z, which in this case would be 1.07.So if we did methodone, we’d end up going one minus point 1423, which we already looked at, and we get point8577, we use method to we’d take the Z of 1.7, not negative 1.07, but 1.07. And we couldgo look it up in the Z table, and we get point 8577. Again, 85 point 77%. So if this isn’tactually a harder question, I just wanted to show you how it works when you’re gettinglike a bigger piece, bigger than 50% piece of the distribution. And here’s another sortof similar example, where we’re looking at the probability that students will score lessthan 90, okay. So that’s easy, right for the area’s to the left of the specified Z value,just use the table directly. So when we went and looked up z equals 1.69, we got point9545. So that’s the answer. It’s 95.45% of the curve is below z equals 1.69, or belowx equals 90.So as I mentioned before, but I’ll just mention again, you’re supposed totreat all probabilities to the left of z equals negative 3.49 as P equals zero. So I showedyou what negative 3.49 looks like in the Z table. It’s like point O two. Well, there’snot much smaller than that. So just, if you actually calculate z and you get like negativefour, just say the P is zero, okay. Then the second thing is treat all areas and probabilitiesto the right of z equals 3.49, SP equals one or 100%. So as you can imagine, you know,3.49, that’s at the top of the curve. So if you calculate a Z and you got like a five,you can just assume that’s 100%, right or one. Okay, um, so we’ve gone through how tocalculate z. And we’ve talked about looking at probabilities in the Z table.And we’veeven talked about manipulating those probabilities to get certain probabilities. But we haven’ttalked about calculating x when z is given. So sometimes you’re actually given a z. Andyou are have to calculate the x back from the Z. In fact, sometimes it’s even harder.Sometimes you’re given a probability. And the probability is not as easy. But you canuse the probability, remember that those little percents in the middle of the table, you cango find it in the middle of the table and look up the Z that keys to it, and then putit into this equation.And so I’m going to just give you examples of some real life questionsthat you might see, like on a homework or on a task, probably not in real real life.That where you need to calculate x, and you need to use that formula in the red circle.So let’s say I was just bored. And I was wondering, what is the score the test score on the storydistribution? That is add z equals 1.5? Okay, so see where z equals 1.5? We never askedthat question before. So let’s say I just out of curiosity wanted to know, what wouldthe test score be of a student who was at z equals 1.5. So what I would do is I wouldtake 1.5 times 14.5, because that’s what the formula says. It’s z times the standard deviation.And then I do that first because order of operation.And then after doing that, I’dadd the mu, which is 65.5. And I get 87.3. So the x, the student who got 87.3, that studentgot a score, that’s add z equals 1.5. Now, as you probably imagine, people don’t go aroundasking so much about well, I wonder what that person’s score is at z equals negative 2.3?Or whatever. They don’t usually phrase it like that. Usually, you see more like a questionlike this, which is what is the score that marks the top 7% of scores? And that’s a secretway of saying, We are looking for the Z at p equals point. Oh, seven. Oh, so it’s likewe turn that 7% backwards into probability. And we say, we’re actually looking for theZ at p equals point.Oh, seven. Oh, so how do you do that? Well, I’m going to show you. Okay, so we’re on the hunt for probability. Point.0700. Okay, so let’s start at the top of the table here. You’ll see we’re digging aroundin the middle of the table, right? And you’ll see like point oh, that’s nowhere near theballpark, because we’re looking for point O seven. Oh, so let’s scroll up here. or scrolldown, actually. So now we’re more we’re in the point O four neighborhood. Here’s pointO six. Okay, we’re getting close. Well, here we have a point. Oh, 708. And that’s point oh, eight more thanwe want it to be. Well, here next door, we have point Oh, 694.And that’s only point oh, six less than we want it to be right, because if it had pointO six more, it would be point O seven.Oh, so this is technically closer than this one, becausethis is point O, O eight off. And this is only off by point O six. So we’re gonna choosepoint o 694. As the probably the probability of record for this for the top 7%. Only, we’renot going to just choose this, we’re going to figure out what is z at that score. Sowhat are we gonna do, we’re gonna map back here, negative 1.4. And then we got to goall the way up, which we can guess is eight. So it’s negative 1.48. So hold that thought.Okay, we started out looking for the Z p equals 0.0700. And but the closest we got was 0.0694,and then map to z equals negative 1.48.Now, what I want you to notice is negative 1.48is actually on the left side of me. Okay, so that is the z score at the bottom 7% ofthe scores. So we’re going to use the positive version of that see, since we want the top7%, so we’re going to use 1.48. So the opposite See, and now we’re going to plug it into theequation. So 1.48 times 14.5, which is the standard deviation plus 65.5 equals 87. Sonow at seven is the score that marks the top 7% of the scores.I’m going to do anotherexercise for you. That does the this time the bottom 3% of the scores because this isoften kind of challenging for students. So I’ll just give you a second demonstration.So as you can imagine, we’re going on the hunt now for z at p equals 0.0300. So let’sgo over to the Z table. All right, now we’re getting a little good at this, right? So we’redigging around in the middle, and we’re looking for 0.0300.Okay, and starting at the top,we’re in the 00. department. Oh, here’s point 01. Something 02. Okay, we’re getting closeto the point 0300. So a point, point 0301. Could you ask for anything closer? TotallyPerfect. Okay, so that’s what we’re going to use for our z is the the Z at 0.0301. Solet’s look up that C so that c is negative 1.8. And then we look up eight, so it’s negative1.88. Hold that thought. All right. Well, we were on the hunt for Pequals Oh, point oh, three. Oh, and we didn’t find that. But we did find p equals point.Oh, 301 and the table, and that mapped back to z equals negative 1.88. Right. And nowwe go back to the question, we see that we want the bottom 3%, so we keep the negative.Now if I’d asked about the top 3%, we’d lose the negative we use 1.88 in the equation,but since we want the bottom 3%, we’re going to keep the negative.Okay, so now let’s dothe equation. So x equals and then in the parentheses negative 1.88 times 14.5, whichis our standard deviation, then plus our mu, which is 65.5. And the score we get is 38.2.So 38.2 is the score that marks the bottom 3% of scores, and just be happy your scoreis not in there. Okay, now, here’s another challenging hard question. What is the questionon the tester, probably not in real life, but on a test says what scores mark the middle20% of the data.And so I put little arrows on there just to point out well, when theysay middle, they mean, it’s hugging the meal, it’s actually assuming that there’s gonnabe 10% on the right side of the meal, and 10% on the left side of the meal. And so howyou start to do this is you figure out the z score for one minus point two, which isthe 20% divided by two, which equals four, right? So then after that, you know, becauseone minus point two is point eight, and point eight divided by two is point four. So weget this point four. So we go find the z score at point four, which you’re good at usingthe Z table now. So uh, so I’m, you know, looked around, and I found point 4013, inthat, digging around in the middle of the Z table, and that map back to negative z equalsnegative point two, five, right. And so that is then what I would put on for the lowerlimit on that one, and then z equals point two, five, the positive version goes on theother side.So once you figured out both of the Z’s, the Z on the left and the Z on theright, you just have to put them through the equation. So for the left side, we use thenegative z. And for the right side, we use the positive Z. And that’s how we get ourlimits. So what’s for is mark the middle 20% of the data 61.9 and 69.1. It’s not weirdhow that worked out. But anyway, 61.9 and 69.1. Mark the middle 20% of the data. I didn’ttotally didn’t do that on purpose. It just worked out that way. All right, I can’t believeyou made it through all this. I’ll bet your brain is ready to explode. So now is a goodtime to talk about just a little review. Just help me come down a little bit from this wholereally intense lecture. Okay. So first, I’m going to do a little Z score quiz game showstyle stuff here, right? So if you ever get the question when you’re on the test, andyou’re like, Oh, my gosh, where is x? Where’s x? Well, if you can’t find x, it’s usuallyin the question.So usually, the way these questions go is somebody like maybe me, we’llput a mu and a standard deviation at the top of the question. And then there’ll be like,maybe five questions about that pertain to that mu and that standard deviation, but theyasked about different axes. And when I would teach this class, a person, you know, peoplewill come running up to me in the middle of a test, which you probably shouldn’t do. Andthey would say, where’s the x? Where’s the x you gave me you know? These pieces of theequation but I can’t find the x. And I’d be like, walk on the question. Look in the question,you know, because I don’t want to give it away, and then they’d all run back to theirseats and find it. So that’s so if you’re wondering, your panic and where’s x? Lookin the question, it’s usually in the question. Okay, so let’s say you find an X, and whatdo you do with an x? Okay, and you’re stuck with an X, what do you Well, usually, whatyou have to do is calculate a z score.So remember, if you’ve got an X, you probablyhave a mu and a standard deviation, you can calculate a z score on that. So if you’repanicking on a test, and you have an x, I mean, Sandy nation, just for fun, calculatea z score and see if it gets you anywhere. Okay, well, let’s say you have a z score,what do you do with a Z score? Well, you always look it up, right? I mean, if you’re, if you’regoing this direction, if you’re getting if you started with an X, and you get a Z, yougot to go to the Z table with.Okay, so that’s your next step. So if you’re doing all thiswork, calculate a z score. And then you’re done. You’re like, Oh, my gosh, what’s mynext step? Go look at the Z table. Well, what is the question asks for an x, right? Well,remember, we have a whole formula for that. So use the x formula. So if there’s no x anywhere,and it’s asking for an x, then use the other formula, use the x formula? And what if the question gives you a P, orI just said p for probability, but it could be a percentage, like Remember, the top is7%, and the bottom 3%? Well, if they give you a percent, just start digging around inthe middle of the Z table, just start digging around looking for that person. Because onceyou start digging around, you realize that map’s back to a z. And then you can get intothe groove of using the x formula, and you’ll probably get yourself out of this pack.Sohere are some final tips and tricks for getting z scores and probabilities, right? And I’vesaid this one before, draw a picture. And what do I mean by that graph out the question,draw the curve, draw the line from you, which goes in the middle. And where the X goes aboveor below the mu, just start with that it doesn’t have to be the scale. But mainly, you wantto get those elements in there. There’s 1x shade, the part of the curve wanted eitherabove the X or below the x, you know, just color it in. So that you get an idea of Doyou want the big part, the one that’s greater than 50%, or the little part, the one that’sless than 50%? If there are two x’s, then shade in the area wanted, which is usuallyin between them. If it’s a calculate the x question, put where the Z or the P is. Soif it was like the top 7%, you could shade in the top little part of the curve. If itwas the bottom 3%, you could cheat in the bottom little part of the curve.So make thispicture and do it at the beginning. Okay, then, note that x is usually in the question.If you can’t find x, and you’re trying to do the Z formula, and you’re saying, Okay,I’m trying to make a z score. That’s what it asks for. I’m trying to find a probability.That’s what it asks for looking the question, and you’ll probably find the accent there.A big problem that I see is people mistake little Z’s for peace. Now, obviously, if you’vegot a Z, that’s like negative, you know, a, p can’t be negative, a probability can’t benegative.So you won’t make that mistake. Even if it’s like negative point two, five,right? You won’t make that mistake. And if the Z is bigger than one, you won’t make thatmistake. So if you see a z equals 2.5, you’re like, obviously, that’s not a probability.But when you have a little BBC score, that’s between zero and one, like point O two, three,it looks a lot like a P, but it’s still a z. So a lot of times people get a little lazy,like they hate using the Z table, and then they calculate the z score, and it’sreally little, so they don’t look it up.Don’t be fooled. You still have to look it up. So if you’re calculating z, you need a littlebaby z like that it still is he still go look it up. Okay. Then finally, remember how stepone was draw a picture. And I went on and on about that. Step 99. Or the last step beforeyou’re done with the question is check your logic against that picture. So if you shadeda big part of your picture, your probability should be bigger than point five, or 50%.If you shaded a little tiny part of your picture, and you’re getting like point nine, five,something, you know that that’s wrong.So please check your logic against the picture.Before you say that you’re done with your question. Okay. So you made it through thislong lecture about z, and about probabilities. So I gave you an introduction to the standardnormal curve into those two Z score formulas. I showed you how to calculate z scores, andhow to look at probabilities. And I also showed you at the end, how to calculate x if givena z score or a probability. Okay, and all I want to say is, unfortunately, those studentsthose pretend students on that distribution, they were none of them got 100% Okay? That’snot the case in our class, a lot of times people get 100% on the quizzes. That’s whyI can’t use your grades as examples. Okay, so good luck on the quiz. Well, hello, it’stime for statistics. It’s Monica wahi, your library college lecturer back with chapter7.4 and 7.5 sampling distributions and the central limit theorem. So at the end of thislecture, you should be able to state the new statistical notation for parameters and statistics,for two measures of variation. Name one type of inference and describe it.Explain thedifference between a frequency distribution and a sampling distribution, describe thecentral limit theorem in either words or formulas, and also describe how to calculate the standarderror. So, here’s your introduction to this lecture. And as you can see, I must 7.4 and7.5. Together Again, they felt like a natural fit. First, we’re going to review and maybeoverview on parameters, statistics, and also inferences, we’re going to just talk aboutthose ideas, because that will sort of easy into the next part, which is where we starttalking about sampling distribution, which is the new concept here. Okay. And then we’llgo on to talk about the central limit theorem. And finally, I’ll do a little demonstrationof how to find probabilities regarding x bar.So if you’re not really sure about what thatmeans, don’t worry, you should be able to understand it at the end of this lecture.All right, here’s the first part, parameters, statistics and inferences. And this is thereview and overview I promised you. So if you remember from a long time ago, a statisticis a numerical measure describing a sample. And a parameter is a numerical measure describinga population remember s s sample statistic p p, population parameter, you probably rememberthat. Okay, so we have different ways of notating these.So if you look under measure, likeyou see me right, and if it’s a statistic, it’s x bar, and I say x bar on this on theslide sometimes because it’s hard to make that little line always be positioned abovethe x. So I’m just lazy to say x bar. And then under parameter, it’s that that new symbol,so it’s pronounced a meal, but it looks like that thing on the slide. All right, um, thenext two variants and standard deviation, remember how they’re friends. And so the statisticversion is the s for variance, it’s the s with the little two up there, the exponent,because you know, it’s standard deviation to the second is variance in the square rootof variance is a standard deviation. So that’s why they have s and then S to thesecond for the statistic, okay. For the parameter, it’s that lowercase sigma symbol. And that’sit’s that to the second when it’s variance, and it’s just without the exponent, when it’sjust the regular parameter of standard deviation, right.And you’re used to seeing these on the slides.This is just review. I’m also in mentioned in the book proportion is p hat, and thenthe parameter is P. But I don’t really go into that. I just wanted to do a little shoutout to it. Okay, let’s think about the word inference, like infer, like, if somebody impliessomething, maybe you’ll infer it.Like, he implied, it would be hard if I came over latethat night. So I inferred that I shouldn’t come over late then. So like here, you know,you may have heard the term where there’s smoke, there’s fire. And so you see this onthe slide, there’s a lot of smoke. Is there fire, though, is that smoke coming from fire?Because if you look at it, it probably could be coming from fire. But there’s sort of thisoutside chance. It’s not what we think it is, like maybe, you know, I have if you’veever used a fire extinguisher, they make all this phone come out.Maybe it’s that, youknow, or maybe it’s like, if you’ve ever had dry eyes, and then that makes a bunch of smoke.Maybe it’s not fire, right? So where there’s smoke, there’s fire. That’s an inference.Well, let’s see if it’s actually fire, right. But we weren’t sure we thought it waslikely to be fire. But we weren’t sure. And so there’s inference is something that youdo in statistics, because you use probability to make these inferences because you can’tsee the fire. You can just see the smoke and you’re not sure, right? So there’s three differentkinds. I’m going to talk about the first kind of estimation, where we estimate the valueof a parameter using a sample.So the sample is kind of like the smoke and the parametersthe fire we can’t see. So we estimate Okay, and we’re going to talk about that inchapter eight more. A second time, type of inference we do is testing, where we do atest to help us make a decision about a population parameter. In other words, we don’t know one,but we want to make a decision about it. So we do a statistical test. And we’re not goingto get into that, that’s in chapter nine. Finally, there’s regression, where we makepredictions or forecasts about a statistic, that’s a third kind of inference. And we actuallyalready did this in chapter 4.2. So the reason why I bring up all of this is that estimation,which is going to be in chapter eight, and testing, which is going to be in chapter nine,but we’re not going over chapter nine in this class. But um, but if we were, you know, you’dhave to know this because in this lecture, I’m going to talk about sampling distributionsin the central limit theorem.And you need to grasp those things in order to do those,these two things on the slide that with the box around them, estimation, and testing.And so that’s why I’m bringing this up now. Okay, so now we’re going to move on to talkingabout sampling distribution, and how it’s different from a frequency distribution. Alright,so let’s just remind ourselves what a frequency distribution actually is. Okay? So rememberthat from a long time ago, what you would have is a quantitative variable, you’d makea frequency table. And then you use that to graph the histogram, right. And here, I madean example down there of frequency histogram that shows a normal distribution. And so that’swhat you would do, you know, step two would be draw it. And then you see the shape andfigure out what the distribution was of that quantitative variable, or that x, okay, becauseeach one of these is an X, like the middle one, it’s almost 30 X’s that are in that frequency.Okay, now we’re going to talk about sampling distribution, it’s a little more complicated.In a sampling distribution, you start out with a population, that’s the first thingis you’re dealing with population, then you pick an N, of a certain size, like you picka number, that you’re going to have your sample size B.And then you take as many samplesof that size as possible from the population. And then you make an x bar from each of thesamples. So there’s a ton of samples, right? Because and I’ll show you a little demonstration.So you can really wrap your mind around how many different samples that can be. But eachone is going to have an x bar. And then you make a histogram of all those x bars. So likeI said, I’m going to just kind of show you what I’m talking about. So we’re going toimagine this is a population of people. And we’re going to imagine we’re going to talkabout BMI or body mass index, just so you can wrap your mind around this. So you startwith this population, let’s decide on an N. How about five five is good, right? So nowwhat the deal is, is I’m trying to take as many samples of n as possible from all ofthese people on the slide.So here’s our first sample we took, and we got an x bar for BMIof 23. From these five people. Well, let’s try these five people. Now, look, we doubledipped with that first one, okay, but we get this x bar of 21. And we can keep going. Andactually, there’s gonna be a ton of these, right, there’s a ton of different ones. Butit’s finite. I mean, at the end of the day, there’s only so many groups of five, I canget out of this population on the slide, and each group of five is going to have its ownx bar.So I could write down every single one of those x bars I get for every singlegroup of five I can make out of this. And then I can make a histogram of all the x bars.And, of course, I’d start with a frequency table. But look at the frequencies, they’rehuge. That’s because you can get just a ton of samples out of one population. And so whatyou’ll see is if you make a histogram out of that, it looks normally distributed, it’sjust that the frequencies are really high, because there’s a whole bunch of differentsamples you can take. And remember, this is a frequency histogram of x bars. This is eachone of these frequencies is an x bar that you got out of a group of five you could take.And so that’s what the sampling distribution is, it ends up looking like a histogram, butit’s a histogram of all the possible x bars you could get from all the possible samplesof whatever end size you picked from the population that you have.So uh, so this is the fancy way, the officialstatistical way of saying it is a sampling distribution is a probability distributionof A sample statistic, in this case x bar based on all possible simple random samplesof the same size from the same population. So that’s what makes it the sampling distributionand not a frequency distribution. And so in the next section, so you’re probably like,Okay, great, that’s wonderful. You just explained that. But in the next section, we’re goingto talk about the central limit theorem, here comes a theorem, right.And there’s a prooffor the theorem. And you need to understand this concept of sampling distribution forinference in order to understand this proof, so I just had to go through this. Okay, nowwe’re on to the central limit theorem, and how it’s used for statistical inference. SoI’m gonna start by explaining it in words and see that sampling distributions over there.So this is the words around the central limit theorem, it says, For any normal distribution,and remember, we’re talking about a normal distribution here, the sampling distribution,meaning the distributions of the x bars from all possible samples, like we just talkedabout, is a normal distribution, meaning it’s not skewed, it’s not my model, whatever, itlooks kinda like what is on the slide.Okay. And then to this is important, the mean ofthe x bars is actually mu. So I had a student who would say, Oh, the x bar of the x bars,is mu. And that’s actually true. If you actually did the thing I described, which don’t tryit at home, because you’ll be up all night taking samples, okay. But if you did, if youactually got all samples of five from a population, and got all their x bars, and you made a meanof all those x bars, you’d get mu and how you could check it is, of course, just easilytaking a mean of the entire population like that would have been the easy way to do it.But no, if you do it this way, where you get every possible x bar for a particular samplesize, and then you make an x bar, those x bars, you’ll get meal.So that’s, you know,it’s a proof. So that sounds like a thing, that would be inappropriate, right? Now, here’sthe next part three, the standard deviation of all those x Mars is actually the populationstandard deviation divided by the square root of whatever and you picked. So in other words,if you have the whole population data, and you just found out the standard deviation,you just have the standard deviation. But if you did this thing with the x bar, whereyou took all those x bars, and you found the standard deviation of those x bars, that wouldequal the population standard deviation divided by the square root of whatever n, you useto get all those x bars, again, sounds really poufy In theory, but that’s the third partof the central limit theorem in words. And so here’s some people like to look atit from a formula standpoint.So you’ll see on the right side of the slide, in this little,these little formulas, that N means the sample size. And remember, I picked five, you couldpick a different one, right? And mu is the mean of the x distribution, meaning the populationmean, right. And then that population standard deviation symbol is the standard deviationof the x distribution mean the population standard deviation. So we look on the left.Now this is just a formula version of what I just the mu of all the x bars that you couldget from a particular sample in a particular population is going to equal the mean or thepopulation. And the standard deviation of all those x bars is going to equal the populationstandard deviation divided by the square root of whatever n you picked. So now, I just wantto point out the Z thing. We’ve been doing this z thing, right, but we’ve been doingit with 1x. Now, if you imagine grabbing a bunch of x’s, in other words, a sample, thisis the formula you’re going to be using, which is x bar minus mu over the standard deviationdivided by the square root of n, right? And so that’s kind of what we’re moving into hereis what happens if you get a sample and you’re looking at x bar, not if you just grab 1x.And you’re looking at that.So I wanted to point out, first of all, that this whole thingis only supposed to happen if your n is greater than 30. Okay? Otherwise, you shouldn’t reallybe doing this. Then the second thing I wanted to point out is that this piece underneathand the lower part of the equation, that’s called the standard error, they named thatpiece. And part of the reason why I like that they named that piece separately, is I usuallymake that piece before I even do the equation. So I just have that number sitting aroundbecause, you know, there’s a square root underneath this standard deviation, and that whole thingis underneath another thing so it’s hard to do all that dividing. So I usually just makethat standard error first, by taking the standard population standard deviation divided by thesquare root of n and just have that number and then later I use it in this z equation.So that’s two things I wanted you to notice. So I brought that out on the slide. Okay,here’s more on the central limit theorem. So if the distribution of X is normal, thenthe distribution of x bar is also normal.So we look at the top, that’s an example ofjust an X distribution. And then if you go do that thing, we take all those samples,and you get all those x bars. And then you make the histogram, you’ll see the pink onedown, lower. Next bar distribution, this is just a pictorial example. But even if the distribution of X is not normal,as long as there’s more than 30, and is more than 30, the central limit theorem says thatthe x bar distribution is approximately normal. So remember, a lot of that hospital data we’vebeen looking at, like a hospital beds in a state, often you’ll see a skewed distribution.But if you have more than 30, hospitals, then it what you could do is you could pick n n,and take n bigger than 30. And take a bunch of samples and get a bunch of x bar, it’snot just a bunch get all of them all of the possible ones.And then when you if you madethat x bar distribution, even though the hospital beds would be skewed, just as an X distribution,their x bar distribution would be normal. And that’s one other important piece of thecentral limit theorem. That’s one important piece of that proof is that all of those xbars that you get, will end up on a normal distribution, even if your underlying distributionis not normal. So long as the end you’re picking is greater than 30. And finally, that leadsto you know, proofs are they build on each other, that leads us to the concept that asample statistic is considered unbiased, just unbiased, right? It’s not perfect, but it’sunbiased. If the mean of its sampling distribution, equals the parameter being estimated, in otherwords, the fact that the x bar of the x bar is is mu, means that an x bar is going tobe unbiased.It might not be mu, it might not be exactly the same as the populationmean. But it will be unbiased. It’s not a biased representative of mu. All right, nowlet’s move on to finding probabilities regarding x bar. So for those of you who want to actuallydo something and apply something and stop thinking about theory, let’s go. Okay, butlet’s remind ourselves, what are we doing? Right? What are we doing? Well, what werewe doing in chapters 7.1 through 7.3, we were looking at having a normally distributed x.So we have this population of quantitative values that were normally distributed. Andwe had a population mean a mu, and we the population standard deviation. And we keptdoing these exercises, where we were finding the probability of selecting a value fromthat population and x from that population above or below a certain value of x, right.And so we were looking at the probabilities, and we’d look up the z score in the Z tableprobabilities.And so basically, what we would be doing is converting m x to z, right. Andwe use this formula here to convert x to z. So whenever we add an x, we could put it onthe Z distribution, and we could figure out the probability. So here’s what’s different.Now, you’ll notice the first thing has not changed, we’re still talking about normallydistributed x’s, we’re still talking about a population where we have a mu and a populationstandard deviation. But now we’re not just grabbing 1x. From that population, we’re grabbinga sample. And because we’re grabbing a sample, we have to pick an N. So the N is going tobe different each time, right? So we’re grabbing a sample of the population. Well, how do weboil that down to one number? Well, we’re taking the x bar are the mean value from thatsample. And that’s what we’re doing. The Z score is that x bar instead of the x, becausewe’re taking a sample, so when you see the formula below, you’ll notice that the otherone just had x in it, because we only had one, this one has x bar, and because we havea sample, you also notice that downstairs, what we had before was the population standarddeviation, but now we have the standard error.Remember I talkedabout that the population standard deviation divided by the square root of n, that’s wheren comes in, because it’s going to matter which what and you have to make the Z come out right?Alright, so now that we’re reminded of what we’re doing, we’ll just explain how to doit right. So let’s say you do have an N, right, and you have an x bar, like you grabbed yourn and you got an x bar, you can convert that x bar to a z score using this formula, where,of course, you have to be told the population mean and the population standard deviation,but then you’ll have your x bar and you’ll have your n.So you can do the whole equation.And then you’ll get to see and guess what you do. What do you do with a Z, you lookit up. So you look at the probability for the z score in the Z table. Like in chapter7.2, and 7.3. Only, this is just about x bar, basically. So um, and then I thought, whatI would do is walk you through two examples.You’re already kind of good at this, becausethis is not too different from 7.2, and 7.3. But I just want to walk you through it, becauseit is a little different when you have a sample versus just 1x. Okay, so remember our poorchemistry class that I was in when I got to 73? Well, remember, we were assuming it was100 Student class.So there were 100 students in the class and equals 100 in the class capital,right, because they’re the population. And then if you look on the slide, you’ll seethe mu of their scores was pretty bad. It was 65.5 on 100 point test, and the populationstandard deviation was 14.5. So this was the population of this 100 Student class. So I’mgoing to do some exercises here, let’s say we’re going to pick a, we have to pick anN bigger than 30.So we’re going to pick an N of 49. Right? Now, I’m coming up with alittle scenario here. To pass the class students have to get at least 70, which is a C. Solet’s pretend this is the question, what is the probability of me selecting a sample of49 students with an x bar greater than 70? Notice how we ask the question a little bitdifferently. What’s the probability of me getting a set of 49 students such that theirx bar is greater than 70? Does not kind of remind you of the central limit theorem, wherewe had to go back and get a like an N a five, we got different ends of five? What what’sthe probability of me getting one of those samples that has an x bar in the greater than70? That’s the question, right. And I drew this out here, remember our old z distributionwith our also our x distribution, and I kind of drew where somebody is.But I wanted youto point I wanted to point out for you, the probability for an x bar is going to be smallerthan for x, because you’re going to have to do a lot of work to get that x bar to be above70. Right? So here we go. So I’m just going to remind you that the equation at the topand the equation at the bottom are the same equation. I’m just using the term assay forthe standard error. And I like to calculate that separately, like I told you, so I liketo do that first. So we’re going to do that. And how do we do that? Well, the end was 49,right? And I’m the population standard deviation is 14.5.So that’s where we get this, thisnumber, the standard error of 2.1. So now, let’s calculate the Z. All right, here’s z.So z is our x, which is our x bar, which is 70 minus 65.5, which is our mu, divided byour prep cooked standard error, which is 2.1. And we get a Z of 2.17. So we’re tempted tolook that up. But let’s look at our picture. So here’s our z distribution. And what we’regoing for is this little piece at the top right above 2.17. So that’s a little piece.So we got to look for that right? Let’s go look. So because we’re going to go for thepiece at the top, we’re going to use the opposite z.There’s remember two ways of doing this.But everybody seems to prefer the way where you use the opposite z if you’re looking forsomething to the right. So we’re going to use negative 2.17 to get a little piece, right?Because when you look that up, I’m not going to demonstrate you guys are good at this now.You get P equals 0.0150. If you were to look up 2.17, then you’d get the big piece. Sothat’s why we do this. And so then the answer is, remember the question was what is theprobability of me selecting a sample or a set of 49 students with an x bar that’s greaterthan 70.And remember how this real test really sucked. I mean, people that mu was 65.5. So it was prettyhard to get a high score. So the probability was pretty low as point 0.0150. Or if youdo that Present version 1.5%. Okay, now we’re going to try a different one. That one wasasking what is the probability of me selecting a sample with an x bar greater than a certainnumber? Now we’re going to talk about the probability of selecting a sample with thex bar between two numbers, right? So again, we’re back with our poor student class thatwith this terrible chemistry test, this time I decided to choose the end of 36, you’llnotice that I always choose perfect squares for ends because you have to take the squareroot, and I’m just lazy. So okay, here’s our question, what is the probability of me selectinga sample of 36 students with an x bar between 60 and 65. And just I drew this picture uphere to remind you that, that’s gonna be on the left side of meal, you know, we’re goingto be dealing with negative Z’s right.And so we have to remember when we would havetwo axes, back in 7.2, and 7.3. Well, this is now a situation where we have 2x bars,so you just got to name them x bar one and x bar two. And, again, I show you this demonstration,you know, these red arrows, but the probability for x bar will be smaller than for x, becauseit’s harder to get a whole group of people together to give you an x bar in between acertain place. Alright, so this is not new, these are the same formulas I showed you before,I just want to emphasize that making your standard error first, can really help youas you move along through these problems, it just makes it a little easier to calculate,especially in this case, where we’re going to use the standard error twice. So again,what we do is we take, this would look exactly like the last standard error, but it’s differentbecause our n is different.So this time, our standard error comes out as 2.4. And whatI just want to remind you is that the more and you get, the bigger that square root ofn gets, I mean, n gets bigger, the square root of n gets bigger. And that’s then thesmaller the standard error gets. So you can make the standard error really small, if youjust get a lot of n, right. So here’s z one and z two, I put them both up there. But wecan just walk through this, you know, x bar one is 60.And x bar two is 65. Because it’sbetween 60 and 65. So you see that, um, you see what’s going on in the slide. And likeI told you, you know, these were both of these x bars are below the mu. So they’re both kindof negative Z’s. And so we’ve got our negative Z’s. And that now we have to just remind ourselves,well, what are we doing, right? And so you see, z one is at negative 2.28. So that’sa little piece at the bottom, we’re going to want to trim off.And then the big pieceat the top for z two, that starts at negative point two, one. So that’s just remember, thepicture is really helpful. So now we’re going to go deal with the probabilities, right?So for z one, we’re looking at something to the left, so we just leave the Z alone andgo look it up. And that’s p equals 0.0113. For z two, we got to flip the sign becausewe have to use the opposite z, because we’re going for the right, so that was the probabilitytwo and we can check that see, because we can see that’s more than 50% of that shape.So it’s point 5832.Okay, so we got our probabilities now. And like just like last time, we gotto take one minus both of those pieces, right? And then we get the probability in the middle.And that’s the probability of drawing us sample of 36 students with an x bar between 60 and65. And I just to translate that to the answer, the probability is point 4055. Or if you roundedit, you know, when you like, percents, you could say 41%. So in conclusion, we reviewedthe parameters, and the statistics, and those notations. And we talked about inferencesand what we’re doing with inference. Next, we talked about what a sampling distributionis, and how that’s different from a frequency distribution. So you can tellyou know what’s going on with that. Then I presented to you the central limit theorem,which may have been kind of confusing, because you know, theorems always are, they’re alwaysabout different principles and about different things equaling each other.But because ofthe central limit theorem, we then have permission to do the operations we’re doing after that,which is finding probabilities regarding x bar. The central limit theorem says that,you know, this is how the world works. So you get to use the standard error, and youget to do these kinds of calculations. So now, you know how to in addition to findingprobabilities regarding x, you can find probabilities, we got x bar. Don’t you feel smart.

As found on YouTube

Book Now For Environmental Consultingl In Newcastle

Leave a Reply

Your email address will not be published.