Introduction to Probability and Statistics Using R pdf download






















If instead it shows UseMethod "something" then you will need to choose the class of the object to be inputted and next look at the method that will be dispatched to the object. There is one for dendrogram objects and a default method for everything else.

Simply type the name to see what each method does. Some functions are hidden by a namespace see An Introduction to R [85] , and are not visible on the first try. For example, if we try to look at the code for wilcox. UseMethod "wilcox. In cases like these we prefix the package name to the front of the func- tion name with three colons; the command statswilcox. If it shows. Internal something or. Primitive "something" , then it will be necessary to download the source code of R which is not a binary version with an.

See Ligges [60] for more discussion on this. Primitive "exp" Be warned that most of the. Internal functions are written in other computer languages which the beginner may not understand, at least initially. Fortunately, R has extensive help resources and you should immediately become familiar with them. Begin by clicking Help on Rgui. The following options are available. Typing mean in the window is equivalent to typing help "mean" at the command line, or more simply,? Note that this method only works if the function of interest is contained in a package that is already loaded into the search path with library.

This is possibly the best help method for beginners. It can be started from the command line with the command help. The advantage is that you do not need to know the exact name of the function; the disadvantage is that you cannot point-and-click the results.

An equivalent way is?? It can be very useful for finding other questions that other users have asked. The example function will run the code automatically, skipping the intermediate step. For instance, we may try example mean to see a few examples of how the mean function works. Particularly pay attention to the bottom of the page which lists several special interest groups SIGs related to R.

Bear in mind that R is free software, which means that it was written by volunteers, and the people that frequent the mailing lists are also volunteers who are not paid by customer support fees.

Consequently, if you want to use the mailing lists for free advice then you must adhere to some basic etiquette, or else you may not get a reply, or even worse, you may receive a reply which is a bit less cordial than you are used to.

Below are a few considerations: 1. Note that there are different FAQs for different operating systems. You should read these now, even without a question at the moment, to learn a lot about the idiosyncrasies of R. Search the archives. Even if your question is not a FAQ, there is a very high likelihood that your question has been asked before on the mailing list. If you want to know about topic foo, then you can do RSiteSearch "foo" to search the mailing list archives and the online help for it.

Do a Google search and an RSeek. If your question is not a FAQ, has not been asked on R-help before, and does not yield to a Google or alternative search, then, and only then, should you even consider writing to R-help.

Below are a few additional considerations. This will save you a lot of trouble and pain. Readers of your message will take the text from your mail and copy-paste into an R session. Questions are often related to a specific data set, and the best way to communicate the data is with a dump command. For instance, if your question involves data stored in a vector x, you can type dump "x","" at the command prompt and copy-paste the output into the body of your email message.

Sometimes the answer the question is related to the operating system used, the attached packages, or the exact version of R being used. The sessionInfo command collects all of this information to be copy-pasted into an email and the Posting Guide requests this information. See Appendix A for an example. Below are a few of the important ones.

There are also loads of contributed information books, tutorials, etc. There are mirrors all over the world with duplicate infor- mation. Here you can find development code which has not yet been released to CRAN.

If you find a trick of your own, login and share it with the world. More generally, the command history will show a whole list of recently entered commands. These list all available objects in the workspace. All of the numbers will automatically be entered into the vector x. These commands may be saved in a file called Rprofile. Alternatively, you can make a file.

This allows for multiple configurations for different projects or users. This file is then automatically loaded the next time R starts in which case R will say [previously saved workspace restored]. This is a valuable feature for experienced users of R, but I find that it causes more trouble than it saves with beginners.

Once we see how to display data distributions, we next introduce the basic properties of data distributions. We qualitatively explore several data sets. Once that we have intuitive properties of data sets, we next discuss how we may numerically measure and describe those properties with descriptive statistics.

What do I want them to know? In each subsection we look at some examples of the type in question and introduce methods to display them. They invariably assume numerical values. Quantitative data can be further subdivided into two categories. Examples include: counts, number of arrivals, or number of successes.

They are often represented by integers, say, 0, 1, 2, etc.. These are also known as scale data, interval data, or measurement data. Examples include: height, weight, length, time, etc. Continuous data are often characterized by fractions or decimals: 3. Note that the distinction between discrete and continuous data is not always clear-cut. Sometimes it is convenient to treat data as if they were continuous, even though strictly speaking they are not continuous.

See the examples. Example 3. Annual Precipitation in US Cities. The vector precip contains average amount of rainfall in inches for each of 70 cities in the United States and Puerto Rico. These are quantitative continuous data.

Lengths of Major North American Rivers. The U. Geological Survey recorded the lengths in miles of several rivers in North America.

They are stored in the vector rivers in the datasets package which ships with base R. Let us take a look at the data with the str function. These data are definitely quantitative and it appears that the measurements have been rounded to the nearest mile. Thus, strictly speaking, these are discrete data. But we will find it convenient later to take data like these to be continuous for some of our statistical procedures.

Yearly Numbers of Important Discoveries. The entries are integers, and since they represent counts this is a good example of discrete quantitative data. We will take a closer look in the following sections. There are almost as many display types from which to choose as there are data sets to plot. We describe some of the more popular alternatives.

Strip charts also known as Dot plots These can be used for discrete or continuous data, and usually look best when the data set is not too large. Along the horizontal axis is a numerical scale above which the data values are plotted. We can do it in R with a call to the stripchart function. There are three available methods.

This method is good to display only the distinct values assumed by the data set. This method is best used for discrete data with a lot of ties; if there are no repeats then this method is identical to overplot.

See Figure 3. The leftmost graph is a strip chart of the precip data. The graph shows tightly clustered values in the middle with some others falling balanced on either side, with perhaps slightly more falling to the left.

Later we will call this a symmetric distribution, see Section 3. The middle graph is of the rivers data, a vector of length There are several repeated values in the rivers data, and if we were to use the overplot method we would lose some of them in the display.

This plot shows a what we will later call a right-skewed shape with perhaps some extreme values on the far right of the display. The third graph strip charts discoveries data which are literally a textbook example of a right skewed distribution. Histogram These are typically used for continuous data. A histogram is constructed by first deciding on a set of classes, or bins, which partition the real line into a set of boxes into which the data values fall. The scale on the y axis can be frequency, percentage, or density relative frequency.

The term histogram was coined by Karl Pearson in , see [66]. We are going to take another look at the precip data that we investigated earlier. The strip chart in Figure 3. There are many ways to plot histograms in R, and one of the easiest is with the hist function. The following code produces the plots in Figure 3. Please be careful regarding the biggest weakness of histograms: the graph obtained strongly depends on the bins chosen.

Choose another set of bins, and you will get a different histogram. Luckily for us there are algorithms to automatically choose bins that are likely to display well, and more often than not the default bins do a good job. This is not always the case, however, and a responsible statistician will investigate many bin choices to test the stability of the display. Recall that the strip chart in Figure 3. Watch what happens when we change the bins slightly with the breaks argument to hist.

There are two humps: a big one in the middle and a smaller one to the left. Graphs like this often indicate some underlying group structure to the data; we could now investigate whether the cities for which rainfall was measured were similar in some way, with respect to geographic region, for example.

The rightmost graph in Figure 3. If we were to continue increasing the number of bins we would eventually get all observed bins to have exactly one element, which is nothing more than a glorified strip chart. Stemplots more to be said in Section 3. The final digit of the data values is taken to be a leaf, and the leading digit s is are taken to be stems.

We draw a vertical line, and to the left of the line we list the stems. To the right of the line, we list the leaves beside their corresponding stem. There will typically be several leaves for each stem, in which case the leaves accumulate to the right. It is sometimes necessary to round the data values, especially for larger data sets.

UKDriverDeaths is a time series that contains the total car drivers killed or se- riously injured in Great Britain monthly from Jan to Dec Compulsory seat belt use was introduced on January 31, We construct a stem and leaf dia- gram in R with the stem. Note that the data have been rounded to the tens place so that each datum gets only one leaf to the right of the dividing line.

Notice that the depths have been suppressed. To learn more about this option and many others, see Section 3. Unlike a histogram, the original data values may be recovered from the stemplot display — modulo the rounding — that is, starting from the top and working down we can read off the data values , , , , etc. Index plot Done with the plot function. These are good for plotting data which are ordered, for example, when the data are measured over time. That is, the first observation was measured at time 1, the second at time 2, etc.

It is a two dimensional plot, in which the index or time is the x variable and the measured value is the y variable. Level of Lake Huron Brockwell and Davis [11] give the annual mea- surements of the level in feet of Lake Huron from — The data are stored in the time series LakeHuron.

Figure 3. Density estimates 3. Please bear in mind that some data look to be quantitative but are not, because they do not represent numerical quantities and do not obey mathematical rules. Shoe size is not quantitative, however, because if we take a size 8 and combine with a size 9 we do not get a size This type of data does not usually play much of a role in statistics.

But other qualitative variables serve to subdivide the data set into categories; we call these factors. In the above examples, gender, race, political party, and socioeconomic status would be considered factors shoe size would be another one. The possible values of a factor are called its levels. For instance, the factor gender would have two levels, namely, male and female.

Socioeconomic status typically has three levels: high, middle, and low. Factors may be of two types: nominal and ordinal. Nominal factors have levels that correspond to names of the categories, with no implied ordering.

Examples of nominal factors would be hair color, gender, race, or political party. In contrast, ordinal factors have some sort of ordered structure to the underlying factor levels. For instance, socioeconomic status would be an ordinal categorical variable because the levels cor- respond to ranks associated with income, education, and occupation.

Another example of ordinal categorical data would be class rank. Factors have special status in R. The state. These would be ID data. State Facts and Features. Department of Commerce of the U. To see all of the levels we printed the first five entries of the vector in the second line.

We may count frequencies with the table function or list proportions with the prop. A bar is displayed for each level of a factor, with the heights of the bars proportional to the frequencies of observations falling in the respective categories. A disadvantage of bar graphs is that the levels are ordered alphabetically by default , which may sometimes obscure patterns in the display. It is already stored internally as a factor. The display on the left is a frequency bar graph because the y axis shows counts, while the display on the left is a relative frequency bar graph.

The only difference between the two is the scale. Looking at the graph we see that the majority of the fifty states are in the South, followed by West, North Central, and finally Northeast. Notice the cex. Pareto Diagrams A pareto diagram is a lot like a bar graph except the bars are rearranged such that they decrease in height going from left to right.

The rearrangement is handy because it can visually reveal structure if any in how fast the bars decrease — this is much more difficult when the bars are jumbled. We can make a pareto diagram with either the RcmdrPlugin. IPSUR package or with the pareto. The code follows. They do not convey any more or less information than the associated bar graph, but the strength lies in the economy of the display.

Dot charts are so compact that it is easy to graph very complicated multi-variable interactions together in one graph. See Section 3. We will give an example here using the same data as above for comparison. The graph was produced by the following code. Compare it to Figure 3. Pareto chart analysis for table state. Percentage Cum. Pie charts are consequently a very bad way of displaying information. A bar chart or dot chart is a preferable way of displaying qualitative data.

We are not going to do any examples of a pie graph and discourage their use elsewhere. For example, the stem. We saw in Section 3. R reserves the special symbol NA to representing missing data. In those cases we can find the locations of any NAs with the is.

See Appendix D. The acronym to remember is Center, Unusual features, Spread, and Shape. Loosely speaking, the center of a data set is associated with a number that represents a middle or general tendency of the data.

Of course, there are usually several values that would serve as a center, and our later tasks will be focused on choosing an appropriate one for the data at hand. Judging from the histogram that we saw in Figure 3.

The shape can tell us a lot about any underlying structure to the data, and can help us decide which statistical procedure we should use to analyze them. A left-skewed or negatively skewed distribution is stretched to the left side. A symmetric distribution has a graph that is balanced about its center, in the sense that half of the graph may be reflected about a central line of symmetry to match the other half.

We have already encountered skewed distributions: both the discoveries data in Figure 3. Some distri- butions tend to have a flat shape with thin tails. These are called platykurtic, and an example of a platykurtic distribution is the uniform distribution; see Section 6.

On the other end of the spec- trum are distributions with a steep peak, or spike, accompanied by heavy tails; these are called leptokurtic. Examples of leptokurtic distributions are the Laplace distribution and the logistic dis- tribution. See Section 6. In between are distributions called mesokurtic with a rounded peak and moderately sized tails.

The standard example of a mesokurtic distribution is the famous bell- shaped curve, also known as the Gaussian, or normal, distribution, and the binomial distribution can be mesokurtic for specific choices of p. See Sections 5. They indicate clumping of the data about distinct values, and gaps may exist between clusters.

Clusters often suggest an underlying grouping to the data. For example, take a look at the faithful data which contains the duration of eruptions and the waiting time between eruptions of the Old Faithful geyser in Yellowstone National Park.

Do not be frightened by the complicated information at the left of the display for now; we will learn how to interpret it in Section 3. Such observations are troublesome to many statistical procedures; they cause exaggerated estimates and instability. It is important to identify extreme observations and examine the source of the data more closely. Especially with large data sets becoming more prevalent, many of which being recorded by hand, mistakes are a common problem.

After closer scrutiny, these can often be fixed. For example, in medical research some subjects may have relevant complications in their genealogical history that would rule out their participation in the ex- periment. Or when a manufacturing company investigates the properties of one of its devices, perhaps a particular product is malfunctioning and is not representative of the majority of the items. Many of the most influential scientific discoveries were made when the investigator noticed an unexpected result, a value that was not predicted by the classical theory.

Albert Einstein, Louis Pasteur, and others built their careers on exactly this circumstance. The idea is that there are a number of different categories, and we would like to get some idea about how the categories are represented in the population. For example, we may want to see how the 3. To calculate its value, first sort the data into an increasing sequence of numbers. The same cannot be said for the sample mean. Any significant changes in the magnitude of an observation xk results in a corresponding change in the value of the mean.

Hence, the sample mean is said to be sensitive to extreme observations. The trimmed mean is a measure designed to address the sensitivity of the sample mean to extreme observations. Given a data set x1 , x2 ,. The sample quantiles are related to the order statistics.

Unfortunately, there is not a universally accepted definition of them. Indeed, R is equipped to calculate quantiles using nine distinct defini- tions! Suppose the data set has n observations. Keep in mind that there is not a unique definition of percentiles, quartiles, etc. The difference is small and seldom plays a role except in small data sets with repeated values. In fact, most people do not even notice in common use. In the Expression to compute dialog simply type sort varname , where varname is the variable that it is desired to sort.

Intuitively, the sample variance is approximately the average squared distance of the observations from the sample mean. The sample standard deviation is used to scale the estimate back to the measurement units of the original data. In the meantime, the following two rules give some meaning to the standard deviation, in that there are bounds on how much of the data can fall past a certain distance from the mean. Fact 3. The price for such generality is that the bounds are not very tight; if we know more about how the data are shaped then we can say more about how much of the data can fall a given distance from the mean.

Interquartile Range Just as the sample mean is sensitive to extreme values, so the associated measure of spread is similarly sensitive to extremes. Further, the problem is exacerbated by the fact that the extreme distances are squared. Comparing Apples to Apples We have seen three different measures of spread which, for a given data set, will give three different answers. Which one should we use? It depends on the data set. If the data are well behaved, with an approximate bell-shaped distribution, then the sample mean and sample standard deviation are natural choices with nice mathematical properties.

However, if the data have an unusual or skewed shape with several extreme values, perhaps the more resistant choices among the IQR or MAD would be more appropriate. However, once we are looking at the three numbers it is important to understand that the esti- mators are not all measuring the same quantity, on the average. See 8 for more details. Gupta, S. An Introduction to Research Methods. Nurul Islam, M. Introduction to Statistics Statistics Authors Kieth A. Discusses probability theory and to many methods used in problems of statistical inference.

The Third Edition features material on descriptive statistics. Science at the University of Dhaka, Bangladesh. Currently Dr. Islam is Pro-Vice.. Jan 7, — Introduction to probability and statistics for engineers and scientists, 5th edition, An introduction to statistics and probability by nurul islam pdf, The security of conventional cryptography Sep 24, — Islam probablity by m nurul islam that we will extremely offer.

It is not on the Eventually, you will utterly discover An introduction to statistics and probability. By: Islam, M. Dispersion 1 - Free download as Powerpoint Presentation. Nurul Islam ,M. An Introduction to Statistics and Probability.. Edition: 3rd ed. An Introduction to Statistics and Probability.

Nurul Islam, 6, Revised, Research. Nurul Islam, 2, Statistic.. Introduction to Statistics: An Intuitive Guide for This course provides an introduction planar engineering dynamics for particles and rigid bodies. On now. Free to learn.. Rafiqul Islam, M.

Korban Ali, Md. Freund Publisher: Dhaka : Book World, Edition: 3rd ed. Maybe you have knowledge that, people have search Probability By Nurul Islam The advantage is that you do not need to know the exact name of the function; the disadvantage is that you cannot point-and-click the results. An equivalent way is?? It can be very useful for finding other questions that other users have asked.

The example function will run the code automatically, skipping the intermediate step. For instance, we may try example mean to see a few examples of how the mean function works. Particularly pay attention to the bottom of the page which lists several special interest groups SIGs related to R. Bear in mind that R is free software, which means that it was written by volunteers, and the people that frequent the mailing lists are also volunteers who are not paid by customer support fees.

Consequently, if you want to use the mailing lists for free advice then you must adhere to some basic etiquette, or else you may not get a reply, or even worse, you may receive a reply which is a bit less cordial than you are used to.

Below are a few considerations: 1. Note that there are different FAQs for different operating systems. You should read these now, even without a question at the moment, to learn a lot about the idiosyncrasies of R.

Search the archives. Even if your question is not a FAQ, there is a very high likelihood that your question has been asked before on the mailing list. If you want to know about topic foo, then you can do RSiteSearch "foo" to search the mailing list archives and the online help for it.

Do a Google search and an RSeek. If your question is not a FAQ, has not been asked on R-help before, and does not yield to a Google or alternative search, then, and only then, should you even consider writing to R-help. Below are a few additional considerations.

This will save you a lot of trouble and pain. Readers of your message will take the text from your mail and copy-paste into an R session. Questions are often related to a specific data set, and the best way to communicate the data is with a dump command. For instance, if your question involves data stored in a vector x, you can type dump "x","" at the command prompt and copy-paste the output into the body of your email message. Sometimes the answer the question is related to the operating system used, the attached packages, or the exact version of R being used.

The sessionInfo command collects all of this information to be copy-pasted into an email and the Posting Guide requests this information. See Appendix A for an example. Below are a few of the important ones. There are also loads of contributed information books, tutorials, etc. There are mirrors all over the world with duplicate infor- mation. Here you can find development code which has not yet been released to CRAN. If you find a trick of your own, login and share it with the world.

More generally, the command history will show a whole list of recently entered commands. These list all available objects in the workspace. All of the numbers will automatically be entered into the vector x. These commands may be saved in a file called Rprofile.

Alternatively, you can make a file. This allows for multiple configurations for different projects or users. This file is then automatically loaded the next time R starts in which case R will say [previously saved workspace restored]. This is a valuable feature for experienced users of R, but I find that it causes more trouble than it saves with beginners.

Once we see how to display data distributions, we next introduce the basic properties of data distributions. We qualitatively explore several data sets. Once that we have intuitive properties of data sets, we next discuss how we may numerically measure and describe those properties with descriptive statistics. What do I want them to know? In each subsection we look at some examples of the type in question and introduce methods to display them. They invariably assume numerical values.

Quantitative data can be further subdivided into two categories. Examples include: counts, number of arrivals, or number of successes.

They are often represented by integers, say, 0, 1, 2, etc.. These are also known as scale data, interval data, or measurement data.

Examples include: height, weight, length, time, etc. Continuous data are often characterized by fractions or decimals: 3. Note that the distinction between discrete and continuous data is not always clear-cut. Sometimes it is convenient to treat data as if they were continuous, even though strictly speaking they are not continuous. See the examples. Example 3. Annual Precipitation in US Cities. The vector precip contains average amount of rainfall in inches for each of 70 cities in the United States and Puerto Rico.

These are quantitative continuous data. Lengths of Major North American Rivers. The U. Geological Survey recorded the lengths in miles of several rivers in North America. They are stored in the vector rivers in the datasets package which ships with base R. Let us take a look at the data with the str function. These data are definitely quantitative and it appears that the measurements have been rounded to the nearest mile.

Thus, strictly speaking, these are discrete data. But we will find it convenient later to take data like these to be continuous for some of our statistical procedures. Yearly Numbers of Important Discoveries. The entries are integers, and since they represent counts this is a good example of discrete quantitative data. We will take a closer look in the following sections.

There are almost as many display types from which to choose as there are data sets to plot. We describe some of the more popular alternatives. Strip charts also known as Dot plots These can be used for discrete or continuous data, and usually look best when the data set is not too large. Along the horizontal axis is a numerical scale above which the data values are plotted. We can do it in R with a call to the stripchart function. There are three available methods. This method is good to display only the distinct values assumed by the data set.

This method is best used for discrete data with a lot of ties; if there are no repeats then this method is identical to overplot.

See Figure 3. The leftmost graph is a strip chart of the precip data. The graph shows tightly clustered values in the middle with some others falling balanced on either side, with perhaps slightly more falling to the left. Later we will call this a symmetric distribution, see Section 3.

The middle graph is of the rivers data, a vector of length There are several repeated values in the rivers data, and if we were to use the overplot method we would lose some of them in the display. This plot shows a what we will later call a right-skewed shape with perhaps some extreme values on the far right of the display. The third graph strip charts discoveries data which are literally a textbook example of a right skewed distribution.

Histogram These are typically used for continuous data. A histogram is constructed by first deciding on a set of classes, or bins, which partition the real line into a set of boxes into which the data values fall. The scale on the y axis can be frequency, percentage, or density relative frequency.

The term histogram was coined by Karl Pearson in , see [66]. We are going to take another look at the precip data that we investigated earlier. The strip chart in Figure 3. There are many ways to plot histograms in R, and one of the easiest is with the hist function. The following code produces the plots in Figure 3. Please be careful regarding the biggest weakness of histograms: the graph obtained strongly depends on the bins chosen.

Choose another set of bins, and you will get a different histogram. Luckily for us there are algorithms to automatically choose bins that are likely to display well, and more often than not the default bins do a good job. This is not always the case, however, and a responsible statistician will investigate many bin choices to test the stability of the display.

Recall that the strip chart in Figure 3. Watch what happens when we change the bins slightly with the breaks argument to hist. There are two humps: a big one in the middle and a smaller one to the left. Graphs like this often indicate some underlying group structure to the data; we could now investigate whether the cities for which rainfall was measured were similar in some way, with respect to geographic region, for example.

The rightmost graph in Figure 3. If we were to continue increasing the number of bins we would eventually get all observed bins to have exactly one element, which is nothing more than a glorified strip chart. Stemplots more to be said in Section 3. The final digit of the data values is taken to be a leaf, and the leading digit s is are taken to be stems.

We draw a vertical line, and to the left of the line we list the stems. To the right of the line, we list the leaves beside their corresponding stem. There will typically be several leaves for each stem, in which case the leaves accumulate to the right. It is sometimes necessary to round the data values, especially for larger data sets. UKDriverDeaths is a time series that contains the total car drivers killed or se- riously injured in Great Britain monthly from Jan to Dec Compulsory seat belt use was introduced on January 31, We construct a stem and leaf dia- gram in R with the stem.

Note that the data have been rounded to the tens place so that each datum gets only one leaf to the right of the dividing line. Notice that the depths have been suppressed. To learn more about this option and many others, see Section 3. Unlike a histogram, the original data values may be recovered from the stemplot display — modulo the rounding — that is, starting from the top and working down we can read off the data values , , , , etc.

Index plot Done with the plot function. These are good for plotting data which are ordered, for example, when the data are measured over time. That is, the first observation was measured at time 1, the second at time 2, etc. It is a two dimensional plot, in which the index or time is the x variable and the measured value is the y variable.

Level of Lake Huron Brockwell and Davis [11] give the annual mea- surements of the level in feet of Lake Huron from — The data are stored in the time series LakeHuron. Figure 3. Density estimates 3. Please bear in mind that some data look to be quantitative but are not, because they do not represent numerical quantities and do not obey mathematical rules. Shoe size is not quantitative, however, because if we take a size 8 and combine with a size 9 we do not get a size This type of data does not usually play much of a role in statistics.

But other qualitative variables serve to subdivide the data set into categories; we call these factors. In the above examples, gender, race, political party, and socioeconomic status would be considered factors shoe size would be another one.

The possible values of a factor are called its levels. For instance, the factor gender would have two levels, namely, male and female. Socioeconomic status typically has three levels: high, middle, and low.

Factors may be of two types: nominal and ordinal. Nominal factors have levels that correspond to names of the categories, with no implied ordering. Examples of nominal factors would be hair color, gender, race, or political party. In contrast, ordinal factors have some sort of ordered structure to the underlying factor levels. For instance, socioeconomic status would be an ordinal categorical variable because the levels cor- respond to ranks associated with income, education, and occupation.

Another example of ordinal categorical data would be class rank. Factors have special status in R. The state. These would be ID data. State Facts and Features. Department of Commerce of the U. To see all of the levels we printed the first five entries of the vector in the second line. We may count frequencies with the table function or list proportions with the prop.

A bar is displayed for each level of a factor, with the heights of the bars proportional to the frequencies of observations falling in the respective categories. A disadvantage of bar graphs is that the levels are ordered alphabetically by default , which may sometimes obscure patterns in the display.

It is already stored internally as a factor. The display on the left is a frequency bar graph because the y axis shows counts, while the display on the left is a relative frequency bar graph. The only difference between the two is the scale. Looking at the graph we see that the majority of the fifty states are in the South, followed by West, North Central, and finally Northeast. Notice the cex. Pareto Diagrams A pareto diagram is a lot like a bar graph except the bars are rearranged such that they decrease in height going from left to right.

The rearrangement is handy because it can visually reveal structure if any in how fast the bars decrease — this is much more difficult when the bars are jumbled. We can make a pareto diagram with either the RcmdrPlugin.

IPSUR package or with the pareto. The code follows. They do not convey any more or less information than the associated bar graph, but the strength lies in the economy of the display.

Dot charts are so compact that it is easy to graph very complicated multi-variable interactions together in one graph. See Section 3. We will give an example here using the same data as above for comparison. The graph was produced by the following code. Compare it to Figure 3. Pareto chart analysis for table state. Percentage Cum. Pie charts are consequently a very bad way of displaying information. A bar chart or dot chart is a preferable way of displaying qualitative data. We are not going to do any examples of a pie graph and discourage their use elsewhere.

For example, the stem. We saw in Section 3. R reserves the special symbol NA to representing missing data. In those cases we can find the locations of any NAs with the is. See Appendix D. The acronym to remember is Center, Unusual features, Spread, and Shape. Loosely speaking, the center of a data set is associated with a number that represents a middle or general tendency of the data. Of course, there are usually several values that would serve as a center, and our later tasks will be focused on choosing an appropriate one for the data at hand.

Judging from the histogram that we saw in Figure 3. The shape can tell us a lot about any underlying structure to the data, and can help us decide which statistical procedure we should use to analyze them.

A left-skewed or negatively skewed distribution is stretched to the left side. A symmetric distribution has a graph that is balanced about its center, in the sense that half of the graph may be reflected about a central line of symmetry to match the other half.

We have already encountered skewed distributions: both the discoveries data in Figure 3. Some distri- butions tend to have a flat shape with thin tails. These are called platykurtic, and an example of a platykurtic distribution is the uniform distribution; see Section 6. On the other end of the spec- trum are distributions with a steep peak, or spike, accompanied by heavy tails; these are called leptokurtic.

Examples of leptokurtic distributions are the Laplace distribution and the logistic dis- tribution. See Section 6. In between are distributions called mesokurtic with a rounded peak and moderately sized tails.

The standard example of a mesokurtic distribution is the famous bell- shaped curve, also known as the Gaussian, or normal, distribution, and the binomial distribution can be mesokurtic for specific choices of p. See Sections 5. They indicate clumping of the data about distinct values, and gaps may exist between clusters.

Clusters often suggest an underlying grouping to the data. For example, take a look at the faithful data which contains the duration of eruptions and the waiting time between eruptions of the Old Faithful geyser in Yellowstone National Park. Do not be frightened by the complicated information at the left of the display for now; we will learn how to interpret it in Section 3. Such observations are troublesome to many statistical procedures; they cause exaggerated estimates and instability.

It is important to identify extreme observations and examine the source of the data more closely. Especially with large data sets becoming more prevalent, many of which being recorded by hand, mistakes are a common problem. After closer scrutiny, these can often be fixed. For example, in medical research some subjects may have relevant complications in their genealogical history that would rule out their participation in the ex- periment. Or when a manufacturing company investigates the properties of one of its devices, perhaps a particular product is malfunctioning and is not representative of the majority of the items.

Many of the most influential scientific discoveries were made when the investigator noticed an unexpected result, a value that was not predicted by the classical theory. Albert Einstein, Louis Pasteur, and others built their careers on exactly this circumstance. The idea is that there are a number of different categories, and we would like to get some idea about how the categories are represented in the population. For example, we may want to see how the 3. To calculate its value, first sort the data into an increasing sequence of numbers.

The same cannot be said for the sample mean. Any significant changes in the magnitude of an observation xk results in a corresponding change in the value of the mean. Hence, the sample mean is said to be sensitive to extreme observations. The trimmed mean is a measure designed to address the sensitivity of the sample mean to extreme observations.

Given a data set x1 , x2 ,. The sample quantiles are related to the order statistics. Unfortunately, there is not a universally accepted definition of them.

Indeed, R is equipped to calculate quantiles using nine distinct defini- tions! Suppose the data set has n observations. Keep in mind that there is not a unique definition of percentiles, quartiles, etc.

The difference is small and seldom plays a role except in small data sets with repeated values. In fact, most people do not even notice in common use.

In the Expression to compute dialog simply type sort varname , where varname is the variable that it is desired to sort. Intuitively, the sample variance is approximately the average squared distance of the observations from the sample mean. The sample standard deviation is used to scale the estimate back to the measurement units of the original data. In the meantime, the following two rules give some meaning to the standard deviation, in that there are bounds on how much of the data can fall past a certain distance from the mean.

Fact 3. The price for such generality is that the bounds are not very tight; if we know more about how the data are shaped then we can say more about how much of the data can fall a given distance from the mean. Interquartile Range Just as the sample mean is sensitive to extreme values, so the associated measure of spread is similarly sensitive to extremes.

Further, the problem is exacerbated by the fact that the extreme distances are squared. Comparing Apples to Apples We have seen three different measures of spread which, for a given data set, will give three different answers. Which one should we use? It depends on the data set. If the data are well behaved, with an approximate bell-shaped distribution, then the sample mean and sample standard deviation are natural choices with nice mathematical properties.

However, if the data have an unusual or skewed shape with several extreme values, perhaps the more resistant choices among the IQR or MAD would be more appropriate.

However, once we are looking at the three numbers it is important to understand that the esti- mators are not all measuring the same quantity, on the average. See 8 for more details. The sample standard deviation is sqrt var x or just sd x. The sign of g1 indicates the direction of skewness of the distribution. Values of g1 near zero indicate a symmetric distribution. These are not hard and fast rules, however. The value of g1 is subject to sampling variability and thus only provides a suggestion to the skewness of the underlying distribution.

The subtraction of 3 may seem mysterious but it is done so that mound shaped samples have values of g2 near zero. Notice that both the sample skewness and the sample kurtosis are invariant with respect to location and scale, that is, the values of g1 and g2 do not depend on the measurement units of the data.

Both functions have a na. Its tools are useful when not much is known regarding the underlying causes associated with the data set, and are often used for checking assumptions. For example, suppose we perform an experiment and collect some data. We look at the data using exploratory visual tools.

Trim Outliers: Some data sets have observations that fall far from the bulk of the other data in a sense made more precise in Section 3. These extreme observations often obscure the underlying structure to the data and are best left out of the data display. The trim. Split Stems: The standard stemplot has only one line per stem, which means that all observations with first digit 3 are plotted on the same line, regardless of the value of the second digit.

We can often fix the display by increasing the number of lines available for a given stem. Obser- vations with second digit 0 through 4 would go on the upper line, while observations with second digit 5 through 9 would go on the lower line. We could do a similar thing with five lines per stem, or even ten lines per stem.

The end result is a more spread out stemplot which often looks better. A good example of this was shown on page Reading about R a few months ago I found this page that gives books away for free, like this one:. There are so many other for free. I hope someone gets help from this and thanks for all information given here. This site uses Akismet to reduce spam.

Learn how your comment data is processed. Skip to content This post will eventually grow to hold a wide list of books on statistics e-books, pdf books and so on that are available for free download. Here is a favoring review the book received in JASA. Download link approx. Download link p.

This textbook is intended for introductory statistics courses. R is not used in this book. The book treats exploratory data analysis with more attention than is typical, includes a chapter on simulation, and provides a unified approach to linear models.

This text lays the foundation for further study and development in statistics using R. Download link first discovered through open text book blog R Programming — a wikibook.



0コメント

  • 1000 / 1000