Discrete Probability Distributions | Real Statistics Using Excel [PDF]

Definition 1: The (probability) frequency function, also called the probability density function (abbreviated pdf), of a

5 downloads 25 Views 240KB Size

Recommend Stories


Discrete Probability Distributions
The wound is the place where the Light enters you. Rumi

[PDF] Statistics for Managers Using Microsoft Excel
Learn to light a candle in the darkest moments of someone’s life. Be the light that helps others see; i

[PDF] Statistics for Managers Using Microsoft Excel
Do not seek to follow in the footsteps of the wise. Seek what they sought. Matsuo Basho

[PDF] Statistics for Managers Using Microsoft Excel
Don't count the days, make the days count. Muhammad Ali

STATISTICS FOR MANAGERS USING Microsoft Excel [PDF]
Custom Edition for. UMASS-Amherst. Professor Robert Nakosteen. Taken from: Statistics for Managers: Using Microsoft Excel, Fifth Edition by David M. Levine, David F. Stephan, Timothy C. Krehbiel, and Mark L. Berenson. Business Data Analysis. SCH-MGMT

CHAPTER 5 SOME DISCRETE PROBABILITY DISTRIBUTIONS 5.1 Uniform Discrete Distribution
In the end only three things matter: how much you loved, how gently you lived, and how gracefully you

Probability and Probability Distributions
Every block of stone has a statue inside it and it is the task of the sculptor to discover it. Mich

Probability Distributions
Everything in the universe is within you. Ask all from yourself. Rumi

Probability Distributions
Never let your sense of morals prevent you from doing what is right. Isaac Asimov

Probability & Statistics
If you are irritated by every rub, how will your mirror be polished? Rumi

Idea Transcript


Real Statistics Using Excel Everything you need to do real statistical analysis using Excel

Discrete Probability Distributions We now define the concept of probability distributions for discrete random variables, i.e. random variables that take a discrete set of values. Such random variables generally take a finite set of values (heads or tails, people who live in London, scores on an IQ test), but they can also include random variables that take a countable set of values (0, 1, 2, 3, …). Definition 1: The (probability) frequency function, also called the probability density function (abbreviated pdf), of a discrete random variable x is defined so that for any value t in the domain of the random variable (i.e. in its sample space):

i.e. the probability that x assumes the value t. The corresponding (cumulative) distribution function F(x) is defined at value t by

Property 1: For any discrete random variable defined over the range S with frequency function f and distribution function F

for all t in S. Proof: These are characteristics of the probability function P(E) per Property 1 of Basic Probability Concepts. Observation: If f is the frequency function of a discrete random x with distribution function F, then f(t) is the probability that x takes the value t and F(t) is the probability that x takes a value less than or equal to t. Thus, the probability that x takes a value t such that t1 < t ≤ t2 is F(t2) – F(t1).

Here u 1 is the first value in the domain of f which is larger than t1. Such a u 1 exists since x is a discrete random variable (usually u 1 = t1+1). A frequency function can be expressed as a table or a bar chart, as described in the following example. Example 1: Find the distribution function for the frequency function given in columns A and B below. Also show the graph of the frequency and distribution functions.

Figure 1 – Table of frequency and distribution functions

Given the frequency function defined by the table in the range B4:B11, we can define the distribution function in the range C4:C11 by putting the formula =B4 in cell C4 and the formula =B5+C4 in cell C5 and then copying this formula into cells C6 to C11 (e.g. by highlighting the range C5:C11 and pressing Ctrl-D). Using the approach described in Example 2.1, we can generate the graphs of the frequency and distribution functions as follows:

Figure 2 – Charts of frequency and distribution functions

Excel Function: Excel provides the function PROB, which is defined as follows: Where R1 is the range defining the discrete values of the random variable x (e.g. A4:A11 in Figure 1) and R2 is the range consisting of the frequency values f(x) corresponding to the x values in R1 (e.g. B4:B11 in Figure 1), the Excel function PROB is defined as follows: PROB(R1, R2, c) = the frequency value f(c) PROB(R1, R2,, c) = the cumulative distribution value F(c) PROB(R1, R2, a, b) = the probability that x takes a value t between a and b, inclusive, i.e.

Thus in Example 1, we can put the formula =PROB(A4:A11,B4:B11,,A8) in cell C8, and similarly for the other values in column C. Also for the frequency function in Example 1, P(3 ≤ x ≤ 5) = PROB(A4:A11,B4:B11,A6,A8) For Example 1 it also follows that , P(3 ≤ x ≤ 5) = f(3) + f(4) + f(5) = F(5) – F(2) = 0.31. Example 2: Determine the frequency function for the data in column A of Figure 3.

Figure 3 – Constructing a frequency function

First create a list of unique data values. This can be obtained by first copying the raw data scores in column A to a new place in the worksheet (e.g. in column C in the example above) and selecting Data > Data Tools|Remove Duplicates. The highlighted data can then optionally be sorted via Data > Sort & Filter|Sort. The result appears in cell range C4:C8 above. Alternatively use the Real Statistics QSORT and NoDupes functions as described in Supplemental Functions. Then use the COUNTIF function (see Built-in Functions) to count how many times each score appears in the sample data. E.g. cell D4 contains the formula =COUNTIF($A$3:$A$15,C4), which has value 2 since the data element 12 (the value in cell C4) appears twice in the raw data. Since there are 12 data elements, the correct value of the frequency function for data element 2 is 2/12 = 0.167, which can be calculated via the formula D4/D$9 in cell E4 where D9 contains the formula SUM(D4:D8). Real Statistics Function: The Real Statistics Resource Pack supplies the following supplemental array function to create the frequency function. FREQTABLE(R1) = an n × 3 array which contains the frequency table for the data in range R1, where n = the number of unique values in R1 (i.e. the number of data elements in R1 without duplicates) To use the function you must highlight an array with 3 columns and at least as many rows as unique elements in R1. You can highlight more rows than you need; any extra rows will take value #N/A. Example 3: Repeat Example 2 using the FREQTABLE function.

Figure 4 – Using the FREQTABLE function

The output from =FREQTABLE(A3:A14) (where A3:A14 is as in Figure 3) is shown in range M4:O8 of Figure 4 (the headings in row 3 have been added manually). Real Statistics Data Analysis Tool: The resource pack also contains a supplemental data analysis tool called Frequency Table. This works just like the FREQTABLE function except that you don’t need to specify the size of the frequency table. The analysis tool sizes the output automatically. Observation: See Histograms for examples of the use of the FREQTABLE function and Frequency Table data analysis tool. Observation: The notion of probability function can be extended to multiple random variables. We now give the definition for two random variables. Definition 2: f(x, y) is a joint probability density function (pdf) of random variables x, y if for any values of a and b in the domains of x and y respectively

In this case the cumulative distribution function is given by

Property 2: If x is a random variable with pdf f and y is a random variable with pdf g, then x and y are independent if and only if the function f(x) · g(y) is a joint pdf for x, y. Proof: Follows from Definition 3 of Basic Probability Concepts. Observation: We will study a number of discrete distributions in this website such as the binomial distribution and Poisson distribution.

30 Responses to Discrete Probability Distributions James says: November 10, 2013 at 11:00 pm

Charles Once again, another good article. On this page http://www.real-statistics.com/probability-functions/discrete-probability-distributions/ I wonder if you could clarify why we are changing variable at certain points. There isn’t obviously any problem with the article itself but it seems like a good opportunity to understand what is happening in the mind of a statistician. I’m an engineer and having read other stats’ papers where algebraic identifiers get routinely substituted without explanation I often wonder if this happens for an important reason that I am missing. Just to be clear, in your article the following points are really quite trivial and following the thread is not difficult, its the application of the ideas in the wider world I’m interested in. Just so you can see where I’m coming from I can see in Definition 1 that there is a different role given to the letters “x” and “t” though even here it seems a little abstract. I presume the intention is to be rigourous so as to distinguish “x” as a discrete random variable from “t” as simply a value. Otherwise it would be enough for me to simply say that f(t) = P(t), i.e. the probability of arriving at the value t. In figure 1 I would intuitively assume that column A is actually “t” rather than “x” given how you had defined them. Why have you used x therefore when you had already gone out of your way to define x and t separately? The main point of this comment is actually aimed at the Example for the Excel function PROB. Here we introduce a third letter “c”. Once again, I would have assumed that we’d put “t” here, or “x = t” following the earlier convention. Have we changed simply because we are on a different example (which is what it appears) or for some other reason? Thanks again. James Reply

Charles says: November 12, 2013 at 12:35 pm

James, The type of notation I have used is quite common in mathematics. The only deviation from the norm is that it is common to use capital letters at the end of the alphabet (X, Y, etc.) for random variables. I have chosen to use small letters (x, y, etc.) since I use capital letters for vectors and matrices. In any case, I understand that this can sometimes be confusing. I have read a lot of mathematical papers and often adjusting to someone else’s notation is half the battle in understanding their paper. In the webpage you referenced I made things more complicated by inadvertently omitting the letters x and t in Definition 1 (I have now corrected this). Sometimes choosing to use the letters t and u instead of a and b has some logic behind it, and sometimes it is pretty arbitrary. Mathematicians also “abuse” the notation either out of laziness or to keep things simple, especially where their intentions are clear (at least they think it is clear). These traits carry over to statisticians, even those who are not particularly mathematical. In any case, your observations are valid, and I have revised the referenced webpage to make things clearer (at least I hope so). I can’t promise that I will do this on every webpage, but I will do my best not to confuse people too much with the notation I use. Charles Reply

Gaylord Lussac says: February 12, 2015 at 3:59 am

PROB doesn’t seem to like -ve c values (I was using it in the context of portfolio returns) Also, it seems that the c values must be in increasing order in the R arrays, else PROB is not happy. This is assumed in every example I’ve seen but never formally stated, even on Microsoft sites. Nice site. Reply

Charles says: February 12, 2015 at 11:09 am

Gaylord, I just tried some examples where the R1 array was not in increasing order, including the case where the order of the x values were 1,3,2,4,5,6,6,8 (in that order), and the results were correct. Can you give an example where PROB did not calculate the correct values? Charles Reply

Aatish says: February 27, 2015 at 10:59 am

Sir, thanks for the excel add in. Sir, there is one error ‘Compile error in hidden module: Analysis’ comes. How can I solve it? Regards A K Sahu Reply

Charles says: February 27, 2015 at 4:48 pm

Aatish, Try entering the formula =VER() in any blank cell in a worksheet. If you get an error, then the Real Statistics software was not installed properly. If you get the release number (e.g. 3.6.2) of the Real Statistics add-in then the cause is different and we will need to diagnose the problem in a different way. In this case, let me know which version of Excel are you using? Charles Reply

Rob Asonza says: April 20, 2015 at 9:29 am

Sir, is there a way on how I can determine the distribution of a certain set of data in Excel? (if it’s normal, poisson, etc) Reply

Charles says: April 20, 2015 at 12:43 pm

Rob, There are quite a number of ways to determine whether data is normally distributed. These methods are described on the webpage: Testing for Normality For other distributions you can use the Chi-square Goodness of Fit or Kolmogorov-Smirnov test. These are described on the following webpages: Chi-square Goodness of Fit test One Sample Kolmogorov-Smirnov test Charles Reply

Marian says: October 19, 2015 at 7:22 am

Hi, Thank you for your articles, they have really been enlightening. I have a question regarding fitting data to a distribution. I am conducting a study on copulas and the process requires me to have that my data follows a uniform distribution. Do you know how I can do this with the data I have? And is there a way to do this through Excel? Reply

Charles says: October 21, 2015 at 10:01 am

Marian, Are you trying to (1) test whether your data does follow a uniform distribution or (2) create data that follows a uniform distribution? If (1) you can use the Kolmogorov-Smirnov test and/or test whether your data is random using a Runs test. See the following webpages: Kolmogorov-Smirnov test Runs Test If (2) you can create random values between a and b with a < b using the formula =a+(b-a)*RAND() Charles Reply

Kay says: December 4, 2015 at 8:25 pm

Hello, Does the data that I use in excel have to follow a normal distribution or any other distribution to be able to use the probability frequency function and cumulative distribution function on my data as described above? Also, Are frequencies output with these functions or are they probabilities and what is the distinction if any? Are these functions used merely to analyze the data, or is it used to output probabilities/ frequencies that you are assume will stay true when more time passes and you get more and new data? Reply

Charles says: December 5, 2015 at 12:01 pm

Hello Kay, 1. You data does not have to follow a normal or any other distribution. Of course, the data itself will define a distribution, but it may not be any of the commonly used distributions. 2. The output from the PROB function are probabilities. The output from FREQTABLE includes both frequencies and probabilities. If I toss a coin 10,000 times and it comes up heads 5,010 times, then the frequency is 5,010 and the probability id 5,010/10,000. A frequency table is generally used to describe data and may be used for subsequent analysis. There is no assumption that probabilities/frequencies will remain the same as you get more data. Charles Reply

Kay says: December 5, 2015 at 10:45 pm

Charles, Say you have two columns of data instead of one. The first column, as shown below, has number of books checked out at a library and the second column has the duration in days it took for a person to return that number of books: # days 1 3 2 5 7 17 7 23 3 5 4 6 5 15 4 8 6 16 5 12 Assume that this is just sample data and that when I calculate, I will have data showing number of days it took for all counts of books (1,2,3,4 ect.) Could a frequency table still be applied to this above data to show that when a given number of books are checked out, a frequency and probability can be computed in terms of duration in time. In other words, something that says if 7 books are checked out, what is probability that in a given time span 7 books are checked out or less or something like the probability that between 5 and 7 books inclusive are checked out. I was thinking for each count of books checked out, sum up the total duration they were checkout for and divide that number by the total number of days amongst all counts of books checked out. So in the above example, for 7 books checked out, I would sum up 17 and 23 and divide by the total number of days, 110 to give a probability or F(x) of 0.363, and then I could use the PROB function in excel and do a CDF to see the probability 7 book or less are checked out. I don’t know how frequency would play in here, however. Reply

Charles says: December 6, 2015 at 10:02 am

Kay, Before trying to organize this information in the form of a frequency table, it would be helpful to understand what sort of questions you are seeking to answer. You seem to say that you want to understand the probability that 7 books or fewer are checked out. Based on the data this seems to be 100% since no person checked out more than 7 books. Please clarify what your objectives are. Charles Reply

Kay says: December 6, 2015 at 6:03 pm

Charles , I agree what I wrote was not clear. First, assume the sample data is from books were checked out in a time span of a month: I would like to find : 1)During the month of which this data was collected, what range of number books checked out resulted in the books being returned the fastest and what range resulted in books being checked out the slowest amount of time.The choices of ranges of number of books are: 1 book to 3 books inclusive, 4 books to 5 books inclusive, or 6 books to 7 books inclusive. I want to express this in frequencies and probabilities if possible. But if that does not fit what I am looking for, please let me know. Charles says: December 7, 2015 at 9:11 am

Kay, When you say, for example, that 7 books were checked out and it took 23 days to return the books, does this mean that all 7 books were returned after 23 days or that the last of the 7 books were returned after 23 days or the sum of the number of days it took to return each book was 23_ Charles

Kay says: December 7, 2015 at 8:29 pm

Charles, It means all 7 books were returned after 23 days all at once. Thanks, Kay Reply

Charles says: December 8, 2015 at 12:02 am

Kay, If x is the random variable = number of days it takes to return a book, then the frequency table can be represented as days freq pdf cum cdf 3 1 2.3% 1 2.3% 5 5 11.4% 6 13.6% 6 4 9.1% 10 22.7% 8 4 9.1% 14 31.8% 12 5 11.4% 19 43.2% 15 5 11.4% 24 54.5% 16 6 13.6% 30 68.2% 17 7 15.9% 37 84.1% 23 7 15.9% 44 100.0% Charles Reply

Artem says: March 13, 2016 at 12:53 pm

Charles, Thank you very much for your effort making this great website. I enjoy learning here very much. I’ve got a problem with understanding the first observation: P(t1

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.