Random Variable Distribution

Introduction to Random Variables

A random variable is a variable whose value is determined by the outcome of a random event. It serves as a function that maps outcomes from a sample space to numerical values.

For example, if we roll a die, we can define a random variable X as the number that appears face up. In this case, X can take values 1, 2, 3, 4, 5, or 6, each with a probability of 1/6.

Mathematical Definition

\[ X: \Omega \rightarrow \mathbb{R} \]

Where \(\Omega\) is the sample space and \(\mathbb{R}\) is the set of real numbers.

The study of random variables involves understanding their distributions, which describe how probability is allocated across the possible values of the random variable.

Case Study: Customer Service Calls

A call center receives a random number of calls each hour. The manager wants to understand this randomness to better staff the center.

Problem:

A small business receives customer service calls throughout the day. Define a suitable random variable to model the number of calls received in a 1-hour period.

Solution:

Let's define a random variable \(X\) as "the number of customer service calls received in a 1-hour period."

This random variable has these characteristics:

  • It maps from the sample space (all possible call scenarios) to numerical values
  • It can only take non-negative integer values: 0, 1, 2, 3, ...
  • Each value has some probability of occurring

For practical use, the business might collect data over several weeks to estimate the probabilities:

\(x\) (number of calls) 0 1 2 3 4 5+
P(X = x) 0.15 0.30 0.25 0.15 0.10 0.05

With this probability distribution, the business can make informed decisions:

  • The most likely scenario is receiving 1 call per hour (30% probability)
  • There is a 45% chance of receiving 2 or more calls
  • They can plan staffing to ensure adequate coverage based on these probabilities

Types of Random Variables

There are two main types of random variables:

Discrete Random Variables

A discrete random variable can take on a countable number of distinct values. Examples include:

  • Number of students in a class
  • Number of heads in 10 coin flips
  • Number of customers arriving at a store in an hour
\[ P(X = x) \geq 0 \text{ for all } x \] \[ \sum_{x} P(X = x) = 1 \]

Continuous Random Variables

A continuous random variable can take on any value in a continuous range. Examples include:

  • Height of a randomly selected person
  • Time required to complete a task
  • Temperature at a specific location
\[ P(X = x) = 0 \text{ for all } x \] \[ \int_{-\infty}^{\infty} f(x) dx = 1 \]

Key Differences

Feature Discrete Random Variables Continuous Random Variables
Values Countable distinct values Uncountable values in a range
Probability Can have positive probability at individual points Zero probability at any individual point
Distribution Probability mass function (PMF) Probability density function (PDF)

Case Study: Comparing Discrete and Continuous Variables

Case 1: Manufacturing Defects (Discrete)

A quality control engineer inspects 20 randomly selected products from a production line and counts the number of defects.

Problem: Model the number of defective products as a random variable.

Solution:

Let X = number of defective products in the sample of 20 items.

  • X is discrete because it can only take whole number values: 0, 1, 2, ..., 20
  • Each value has a specific probability P(X = x)
  • If the production line has a 5% defect rate, we can model X using a binomial distribution with n = 20 and p = 0.05

For example, the probability of finding exactly 2 defects would be:

\[ P(X = 2) = \binom{20}{2} (0.05)^2 (0.95)^{18} \approx 0.189 \]

This gives the quality control team a quantitative basis for monitoring production quality.

Case 2: Customer Wait Time (Continuous)

A bank is analyzing the time customers spend waiting in line before being served by a teller.

Problem: Model the waiting time as a random variable.

Solution:

Let Y = waiting time (in minutes) for a customer.

  • Y is continuous because time can take any positive real value (2.37 minutes, 3.14159 minutes, etc.)
  • For any exact value y, P(Y = y) = 0
  • We use a probability density function (PDF) to model the distribution
  • For example, if Y follows an exponential distribution with mean wait time of 5 minutes:
\[ f(y) = \frac{1}{5}e^{-y/5} \text{ for } y \geq 0 \]

To find the probability of waiting between 2 and 4 minutes:

\[ P(2 \leq Y \leq 4) = \int_{2}^{4} \frac{1}{5}e^{-y/5} dy \approx 0.231 \]

This allows the bank to make predictions like "about 23.1% of customers will wait between 2 and 4 minutes" and staff accordingly.

Probability Distribution

A probability distribution describes how the probabilities are distributed across the values of a random variable.

Probability Mass Function (PMF)

For discrete random variables, we use a probability mass function (PMF).

\[ p(x) = P(X = x) \]

Properties of a PMF:

  • \(p(x) \geq 0\) for all \(x\)
  • \(\sum_{x} p(x) = 1\)

Probability Density Function (PDF)

For continuous random variables, we use a probability density function (PDF).

\[ f(x) = \lim_{\Delta x \to 0} \frac{P(x < X \leq x + \Delta x)}{\Delta x} \]

Properties of a PDF:

  • \(f(x) \geq 0\) for all \(x\)
  • \(\int_{-\infty}^{\infty} f(x) dx = 1\)
  • \(P(a \leq X \leq b) = \int_{a}^{b} f(x) dx\)

Common Probability Distributions

Discrete Distributions

  • Bernoulli: Models a single success/failure experiment
  • Binomial: Sum of Bernoulli trials
  • Poisson: Counts rare events in a fixed interval
  • Geometric: Number of trials until first success

Continuous Distributions

  • Uniform: Equal probability in an interval
  • Normal (Gaussian): Bell-shaped curve
  • Exponential: Models time between events
  • Beta: Models probability distributions

Case Study: Using Probability Distributions in Real Scenarios

Case 1: Email Spam Detection (Binomial Distribution)

A spam filter is being tested on incoming emails. Each email is classified as either spam or not spam.

Problem: If the filter is 95% accurate and you receive 20 emails in a day, what is the probability that exactly 18 emails are correctly classified?

Solution:

This is a classic application of the binomial distribution:

  • Each email classification is a "trial"
  • Each trial has two possible outcomes: correct or incorrect classification
  • The probability of success (correct classification) is p = 0.95
  • We have n = 20 independent trials
  • We want to find P(X = 18) where X is the number of correct classifications
\begin{align*} P(X = 18) &= \binom{20}{18} (0.95)^{18} (0.05)^{2} \\ &= \frac{20!}{18!(20-18)!} (0.95)^{18} (0.05)^{2} \\ &= \frac{20 \cdot 19}{2 \cdot 1} (0.95)^{18} (0.05)^{2} \\ &= 190 \cdot (0.95)^{18} \cdot (0.05)^{2} \\ &\approx 0.2501 \end{align*}

So the probability of exactly 18 correct classifications out of 20 emails is about 25.01%. This information helps the email service provider understand the expected performance of their spam filter.

Case 2: Call Center Response Time (Exponential Distribution)

A call center tracks the time customers spend on hold before speaking with a representative.

Problem: If the average hold time is 3 minutes and follows an exponential distribution, what is the probability that a customer will wait less than 2 minutes?

Solution:

The exponential distribution is often used to model waiting times. For an exponential random variable X with rate parameter λ:

  • The PDF is f(x) = λe^(-λx) for x ≥ 0
  • The mean (average) is E[X] = 1/λ
  • In our case, E[X] = 3 minutes, so λ = 1/3
  • We want to find P(X < 2)

For the exponential distribution:

\begin{align*} P(X < 2) &= 1 - P(X \geq 2) \\ &= 1 - e^{-\lambda \cdot 2} \\ &= 1 - e^{-(1/3) \cdot 2} \\ &= 1 - e^{-2/3} \\ &\approx 1 - 0.5134 \\ &\approx 0.4866 \end{align*}

So the probability that a customer will wait less than 2 minutes is approximately 48.66%. This helps the call center set realistic expectations for customers and adjust staffing to meet service level targets.

Distribution Function

The Cumulative Distribution Function (CDF) gives the probability that a random variable X is less than or equal to a particular value x.

\[ F(x) = P(X \leq x) \]

CDF for Discrete Random Variables

For a discrete random variable, the CDF is calculated as:

\[ F(x) = \sum_{t \leq x} p(t) \]

Properties:

  • \(0 \leq F(x) \leq 1\) for all \(x\)
  • \(F(-\infty) = 0\) and \(F(\infty) = 1\)
  • \(F(x)\) is non-decreasing
  • \(F(x)\) is right-continuous
  • \(P(a < X \leq b) = F(b) - F(a)\)

CDF for Continuous Random Variables

For a continuous random variable, the CDF is calculated as:

\[ F(x) = \int_{-\infty}^{x} f(t) dt \]

Properties:

  • \(0 \leq F(x) \leq 1\) for all \(x\)
  • \(F(-\infty) = 0\) and \(F(\infty) = 1\)
  • \(F(x)\) is non-decreasing
  • \(F(x)\) is continuous
  • \(f(x) = \frac{dF(x)}{dx}\) where \(f(x)\) is the PDF

Uses of the Cumulative Distribution Function

  • Finding probabilities of ranges: \(P(a \leq X \leq b) = F(b) - F(a)\)
  • Determining percentiles: If \(F(x) = p\), then x is the pth percentile
  • Generating random samples from a distribution
  • Comparing distributions (e.g., stochastic dominance)

Case Study: Using CDFs in Risk Assessment

Insurance Claim Amounts

An insurance company models claim amounts using a probability distribution to assess risk and set premiums.

Problem: Historical data shows that auto insurance claims follow a lognormal distribution with parameters μ = 7.5 and σ = 1.2. The company wants to determine:

  1. The probability that a claim will be less than $2,000
  2. The 90th percentile of claim amounts (to set reserves)

Solution:

For a lognormal distribution with parameters μ and σ, the CDF is:

\[ F(x) = \Phi\left(\frac{\ln(x) - \mu}{\sigma}\right) \]

where Φ is the standard normal CDF.

1. To find the probability that a claim will be less than $2,000:

\begin{align*} P(X \leq 2000) &= F(2000) \\ &= \Phi\left(\frac{\ln(2000) - 7.5}{1.2}\right) \\ &= \Phi\left(\frac{7.6009 - 7.5}{1.2}\right) \\ &= \Phi(0.0841) \\ &\approx 0.5335 \end{align*}

So there's approximately a 53.35% chance that a claim will be less than $2,000.

2. To find the 90th percentile, we need to find the value x such that F(x) = 0.90:

\begin{align*} 0.90 &= \Phi\left(\frac{\ln(x) - 7.5}{1.2}\right) \\ \Phi^{-1}(0.90) &= \frac{\ln(x) - 7.5}{1.2} \\ 1.282 &= \frac{\ln(x) - 7.5}{1.2} \\ 1.282 \times 1.2 &= \ln(x) - 7.5 \\ 1.538 + 7.5 &= \ln(x) \\ 9.038 &= \ln(x) \\ x &= e^{9.038} \\ x &\approx 8,400 \end{align*}

So the 90th percentile is approximately $8,400, meaning 90% of claims are expected to be below this amount.

Practical implications:

  • The company can use the CDF to price policies appropriately, ensuring premiums cover expected claim amounts
  • They can set reserves of $8,400 per policy to cover 90% of potential claims
  • For catastrophic coverage, they might look at even higher percentiles (e.g., 95th or 99th)
  • Understanding the distribution helps with reinsurance decisions for extremely large claims

Interactive Examples

Experiment with different probability distributions to better understand their properties.

Binomial Distribution Simulator

The binomial distribution models the number of successes in a fixed number of independent trials, each with the same probability of success.

10
0.5

Probability Mass Function

Cumulative Distribution Function

k =
Result will appear here
k =
Result will appear here

Normal Distribution Simulator

The normal distribution is a continuous probability distribution that is symmetric about the mean, showing that data near the mean are more frequent than data far from the mean.

0
1

Probability Density Function

Cumulative Distribution Function

a = b =
Result will appear here

Test Your Knowledge

Answer the following questions to test your understanding of random variables and probability distributions.

Question 1:

In a discrete uniform distribution with outcomes {1, 2, 3, 4, 5, 6}, what is P(X = 4)?

A) 1/3
B) 1/4
C) 1/6
D) 1/2

Explanation: In a discrete uniform distribution, all outcomes have the same probability. Since there are 6 possible outcomes, P(X = 4) = 1/6.

Question 2:

For a continuous random variable X, what is P(X = c) for any specific value c?

A) 1
B) 0
C) 0.5
D) It depends on the value of c

Explanation: For a continuous random variable, the probability of any specific point is always 0. This is because continuous random variables can take uncountably infinite values in an interval, and we calculate probabilities using integrals over ranges rather than sums of individual points.

Question 3:

In a binomial distribution with n = 10 and p = 0.2, what is the expected value (mean) of X?

A) 2
B) 0.2
C) 10
D) 5

Explanation: For a binomial distribution, the expected value (mean) is E[X] = n·p = 10 × 0.2 = 2. This represents the average number of successes we expect in 10 trials with a success probability of 0.2.

Question 4:

If F(x) is the cumulative distribution function (CDF) of a random variable X, which of the following is NOT a property of F(x)?

A) 0 ≤ F(x) ≤ 1 for all x
B) F(x) is non-decreasing
C) F(-∞) = 0 and F(∞) = 1
D) F(x) must be a continuous function for all random variables

Explanation: Option D is NOT a property of all CDFs. While the CDF of a continuous random variable is continuous, the CDF of a discrete random variable is typically a step function with jumps at the points where the random variable has positive probability. The other properties (A, B, and C) are valid for all CDFs.

Question 5:

A normal distribution has a mean of 70 and a standard deviation of 5. Approximately what percentage of values falls within the range [65, 75]?

A) 50%
B) 68%
C) 95%
D) 99.7%

Explanation: The range [65, 75] corresponds to [μ-σ, μ+σ], or one standard deviation on either side of the mean. According to the 68-95-99.7 rule (empirical rule) for normal distributions, approximately 68% of values fall within one standard deviation of the mean.