Data scientist: Types of Data Distribution – Data Distribution Fundamentals

Data scientist: Types of Data Distribution – Data Distribution Fundamentals

How could data be distributed?

Data could be distributed in different ways so that it is important to understand how data is distribute in order to infer about the whole population we are interested in. The main types of data distribution are:

  • Binomial distribution (Discrete)
  • Bernoulli distribution (Discrete)
  • Normal/Gaussian distribution (Continuous)
  • Poisson distribution (Discrete)
  • Power Law Distribution (Discrete)
  • Binomial distribution (Discrete).- it is a type of distribution in which outcomes are binomial (yes/no,true/false) and experiments are repeated N times.
  • Bernoulli distribution (Discrete).- It is a type of binomial distribution in which the probability that an event occurs is binomial (yes/no,true/false) and the experiment is not repeated so that N=1

p(k;p)=pk()1-k for k E {0,1}

  • Normal/Gaussian Distribution (Continuous). This type of distribution relates the mean and standard deviation and states how much varied and distant is the data from those variables. Data that is normally distributed shows signs of being 99.7% of the whole data is within 3 standard deviations from the mean. Data also is distributed in a bell shape curve that is symmetrical so that values greater than the mean has the same shape than those lower than the mean. Normal distribution is usually verified using graphs or quantitative methods.
  • Poisson Distribution (Discrete).- Events occur in fixed interval of time/space with a constant rate Lambda and independently. Probabilities that events occur under these conditions (number of times within a specific time interval). It is used in cases that occur rarely and when we want to know that probability of a getting impredictable or random event in a specific time interval.

p(k,L)=Lkexp(-L)/k!

K=number of occurrences

L=rate of occurance (occurrences/time)

  • Power Law Distribution (Discrete). The change of one quantity varies as power of another. P(X=x)=cx-a where a is the law’s exponent and c is the normalizing constant

There are two other concepts that are relate to data distribution and that help to understand the corresponding formulas to calculate the sample mean and standard deviation and infer about the mean and standard deviation of the whole population. These are:

  • The law of large numbers
  • The central limit theorem

The law of large numbers explains why the average of a large number of random data in a sample will be the same as the average of the population.

The central limit theorem is the one to indicate that the sum of independent random variables with non-zero and finite variance (Sn) has a distribution function that correctly approximates that of a normal distribution.