Poisson

Instance of: Distribution

AKA:

Distinct from:

English: Useful for describing the number of events that occure within a fixed time when they are independent of one another.

Formalization:

Probability Mass Function \[ f(k;\lambda)=Pr(X=k)=\frac{\lambda^k e^{-\lambda}}{k!} \]

The poisson only has one parameter \(\lambda\) which is both its mean and variance \[ \lambda=E(X)= Var(X) \]

k is the number of occurrences {0,1,2,…}

e is Euler’s number

! is factorial

Cites/Notes: Wikipedia - Poisson distribution) ; Wikidata

The binomial distribution converges to the poisson in the limit when the probability of success equals lambda and the number of trials goes to infinity.

The canonical example of Poisson process is death by horse kicks in the Prussian army collected Ladislaus Josephovich Bortkiewicz (9/1868 – 8/1931) in his book “Law of Small Numbers” (Bortkiewicz and Bortkevič 1898). (Note it had been well known by then, introduced in 1711 and previously used for estimating wrongful convictions.)

Poisson distribution – Horse kick data A Prussian Poisson Process

Sums of poisson-distributed random variables are also Poisson distributed.

Extensions

Varying arrival rate - mixed Poisson distribution

Arrival in groups rather than individuals - compound Poisson

No zero counts - zero-truncated Poisson

Too many zeros - zero infalted Poisson

Code

Imports and spin up toy data objects and databases.
toy_vector_numeric <- c(1,2,3,4,5)
toy_vector_character <- c('a','b','c','d','e')
toy_matrix <- matrix(1:9, nrow=3,ncol=3)
toy_list <- list('a','1',T,c('red','green'))
toy_df <- data.frame(id=c('unit1','unit2','unit3'), y=c(1,2,3), x= c(3,2,1))

toy_dirty_df <- data.frame(id=c('','NA','NaN','inf'), y=c(1,2,3,4), x= c(0,NA,NaN,Inf)) #can't explicitly include NULL 

library(data.table)
toy_dt <- data.table(id=c('unit1','unit2','unit3'), y=c(1,2,3), x= c(3,2,1))
toy_dirty_dt <- as.data.table(toy_dirty_df)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::between()   masks data.table::between()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::first()     masks data.table::first()
✖ dplyr::lag()       masks stats::lag()
✖ dplyr::last()      masks data.table::last()
✖ purrr::transpose() masks data.table::transpose()
library(arrow)

Attaching package: 'arrow'

The following object is masked from 'package:utils':

    timestamp
import numpy as np
toy_vector_numeric = np.array([1,2,3,4,5])
toy_vector_character = np.array(['a','b','c','d','e'])
toy_list = ['a','1',True,['red','green']]
toy_dictionary = { 'a':1 , 'b':2, 'c':3}

from jax import numpy as jnp
toy_vector_numeric_jax = jnp.array([1,2,3,4,5])
#toy_vector_character_jax = jnp.array(['a','b','c','d','e']) #only numeric is allowed in jax
WARNING:jax._src.lib.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
import pandas as pd
toy_df = pd.DataFrame(data={'id': ['unit1','unit2','unit3'], 'y': [1, 2, 3], 'x': [3, 2, 1]})

import torch

import tensorflow as tf

import pyarrow as pa
library(DBI)
# Create an ephemeral in-memory RSQLite database
#con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
#dbListTables(con)
#dbWriteTable(con, "mtcars", mtcars)
#dbListTables(con)

#Configuration failed because libpq was not found. Try installing:
#* deb: libpq-dev libssl-dev (Debian, Ubuntu, etc)
#install.packages('RPostgres')
#remotes::install_github("r-dbi/RPostgres")
#Took forever because my file permissions were broken
#pg_lsclusters
require(RPostgres)
# Connect to the default postgres database
#I had to follow these instructions and create both a username and database that matched my ubuntu name
#https://www.digitalocean.com/community/tutorials/how-to-install-postgresql-on-ubuntu-20-04-quickstart
con_Postgres <- dbConnect(RPostgres::Postgres())

DROP TABLE IF EXISTS toy_df;

CREATE TABLE IF NOT EXISTS toy_df (
  id varchar(5),
    y INTEGER,
    x INTEGER
);

INSERT INTO toy_df (id, y, x)
VALUES
    ('unit1',1,3),
    ('unit2',2,2),
    ('unit3',3,1);
    
#install.packages("duckdb")
library("DBI")
con_duckdb = dbConnect(duckdb::duckdb(), ":memory:")
#pip install duckdb==0.6.0
import duckdb
con_duckdb = duckdb.connect()

0.1 R

A Prussian Poisson Process

Base

The Poisson Distribution

rpois(n=100000, lambda=1)  %>% hist()

rpois(n=100000, lambda=3)  %>% hist()

rpois(n=100000, lambda=6)  %>% hist()

myURL <- 'https://raw.githubusercontent.com/SmilodonCub/MSDS2020_Bridge/master/VonBort.csv' 
gitVonBort <- read.csv( url( myURL ) ) # read.csv is a built in function that will read the csv data as an R dataframe.
#head( gitVonBort ) # head() will by default display the first 6 rows of the dataframe
head( gitVonBort, 4 ) # the second argument customizes the number of lines made visible
  deaths year corps fisher
1      0 1875     G     no
2      0 1875     I     no
3      0 1875    II    yes
4      0 1875   III    yes

deaths: number (int) of deaths in a year
year: year of the data entry given as a number (int)
corps: a factor indicating which corps the data entry corresponds to.
The forth feature, ‘fisher’, is less intuitive; it is a factor that indicates whether the corps was included in the analysis performed by R.A. Fisher in 1925. Borkiewicz qualitatively established the Poisson distribution to his data, however Fisher was the first to quantitatively demonstrate the goodness of fit of the Poisson probability model to the horse-kicking via the chi-squared test. (Merikoski 2017). The data that was excluded from the analysis because it was considered to be from heterogeneous corps. For instance, corps ‘G’ was an elite calvalry corps. For this analysis, we will use the same subset of the horse-kicking data that Fisher used. That is to say, that will be using the data entries where ‘fisher’ = ‘yes’

gitVonBort %>% pull(deaths) %>% hist()

gitVonBort %>% pull(deaths) %>% mean()
[1] 0.7

Tidyverse

DataTable

Arrow

0.2 Python

0.2.0.1 3.x / math/ statistics

0.2.0.2 NumPy / SciPy / scikit-learn

0.2.0.3 Pandas

0.3 Jax

0.4 Numpyro

0.5 Stan

0.6 Torch

0.7 Tensorflow

References

Bortkiewicz, Ladislaus von, and Vladislav I. Bortkevič. 1898. Das Gesetz der kleinen Zahlen. B.G. Teubner.