Assuming a constant true failure rate λ, the number of failures per time follows a Poisson distribution, Poisson(λ), just as the number of heads above followed a binomial distribution.

Also as above, we’d instead like to understand the distribution of the failure rate λ.

It follows a gamma distribution with parameters ɑ and β.

As with the beta distribution, they have a convenient interpretation.

After seeing f failures in t days, the distribution of λ is Gamma(f,t).

import numpy as np import scipy.

stats as stats from pyspark.

sql.

types import * to_gamma_schema = ArrayType(StructType([StructField(“rates”, DoubleType()), StructField(“pdfs”, DoubleType())])) def build_gamma_udf(df, max_scale_factor=1): max_avg_failure = df.

select(max(“avg(failure)”)).

collect()[0][0] # Computes the PDF of the gamma distribution at points for each models rate param def to_gamma_pdf(count, avgf): # k = alpha = number of failures k = count * avgf # theta = 1/beta, where beta = total observations theta = 1.

0 / count # Pick rates to try with more samples at small rates rates = np.

geomspace(max_avg_failure / 1000, max_avg_failure * max_scale_factor, 100) # Calculate the probability distribution function at those points pdfs = stats.

gamma.

pdf(rates, a=k, scale=theta) return list(zip(rates.

tolist(), pdfs.

tolist())) return udf(to_gamma_pdf, to_gamma_schema) to_gamma_pdf_udf = build_gamma_udf(agg_df) gamma_df = agg_df.

withColumn(“rates_pdfs”, explode(to_gamma_pdf_udf(“count(1)”, “avg(failure)”))).

select(“model”, col(“rates_pdfs.

rates”).

alias(“rate”), col(“rates_pdfs.

pdfs”).

alias(“pdf”)) display(gamma_df.

withColumn(“pct_rate”, col(“rate”) * 100.

0)) Notice how different the distribution of estimated failure rate is for various models.

The ST8000DM004 has a very wide distribution across, admittedly, higher failure rates; it’s cut off for readability here, but the bulk of its distribution extends past 0.

4% per year (again note that the plot is annualized failure rate, for readability).

This is because we have relatively very few observations of this model: Zoom in on the two peaked distributions at the left, for the “ST500LM012 HN” and “WDC WD5000LPVX”: Interestingly, the WDC’s most likely annualized failure rate is lower than the ST500’s — about 0.

015% vs 0.

02%.

However, we have fewer observations of the WDC, and its distribution is wider.

If the question were, which drive is more likely to have a failure rate above 0.

03%, then based on this information, it would be the WDC.

This might actually be relevant if, for example, deciding which drives are most likely to be out of manufacturer’s tolerances.

Distributions matter!.Not just MAP estimates.

Analyzing Failures in a Stream with Priors This data doesn’t arrive all at once, in reality.

It arrives in a stream, and so it’s natural to run these kind of queries continuously.

This is simple with Apache Spark’s Structured Streaming, and proceeds almost identically.

Of course, on the first day this streaming analysis is rolled out, it starts from nothing.

Even after two quarters of data here, there’s still significant uncertainty about failure rates, because failures are rare.

An organization that’s transitioning this kind of offline data science to an online streaming context probably does have plenty of historical data.

This is just the kind of prior belief about failure rates that can be injected as a prior distribution on failure rates!.The gamma distribution is, conveniently, conjugate to the Poisson distribution, so we can use the gamma distribution to represent the prior, and can trivially compute the posterior gamma distribution after observing some data without a bunch of math.

Because of the interpretation of the parameters ɑ and β as number of failures and number of days, these values can be computed over historical data and then just added to the parameters computed and updated continuously by a Structured Streaming job.

Observe: priors_df = spark.

read.

option(“header”, True).

option(“inferSchema”, True).

csv(data_path + “/historical/”).

select(“model”, “failure”).

groupBy(“model”).

agg({“*”: “count”, “failure”: “avg”}).

withColumn(“alpha”, col(“count(1)”) * col(“avg(failure)”)).

withColumn(“beta”, col(“count(1)”)).

select(“model”, “alpha”, “beta”).

cache() agg_stream = spark.

readStream.

schema(raw_df.

schema).

option(“header”, True).

option(“maxFilesPerTrigger”, 1).

csv(data_path + “/current/”).

select(“model”, “failure”).

groupBy(“model”).

agg({“*”: “count”, “failure”: “avg”}).

filter(“avg(failure) > 0”).

orderBy(desc(“avg(failure)”)).

limit(6) max_avg_failure = priors_df.

select(avg(col(“alpha”) / col(“beta”)).

alias(“failure_rate”)).

agg(max(“failure_rate”)).

collect()[0][0] def to_gamma_pdf(count, avgf, prior_alpha, prior_beta): # Note: priors are added below k = prior_alpha + count * avgf theta = 1.

0 / (prior_beta + count) rates = np.

geomspace(max_avg_failure / 1000, max_avg_failure, 100) pdfs = stats.

gamma.

pdf(rates, a=k, scale=theta) return list(zip(rates.

tolist(), pdfs.

tolist())) to_gamma_pdf_udf = udf(to_gamma_pdf, to_gamma_schema) gamma_stream = agg_stream.

join(priors_df, “model”, how=left).

fillna(0, subset=[“alpha”, “beta”]).

withColumn(“gamma”, to_gamma_pdf_udf(“count(1)”, “avg(failure)”, “alpha”, “beta”)).

select(“model”, explode(“gamma”).

alias(“rates_pdfs”)).

select(“model”, col(“rates_pdfs.

rates”).

alias(“rate”), col(“rates_pdfs.

pdfs”).

alias(“pdf”)) Historical data yields the ɑ and β from the gamma-distributed priors for each model, and this is easily joined into the same analysis on a stream of data (here, simulated as a ‘stream’ from the data we analyzed above).

When run, this yields an updating plot of failure rate distributions for the models with the highest failure rate.

It could as easily produce alerts or other alerts based on other functions of these distributions, with virtually the same code that produced a simple offline analysis.

Conclusion There’s more going on when you fit a distribution than you may think.

You may be employing ‘frequentist’ or ‘Bayesian’ reasoning without realizing it.

There are advantages to the Bayesian point of view, because it gives a place for prior knowledge about the answer, and yields a distribution over answers instead of one estimate.

This perspective can easily be put into play in real-world applications with Spark’s Structured Streaming and understanding of prior and posterior distributions.

Try Databricks for free.

Get started today.. More details