Beliebte Suchanfragen
//

Simple Fraud Detection with PyMC

26.1.2023 | 7 minutes of reading time

In one of my last projects, we were facing a prediction problem with very limited data. Each set of data took a specialist hours to compile, and results were not always successful. Therefore, we were looking for a tool to handle these requirements, as artificial intelligence could not be trained with the limited amount of raw data. Thus, we turned to statistical approaches, namely Bayesian statistics with the Python package PyMC. I will explain the theory it is based on and describe it with the example of fraud detection on dice.

Bayesian statistics

Bayesian statistics is a branch of statistics that utilises Bayes' theorem to update our beliefs about the probability of a hypothesis as new data becomes available. Bayes' theorem states that the probability of a hypothesis (H) given some data (D) is equal to the probability of the data given the hypothesis (D|H) multiplied by the prior probability of the hypothesis (p(H)) divided by the total probability of the data (p(D)). This allows us to update our beliefs about a hypothesis as new data becomes available, rather than relying solely on the data at hand.

One of the most popular Python packages for implementing Bayesian statistics is PyMC. PyMC is a powerful package that allows users to easily define, fit, and analyse Bayesian models. It includes a variety of built-in distributions, such as the normal, binomial, and Poisson distributions, as well as a variety of samplers, such as Metropolis-Hastings and the No-U-Turn Sampler (NUTS). PyMC also includes a variety of convenient tools for diagnosing and visualising the results of your model.

An important thing to keep in mind when using PyMC is that you should think about the problem in terms of probability distributions and not in terms of point estimates. This can take some getting used to, but it is essential for accurately modelling complex systems. Every result you obtain has a probability. Confidence intervals are therefore crucial parts of any results from PyMC. This is an advantage compared to neural networks.

Bayesian statistics use the term prior probability and posterior probability. The prior probability is the probability distribution before any data has been seen. Basically, it shows the possible range of outcomes for all assumed parameters by simply guessing. It is noted as p(H)Posterior probability, however, is the distribution taking the data into account, i.e, (D|H) times p(H).

Rolling a tampered dice

As an example, we are using PyMC to model dice rolls. We want to find out whether the die has been tampered with or is good to use. We specify that the probability of each face (1-6) is modelled with a Dirichlet distribution with equal probability for the good dice. We also assume that the observed data, which is the outcome of the dice roll, follows a categorical distribution with a parameter p, that is the inferred probability of each face. Without any data, this is the prior probability. We then perform inference on the model using Markov Chain Monte Carlo (MCMC) sampling and use the samples to infer the probabilities of each face, i.e., the posterior probability. The tampered die yields a 6 in 7 out of 12 rolls – so this is really obvious.

Here is a sample code:

1import numpy as np  
2import pymc as pm  
3from matplotlib import pyplot as plt  
4  
5# Defining the dice and preparing data  
6tampered_dice = [1 / 12, 1 / 12, 1 / 12, 1 / 12, 1 / 12, 7 / 12]  
7good_dice = [1 / 6] * 6  
8num_rolls = 10000  
9x = list(range(0, num_rolls, 1))  
10  
11for idx, dice in enumerate([tampered_dice, good_dice]):  
12    # generate the dice rolls data  
13    p = np.array(dice)  
14    dice_rolls = np.random.choice(6, size=num_rolls, p=p)  
15  
16    # specify the dice model  
17    with pm.Model() as dice_model:  
18        p = pm.Dirichlet("p", a=np.ones(6))  
19  
20        # specify the likelihood  
21        face = pm.Categorical("face", p=p, observed=dice_rolls)  
22  
23        # perform inference using the data  
24        trace = pm.sample(draws=100, tune=100, chains=2)  
25  
26        # sampling data from before and after data is available  
27        prior_predictive = pm.sample_prior_predictive()  
28        post_pred = pm.sample_posterior_predictive(trace)  
29  
30    # presenting prior predictions  
31    fig, ax = plt.subplots()  
32    ax.hist(prior_predictive.observed_data.face)  
33    plt.xlabel("Die face")  
34    plt.ylabel("Occurences in 10k rolls.")  
35    if idx == 0:  
36        plt.title("Tampered dice")  
37    else:  
38        plt.title("Good dice")  
39    plt.savefig(f"prior_predictive_{idx}.png", dpi=100)  
40    plt.show()  
41  
42    # presenting posterior predictions  
43    trace.extend(post_pred)  
44  
45    fig, ax = plt.subplots()  
46    ax.hist(trace.observed_data.face)  
47    plt.xlabel("Die face")  
48    plt.ylabel("Occurences in 10k rolls.")  
49    if idx == 0:  
50        plt.title("Tampered dice")  
51    else:  
52        plt.title("Good dice")  
53    plt.savefig(f"posterior_predictive_{idx}.png", dpi=100)  
54    plt.show()  
55  
56    # calculate the expected probabilities of a fair dice  
57    expected_probs = np.ones(6) / 6  
58  
59    # calculate the difference between the posterior probabilities and the expected probabilities  
60    prob_diff = np.abs(trace.posterior.p - expected_probs)  
61  
62    # calculate the mean and standard deviation of the difference  
63    mean_diff = np.mean(prob_diff, axis=0)  
64    std_diff = np.std(prob_diff, axis=0)  
65  
66    # set a threshold for the difference  
67    threshold = 0.05  
68  
69    # check if the difference between the inferred probabilities and the expected probabilities is above the threshold  
70    tampered = mean_diff > threshold  
71  
72    if tampered.any():  
73        print("Dice may have been tampered with.")  
74    else:  
75        print("Dice does not seem to have been tampered with.")

By running it, the program checks whether the die faces are within 5 % of allowed deviation from the ideal die. Probably, your regular at-home die could be seen as tampered by this code, because the face 1 is heavier than 6, making 6 appear more often than 1, on average. For the prior probability, I have chosen 100 draws. The draws are the number of wyld guesses the program does to obtain potential model parameters. It is advisable to use larger values, e.g., 1000 to have enough granularity in the probability distribution.

Why PyMC?

Now what are the advantages and disadvantages of PyMC? PyMC is a powerful package that allows users to easily define, fit, and analyse models. It provides a variety of built-in distributions and samplers suitable for a wide range of models. PyMC is also very flexible and can handle complex models. However, PyMC is a probabilistic programming library, which can take some getting used to. It's important to understand the concepts of probability distributions and Bayesian statistics. PyMC can be computationally expensive, especially for large and complex models and may not be suitable for real-time applications or large datasets. Especially since PyMC is built in Python. Bayesian methods are a powerful tool for data analysis, and PyMC makes it easy to implement these methods in Python. PyMC is, however, far more tedious and not as well documented as Tensorflow, but then again, Tensorflow is the current standard.

Alternatives to PyMC with different requirements and advantages are:

  • Stan: Written in C++ and thus faster than PyMC for large and complex models.
  • JAGS: Written in C++
  • Edward: Built on top of TensorFlow, which allows for the use of deep learning models. Useful for Bayesian deep learning problems.

Relevant applications are found in probabilistic forecasting, a type of forecasting that provides a range of possible outcomes, rather than a single point estimate. Some examples include weather forecasting, financial forecasting and most relevant currently in Europe, energy forecasting.

Sometimes, neural networks and deep learning yield results, which appear seemingly magically. The most prominent examples right now are ChatGPT and MidJourney. How they achieve their results is only partially understandable. Bayesian statistic require us to think about the model that underlies the observed data and thus understand more deeply the problem at hand. While not as powerful as artificial intelligence, probabilistic approaches can help us prepare data and set requirements for AI. More understanding about the problem is often a key benefit.

Conclusion

PyMC is an interesting tool in your toolbox. It uses Bayesian statistics for powerful data analysis, giving the Data Scientist or Engineer the capabilities to fine tune their model and find the best representation for the observed data. PyMC may not be as convenient and powerful as neural networks or deep learning approaches in artificial intelligence, but it helps humans to understand more about the problem at hand. Furthermore, it gives a range of results with different probabilities and not a point estimate. All this makes PyMC an interesting tool.

share post

//

More articles in this subject area

Discover exciting further topics and let the codecentric world inspire you.

//

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.