A Python package for Geometric Nearest Neighbors (GNN) workflows: data rescaling, fast anomalies finding, similarity matrices derivation.
This Python package is a translation to Python of the Wolfram Language software monad "MonadicGeometricNearestNeighbors", [AAp1]. (The R package "GNNMon-R", [AAp4], is another translation of [AAp1].)
Remark: In order to keep outputs consistent with R and WL package the data frame outputs of the Python package use camel case for the column names. (That might change in the future.)
Consider the following computational tasks for a given set of
- Find the points of
$P$ that are outliers or anomalies - Find the points of another set
$P_1$ that can be seen anomalies wrt to$P$ - For a given nD point find its Nearest Neighbors (NNs) in
$P$ - The points of
$P$ can have labels - It might be desired to get the distances and labels of the NNs
- The points of
- Plot the points
$P$ with minimal setup or specification writing - Give the (sparse) proximity matrix of
$P$ for a specified number of neighbors
Let us define an anomalous point as one that is "too far" from the other points.
Which points are "too far" from the rest can be determined by examining statistics of distances between each point
and
More concretely, point anomalies are found in the following way:
- Input:
- Points
Pas a data frame, list, or dictionary - Number of nearest neighbors
n - Distance function
d - Aggregation function
a
- Points
- For each point of
P- Find its
nnearest neighbors - Aggregate with
athe correspondingndistance
- Find its
- Using the statistics for the previous step -- a 1D array -- find outlier identification parameters
- Like, Hampel-, SPLUS-, Quartile parameters
- Identify anomalies using the parameters of the previous step
Load packages:
from RandomDataGenerators import *
from OutlierIdentifiers import *
import numpy
import random
import pandas as pd
import plotly.express as pxGenerate random points:
help(random_data_frame)dfPoints = random_data_frame(n_rows=30, columns_spec = ["X", "Y"], generators= {"X": numpy.random.normal, "Y": numpy.random.normal})
print(dfPoints.shape)
print(dfPoints[1:6])Here is a summary:
dfPoints.describe()Here is a plot of the points:
fig = px.scatter(dfPoints, x="X", y="Y", template="plotly_dark")
fig.show()A typical pipeline of geometric nearest neighbors processing:
gnnObj = (GeometricNearestNeighborsProcessor(dsPoints)
.make_nearest_function(distance_function = "EuclideanDistance")
.compute_thresholds(number_of_nearest_neighbors = 10, aggregation_function = "mean", outlier_identifier = "QuartileIdentifierParameters")
.find_anomalies()
.echo_function_value("Anomaly points:", lambda x: print(x))
.plot(title="Random points", template="plotly_dark")
)Show the plot obtained above:
gnnObj.take_value().show()Here we generate another set of random points using the same random point generators:
dfPoints2 = random_data_frame(n_rows=40, columns_spec = ["X", "Y"], generators= {"X": numpy.random.normal, "Y": numpy.random.normal})
print(dfPoints2.shape)Here the points of second set are classified into being anomalous or not:
gnnObj.classify(dfPoints2).take_value()See the notebook "Usage-examples.ipynb" for more detailed examples.
-
The package provides the class
GeometricNearestNeighborsProcessorthat can be used to construct chainable, monadic pipeline-like behavior. -
Plotting of the data points is done via "plotly" -- just a scatter plot of 2D points for now.
-
The chainable behavior of the methods of the class
GeometricNearestNeighborsProcessoris implemented by following the principle that all methods returnself, except the so-called "takers".- I.e., methods with names that start with "take_".
-
The core Nearest Neighbors (NNs) finding functionality is provided by the "scikit-learn" class NearestNeighbors.
-
The NNs finding algorithms used by
GeometricNearestNeighborsProcessorare "scan" and "kdtree".- "scan" is implemented in the class
GeometricNearestNeighborsProcessorinstead of delegating to scikit-learn'sNearestNeighbors"brute" algorithm. - "kdtree" delegates to
NearestNeighbors"kd_tree" algorithm.
- "scan" is implemented in the class
[AAp1] Anton Antonov, MonadicGeometricNearestNeighbors, Wolfram Language paclet, (2023-2025), Wolfram Language Paclet Repository.
[AAp2] Anton Antonov, OutlierIdentifiers, Wolfram Language paclet, (2023), Wolfram Language Paclet Repository.
[AAp3] Anton Antonov, OutlierIdentifiers, R package, (2019-2024), GitHub/antononcube.
[AAp4] Anton Antonov, GNNMon-R, R package, (2019-2025), GitHub/antononcube.
[AAp5] Anton Antonov, KDTreeAlgorithm, R package, (2025), GitHub/antononcube.
[AAp6] Anton Antonov, RandomDataGenerators, Python package, (2021-2026), GitHub/antononcube. (PIPy.org.)
[AAp7] Anton Antonov, OutlierIdentifiers, Python package, (2024), GitHub/antononcube. (PIPy.org.)