Skip to content

antononcube/Python-GeometricNearestNeighborsProcessor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GeometricNearestNeighborsProcessor

A Python package for Geometric Nearest Neighbors (GNN) workflows: data rescaling, fast anomalies finding, similarity matrices derivation.

This Python package is a translation to Python of the Wolfram Language software monad "MonadicGeometricNearestNeighbors", [AAp1]. (The R package "GNNMon-R", [AAp4], is another translation of [AAp1].)

Remark: In order to keep outputs consistent with R and WL package the data frame outputs of the Python package use camel case for the column names. (That might change in the future.)


Purpose and theoretical background

Consider the following computational tasks for a given set of $n$-dimensional (nD) points $P$:

  1. Find the points of $P$ that are outliers or anomalies
  2. Find the points of another set $P_1$ that can be seen anomalies wrt to $P$
  3. For a given nD point find its Nearest Neighbors (NNs) in $P$
    • The points of $P$ can have labels
    • It might be desired to get the distances and labels of the NNs
  4. Plot the points $P$ with minimal setup or specification writing
  5. Give the (sparse) proximity matrix of $P$ for a specified number of neighbors

Let us define an anomalous point as one that is "too far" from the other points.
Which points are "too far" from the rest can be determined by examining statistics of distances between each point and $k$ nearest neighbors of it.

More concretely, point anomalies are found in the following way:

  1. Input:
    • Points P as a data frame, list, or dictionary
    • Number of nearest neighbors n
    • Distance function d
    • Aggregation function a
  2. For each point of P
    • Find its n nearest neighbors
    • Aggregate with a the corresponding n distance
  3. Using the statistics for the previous step -- a 1D array -- find outlier identification parameters
    • Like, Hampel-, SPLUS-, Quartile parameters
  4. Identify anomalies using the parameters of the previous step

Usage examples

Load packages:

from RandomDataGenerators import *
from OutlierIdentifiers import *

import numpy
import random

import pandas as pd
import plotly.express as px

Generate random points:

help(random_data_frame)
dfPoints = random_data_frame(n_rows=30, columns_spec = ["X", "Y"], generators= {"X": numpy.random.normal, "Y": numpy.random.normal})
print(dfPoints.shape)
print(dfPoints[1:6])

Here is a summary:

dfPoints.describe()

Here is a plot of the points:

fig = px.scatter(dfPoints, x="X", y="Y", template="plotly_dark")
fig.show()

A typical pipeline of geometric nearest neighbors processing:

gnnObj = (GeometricNearestNeighborsProcessor(dsPoints)
   .make_nearest_function(distance_function = "EuclideanDistance")
   .compute_thresholds(number_of_nearest_neighbors = 10, aggregation_function = "mean", outlier_identifier = "QuartileIdentifierParameters")
   .find_anomalies()
   .echo_function_value("Anomaly points:", lambda x: print(x))
   .plot(title="Random points", template="plotly_dark")
)

Show the plot obtained above:

gnnObj.take_value().show()

Here we generate another set of random points using the same random point generators:

dfPoints2 = random_data_frame(n_rows=40, columns_spec = ["X", "Y"], generators= {"X": numpy.random.normal, "Y": numpy.random.normal})
print(dfPoints2.shape)

Here the points of second set are classified into being anomalous or not:

gnnObj.classify(dfPoints2).take_value()

See the notebook "Usage-examples.ipynb" for more detailed examples.


Implementation details

  • The package provides the class GeometricNearestNeighborsProcessor that can be used to construct chainable, monadic pipeline-like behavior.

  • Plotting of the data points is done via "plotly" -- just a scatter plot of 2D points for now.

  • The chainable behavior of the methods of the class GeometricNearestNeighborsProcessoris implemented by following the principle that all methods return self, except the so-called "takers".

    • I.e., methods with names that start with "take_".
  • The core Nearest Neighbors (NNs) finding functionality is provided by the "scikit-learn" class NearestNeighbors.

  • The NNs finding algorithms used by GeometricNearestNeighborsProcessor are "scan" and "kdtree".

    • "scan" is implemented in the class GeometricNearestNeighborsProcessor instead of delegating to scikit-learn's NearestNeighbors "brute" algorithm.
    • "kdtree" delegates to NearestNeighbors "kd_tree" algorithm.

References

Wolfram Language

[AAp1] Anton Antonov, MonadicGeometricNearestNeighbors, Wolfram Language paclet, (2023-2025), Wolfram Language Paclet Repository.

[AAp2] Anton Antonov, OutlierIdentifiers, Wolfram Language paclet, (2023), Wolfram Language Paclet Repository.

R

[AAp3] Anton Antonov, OutlierIdentifiers, R package, (2019-2024), GitHub/antononcube.

[AAp4] Anton Antonov, GNNMon-R, R package, (2019-2025), GitHub/antononcube.

[AAp5] Anton Antonov, KDTreeAlgorithm, R package, (2025), GitHub/antononcube.

Python

[AAp6] Anton Antonov, RandomDataGenerators, Python package, (2021-2026), GitHub/antononcube. (PIPy.org.)

[AAp7] Anton Antonov, OutlierIdentifiers, Python package, (2024), GitHub/antononcube. (PIPy.org.)

About

Python package for Geometric Nearest Neighbors (GNN) workflows: data rescaling, fast anomalies finding, similarity matrices derivation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages