Quantcast
Channel: Is there a way to make this Python kNN function more efficient? - Stack Overflow
Viewing all articles
Browse latest Browse all 3

Is there a way to make this Python kNN function more efficient?

$
0
0

After having troubles with MATLAB I decided to try Python:

I wrote a function that calculates kNN when the samples are of my own class using my own distance function:

def closestK(sample, otherSamples, distFunc, k):"Returns the closest k samples to sample based on distFunc"    n = len(otherSamples)    d = [distFunc(sample, otherSamples[i]) for i in range(0,n)]    idx  = sorted(range(0,len(d)), key=lambda k: d[k])    return idx[1:(k+1)]def kNN(samples, distFunc, k):    return [[closestK(samples[i], samples, distFunc, k)] for i in range(len(samples))]

and this is the distance function:

@staticmethod    def distanceRepr(c1, c2):    r1 = c1.repr    r2 = c2.repr    # because cdist needs 2D array    if r1.ndim == 1:        r1 = np.vstack([r1,r1])    if r2.ndim == 1:        r2 = np.vstack([r2,r2])    return scipy.spatial.distance.cdist(r1, r2, 'euclidean').min()

But it still works amazingly slower compared to the "normal" kNN function, even when using "brute" algorithm. Am I doing something wrong?

UPDATE

I'm adding the constructor of the class. The attribute repr contains a set of vectors (from 1 to whatever) and the distance is calculated to be the minimal euclidean distance between the two sets of repr.

class myCluster:    def __init__(self, index = -1, P = np.array([])):        if index ==-1 :            self.repr = np.array([])            self.IDs = np.array([])            self.n = 0            self.center = np.array([])        else:            self.repr = np.array(P)            self.IDs = np.array(index)            self.n = 1            self.center = np.array(P)

and the rest of relevant code (X is a matrix whose rows are samples and columns are variables):

level = [myCluster(i, X[i,:]) for i in range(0,n)]kNN(level, myCluster.distanceRepr, 3)

UPDATE 2

I've made some measurements and the line that takes most of the time is

d = [distFunc(sample, otherSamples[i]) for i in range(0,n)]

So there is something with the distFunc. When I change it to return

np.linalg.norm(c1.repr-c2.repr)

i.e. "normal" vector calculation, with no sorting, the running time stays the same. So the problem lies in the calling of this function. Does it make sense that the use of classes changes the running time by a factor of 60?


Viewing all articles
Browse latest Browse all 3

Latest Images

Trending Articles



Latest Images