Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError when using 'average' for linkage #2

Open
Terry1004 opened this issue May 29, 2018 · 2 comments
Open

IndexError when using 'average' for linkage #2

Terry1004 opened this issue May 29, 2018 · 2 comments

Comments

@Terry1004
Copy link

I'm sorry to disturb you again, but I encountered another error when using the linkage method while running the following script:

# -*- coding: utf-8 -*-
from sparsehc_dm import sparsehc_dm
import numpy as np
from con_sql import get_vectors_db
import resource 
import time

# 1w points test

SIZE = 10000
SORT_RAM = int(0.5 * 1024 * 1024 * 1024)
INDEX_TO_PRINT = 20
HOST = 'XXX'
USER = 'XXX'
PASSWD = 'XXX'
DB = 'XXX'
GET_SENTENCES_SQL = 'XXX'
GET_VECTORS_URL = 'XXX'

def print_use(start_time):
    print('total time taken: %s seconds' % (time.time() - start_time))
    print('memory usage: %f GB' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024. / 1024.))

def get_distance(i, j, vectors):
    return np.linalg.norm(vectors[i] - vectors[j])

def cluster(size, sort_ram, index_to_print, vectors):
    distances = sparsehc_dm.InMatrix(sort_ram)
    for i in range(size - 1):
        for j in range(i + 1, size):
            distances.push(i, j, get_distance(i, j, vectors))
    
    l_history = sparsehc_dm.linkage(distances, 'average')
    print (l_history[: index_to_print])
    
def main():
    start_time = time.time()
    vectors = get_vectors_db(HOST, USER, PASSWD, DB, SIZE, GET_SENTENCES_SQL, GET_VECTORS_URL)
    print('time take to read from database: %s seconds' % (time.time() - start_time))
    cluster(SIZE, SORT_RAM, INDEX_TO_PRINT, vectors)
    print_use(start_time)

if __name__ == '__main__':
    main()

Basically this script is reading data from some database and will process the data to obtain a 2d numpy array (via get_vectors_db), which will be identified as points to be clustered, while the time and memory usage is monitored. However, during the execution of the script, the following was printed on screen:

Using 0.5GB RAM for on-disc sorting.
[STXXL-MSG] STXXL v1.4.99 (prerelease/Release) (git 263df0c54dc168212d1c7620e3c10c93791c9c29) + gnu parallel(20180415)
[STXXL-MSG] Disk '/tmp/stxxl' is allocated, space: 953 MiB, I/O implementation: syscall queue=0 devid=0 unlink_on_open
Traceback (most recent call last):
  File "cluster.py", line 95, in <module>
    main()
  File "cluster.py", line 91, in main
    cluster(SIZE, SORT_RAM, INDEX_TO_PRINT, vectors)
  File "cluster.py", line 85, in cluster
    l_history = sparsehc_dm.linkage(distances, 'average')
IndexError: vector::_M_range_check: __n (which is 8145) >= this->size() (which is 8145)

A similar error also occurred when I set SIZE = 5000:

Using 0.5GB RAM for on-disc sorting.
[STXXL-MSG] STXXL v1.4.99 (prerelease/Release) (git 263df0c54dc168212d1c7620e3c10c93791c9c29) + gnu parallel(20180415)
[STXXL-MSG] Disk '/tmp/stxxl' is allocated, space: 953 MiB, I/O implementation: syscall queue=0 devid=0 unlink_on_open
Traceback (most recent call last):
  File "cluster.py", line 92, in <module>
    main()
  File "cluster.py", line 88, in main
    cluster(SIZE, SORT_RAM, INDEX_TO_PRINT, vectors)
  File "cluster.py", line 82, in cluster
    l_history = sparsehc_dm.linkage(distances, 'average')
IndexError: vector::_M_range_check: __n (which is 3736) >= this->size() (which is 3736)

However, I got no error if I replace 'average' with 'complete' in sparsehc_dm.linkage.

@mdimura
Copy link
Owner

mdimura commented May 30, 2018

Strange. I don't know, what could cause this, one would have to debug it.

@Terry1004
Copy link
Author

Actually, I tried pushing a constant value to each entry of sparsehc_dm.InMatrix(), and I received the same error while the number is 0.
IndexError: vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants