Fast and Simple Deterministic Seeding of KMeans for Text Document Clustering

Ehsan Sherkat; Julien Velcin; Evangelos E. Milios

doi:10.1007/978-3-319-98932-7_7

Communication Dans Un Congrès Année : 2018

Fast and Simple Deterministic Seeding of KMeans for Text Document Clustering

, (1) ,

Ehsan Sherkat

Fonction : Auteur

Julien Velcin

Fonction : Auteur
PersonId : 967191
IdHAL : julien-velcin
ORCID : 0000-0002-2262-045X

Entrepôts, Représentation et Ingénierie des Connaissances

Evangelos E. Milios

Fonction : Auteur

Résumé

KMeans is one of the most popular document clustering algorithms. It is usually initialized by random seeds that can drastically impact the final algorithm performance. There exists many random or order-sensitive methods that try to properly initialize KMeans but their problem is that their result is non-deterministic and unrepeatable. Thus KMeans needs to be initialized several times to get a better result, which is a time-consuming operation. In this paper, we introduce a novel deter-AQ1 ministic seeding method for KMeans that is specifically designed for text document clustering. Due to its simplicity, it is fast and can be scaled to large datasets. Experimental results on several real-world datasets demonstrate that the proposed method has overall better performance compared to several deterministic, random, or order-sensitive methods in terms of clustering quality and runtime.

Domaines

Web Traitement du texte et du document Recherche d'information [cs.IR] Intelligence artificielle [cs.AI] Machine Learning [stat.ML]

Fichier principal

clef-2018.pdf (1.59 Mo)

Origine	Fichiers produits par l'(les) auteur(s)

Julien Velcin : Connectez-vous pour contacter le contributeur

https://hal.univ-lyon2.fr/hal-01953432

Soumis le : mercredi 12 décembre 2018-22:52:48

Dernière modification le : mercredi 4 septembre 2024-17:34:07

Archivage à long terme le : mercredi 13 mars 2019-16:28:53

Dates et versions

hal-01953432 , version 1 (12-12-2018)

Identifiants

HAL Id : hal-01953432 , version 1
DOI : 10.1007/978-3-319-98932-7_7

Citer

Ehsan Sherkat, Julien Velcin, Evangelos E. Milios. Fast and Simple Deterministic Seeding of KMeans for Text Document Clustering. 9th Conference and Labs of the Evaluation Forum (CLEF), Sep 2018, Avignon, France. pp.76-88, ⟨10.1007/978-3-319-98932-7_7⟩. ⟨hal-01953432⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-LYON1 UNIV-LYON2 ERIC UDL

68 Consultations

551 Téléchargements