How the cluster was developed
From the handpicked keywords, we sent a query to the Lund University publication database LUCRIS which returns all publications with any field matching at least one of the keywords. Publications having an abstract were then selected. This abstract was split into sentences and tokens (words, using SpaCy with the en_core_web_lg model, and lemmas and noun chunks were extracted using this library. Some publications were incorrectly tagged as English, and were filtered out prior to annotation by using an automatic method, the same method deployed by the Chromium browser. This removed 11 publications. We also did standard stop word filtering, and removal of punctuation and numbers.
The clustering is based on two methods: a bag-of-word method (BOW) using Term Frequency-Inverse Document Frequency (TF-IDF) weighting where each document is a single abstract, and secondly a pretrained Deep Averaging Network (DAN) model: Universal Sentence Encoder (Large).
Each abstract were transformed into first a high dimensional sparse BOW vector and one 512 dimensional vector we call an embedding as it embeds useful semantically relevant information in a lower dimensional vector.
Each dimension in the BOW vector corresponds to a term (in this case the lemma, the canonical form of the word) and its weight being a TF-IDF weight indicating relevance for this abstract. The IDF term of TF-IDF will reduce the impact of frequently occurring words throughout the collection as they tend to be of less relevance. The 512 dimensional embedding was produced using the Universal Sentence Encoder (USE) on a reduced abstract: the title and the two first sentences, this cutoff is motivated to increase separation between abstracts, as it is harder for this model to produce quality embeddings when the input is very long. The purpose of the Universal Sentence Encoder is to transform a sequence of words into a high dimensional vector which loosely described is in a space of semantics. In this space completely different sentences which contains few or none of the same words could have similar directions if they are semantically similar. USE and similar methods tries overcome one of the primary limitation of BOW methods: how to account for synonyms with completely different word forms.
For clustering, the BOW vectors were embedded by reducing its dimensionality using Truncated Singular Value Decomposition (SVD), i.e. only using the top k singular values essentially compressing the BOW vector. We tested various dimensions from 32 to 128, 48 produced to us the best embeddings, the tradeoff being: very separable clusters or not separable at all. This method was compared against the USE embedding. We clustered the BOW embedding and USE embedding using K-Means for 2 to 48 clusters. K-Means produces local minima clustering based on initial conditions and as such we reran it 25 times from 2 to 48 clusters. We computed the min, mean, max, and standard deviation of silhouette scores to get an idea of how different the clusterings for each k clusters. We ultimately selected the clustering for each k which had the highest silhouette score.
We do not have a gold-standard to objectively evaluate clustering, and because of this we had to gain insight in a different way. To evaluate clustering, we down-projected the BOW embedding and USE embedding using T-SNE to 2 dimensions which is a method most useful for visualizing high-dimensional vectors. It is approximate and the actual distance on down-projected 2D map is hard to interpret as it is a non-linear transformation, T-SNE will place similar similar clusters close but the actual distance should be taken with a grain of salt.
We then picked the number of clusters which balances coverage of the map, as to showing the dominant topics being researched and not being too clumped with little meaning. Meaning was judged by producing word-clouds, in which we summed the BOW TF-IDF vectors residing in each cluster for both embeddings, providing a way to see what dominant words resides in each cluster compared to the rest of the corpus.
Finally, we had the choice of picking a simpler BOW embedding or a more semantically relevant USE embedding. The language used within scientific abstracts is diverse, including technical vocabulary and rare words which the general public might find difficult to understand. This diversity and difference from everyday language presents a challenge in general. Our BOW embeddings were based solely on the statistics of our abstract collection, and the diversity of research meant that overlap in terms of words was low. We chose the USE embedding in the end as it produced more semantically coherent clusters. Both methods produced acceptable T-SNE projections but similarity were harder to judge using BOW as synonyms would pop up in different cluster more frequently, we would essentially measure similarity of vocabulary in clusters, not similar topics in abstracts.