Topic modeling of Arxiv articles

NLP using tidy tools

An attempt at performing topic modeling on large dataset to uncover what topics constitute an abstract article.
R
NLP
Topic modeling
big data
Author

Dan A. Tshisungu

Published

May 25, 2024

1. Introduction

ArXiv is a free distribution service and open-access archive for nearly 2.5 million scholarly articles in the fields of :(i) physics, (ii) mathematics, (iii) quantitative biology (iv), computer science, (v) quantitative finance, (vi) statistics, (vii) electrical engineering and systems science, and (viii) economics.

When publishing an article on ArXiv, an author must select the most applicable field and subject area. For example, an author can choose the field of computer science and the subject area artificial intelligence. Authors can also select multiple areas by cross-listing an article and choosing additional subject areas.

Much like data science, subject areas on ArXiv have expanded over the years, and definitions have evolved. Consequently, users now encounter challenges in finding articles of interest.

We load the necessary libraries:

library(tidyverse)
library(tidymodels)
library(learntidymodels)
library(tidytext)
library(data.table)
library(tidyr)
library(dplyr)
library(here)
library(quanteda)
library(tm)
library(stm)
library(lda)
library(ldatuning)
library(skimr)
library(SnowballC)
library(Matrix)
library(text2vec)
library(textstem)
library(parallel)
library(doMC)
library(parallelly)
library(doParallel)
library(kableExtra)
library(ggthemes)
library(furrr)
library(DT)

2. Data overview

We start by loading and getting a glance at the data that we have:

# A tibble: 131,565 × 4
   Date       Title                                        Abstract Subject_area
   <chr>      <chr>                                        <chr>    <chr>       
 1 26/12/2009 A User's Guide to Zot                        "Zot is… LO          
 2 05/10/2009 Prediction of Zoonosis Incidence in Human u… "Zoonos… LG          
 3 08/05/2015 Wireless Multicast for Zoomable Video Strea… "Zoomab… NI          
 4 09/12/2015 On Computing the Minkowski Difference of Zo… "Zonoto… CG          
 5 26/01/2007 The Zones Algorithm for Finding Points-Near… "Zones … DB          
 6 21/02/2017 Occupancy Counting with Burst and Intermitt… "Zone-l… NI          
 7 16/12/2009 Zone Diagrams in Euclidean Spaces and in Ot… "Zone d… CG          
 8 08/05/2015 Decomposition of Power Flow Used for Optimi… "Zonal … CE          
 9 23/12/2008 Some sufficient conditions on Hamiltonian d… "Z-mapp… DM          
10 25/03/2013 ZKCM: a C++ library for multiprecision matr… "ZKCM i… MS          
# ℹ 131,555 more rows

We notice that we have 5 variables and 131565 abstracts.

A quick summary tells us that we have roughly 894 NA observations under the variable Subject Area variable, and that we have 131384 unique abstracts.

Data summary
Name Piped data
Number of rows 131565
Number of columns 5
Key NULL
_______________________
Column type frequency:
character 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ID 0 1.00 11 14 0 131565 0
Date 0 1.00 10 10 0 6176 0
Title 0 1.00 4 256 0 131346 0
Abstract 1 1.00 9 3806 0 131384 0
Subject_area 894 0.99 2 2 0 39 0

We will try something different today: both topic modeling and clustering. I know it might appear a little bit redundant as topic modeling kind of clusters text but bear with me🙏.

To achieve this, we will follow an approach that can be summarized in 2 steps:

  • Conversion of character data into numerical data as most algorithms understand numbers better than text;

  • Dimension reduction from a large corpus of words to a set of topics.

3. Topic modeling

3.1. Conversion of text data into numerical data : Corpus extraction

arxiv_corpus <- ArXiv %>% 
  na.omit() %>% 
  distinct(Abstract) %>% 
  unnest_tokens(word, Abstract, token = stringr::str_extract_all,
                drop = FALSE ,
                pattern = "\\b\\w[-\\w]*\\b") %>%
  mutate(word = lemmatize_words(word)) %>%  
  anti_join(get_stopwords(source = "snowball")) %>% 
  filter(!word %in% "") %>% 
  filter(!str_detect(word, "\\d")) %>% 
  filter(nchar(word) >= 2) %>% 
  anti_join(get_stopwords())


vocabulary <- arxiv_corpus %>% 
  select(word) %>% 
  unique()

arxiv_corpus %>%
    count(word, sort = TRUE)

Now we try to create a matrix from our arxiv_corpus and try to filter out some words to alleviate the number of words. We tried to filter words appearing more than 100 times or more than 500 times and we finally sticked to more than 100 times only:

tidy_arxiv <- arxiv_corpus %>% 
  add_count(word) %>%
  filter(n > 100) %>%
  select(-n)

tidy_arxiv_500 <- arxiv_corpus %>% 
  add_count(word) %>%
  filter(n > 500) %>%
  select(-n)




arxiv_sparse <- tidy_arxiv %>%
  count(Abstract, word) %>%
  cast_sparse(Abstract, word, n)

arxiv_sparse_500 <- tidy_arxiv_500 %>%
  count(Abstract, word) %>%
  cast_sparse(Abstract, word, n)

arxiv_corpus_sparse <- arxiv_corpus %>%
  count(Abstract, word) %>%
  cast_sparse(Abstract, word, n)
  • The dimension of the sparse matrix with original corpus: 130491 rows and 166150 columns

  • The dimension of the sparse matrix with filter of words appearing at least 100 times in the corpus: 130491 rows and 6284 columns

  • The dimension of the sparse matrix with filter of words appearing at least 500 times in the corpus: 130488 rows and 2604 columns

3.2. Dimension reduction : train and evaluate topic models

Now we start with the training process. Because we cannot know ahead how many topics our dataset contains, we set different values and based on their performance, we select our number of interest.

With roughly 130491 documents, this process took over 16 hours to train😆, so should you want to reproduce, do not be in a hurry.

Furthermore, I’d like to give credit where it’s due to Julia Silge 🙏🏼 for her immense work and guidance.

set.seed(2024)

registerDoMC(cores = max(1, availableCores() - 1))

many_models <- data_frame(K = c(20, 40, 50, 60, 70, 80, 100)) %>%
  mutate(topic_model = future_map(K, ~stm(arxiv_sparse, K = .,
                                          verbose = FALSE)))

We now have topic models with different K values. We can assess their performance and decide which k value is the most appropriate for our data based on metrics such as semantic coherence, exclusivity, residual and others.

heldout <- make.heldout(arxiv_sparse)

k_result <- many_models %>%
  mutate(exclusivity = map(topic_model, exclusivity),
         semantic_coherence = map(topic_model, semanticCoherence, arxiv_sparse),
         eval_heldout = map(topic_model, eval.heldout, heldout$missing),
         residual = map(topic_model, checkResiduals, arxiv_sparse),
         bound =  map_dbl(topic_model, function(x) max(x$convergence$bound)),
         lfact = map_dbl(topic_model, function(x) lfactorial(x$settings$dim$K)),
         lbound = bound + lfact,
         iterations = map_dbl(topic_model, function(x) length(x$convergence$bound)))

Let us now have some diagnostic plots to allow us to choose the best k value:

The residuals appear to be the lowest at K=100, while the held-out likelihood is the highest at 100.

As for the semantic coherence, it is good practice to get a balance between the exclusivity and the semantic coherence.

Let’s take our min, mean,and max K values for this comparison:

Should we decide to have a threshold value of 9.8 for our exclusivity, it appears that K=100 is the most appropriate choice here.

However, because of our first plot, I may try to extend the values to 120 and 140 and see how they perform. Maybe in another analysis. But for now, let’s stick to k = 100.

*`A topic model with 100 topics, 130491 documents and a 6284 word dictionary.`*

Let us have a look at our reduced data. Because we reduced the number of columns (words) to get topics in the place, we will get a matrix where:

  • each row represents an abstract

  • each column represents a topic

  • value are the probability that a topic is found within a certain abstract.

# A tibble: 130,491 × 101
   document          `Topic 1` `Topic 2` `Topic 3` `Topic 4` `Topic 5` `Topic 6`
   <chr>                 <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 "!-graphs provid…   0.00253   0.00227  0.00111    0.00738  0.0475    0.0332  
 2 "\" How well con…   0.00249   0.00331  0.00130    0.0145   0.00292   0.00646 
 3 "\" Yet another …   0.00241   0.00154  0.00114    0.0163   0.00265   0.00237 
 4 "\"Academia 2.0\…   0.00369   0.00147  0.00127    0.00377  0.00103   0.000324
 5 "\"Amplify and F…   0.0205    0.00127  0.00130    0.0113   0.00328   0.000501
 6 "\"Background su…   0.00361   0.00354  0.00185    0.0102   0.00224   0.00137 
 7 "\"Bibliometrics…   0.00270   0.00176  0.000860   0.00520  0.00111   0.00138 
 8 "\"Big Data is t…   0.00404   0.00187  0.00108    0.00305  0.00455   0.000274
 9 "\"Big data\" ha…   0.00209   0.00159  0.00106    0.00477  0.000911  0.00246 
10 "\"Citation clas…   0.00338   0.00197  0.000635   0.00313  0.00250   0.00176 
# ℹ 130,481 more rows
# ℹ 94 more variables: `Topic 7` <dbl>, `Topic 8` <dbl>, `Topic 9` <dbl>,
#   `Topic 10` <dbl>, `Topic 11` <dbl>, `Topic 12` <dbl>, `Topic 13` <dbl>,
#   `Topic 14` <dbl>, `Topic 15` <dbl>, `Topic 16` <dbl>, `Topic 17` <dbl>,
#   `Topic 18` <dbl>, `Topic 19` <dbl>, `Topic 20` <dbl>, `Topic 21` <dbl>,
#   `Topic 22` <dbl>, `Topic 23` <dbl>, `Topic 24` <dbl>, `Topic 25` <dbl>,
#   `Topic 26` <dbl>, `Topic 27` <dbl>, `Topic 28` <dbl>, `Topic 29` <dbl>, …

Now we have a standard tibble that we can easily manipulate as we wish.

Before that, let’s feed our eyes with some graphs to see the words that compose each topic:

We start by extracting the beta and gamma matrices

        topic     term         beta
        <int>   <char>        <num>
     1:     1 abstract 5.468170e-12
     2:     2 abstract 3.060442e-12
     3:     3 abstract 6.786207e-13
     4:     4 abstract 1.917353e-15
     5:     5 abstract 9.713773e-17
    ---                            
628396:    96    swipt 1.311521e-57
628397:    97    swipt 3.176655e-50
628398:    98    swipt 7.467630e-64
628399:    99    swipt 5.360673e-67
628400:   100    swipt 2.414906e-47
tidy_gamma <- tidy(topic_model_final, 
                         matrix = "gamma",
                         document_names = rownames(arxiv_sparse))

We extract the top terms from the beta matrix:

     topic
     <int>
  1:     1
  2:     2
  3:     3
  4:     4
  5:     5
  6:     6
  7:     7
  8:     8
  9:     9
 10:    10
 11:    11
 12:    12
 13:    13
 14:    14
 15:    15
 16:    16
 17:    17
 18:    18
 19:    19
 20:    20
 21:    21
 22:    22
 23:    23
 24:    24
 25:    25
 26:    26
 27:    27
 28:    28
 29:    29
 30:    30
 31:    31
 32:    32
 33:    33
 34:    34
 35:    35
 36:    36
 37:    37
 38:    38
 39:    39
 40:    40
 41:    41
 42:    42
 43:    43
 44:    44
 45:    45
 46:    46
 47:    47
 48:    48
 49:    49
 50:    50
 51:    51
 52:    52
 53:    53
 54:    54
 55:    55
 56:    56
 57:    57
 58:    58
 59:    59
 60:    60
 61:    61
 62:    62
 63:    63
 64:    64
 65:    65
 66:    66
 67:    67
 68:    68
 69:    69
 70:    70
 71:    71
 72:    72
 73:    73
 74:    74
 75:    75
 76:    76
 77:    77
 78:    78
 79:    79
 80:    80
 81:    81
 82:    82
 83:    83
 84:    84
 85:    85
 86:    86
 87:    87
 88:    88
 89:    89
 90:    90
 91:    91
 92:    92
 93:    93
 94:    94
 95:    95
 96:    96
 97:    97
 98:    98
 99:    99
100:   100
     topic
                                                                                        terms
                                                                                       <char>
  1:                          error, due, take, correct, crucial, modification, author, claim
  2:                                      path, long, short, note, frame, find, along, travel
  3:                  robust, hold, uncertainty, robustness, uncertain, swarm, artificial, ca
  4:              problem, solution, solve, constraint, optimal, optimization, consider, find
  5:                 condition, length, field, sufficient, family, necessary, generator, give
  6:                   graph, direct, subgraph, show, parameterize, induce, bipartite, result
  7:           network, link, topology, layer, connect, communication, connectivity, wireless
  8:                                               et, al, de, recently, work, se, la, recent
  9:                       set, measure, define, concept, order, represent, introduce, subset
 10:                        system, hybrid, paper, operate, base, present, dynamical, provide
 11: propose, parameter, filter, optimization, convergence, adaptive, stochastic, performance
 12:                         code, decode, block, construction, decoder, encode, use, propose
 13:                                   edge, vertex, numb, every, emph, color, connect, cycle
 14:                   information, context, side, mutual, available, source, can, additional
 15:                          node, route, delay, traffic, packet, sensor, protocol, wireless
 16:              interference, channel, user, transmitter, receiver, primary, secondary, csi
 17:           human, computer, world, machine, understand, intelligence, artificial, ability
 18:                         dataset, video, visual, question, temporal, answer, action, task
 19:               dynamic, change, behavior, individual, evolution, static, evolve, response
 20:                                     web, content, user, cache, video, page, site, stream
 21:                       implementation, parallel, scale, implement, use, core, large, fast
 22:            face, recognition, localization, variation, local, descriptor, person, facial
 23:                      impact, reference, science, scientific, study, paper, publish, bias
 24:                     function, value, loss, continuous, hash, sensitivity, integral, show
 25:                             strategy, game, equilibrium, player, nash, play, show, study
 26:                      matrix, kernel, vector, subspace, space, projection, norm, spectral
 27:                           analysis, use, tool, case, present, example, provide, describe
 28:            program, abstract, execution, functional, abstraction, language, use, support
 29:                                policy, article, name, request, serve, copy, reward, push
 30:          signal, sense, sparse, measurement, reconstruction, recovery, recover, sparsity
 31:                    network, neural, train, deep, convolutional, layer, architecture, cnn
 32:          student, team, volume, university, course, collaborative, collaboration, member
 33:                         state, initial, configuration, transition, free, art, phase, net
 34:                                      bound, bind, low, upper, numb, tight, optimal, case
 35:           estimate, estimation, noise, mean, observation, gaussian, unknown, statistical
 36:                      couple, spatial, cell, simulation, use, biological, material, brain
 37:          word, translation, task, sentence, representation, recurrent, embed, embeddings
 38:      structure, community, complex, structural, overlap, topological, underlie, identify
 39:                                           log, tree, omega, string, bit, size, use, give
 40:           property, standard, check, verification, verify, formal, specification, ensure
 41:               type, object, recursive, dependent, different, primitive, generic, session
 42:                            scheme, message, share, propose, broadcast, send, secret, key
 43:                      linear, vector, chain, markov, mix, integer, combination, quadratic
 44:                complexity, computational, reduction, show, hard, result, question, count
 45:                      knowledge, detection, text, document, detect, entity, extract, base
 46:              relay, transmission, transmit, power, cooperative, outage, optimal, propose
 47:             compute, polynomial, transform, equation, discrete, decomposition, use, real
 48:               environment, robot, plan, vehicle, trajectory, use, robotic, reinforcement
 49:                           logic, proof, theorem, calculus, prove, term, formula, logical
 50:     feature, classification, use, classifier, recognition, speech, performance, classify
 51:                            process, framework, level, event, unify, base, propose, allow
 52:         distribution, probability, random, sample, entropy, density, expect, conditional
 53:          technology, organization, business, product, patient, customer, health, company
 54:               quality, evaluation, project, study, effort, practice, result, methodology
 55:                      image, segmentation, use, propose, resolution, pixel, result, color
 56:           cluster, pattern, mine, hierarchical, discover, similarity, discovery, propose
 57:                     approach, technique, use, procedure, instance, heuristic, new, novel
 58:           sequence, correlation, compression, number, shift, period, randomness, protein
 59:      service, resource, cloud, application, compute, management, infrastructure, provide
 60:                class, finite, automaton, regular, show, infinite, characterization, word
 61:                                     map, mathcal, leave, alpha, mathbb, frac, let, right
 62:     semantic, reason, relation, interpretation, express, description, trace, equivalence
 63:                         device, mobile, location, use, sensor, monitor, application, can
 64:                     match, rank, pair, list, stable, competitive, preference, assignment
 65:                      language, natural, dependency, tag, use, grammar, parse, linguistic
 66:                  control, task, schedule, controller, stability, real-time, switch, mode
 67:                                 datum, amount, large, big, collect, source, real, record
 68:                      input, generate, component, output, produce, generation, stage, two
 69:                                    time, space, run, update, numb, stream, interval, can
 70:                          learn, label, task, train, representation, domain, machine, can
 71:                         user, security, privacy, key, secure, recommendation, use, trust
 72:                               store, digital, storage, access, file, library, write, use
 73:                               social, user, online, medium, study, much, people, twitter
 74:                             model, capture, can, use, fit, parameter, result, predictive
 75:                                  region, partition, lattice, mu, lambda, nest, two, fine
 76:                         power, energy, consumption, load, grid, circuit, propose, design
 77:                     user, spectrum, access, base, station, cellular, propose, allocation
 78:                        method, propose, use, base, prediction, accuracy, compare, result
 79:                     mechanism, price, utility, market, auction, allocation, good, budget
 80:                    channel, rate, capacity, source, gaussian, achievable, show, feedback
 81:         distribute, local, communication, global, failure, fault, consensus, reliability
 82:                         point, dimension, line, shape, geometric, curve, plane, boundary
 83:                  channel, frequency, antenna, performance, mimo, propose, signal, design
 84:                         agent, action, assumption, may, knowledge, can, evidence, belief
 85:                     weight, distance, metric, binary, sum, ensemble, generalize, average
 86:            computation, protocol, classical, quantum, communication, can, use, agreement
 87:                      size, variable, prove, result, threshold, show, strong, exponential
 88:                               object, track, map, scene, pose, detection, camera, motion
 89:      node, inference, probabilistic, bayesian, propagation, influence, spread, graphical
 90:                                      test, group, item, hypothesis, numb, case, bin, one
 91:                        search, query, database, index, retrieval, engine, result, answer
 92:                         decision, rule, make, choice, candidate, choose, aggregate, vote
 93:                        research, discuss, paper, work, application, focus, issue, recent
 94:                approximation, epsilon, factor, constant, approximate, give, ratio, delta
 95:                             algorithm, good, online, efficient, present, new, much, show
 96:                 attack, security, can, detection, adversary, detect, malicious, attacker
 97:                         performance, memory, cost, reduce, high, hardware, increase, can
 98:             design, software, development, engineer, architecture, module, source, build
 99:   theory, representation, notion, operation, domain, mathematical, construction, algebra
100:                                     new, paper, good, use, present, base, can, introduce
                                                                                        terms

We join both the gamma and the top terms :

gamma_terms <- tidy_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
    topic = reorder(topic, gamma))

Then the plot:

The topic that occurs the most is mostly about research, discuss, paper, application, recent, etc.

Now in order to label each topic, we could use the first 3 words that construct it.

gamma_terms %>%
  select(topic, gamma, terms) %>%
  kable(digits = 3, 
        col.names = c("Topic", "Topic proportion", "Top 8 words")) %>%
  kable_styling(full_width = F) %>%
  row_spec(1, background = "white") 
Topic Topic proportion Top 8 words
Topic 93 0.032 research, discuss, paper, work, application, focus, issue, recent
Topic 4 0.023 problem, solution, solve, constraint, optimal, optimization, consider, find
Topic 95 0.020 algorithm, good, online, efficient, present, new, much, show
Topic 78 0.020 method, propose, use, base, prediction, accuracy, compare, result
Topic 57 0.019 approach, technique, use, procedure, instance, heuristic, new, novel
Topic 70 0.018 learn, label, task, train, representation, domain, machine, can
Topic 27 0.017 analysis, use, tool, case, present, example, provide, describe
Topic 9 0.017 set, measure, define, concept, order, represent, introduce, subset
Topic 31 0.016 network, neural, train, deep, convolutional, layer, architecture, cnn
Topic 97 0.015 performance, memory, cost, reduce, high, hardware, increase, can
Topic 83 0.014 channel, frequency, antenna, performance, mimo, propose, signal, design
Topic 80 0.014 channel, rate, capacity, source, gaussian, achievable, show, feedback
Topic 59 0.014 service, resource, cloud, application, compute, management, infrastructure, provide
Topic 87 0.013 size, variable, prove, result, threshold, show, strong, exponential
Topic 54 0.013 quality, evaluation, project, study, effort, practice, result, methodology
Topic 99 0.013 theory, representation, notion, operation, domain, mathematical, construction, algebra
Topic 49 0.013 logic, proof, theorem, calculus, prove, term, formula, logical
Topic 50 0.013 feature, classification, use, classifier, recognition, speech, performance, classify
Topic 11 0.012 propose, parameter, filter, optimization, convergence, adaptive, stochastic, performance
Topic 74 0.012 model, capture, can, use, fit, parameter, result, predictive
Topic 60 0.012 class, finite, automaton, regular, show, infinite, characterization, word
Topic 73 0.012 social, user, online, medium, study, much, people, twitter
Topic 94 0.012 approximation, epsilon, factor, constant, approximate, give, ratio, delta
Topic 55 0.012 image, segmentation, use, propose, resolution, pixel, result, color
Topic 12 0.012 code, decode, block, construction, decoder, encode, use, propose
Topic 21 0.012 implementation, parallel, scale, implement, use, core, large, fast
Topic 34 0.012 bound, bind, low, upper, numb, tight, optimal, case
Topic 15 0.012 node, route, delay, traffic, packet, sensor, protocol, wireless
Topic 67 0.012 datum, amount, large, big, collect, source, real, record
Topic 82 0.011 point, dimension, line, shape, geometric, curve, plane, boundary
Topic 52 0.011 distribution, probability, random, sample, entropy, density, expect, conditional
Topic 88 0.011 object, track, map, scene, pose, detection, camera, motion
Topic 47 0.011 compute, polynomial, transform, equation, discrete, decomposition, use, real
Topic 44 0.011 complexity, computational, reduction, show, hard, result, question, count
Topic 77 0.011 user, spectrum, access, base, station, cellular, propose, allocation
Topic 28 0.011 program, abstract, execution, functional, abstraction, language, use, support
Topic 98 0.011 design, software, development, engineer, architecture, module, source, build
Topic 7 0.011 network, link, topology, layer, connect, communication, connectivity, wireless
Topic 69 0.011 time, space, run, update, numb, stream, interval, can
Topic 51 0.010 process, framework, level, event, unify, base, propose, allow
Topic 48 0.010 environment, robot, plan, vehicle, trajectory, use, robotic, reinforcement
Topic 13 0.010 edge, vertex, numb, every, emph, color, connect, cycle
Topic 53 0.010 technology, organization, business, product, patient, customer, health, company
Topic 10 0.010 system, hybrid, paper, operate, base, present, dynamical, provide
Topic 71 0.010 user, security, privacy, key, secure, recommendation, use, trust
Topic 35 0.010 estimate, estimation, noise, mean, observation, gaussian, unknown, statistical
Topic 45 0.010 knowledge, detection, text, document, detect, entity, extract, base
Topic 46 0.010 relay, transmission, transmit, power, cooperative, outage, optimal, propose
Topic 23 0.010 impact, reference, science, scientific, study, paper, publish, bias
Topic 84 0.010 agent, action, assumption, may, knowledge, can, evidence, belief
Topic 17 0.010 human, computer, world, machine, understand, intelligence, artificial, ability
Topic 19 0.010 dynamic, change, behavior, individual, evolution, static, evolve, response
Topic 63 0.010 device, mobile, location, use, sensor, monitor, application, can
Topic 6 0.010 graph, direct, subgraph, show, parameterize, induce, bipartite, result
Topic 62 0.010 semantic, reason, relation, interpretation, express, description, trace, equivalence
Topic 18 0.009 dataset, video, visual, question, temporal, answer, action, task
Topic 37 0.009 word, translation, task, sentence, representation, recurrent, embed, embeddings
Topic 30 0.009 signal, sense, sparse, measurement, reconstruction, recovery, recover, sparsity
Topic 76 0.009 power, energy, consumption, load, grid, circuit, propose, design
Topic 26 0.009 matrix, kernel, vector, subspace, space, projection, norm, spectral
Topic 16 0.009 interference, channel, user, transmitter, receiver, primary, secondary, csi
Topic 91 0.008 search, query, database, index, retrieval, engine, result, answer
Topic 40 0.008 property, standard, check, verification, verify, formal, specification, ensure
Topic 68 0.008 input, generate, component, output, produce, generation, stage, two
Topic 66 0.008 control, task, schedule, controller, stability, real-time, switch, mode
Topic 79 0.008 mechanism, price, utility, market, auction, allocation, good, budget
Topic 24 0.008 function, value, loss, continuous, hash, sensitivity, integral, show
Topic 81 0.008 distribute, local, communication, global, failure, fault, consensus, reliability
Topic 65 0.008 language, natural, dependency, tag, use, grammar, parse, linguistic
Topic 25 0.008 strategy, game, equilibrium, player, nash, play, show, study
Topic 92 0.008 decision, rule, make, choice, candidate, choose, aggregate, vote
Topic 36 0.007 couple, spatial, cell, simulation, use, biological, material, brain
Topic 42 0.007 scheme, message, share, propose, broadcast, send, secret, key
Topic 56 0.007 cluster, pattern, mine, hierarchical, discover, similarity, discovery, propose
Topic 96 0.007 attack, security, can, detection, adversary, detect, malicious, attacker
Topic 89 0.007 node, inference, probabilistic, bayesian, propagation, influence, spread, graphical
Topic 14 0.007 information, context, side, mutual, available, source, can, additional
Topic 39 0.007 log, tree, omega, string, bit, size, use, give
Topic 61 0.007 map, mathcal, leave, alpha, mathbb, frac, let, right
Topic 5 0.007 condition, length, field, sufficient, family, necessary, generator, give
Topic 32 0.006 student, team, volume, university, course, collaborative, collaboration, member
Topic 72 0.006 store, digital, storage, access, file, library, write, use
Topic 20 0.006 web, content, user, cache, video, page, site, stream
Topic 38 0.006 structure, community, complex, structural, overlap, topological, underlie, identify
Topic 85 0.006 weight, distance, metric, binary, sum, ensemble, generalize, average
Topic 22 0.006 face, recognition, localization, variation, local, descriptor, person, facial
Topic 43 0.006 linear, vector, chain, markov, mix, integer, combination, quadratic
Topic 33 0.006 state, initial, configuration, transition, free, art, phase, net
Topic 64 0.006 match, rank, pair, list, stable, competitive, preference, assignment
Topic 86 0.005 computation, protocol, classical, quantum, communication, can, use, agreement
Topic 90 0.005 test, group, item, hypothesis, numb, case, bin, one
Topic 58 0.005 sequence, correlation, compression, number, shift, period, randomness, protein
Topic 41 0.005 type, object, recursive, dependent, different, primitive, generic, session
Topic 1 0.005 error, due, take, correct, crucial, modification, author, claim
Topic 2 0.004 path, long, short, note, frame, find, along, travel
Topic 29 0.003 policy, article, name, request, serve, copy, reward, push
Topic 75 0.003 region, partition, lattice, mu, lambda, nest, two, fine
Topic 100 0.003 new, paper, good, use, present, base, can, introduce
Topic 8 0.003 et, al, de, recently, work, se, la, recent
Topic 3 0.003 robust, hold, uncertainty, robustness, uncertain, swarm, artificial, ca

4. Conclusion

Throughout this project, we have dealt with text from abstracts published by Arxiv. We transformed the text data, and reduced the dimension using topic modeling and LDA.

Next time, we will attempt to cluster our topics and assess whether certain topics have more relationship with one another than others.

Until then, drink water and stay safe🖖🏼.