Edit model card

respapers_topics

This is a BERTopic model. BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.

This pre-trained model was built to demonstrate the use of representation model inspired on KeyBERT to be use within BERTopic.

This model was trained on ~30000 Research Papers abstracts with the KeyBERTInspired representation method (bertopic.representation). The dataset was downloaded from kaggle, with the two subsets (test and train) being merged into a single dataset.

To access the complete code, you can vist this tutorial on my GitHub page: ResPapers

Usage

To use this model, please install BERTopic:

pip install -U bertopic

You can use the model as follows:

from bertopic import BERTopic
topic_model = BERTopic.load("CCatalao/respapers_topics")

topic_model.get_topic_info()

To view the KeyBERT inspired topic representation please use the following:

>>> topic_model.get_topic(0, full=True)
{'Main': [['spin', 0.01852648864225281],
  ['magnetic', 0.015019436257929909],
  ['phase', 0.013081733986038124],
  ['quantum', 0.012942253723133639],
  ['temperature', 0.012591407440537158],
  ['states', 0.011025582290837643],
  ['field', 0.010954775154251296],
  ['electron', 0.010168708734803916],
  ['transition', 0.009728560280580357],
  ['energy', 0.00937042795113575]],
 'KeyBERTInspired': [['quantum', 0.4072583317756653],
  ['phase transition', 0.35542067885398865],
  ['lattice', 0.34462833404541016],
  ['spin', 0.3268473744392395],
  ['magnetic', 0.3024371564388275],
  ['magnetization', 0.2868726849555969],
  ['phases', 0.27178525924682617],
  ['fermi', 0.26290175318717957],
  ['electron', 0.25709500908851624],
  ['phase', 0.23375216126441956]]}

Topic overview

  • Number of topics: 112
  • Number of training documents: 29961
Click here for an overview of all topics.
Topic ID Topic Keywords Topic Frequency Label
-1 data - model - paper - time - based 20 -1_data_model_paper_time
0 spin - magnetic - phase - quantum - temperature 12937 0_spin_magnetic_phase_quantum
1 mass - star - stars - 10 - stellar 3048 1_mass_star_stars_10
2 reinforcement - reinforcement learning - learning - policy - robot 2564 2_reinforcement_reinforcement learning_learning_policy
3 logic - semantics - programs - automata - languages 556 3_logic_semantics_programs_automata
4 neural - networks - neural networks - deep - training 478 4_neural_networks_neural networks_deep
5 networks - community - network - social - nodes 405 5_networks_community_network_social
6 word - translation - language - words - sentence 340 6_word_translation_language_words
7 object - 3d - camera - pose - localization 298 7_object_3d_camera_pose
8 classification - label - classifier - learning - classifiers 294 8_classification_label_classifier_learning
9 convex - gradient - stochastic - convergence - optimization 287 9_convex_gradient_stochastic_convergence
10 graphs - graph - vertices - vertex - edge 284 10_graphs_graph_vertices_vertex
11 brain - neurons - connectivity - neural - synaptic 273 11_brain_neurons_connectivity_neural
12 robots - robot - planning - control - motion 255 12_robots_robot_planning_control
13 prime - numbers - polynomials - integers - zeta 245 13_prime_numbers_polynomials_integers
14 tensor - rank - matrix - low rank - pca 226 14_tensor_rank_matrix_low rank
15 power - energy - grid - renewable - load 222 15_power_energy_grid_renewable
16 channel - power - mimo - interference - wireless 219 16_channel_power_mimo_interference
17 adversarial - attacks - adversarial examples - attack - examples 208 17_adversarial_attacks_adversarial examples_attack
18 gan - gans - generative - generative adversarial - adversarial 200 18_gan_gans_generative_generative adversarial
19 media - social - twitter - users - social media 196 19_media_social_twitter_users
20 posterior - monte - monte carlo - carlo - bayesian 190 20_posterior_monte_monte carlo_carlo
21 estimator - estimators - regression - quantile - estimation 189 21_estimator_estimators_regression_quantile
22 software - code - developers - projects - development 178 22_software_code_developers_projects
23 regret - bandit - armed - arm - multi armed 177 23_regret_bandit_armed_arm
24 omega - mathbb - solutions - boundary - equation 177 24_omega_mathbb_solutions_boundary
25 numerical - scheme - mesh - method - order 175 25_numerical_scheme_mesh_method
26 causal - treatment - outcome - effects - causal inference 174 26_causal_treatment_outcome_effects
27 curvature - mean curvature - riemannian - ricci - metric 164 27_curvature_mean curvature_riemannian_ricci
28 control - distributed - systems - consensus - agents 156 28_control_distributed_systems_consensus
29 groups - group - subgroup - subgroups - finite 153 29_groups_group_subgroup_subgroups
30 segmentation - images - image - convolutional - medical 148 30_segmentation_images_image_convolutional
31 market - portfolio - asset - price - volatility 144 31_market_portfolio_asset_price
32 recommendation - user - item - items - recommender 138 32_recommendation_user_item_items
33 algebra - algebras - lie - mathfrak - modules 131 33_algebra_algebras_lie_mathfrak
34 quantum - classical - circuits - annealing - circuit 121 34_quantum_classical_circuits_annealing
35 moduli - varieties - projective - curves - bundles 119 35_moduli_varieties_projective_curves
36 graph - embedding - node - graphs - network 117 36_graph_embedding_node_graphs
37 codes - decoding - channel - code - capacity 113 37_codes_decoding_channel_code
38 sparse - signal - recovery - sensing - measurements 107 38_sparse_signal_recovery_sensing
39 knot - knots - homology - invariants - link 103 39_knot_knots_homology_invariants
40 spaces - hardy - operators - mathbb - boundedness 95 40_spaces_hardy_operators_mathbb
41 blockchain - security - privacy - authentication - encryption 90 41_blockchain_security_privacy_authentication
42 turbulence - turbulent - flow - flows - reynolds 89 42_turbulence_turbulent_flow_flows
43 privacy - differential privacy - private - differential - data 86 43_privacy_differential privacy_private_differential
44 epidemic - disease - infection - infected - infectious 83 44_epidemic_disease_infection_infected
45 citation - scientific - research - journal - papers 82 45_citation_scientific_research_journal
46 surface - droplet - fluid - liquid - droplets 81 46_surface_droplet_fluid_liquid
47 chemical - molecules - molecular - protein - learning 79 47_chemical_molecules_molecular_protein
48 kähler - manifolds - manifold - complex - metrics 77 48_kähler_manifolds_manifold_complex
49 games - game - players - nash - player 74 49_games_game_players_nash
50 patients - patient - clinical - ehr - care 73 50_patients_patient_clinical_ehr
51 music - musical - audio - chord - note 70 51_music_musical_audio_chord
52 visual - shot - image - cnns - learning 70 52_visual_shot_image_cnns
53 speaker - speech - end - recognition - speech recognition 70 53_speaker_speech_end_recognition
54 cell - cells - tissue - active - tumor 69 54_cell_cells_tissue_active
55 eeg - brain - signals - sleep - subjects 69 55_eeg_brain_signals_sleep
56 fairness - fair - discrimination - decision - algorithmic 67 56_fairness_fair_discrimination_decision
57 clustering - clusters - data - based clustering - cluster 66 57_clustering_clusters_data_based clustering
58 relativity - black - solutions - einstein - spacetime 65 58_relativity_black_solutions_einstein
59 mathbb - curves - elliptic - conjecture - fields 62 59_mathbb_curves_elliptic_conjecture
60 stokes - navier - navier stokes - equations - stokes equations 61 60_stokes_navier_navier stokes_equations
61 species - population - dispersal - ecosystem - populations 60 61_species_population_dispersal_ecosystem
62 reconstruction - ct - artifacts - image - images 58 62_reconstruction_ct_artifacts_image
63 algebra - algebras - mathcal - alpha - crossed 58 63_algebra_algebras_mathcal_alpha
64 tiling - polytopes - set - polygon - polytope 58 64_tiling_polytopes_set_polygon
65 mobile - video - network - latency - computing 57 65_mobile_video_network_latency
66 latent - variational - vae - generative - inference 55 66_latent_variational_vae_generative
67 players - game - team - player - teams 54 67_players_game_team_player
68 genes - gene - cancer - expression - sequencing 53 68_genes_gene_cancer_expression
69 forcing - kappa - definable - cardinal - zfc 51 69_forcing_kappa_definable_cardinal
70 dna - protein - folding - proteins - molecule 50 70_dna_protein_folding_proteins
71 spaces - space - metric - metric spaces - topology 49 71_spaces_space_metric_metric spaces
72 speech - separation - source separation - enhancement - speaker 49 72_speech_separation_source separation_enhancement
73 imaging - resolution - light - diffraction - phase 47 73_imaging_resolution_light_diffraction
74 traffic - traffic flow - prediction - temporal - transportation 46 74_traffic_traffic flow_prediction_temporal
75 climate - precipitation - sea - flood - extreme 45 75_climate_precipitation_sea_flood
76 audio - sound - event detection - event - bird 43 76_audio_sound_event detection_event
77 memory - storage - cache - performance - write 40 77_memory_storage_cache_performance
78 wishart - matrices - eigenvalue - free - smallest 39 78_wishart_matrices_eigenvalue_free
79 domain - domain adaptation - adaptation - transfer - target 39 79_domain_domain adaptation_adaptation_transfer
80 glass - glasses - glassy - amorphous - liquids 39 80_glass_glasses_glassy_amorphous
81 gpu - gpus - nvidia - code - performance 38 81_gpu_gpus_nvidia_code
82 face - face recognition - facial - recognition - faces 38 82_face_face recognition_facial_recognition
83 stock - market - price - financial - stocks 37 83_stock_market_price_financial
84 reaction - flux - metabolic - growth - biochemical 34 84_reaction_flux_metabolic_growth
85 fleet - routing - vehicles - ride - traffic 34 85_fleet_routing_vehicles_ride
86 cooperation - evolutionary - game - social - payoff 33 86_cooperation_evolutionary_game_social
87 students - courses - student - course - education 33 87_students_courses_student_course
88 action - temporal - video - recognition - videos 33 88_action_temporal_video_recognition
89 irreducible - group - mathcal - representations - let 32 89_irreducible_group_mathcal_representations
90 phylogenetic - tree - trees - species - gene 32 90_phylogenetic_tree_trees_species
91 processes - drift - asymptotic - estimators - stationary 31 91_processes_drift_asymptotic_estimators
92 wave - waves - water - free surface - shallow water 30 92_wave_waves_water_free surface
93 distributed - gradient - byzantine - communication - sgd 30 93_distributed_gradient_byzantine_communication
94 voters - voting - election - voter - winner 30 94_voters_voting_election_voter
95 gaussian process - gaussian - gp - process - gaussian processes 30 95_gaussian process_gaussian_gp_process
96 mathfrak - gorenstein - ring - rings - modules 29 96_mathfrak_gorenstein_ring_rings
97 motivic - gw - cohomology - dm - category 29 97_motivic_gw_cohomology_dm
98 recurrent - lstm - rnn - recurrent neural - memory 28 98_recurrent_lstm_rnn_recurrent neural
99 semigroup - semigroups - xy - ordered - pt 27 99_semigroup_semigroups_xy_ordered
100 robot - robots - human - human robot - children 25 100_robot_robots_human_human robot
101 categories - category - homotopy - functor - grothendieck 25 101_categories_category_homotopy_functor
102 queue - queues - server - scheduling - customer 24 102_queue_queues_server_scheduling
103 topic - topics - topic modeling - lda - documents 24 103_topic_topics_topic modeling_lda
104 synchronization - oscillators - chimera - coupling - coupled 24 104_synchronization_oscillators_chimera_coupling
105 stochastic - existence - equation - solutions - uniqueness 24 105_stochastic_existence_equation_solutions
106 fractional - derivative - derivatives - integral - psi 23 106_fractional_derivative_derivatives_integral
107 lasso - regression - estimator - estimators - bootstrap 23 107_lasso_regression_estimator_estimators
108 soil - moisture - machine - resolution - seismic 22 108_soil_moisture_machine_resolution
109 bayesian optimization - optimization - acquisition - bayesian - bo 21 109_bayesian optimization_optimization_acquisition_bayesian
110 urban - city - mobility - cities - social 21 110_urban_city_mobility_cities

Training Procedure

The model was trained as follows:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired

from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

# Prepre sub-models
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
umap_model = UMAP(n_components=5, n_neighbors=50, random_state=42, metric="cosine", verbose=True)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20)
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=5)

# Representation models
representation_models = {"KeyBERTInspired": KeyBERTInspired()}

# Fit BERTopic
topic_model = BERTopic(
                umap_model=umap_model,
                hdbscan_model=hdbscan_model,
                vectorizer_model=vectorizer_model,
                representation_model=representation_models,
                min_topic_size= 10,
                n_gram_range= (1, 1),
                nr_topics=None,
                seed_topic_list=None,
                top_n_words=10,
                calculate_probabilities=False,
                language=None,
                verbose = True
).fit(docs)

Training hyperparameters

  • calculate_probabilities: False
  • language: None
  • low_memory: False
  • min_topic_size: 10
  • n_gram_range: (1, 1)
  • nr_topics: None
  • seed_topic_list: None
  • top_n_words: 10
  • verbose: True

Framework versions

  • Numpy: 1.22.4
  • HDBSCAN: 0.8.33
  • UMAP: 0.5.3
  • Pandas: 1.5.3
  • Scikit-Learn: 1.2.2
  • Sentence-transformers: 2.2.2
  • Transformers: 4.29.2
  • Numba: 0.56.4
  • Plotly: 5.13.1
  • Python: 3.10.11
Downloads last month
5
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.