respapers_topics

This is a BERTopic model. BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.

This pre-trained model was built to demonstrate the use of representation model inspired on KeyBERT to be use within BERTopic.

This model was trained on ~30000 Research Papers abstracts with the KeyBERTInspired representation method (bertopic.representation). The dataset was downloaded from kaggle, with the two subsets (test and train) being merged into a single dataset.

To access the complete code, you can vist this tutorial on my GitHub page: ResPapers

Usage

To use this model, please install BERTopic:

pip install -U bertopic

You can use the model as follows:

from bertopic import BERTopic
topic_model = BERTopic.load("CCatalao/respapers_topics")

topic_model.get_topic_info()

To view the KeyBERT inspired topic representation please use the following:

>>> topic_model.get_topic(0, full=True)
{'Main': [['spin', 0.01852648864225281],
  ['magnetic', 0.015019436257929909],
  ['phase', 0.013081733986038124],
  ['quantum', 0.012942253723133639],
  ['temperature', 0.012591407440537158],
  ['states', 0.011025582290837643],
  ['field', 0.010954775154251296],
  ['electron', 0.010168708734803916],
  ['transition', 0.009728560280580357],
  ['energy', 0.00937042795113575]],
 'KeyBERTInspired': [['quantum', 0.4072583317756653],
  ['phase transition', 0.35542067885398865],
  ['lattice', 0.34462833404541016],
  ['spin', 0.3268473744392395],
  ['magnetic', 0.3024371564388275],
  ['magnetization', 0.2868726849555969],
  ['phases', 0.27178525924682617],
  ['fermi', 0.26290175318717957],
  ['electron', 0.25709500908851624],
  ['phase', 0.23375216126441956]]}

Topic overview

Number of topics: 112
Number of training documents: 29961

Click here for an overview of all topics.

Topic ID	Topic Keywords	Topic Frequency	Label
-1	data - model - paper - time - based	20	-1_data_model_paper_time
0	spin - magnetic - phase - quantum - temperature	12937	0_spin_magnetic_phase_quantum
1	mass - star - stars - 10 - stellar	3048	1_mass_star_stars_10
2	reinforcement - reinforcement learning - learning - policy - robot	2564	2_reinforcement_reinforcement learning_learning_policy
3	logic - semantics - programs - automata - languages	556	3_logic_semantics_programs_automata
4	neural - networks - neural networks - deep - training	478	4_neural_networks_neural networks_deep
5	networks - community - network - social - nodes	405	5_networks_community_network_social
6	word - translation - language - words - sentence	340	6_word_translation_language_words
7	object - 3d - camera - pose - localization	298	7_object_3d_camera_pose
8	classification - label - classifier - learning - classifiers	294	8_classification_label_classifier_learning
9	convex - gradient - stochastic - convergence - optimization	287	9_convex_gradient_stochastic_convergence
10	graphs - graph - vertices - vertex - edge	284	10_graphs_graph_vertices_vertex
11	brain - neurons - connectivity - neural - synaptic	273	11_brain_neurons_connectivity_neural
12	robots - robot - planning - control - motion	255	12_robots_robot_planning_control
13	prime - numbers - polynomials - integers - zeta	245	13_prime_numbers_polynomials_integers
14	tensor - rank - matrix - low rank - pca	226	14_tensor_rank_matrix_low rank
15	power - energy - grid - renewable - load	222	15_power_energy_grid_renewable
16	channel - power - mimo - interference - wireless	219	16_channel_power_mimo_interference
17	adversarial - attacks - adversarial examples - attack - examples	208	17_adversarial_attacks_adversarial examples_attack
18	gan - gans - generative - generative adversarial - adversarial	200	18_gan_gans_generative_generative adversarial
19	media - social - twitter - users - social media	196	19_media_social_twitter_users
20	posterior - monte - monte carlo - carlo - bayesian	190	20_posterior_monte_monte carlo_carlo
21	estimator - estimators - regression - quantile - estimation	189	21_estimator_estimators_regression_quantile
22	software - code - developers - projects - development	178	22_software_code_developers_projects
23	regret - bandit - armed - arm - multi armed	177	23_regret_bandit_armed_arm
24	omega - mathbb - solutions - boundary - equation	177	24_omega_mathbb_solutions_boundary
25	numerical - scheme - mesh - method - order	175	25_numerical_scheme_mesh_method
26	causal - treatment - outcome - effects - causal inference	174	26_causal_treatment_outcome_effects
27	curvature - mean curvature - riemannian - ricci - metric	164	27_curvature_mean curvature_riemannian_ricci
28	control - distributed - systems - consensus - agents	156	28_control_distributed_systems_consensus
29	groups - group - subgroup - subgroups - finite	153	29_groups_group_subgroup_subgroups
30	segmentation - images - image - convolutional - medical	148	30_segmentation_images_image_convolutional
31	market - portfolio - asset - price - volatility	144	31_market_portfolio_asset_price
32	recommendation - user - item - items - recommender	138	32_recommendation_user_item_items
33	algebra - algebras - lie - mathfrak - modules	131	33_algebra_algebras_lie_mathfrak
34	quantum - classical - circuits - annealing - circuit	121	34_quantum_classical_circuits_annealing
35	moduli - varieties - projective - curves - bundles	119	35_moduli_varieties_projective_curves
36	graph - embedding - node - graphs - network	117	36_graph_embedding_node_graphs
37	codes - decoding - channel - code - capacity	113	37_codes_decoding_channel_code
38	sparse - signal - recovery - sensing - measurements	107	38_sparse_signal_recovery_sensing
39	knot - knots - homology - invariants - link	103	39_knot_knots_homology_invariants
40	spaces - hardy - operators - mathbb - boundedness	95	40_spaces_hardy_operators_mathbb
41	blockchain - security - privacy - authentication - encryption	90	41_blockchain_security_privacy_authentication
42	turbulence - turbulent - flow - flows - reynolds	89	42_turbulence_turbulent_flow_flows
43	privacy - differential privacy - private - differential - data	86	43_privacy_differential privacy_private_differential
44	epidemic - disease - infection - infected - infectious	83	44_epidemic_disease_infection_infected
45	citation - scientific - research - journal - papers	82	45_citation_scientific_research_journal
46	surface - droplet - fluid - liquid - droplets	81	46_surface_droplet_fluid_liquid
47	chemical - molecules - molecular - protein - learning	79	47_chemical_molecules_molecular_protein
48	kähler - manifolds - manifold - complex - metrics	77	48_kähler_manifolds_manifold_complex
49	games - game - players - nash - player	74	49_games_game_players_nash
50	patients - patient - clinical - ehr - care	73	50_patients_patient_clinical_ehr
51	music - musical - audio - chord - note	70	51_music_musical_audio_chord
52	visual - shot - image - cnns - learning	70	52_visual_shot_image_cnns
53	speaker - speech - end - recognition - speech recognition	70	53_speaker_speech_end_recognition
54	cell - cells - tissue - active - tumor	69	54_cell_cells_tissue_active
55	eeg - brain - signals - sleep - subjects	69	55_eeg_brain_signals_sleep
56	fairness - fair - discrimination - decision - algorithmic	67	56_fairness_fair_discrimination_decision
57	clustering - clusters - data - based clustering - cluster	66	57_clustering_clusters_data_based clustering
58	relativity - black - solutions - einstein - spacetime	65	58_relativity_black_solutions_einstein
59	mathbb - curves - elliptic - conjecture - fields	62	59_mathbb_curves_elliptic_conjecture
60	stokes - navier - navier stokes - equations - stokes equations	61	60_stokes_navier_navier stokes_equations
61	species - population - dispersal - ecosystem - populations	60	61_species_population_dispersal_ecosystem
62	reconstruction - ct - artifacts - image - images	58	62_reconstruction_ct_artifacts_image
63	algebra - algebras - mathcal - alpha - crossed	58	63_algebra_algebras_mathcal_alpha
64	tiling - polytopes - set - polygon - polytope	58	64_tiling_polytopes_set_polygon
65	mobile - video - network - latency - computing	57	65_mobile_video_network_latency
66	latent - variational - vae - generative - inference	55	66_latent_variational_vae_generative
67	players - game - team - player - teams	54	67_players_game_team_player
68	genes - gene - cancer - expression - sequencing	53	68_genes_gene_cancer_expression
69	forcing - kappa - definable - cardinal - zfc	51	69_forcing_kappa_definable_cardinal
70	dna - protein - folding - proteins - molecule	50	70_dna_protein_folding_proteins
71	spaces - space - metric - metric spaces - topology	49	71_spaces_space_metric_metric spaces
72	speech - separation - source separation - enhancement - speaker	49	72_speech_separation_source separation_enhancement
73	imaging - resolution - light - diffraction - phase	47	73_imaging_resolution_light_diffraction
74	traffic - traffic flow - prediction - temporal - transportation	46	74_traffic_traffic flow_prediction_temporal
75	climate - precipitation - sea - flood - extreme	45	75_climate_precipitation_sea_flood
76	audio - sound - event detection - event - bird	43	76_audio_sound_event detection_event
77	memory - storage - cache - performance - write	40	77_memory_storage_cache_performance
78	wishart - matrices - eigenvalue - free - smallest	39	78_wishart_matrices_eigenvalue_free
79	domain - domain adaptation - adaptation - transfer - target	39	79_domain_domain adaptation_adaptation_transfer
80	glass - glasses - glassy - amorphous - liquids	39	80_glass_glasses_glassy_amorphous
81	gpu - gpus - nvidia - code - performance	38	81_gpu_gpus_nvidia_code
82	face - face recognition - facial - recognition - faces	38	82_face_face recognition_facial_recognition
83	stock - market - price - financial - stocks	37	83_stock_market_price_financial
84	reaction - flux - metabolic - growth - biochemical	34	84_reaction_flux_metabolic_growth
85	fleet - routing - vehicles - ride - traffic	34	85_fleet_routing_vehicles_ride
86	cooperation - evolutionary - game - social - payoff	33	86_cooperation_evolutionary_game_social
87	students - courses - student - course - education	33	87_students_courses_student_course
88	action - temporal - video - recognition - videos	33	88_action_temporal_video_recognition
89	irreducible - group - mathcal - representations - let	32	89_irreducible_group_mathcal_representations
90	phylogenetic - tree - trees - species - gene	32	90_phylogenetic_tree_trees_species
91	processes - drift - asymptotic - estimators - stationary	31	91_processes_drift_asymptotic_estimators
92	wave - waves - water - free surface - shallow water	30	92_wave_waves_water_free surface
93	distributed - gradient - byzantine - communication - sgd	30	93_distributed_gradient_byzantine_communication
94	voters - voting - election - voter - winner	30	94_voters_voting_election_voter
95	gaussian process - gaussian - gp - process - gaussian processes	30	95_gaussian process_gaussian_gp_process
96	mathfrak - gorenstein - ring - rings - modules	29	96_mathfrak_gorenstein_ring_rings
97	motivic - gw - cohomology - dm - category	29	97_motivic_gw_cohomology_dm
98	recurrent - lstm - rnn - recurrent neural - memory	28	98_recurrent_lstm_rnn_recurrent neural
99	semigroup - semigroups - xy - ordered - pt	27	99_semigroup_semigroups_xy_ordered
100	robot - robots - human - human robot - children	25	100_robot_robots_human_human robot
101	categories - category - homotopy - functor - grothendieck	25	101_categories_category_homotopy_functor
102	queue - queues - server - scheduling - customer	24	102_queue_queues_server_scheduling
103	topic - topics - topic modeling - lda - documents	24	103_topic_topics_topic modeling_lda
104	synchronization - oscillators - chimera - coupling - coupled	24	104_synchronization_oscillators_chimera_coupling
105	stochastic - existence - equation - solutions - uniqueness	24	105_stochastic_existence_equation_solutions
106	fractional - derivative - derivatives - integral - psi	23	106_fractional_derivative_derivatives_integral
107	lasso - regression - estimator - estimators - bootstrap	23	107_lasso_regression_estimator_estimators
108	soil - moisture - machine - resolution - seismic	22	108_soil_moisture_machine_resolution
109	bayesian optimization - optimization - acquisition - bayesian - bo	21	109_bayesian optimization_optimization_acquisition_bayesian
110	urban - city - mobility - cities - social	21	110_urban_city_mobility_cities

Training Procedure

The model was trained as follows:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired

from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

# Prepre sub-models
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
umap_model = UMAP(n_components=5, n_neighbors=50, random_state=42, metric="cosine", verbose=True)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20)
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=5)

# Representation models
representation_models = {"KeyBERTInspired": KeyBERTInspired()}

# Fit BERTopic
topic_model = BERTopic(
                umap_model=umap_model,
                hdbscan_model=hdbscan_model,
                vectorizer_model=vectorizer_model,
                representation_model=representation_models,
                min_topic_size= 10,
                n_gram_range= (1, 1),
                nr_topics=None,
                seed_topic_list=None,
                top_n_words=10,
                calculate_probabilities=False,
                language=None,
                verbose = True
).fit(docs)

Training hyperparameters

calculate_probabilities: False
language: None
low_memory: False
min_topic_size: 10
n_gram_range: (1, 1)
nr_topics: None
seed_topic_list: None
top_n_words: 10
verbose: True

Framework versions

Numpy: 1.22.4
HDBSCAN: 0.8.33
UMAP: 0.5.3
Pandas: 1.5.3
Scikit-Learn: 1.2.2
Sentence-transformers: 2.2.2
Transformers: 4.29.2
Numba: 0.56.4
Plotly: 5.13.1
Python: 3.10.11