\n",
""
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from jyquickhelper import add_notebook_menu\n",
"add_notebook_menu()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dataset : Grand Débat National (Great national debate)\n",
"\n",
"The aim of this exercise is to be familiar with the text-mining and topic models such as LDA. One of the contexts where topic modeling is very useful is in open-ended questions. It allows us to explore the variation of topics addressed in people's responses. For this we will use the french \"Grand Débat National\" dataset. This dataset presents a complete set of responses from the [Grand Débat National](https://granddebat.fr/), the public debate organized by President Macron. The purpose of the debate was to better understand the needs and opinions of the French people following the Yellow Vests protests. The results of this debate are now available as [open data](https://granddebat.fr/pages/donnees-ouvertes)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Import data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 1 :** Download one of the ecological transition csv files and load the content into a pandas dataframe. Name this variable `raw_data`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# TODO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will focus on the last question: ``Y a-t-il d'autres points sur la transition écologique sur lesquels vous souhaiteriez vous exprimer ?`` We hope that our LDA model will help us to analyze the topics on which their responses are focused. Let's take a look on the data :"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Multiplier les centrales géothermiques\n",
"1 Les problèmes auxquels se trouve confronté l’e...\n",
"2 NaN\n",
"3 NaN\n",
"4 Une vrai politique écologique et non économique\n",
"5 Les bonnes idées ne grandissent que par le par...\n",
"6 Pédagogie dans ce sens là dés la petite école ...\n",
"7 NaN\n",
"8 faire de l'écologie incitative et non punitive...\n",
"9 Développer le ferroutage pour les poids lourds...\n",
"Name: Y a-t-il d'autres points sur la transition écologique sur lesquels vous souhaiteriez vous exprimer ?, dtype: object"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"question = \"Y a-t-il d'autres points sur la transition écologique sur lesquels vous souhaiteriez vous exprimer ?\"\n",
"raw_data[question].head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we note, there is a lot of missing data (like any open-ended question, people decide whether or not to write a comment). A cleanup step is necessary."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Clean and vectorize documents"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before training our LDA model, we need to tokenize our text. We will tokenize with the [spaCy] (https://spacy.io/) library because we will only perform some basic preprocessing. We will just initialize a blank template for the French language."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import spacy\n",
"\n",
"nlp = spacy.blank(\"fr\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's remove all the rows from the dataframe that don't have an answer for our question (the `NaNs above). This new dataframe will be called ``texts``."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"texts = # TODO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First preprocessing with spacy :"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"spacy_docs = list(nlp.pipe(texts))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have a list of spaCy documents. We will transform each spaCy document into a list of tokens. Instead of the original tokens, we will work with lemmas instead. This will allow our model to generalize better"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is the full list of preprocessing used: \n",
" \n",
"- remove all **words less than 3 characters**,\n",
"- remove all **stop-words**, and\n",
"- lemmatize** the remaining words and,\n",
"- put these words in **minuscule**."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[['multiplier', 'centrales', 'géothermiques'], ['problèmes', 'trouve', 'confronté', 'ensemble', 'planète', 'dénoncent', 'parfait', 'désordre', 'gilets', 'jaunes', 'france', '-ils', 'surpopulation', 'mondiale', 'population', 'passée', 'd’1,5', 'milliards', 'habitants', '1900', 'milliards', '2020', 'montera', 'bientôt', 'milliards', '2040', 'progrès', 'communication', 'village', 'mondial', 'individu', 'fond', 'asie', 'fond', 'afrique', 'passant', 'quartiers', 'campagnes', 'pays', 'aspire', 'vivre', 'blâmer', 'lotis', 'concitoyens', 'logement', 'nourriture', 'biens', 'consommation', 'déplacement', 'mère', 'problèmes', 'solution', 'problèmes', 'stabilisation', 'croissance', 'démographique', 'partage', 'richesses', 'partage', 'terres', 'partage', 'protection', 'biodiversité', 'règlement', 'conflits', 'lutte', 'déforestation', 'lutte', 'dérèglement', 'climatique', 'règlement', 'conflits', 'stabilisation', 'migrations', 'concurrence', 'commerciale', 'mondiale', 'etc.', 'française', 'européenne', 'mondiale', 'france', 'jouer', 'rôle', 'moteur', 'autour', 'déroulera', 'grand', 'débat', 'paraît', 'anecdotique'], ['vrai', 'politique', 'écologique', 'économique']]\n"
]
}
],
"source": [
"docs = [[t.lemma_.lower() for t in doc if len(t.orth_) > 3 and not t.is_stop] for doc in spacy_docs]\n",
"print(docs[:3])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to preserve some word order in our modeling, we will take into account frequent bigrams. For this, we will use the [Gensim](https://radimrehurek.com/gensim/)library. We would like to point out that the Gensim library is an excellent NLP library for topics modeling. \n",
"\n",
"Here is the chosen process: \n",
"\n",
"- We first identify the frequent bigrams in the corpus, \n",
"- then we add them to the list of tokens for the documents in which they appear. This means that the bigrams will not be in their correct position in the text, but this is not a problem: topic models are bag-of-words models that ignore the position of words anyway."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"from gensim.models import Phrases\n",
"\n",
"bigram = Phrases(docs, min_count=10)\n",
"\n",
"for idx in range(len(docs)):\n",
" for token in bigram[docs[idx]]:\n",
" if '_' in token: # les bigrammes peuvent être reconnus par \"_\" qui concatène les mots\n",
" docs[idx].append(token)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's move on to the last steps of the Gensim specific preprocessing. First, we will create a dictionary representation of the documents. This dictionary will map each word to a unique identifier and will help us create word-sack representations of each document. These bag-of-words representations contain the identifiers of the words in the document as well as their frequency. In addition, we can remove the least frequent and most frequent words from the vocabulary. This will improve the quality of our model and speed up its training. The minimum frequency of a word is expressed as an absolute number, the maximum frequency is the proportion of documents in which a word can appear."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of unique words in original documents : 43522\n",
"Number of unique words after removing rare and common words : 17781\n",
"Example representation of document 3 : [(86, 1), (87, 1), (88, 1), (89, 1)]\n"
]
}
],
"source": [
"from gensim.corpora import Dictionary\n",
"\n",
"dictionary = Dictionary(docs)\n",
"print('Number of unique words in original documents :', len(dictionary))\n",
"\n",
"dictionary.filter_extremes(no_below=3, no_above=0.25)\n",
"print('Number of unique words after removing rare and common words :', len(dictionary))\n",
"\n",
"print(\"Example representation of document 3 :\", dictionary.doc2bow(docs[2]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we create bag-of-words representations for each document in the corpus see method [doc2bow](https://radimrehurek.com/gensim/corpora/dictionary.html) :"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"corpus = [ dictionary.doc2bow(doc) for doc in docs]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Topic Modeling with LDA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now it's time to train our LDA! To do this, we use the following parameters: \n",
"\n",
"- **corpus**: the bag-of-words representations of our documents\n",
"- **id2token**: the mapping of indexes to words\n",
"- **num_topics** : the number of topics the model should identify (let's set 10)\n",
"- **chunksize**: the number of documents the model sees on each update (let's set to 1,000)\n",
"- **passes**: the number of times we show the total corpus to the model during training (let's set to 5)\n",
"- **random_state**: we use a seed to ensure reproducibility (let's set to 1)\n",
"\n",
"On a corpus of this size, training usually takes one or two minutes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 2 :**"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"from gensim.models import LdaModel\n",
"\n",
"# TODO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Results and visualization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 3 :** Let's see what the model has learned. To do this, let's display the ten most characteristic words for each topic."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"***********\n",
"* topic 1 *\n",
"***********\n",
"1 : 0.023*\"nucléaire\" + 0.019*\"énergies\" + 0.018*\"énergie\" + 0.012*\"france\" + 0.011*\"faut\" + 0.011*\"centrales\" + 0.010*\"renouvelables\" + 0.009*\"production\" + 0.008*\"électricité\" + 0.007*\"nucléaires\"\n",
"\n",
"***********\n",
"* topic 2 *\n",
"***********\n",
"2 : 0.039*\"écologique\" + 0.038*\"transition\" + 0.023*\"transition_écologique\" + 0.022*\"faire\" + 0.019*\"faut\" + 0.012*\"écologie\" + 0.008*\"changer\" + 0.008*\"citoyens\" + 0.007*\"politique\" + 0.007*\"monde\"\n",
"\n",
"***********\n",
"* topic 3 *\n",
"***********\n",
"3 : 0.038*\"transports\" + 0.031*\"transport\" + 0.018*\"routes\" + 0.018*\"camions\" + 0.014*\"commun\" + 0.012*\"vitesse\" + 0.012*\"développer\" + 0.010*\"ferroutage\" + 0.009*\"train\" + 0.008*\"pollution\"\n",
"\n",
"***********\n",
"* topic 4 *\n",
"***********\n",
"4 : 0.012*\"économie\" + 0.011*\"voir\" + 0.009*\"système\" + 0.009*\"entreprises\" + 0.008*\"sanitaires\" + 0.008*\"déchets\" + 0.008*\"environnement\" + 0.007*\"environnemental\" + 0.007*\"respect\" + 0.007*\"informer\"\n",
"\n",
"***********\n",
"* topic 5 *\n",
"***********\n",
"5 : 0.009*\"pouvoir\" + 0.008*\"place\" + 0.007*\"mettre\" + 0.007*\"travail\" + 0.007*\"principe\" + 0.007*\"faire\" + 0.006*\"coût\" + 0.006*\"compte\" + 0.005*\"grande\" + 0.005*\"exemple\"\n",
"\n",
"***********\n",
"* topic 6 *\n",
"***********\n",
"6 : 0.079*\"produits\" + 0.023*\"interdire\" + 0.018*\"emballages\" + 0.015*\"déchets\" + 0.015*\"pesticides\" + 0.014*\"plastique\" + 0.013*\"favoriser\" + 0.012*\"glyphosate\" + 0.012*\"agriculture\" + 0.010*\"recyclage\"\n",
"\n",
"***********\n",
"* topic 7 *\n",
"***********\n",
"7 : 0.027*\"voiture\" + 0.024*\"électrique\" + 0.024*\"véhicules\" + 0.018*\"voitures\" + 0.015*\"véhicule\" + 0.014*\"électriques\" + 0.014*\"diesel\" + 0.011*\"batteries\" + 0.010*\"voiture_électrique\" + 0.008*\"essence\"\n",
"\n",
"***********\n",
"* topic 8 *\n",
"***********\n",
"8 : 0.037*\"taxer\" + 0.026*\"taxe\" + 0.022*\"entreprises\" + 0.017*\"pollueurs\" + 0.016*\"gros\" + 0.014*\"poids\" + 0.014*\"faire\" + 0.013*\"france\" + 0.013*\"payer\" + 0.011*\"avions\"\n",
"\n",
"***********\n",
"* topic 9 *\n",
"***********\n",
"9 : 0.012*\"énergie\" + 0.011*\"panneaux\" + 0.009*\"développer\" + 0.008*\"place\" + 0.008*\"villes\" + 0.008*\"zones\" + 0.008*\"éoliennes\" + 0.007*\"construction\" + 0.007*\"développement\" + 0.006*\"solaires\"\n",
"\n",
"***********\n",
"* topic 10 *\n",
"***********\n",
"10 : 0.016*\"animaux\" + 0.016*\"agriculture\" + 0.012*\"environnement\" + 0.012*\"biodiversité\" + 0.009*\"faut\" + 0.009*\"arrêter\" + 0.009*\"chasse\" + 0.009*\"santé\" + 0.008*\"pollution\" + 0.008*\"consommation\"\n",
"\n"
]
}
],
"source": [
"for (topic, words) in model.print_topics():\n",
" print(\"***********\")\n",
" print(\"* topic\", topic+1, \"*\")\n",
" print(\"***********\")\n",
" print(topic+1, \":\", words)\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another way to observe topics is to **visualize** them. This can be done with the library [pyLDAvis](https://github.com/bmabey/pyLDAvis). PyLDAvis will show us how popular the topics are in our corpus, how similar the topics are, and which words are most important for that topic. Note that it is important to set ``sort_topics = False`` on the call to pyLDAvis. If you don't, the topics will be sorted differently than in Gensim. This may take a few minutes to load."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 5 :**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pyLDAvis.gensim\n",
"import warnings\n",
"\n",
"pyLDAvis.enable_notebook()\n",
"warnings.filterwarnings(\"ignore\", category=DeprecationWarning) \n",
"\n",
"pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, let's look at the topics that the model recognizes in some of the individual documents. Here we see how LDA tends to assign a high probability to a small number of topics for each document, making its results highly interpretable."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"***********\n",
"* doc 1 *\n",
"***********\n",
"Multiplier les centrales géothermiques\n",
"[(1, 0.7745871)]\n",
"\n",
"***********\n",
"* doc 2 *\n",
"***********\n",
"Les problèmes auxquels se trouve confronté l’ensemble de la planète et que dénoncent, dans le plus parfait désordre, les gilets jaunes de France ne sont-ils pas dus, avant tout, à la surpopulation mondiale ? Cette population est passée d’1,5 milliards d’habitants en 1900 à 7 milliards en 2020 et montera bientôt à 10 milliards vers 2040. Avec les progrès de la communication dans ce village mondial, chaque individu, du fin fond de l’Asie au fin fond de l’Afrique, en passant par les « quartiers » et les « campagnes » de notre pays, aspire à vivre – et on ne peu l’en blâmer – comme les moins mal lotis de nos concitoyens (logement, nourriture, biens de consommation, déplacement,etc.). Voilà la mère de tous les problèmes. Si tel est bien le cas, la solution à tous les problèmes (stabilisation de la croissance démographique, partage des richesses, partage des terres, partage de l’eau, protection de la biodiversité, règlement des conflits, lutte contre la déforestation, lutte contre dérèglement climatique, règlement des conflits, stabilisation des migrations, concurrence commerciale mondiale, etc.) ne sera ni française, ni européenne, mais mondiale. La France se doit d’y jouer un rôle moteur. Le reste, autour duquel se déroulera « le Grand débat », paraît assez anecdotique.\n",
"[(2, 0.6641845)]\n",
"\n",
"***********\n",
"* doc 3 *\n",
"***********\n",
"Une vrai politique écologique et non économique\n",
"[(2, 0.8199877)]\n",
"\n",
"***********\n",
"* doc 4 *\n",
"***********\n",
"Les bonnes idées ne grandissent que par le partage.\n",
"\n",
"En ces jours pénibles où s'affrontent le peuple et ceux entre lesquels ils ont remis leur sort. (tous concernés puis qu’élus).\n",
"\n",
"Une idée qui flotte en ce moment sur la mobilité propre dans les zones non ou mal desservies et qui est encore expérimentale dans quelques communes.\n",
"\n",
"L'avenir est entre vos mains et les nôtres. Pourquoi ne pas planifier l'installation de flotte de véhicules électriques partagés en location, dans ces zones isolées.\n",
"\n",
"En construisant un partenariat avec : état, région, département, commune ou communauté, professionnel de la location, entreprises locales, constructeurs de véhicules, les bailleurs sociaux, pour la mise en place. Avec l'apport financier de la TICPE qui baissera au fil de la décroissance de la consommation, il faudra penser à la déconnecter du budget de fonctionnement de l'état !\n",
"\n",
"Avec un forfait de location pour les utilisateurs, adapté à leur ressources et peut être tout simplement basé sur leurs dépenses de mobilité actuelles. Ainsi pas d'investissements inaccessibles pour eux.\n",
"\n",
"Tout le monde y gagnerait, cela pourrait prendre l'image du relais de poste comme par le passé et réglerait le problème d'autonomie actuel et engendrerait quelques emplois de service, nouveaux.\n",
"[(2, 0.14978762), (5, 0.13328677), (7, 0.29198855), (9, 0.35655138)]\n",
"\n",
"***********\n",
"* doc 5 *\n",
"***********\n",
"Pédagogie dans ce sens là dés la petite école pour sensibilisation via les parcs naturels . Les enfants doivent devenir des prescripteurs pour les générations futures . Il y a urgence\n",
"[(1, 0.28313076), (2, 0.49557972), (10, 0.17458624)]\n",
"\n",
"***********\n",
"* doc 6 *\n",
"***********\n",
"faire de l'écologie incitative et non punitive on n'est pas des gosses à qui on distribue des bonnes ou mauvaises notes\n",
"[(2, 0.8714972)]\n",
"\n",
"***********\n",
"* doc 7 *\n",
"***********\n",
"Développer le ferroutage pour les poids lourds, Instaurer une vignette pour les poids lourds étrangers qui traversent la France\n",
"[(3, 0.23795015), (8, 0.57773215), (9, 0.14310296)]\n",
"\n",
"***********\n",
"* doc 8 *\n",
"***********\n",
"- Favoriser le tri des déchets en mettant en place une redevance incitatrice et non punitive,\n",
"- Arrêt d'une écologie punitive à base de taxes qui est néfaste à la compréhension d'une écologie intelligente, adaptée et durable.\n",
"- Revoir la politique de pose des panneaux solaire en Ferme qui en fait gâche des terrains et sert plus à récolter des taxes pour les caisses des collectivités,\n",
"- Revoir la politique de mise en place des éoliennes qui ne sont pas adaptées, peu rentable (< 30%) et qui produisent des effets néfastes par du bruit et des infra-ondes,\n",
"- Arrêt du rachat, à des prix exorbitants, de l'énergie produite par des panneaux solaire, par des éoliennes ou par tout autre moyen dit propre (Bio-masse, hydraulique),\n",
"- Favoriser le développement de la bio-masse,\n",
"- Développer le stockage intelligent de l'énergie,\n",
"- Réaliser une grande opération pour assurer l'isolation des bâtiments énergivores,\n",
"- Éteindre les éclairages la nuit,\n",
"- Développer les transports en site propre.\n",
"[(2, 0.11118397), (9, 0.8318408)]\n",
"\n"
]
}
],
"source": [
"# Nous en affichons que 8\n",
"i = 0\n",
"for (text, doc) in zip(texts[:8], docs[:8]):\n",
" i += 1\n",
" print(\"***********\")\n",
" print(\"* doc\", i, \" *\")\n",
" print(\"***********\")\n",
" print(text)\n",
" print([(topic+1, prob) for (topic, prob) in model[dictionary.doc2bow(doc)] if prob > 0.1])\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Many collections of unstructured text are not accompanied by labels. Topic models such as LDA are a useful technique for discovering the most important topics in these documents. **Gensim** facilitates learning about these topics and **pyLDAvis** presents the results in a visually appealing way. Together, they form a powerful toolkit for better understanding what's inside large document sets and for exploring subsets of related texts. While these results are often already quite revealing, it is also possible to use them as a starting point, for example, for a labeling exercise for supervised text classification. In sum, thematic models should be in every data scientist's toolbox as a very quick way to gain insight into large document collections."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}