Merge pull request #438 from Living-with-machines/text-on-maps-viz

add text exploration notebook
maps-as-data · Jun 6, 2024 · 8ad642d · 8ad642d
2 parents 1fd75c9 + 27420d2
commit 8ad642d
Showing 1 changed file with 386 additions and 0 deletions.
diff --git a/worked_examples/geospatial/workshops_2024/explore_text_on_maps.ipynb b/worked_examples/geospatial/workshops_2024/explore_text_on_maps.ipynb
@@ -0,0 +1,386 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Exploring Text on Maps"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This notebook provides some examples of how to load and visualise the text outputs of MapReader. \n",
+    "\n",
+    "To make the size of the data manageable and quick, we will use only the outputs from the six glasgow maps we have used throughout the workshop. You should download these files from [here](https://drive.google.com/drive/folders/14tSvFbH2DJSnuRX1B9dA7Ic9JhzVPmc1).\n",
+    "\n",
+    "----"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# uncomment the following line to run if you have not yet installed geopandas or plotly\n",
+    "#!pip install -q geopandas==1.0.0a1\n",
+    "#!pip install plotly"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import geopandas as geopd\n",
+    "import matplotlib.pyplot as plt\n",
+    "import plotly.express as px\n",
+    "from ast import literal_eval\n",
+    "from collections import Counter\n",
+    "from tqdm import tqdm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 'Urban' vs 'Rural' text\n",
+    "\n",
+    "In this notebook we will investigate the textual description of urban and rural landscape. We compare labels that often appear in the built environment versus the rest of the map. \n",
+    "\n",
+    "This notebooks has the following structure:\n",
+    "- for simplicity we convert all polygons to centroids\n",
+    "- we iterate over the dataframe with patch predictions\n",
+    "- we look which labels fall within a certain distance of the patch centoid\n",
+    "- depending on the patch prediction (building or not building) we save the labels in different lists (`adjacent_text` and `other_text`)\n",
+    "- we compute the probability of labels for each of these classes (`adjacent_text` and `other_text`) and compute the difference in proportions to foreground words that are associated with building patches ('urban' labels) and or not ('rural' labels)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# load patch predictions and spotted text\n",
+    "predictions = geopd.read_file(\"/path/to/building_patches.geojson\")\n",
+    "spotted_text = geopd.read_file(\"/path/to/deepsolo_outputs.geojson\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "predictions.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "spotted_text.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# print the shape of the dataframes\n",
+    "print(f\"Building predictions shape: {predictions.shape}\")\n",
+    "print(f\"Spotted text shape: {spotted_text.shape}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To save time for the workshop we will filter to one parent map, `map_75650907.png`.\n",
+    "\n",
+    "> **NOTE**: If you'd like to run on all the results you can skip or comment out the cell below. It will take a long time to run."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# filter to one parent map to save time\n",
+    "predictions = predictions[predictions['parent_id'] == 'map_75650907.png']\n",
+    "spotted_text = spotted_text[spotted_text['image_id'] == 'map_75650907.png']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# convert polygons to centroids\n",
+    "predictions['centroid'] = predictions['geometry'].to_crs(epsg=27700).centroid \n",
+    "spotted_text['centroid'] = spotted_text['geometry'].to_crs(epsg=27700).centroid"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The below cell identifies the text that falls within the 100m of building patches. Text within this distance is stored as \"adjacent text\" and any other text is stored as \"other text\"."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tqdm.pandas()\n",
+    "adjacent_text = [] # here we store labels close to the target category, i.e. building classified as 1\n",
+    "other_text = [] # here we store the other labels\n",
+    "target_label = \"building\"\n",
+    "distance = 100 # maximum distance in meters between patch and text centroid\n",
+    "\n",
+    "for i,row in tqdm(predictions.iterrows(), total=predictions.shape[0]):\n",
+    "    # get text within a certain distance from the patch centroid\n",
+    "    labels = spotted_text[spotted_text.to_crs(epsg=27700).distance(row.centroid) <= distance].text.tolist()\n",
+    "    # if patch is classified as the target label, add text to adjacent_text, otherwise add to other_text\n",
+    "    if row['predicted_label'] == target_label:\n",
+    "        adjacent_text.extend(labels)\n",
+    "    else:\n",
+    "        other_text.extend(labels)\n",
+    "\n",
+    "print('Building labels',len(adjacent_text), 'Other labels',len(other_text))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# get counts and probabilities of the text labels for the building category\n",
+    "building_text_freq =  Counter([i.lower() for i in adjacent_text])\n",
+    "building_text_prob = {k: v/ sum(building_text_freq.values()) for k,v in building_text_freq.items()}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# get counts and probabilities of the text labels for the other category\n",
+    "other_text_freq =  Counter([i.lower() for i in other_text])\n",
+    "other_text_prob = {k: v/ sum(other_text_freq.values()) for k,v in other_text_freq.items()}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# compare both absoluate counts and probabilities of a give word\n",
+    "word = 'street'\n",
+    "print(building_text_freq[word], other_text_freq[word])\n",
+    "print(building_text_prob[word], other_text_prob[word])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# compute the proportional difference\n",
+    "proportional_difference = sorted({w: building_text_prob.get(w,0) - other_text_prob.get(w,0) for w in other_text_prob.keys()}.items(), key=lambda x: x[1], reverse=True)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print('Building labels')\n",
+    "print(proportional_difference[:5])\n",
+    "print('Other labels')\n",
+    "print(proportional_difference[-5:])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pd.DataFrame(proportional_difference[:10]).plot(kind='bar', x=0, y=1, legend=False, \n",
+    "                            title='Top 10 terms in Building labels', \n",
+    "                            xlabel='Term', ylabel='Difference in probability')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pd.DataFrame(proportional_difference[-10:]).plot(kind='bar', x=0, y=1, legend=False, \n",
+    "                            title='Top 10 terms in Other labels', \n",
+    "                            xlabel='Term', ylabel='Difference in probability')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To get a sense of what some of the abbreviations mean, please go to the NLS website: https://maps.nls.uk/os/abbrev/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Visalizing the semantic of text on maps\n",
+    "\n",
+    "In the visualization below we encode each label to a vector using BERT-type language model. This generates a vector for each labels that approximates the 'meaning' of this label. Then we visualize these embeddigns in two dimensional space where you can explore the different semantic regions of the text data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# uncomment the following line to run if you have not yet installed sentence-transformers, scikit-learn and plotly\n",
+    "#!pip install -U -q sentence-transformers scikit-learn plotly"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "from sklearn.manifold import TSNE\n",
+    "from sentence_transformers import SentenceTransformer\n",
+    "import plotly.express as px"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# get all text labels\n",
+    "text_labels = spotted_text.text.str.lower().tolist()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# load pre-trained sentence transformer model\n",
+    "# if you are working with a different language, you can change the model to a multilingual one\n",
+    "# please refer to the documentation for more information: https://www.sbert.net/docs/pretrained_models.html\n",
+    "model = SentenceTransformer('distilbert-base-nli-mean-tokens')\n",
+    "\n",
+    "# encode the sentences\n",
+    "sentence_embeddings = model.encode(text_labels)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# perform dimensionality reduction using TSNE\n",
+    "tsne = TSNE(n_components=2, random_state=42)\n",
+    "embeddings_tsne = tsne.fit_transform(sentence_embeddings)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# visualize the labels in 2D scatter plot\n",
+    "data = pd.DataFrame(embeddings_tsne, columns=['x','y'])\n",
+    "data['text'] = text_labels\n",
+    "fig = px.scatter(data, x=\"x\", y=\"y\", text='text', width=1000, height=1000,)\n",
+    "fig.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# visualize only the text labels in 2D scatter plot\n",
+    "# i.e. remove all numbers\n",
+    "data_text = data[data.text.str.isalpha()]\n",
+    "fig = px.scatter(data_text, x=\"x\", y=\"y\", text='text', width=1000, height=1000,)\n",
+    "fig.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# visualize only the unique text labels in 2D scatter plot\n",
+    "# i.e. remove all numbers and duplicates\n",
+    "data_text_unique =data[data.text.str.isalpha()].drop_duplicates(subset='text')\n",
+    "fig = px.scatter(data_text_unique, x=\"x\", y=\"y\", text='text', width=1000, height=1000,)\n",
+    "fig.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "ce",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}