Recently someone asked how to find OSM users who’ve left a changeset comment, but have not edited themselves. (Technically the initial challenge was for a one line bash script 😉).
Here’s how to do it.
In OpenStreetMap, people can change their username, but OSM data provides an unchanging numeric user id (uid
) for users, which we use here.
First download the dump file, from the OpenStreetMap data download serivce (planet.osm.org
) ⁽¹⁾.
aria2c --seed-time=0 https://planet.openstreetmap.org/planet/discussions-latest.osm.bz2.torrent
This will download discussions-YYMMDD.osm.bz2
⁽²⁾, which is currently about 5 GiB.
I had to write a new tool, anglosaxon
to easily parse large XML files like this into a TSV file format⁽³⁾. This programme works on all XML files, maybe it’s useful for other problems you might have. Install that first.
bzcat discussions-220110.osm.bz2 \
| anglosaxon \
-S -o changeset_id --tab -o changeset_uid --tab -o comment_uid --nl \
-s comment -v ../../id --tab -V ../../uid NO_CHANGESET_UID --tab -V uid NO_COMMENT_UID --nl \
| gzip > changeset-comments.tsv.gz
This took about 45 minutes to run on my machine, and the output is about 4 MiB (19 MiB uncompressed), and has about 805,000 lines. This step takes the longest.
We create the list of all uids who have opened a changeset:
zcat changeset-comments.tsv.gz | cut -f2 |uniq|sort |uniq > changeset-uids.tsv
Then a list of all uids who have left a changeset comment:
zcat changeset-comments.tsv.gz | cut -f3 |uniq|sort|uniq > comment-uids.tsv
Then we compare, what’s in one but not the other.
comm -13 changeset-uids.tsv comment-uids.tsv |sort -n > uids-comment-without-changeset.tsv
Et voilà! Sin é! And there’s your results. 🙂 The file is 29 KiB, and has ~3,500 entries. I’m surprised it’s so high.⁽⁴⁾
You can find all the changesets that a uid has commented on with this command, (replace UID
with the uid)
zcat changeset-comments.tsv.gz | grep -P "\tUID$" | cut -f1
e.g. the comments that uid 23770
⁽⁵⁾ has commented on:
zcat changeset-comments.tsv.gz | grep -P "\t23770$" | cut -f1
The OSM API has several methods to get details on an OSM user, e.g.:
curl "https://api.openstreetmap.org/api/0.6/user/23770"
⁽¹⁾ Here we use aria2c
which will do a regular web/HTTP download, and also use BitTorrent P2P decentralized downloads in addition. --seed-time=0
stops aria2c when the file is fully downloaded, rather than sharing/seeding the file forever over BitTorrent
⁽²⁾ YYMMDD
is the year, month & date that the data was created
⁽³⁾ tab separated values, like CSV, but with tabs
⁽⁴⁾ If you’re curious of how to do that on one line using bash(1)
’s Proccess Substitution:
comm -13 <(cut -f2 changeset-comments.tsv |uniq|sort|uniq) <(cut -f3 changeset-comments.tsv |uniq|sort|uniq) | sort -n > uids-comment-without-changeset.tsv
⁽⁵⁾ My user id