OpenStreetMap

mboeringa's Diary Comments

Diary Comments added by mboeringa

Post When Comment
Minutely updated tile volume: Technical details

Really nice analysis. I know you’ve done similar work on benchmarking in the past, and I’ve always enjoyed reading them due to their clarity.

Kind of funny to see the almost linear drop in tiles per update with number of minutely updates combined. Do I interpret this right to mean that a large percentage of editors work in the same tile extent, and likely push changes in multiple changesets affecting the same tile? Of course, working in a small concentrated area is logical, so it is probably not very surprising.

The OSM Iceberg

man_made=dyke / height=current_height + 217ft perhaps? 😉

https://www.nationalgeographic.co.uk/environment-and-conservation/2017/11/what-the-world-would-look-like-if-all-the-ice-melted

The OSM Iceberg

Well, since we are already in climate meltdown, it seems there is more than one OSM problem that will spontaneously resolve itself in the near future…

Just ensure your man_made=dike / embankment=dyke / embankment=yes / waterway=dam / barrier=coupure is high enough for when the flood waters come… 😉

Removing quantity= tags from pitches in the San Francisco Bay Area

You probably meant to suggest “leisure=pitch” & “sport=tennis” as the correct tagging ;-)

Removing quantity= tags from pitches in the San Francisco Bay Area

I think tagging it as leisure=sports_centre is more appropriate than landuse=recreation_ground. A sport centre tagged facility does not have to be a building in OSM, it can be outdoor:

https://wiki.openstreetmap.org/wiki/Tag:leisure%3Dsports_centre

From what I’ve seen, the most established use for landuse=recreation_ground is (parts of) public parks which contain some facilities like a small sandy beach next to a small lake contained in the park, mowed and maintained grass fields to get a tan or pick-nick, some outdoor training gear, or e.g. public toilets.

Windows Subsystem for Linux

Nice to see this lesser known option on Windows to run Linux via WSL pointed out.

Maybe it is good though to also include the option to run Ubuntu in a Windows “Hyper-V” virtual machine, especially for those who are more user interface oriented rather than command line:

https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/quick-start/enable-hyper-v

Hyper-V is a cost-less option by default available on the Pro, Education and Workstation editions of Windows, but not Home.

This wasn’t the case a few years ago, but Microsoft added an Ubuntu virtual machine as one the default options for guest systems in Hyper-V.

I have been happily running such an Ubuntu virtual machine with osm2pgsql and PostgreSQL / PostGIS for the last two years now on a Windows 10 host system, and it has proven both rock solid and performant.

Bridge Tagging Enhancements

If you do start to add man_made=bridge polygons, please make sure the corresponding highway=x + bridge=x match perfectly by connecting the end nodes of the highway=x + bridge=x way to the man_made=bridge polygon.

Never let it over- or undershoot the man_made=bridge polygon.

This will prevent drawing errors in renderers that take into account the layer=x tag of both features (current ‘openstreetmap-carto’ doesn’t for man_made=bridge, but there are implementations that do).

International Cartographic Conference 2021

Tomas and Geonick, you may find my SOTM 2022 poster interesting (although still project in development) ;-): https://files.osmfoundation.org/s/xDdDz3rpQX2C7FJ

A smiling farm in Tuscany

I can’t help but see a slightly sadly looking upright sitting Russian Bear (https://en.wikipedia.org/wiki/Russian_Bear) with feet and paws and all, in the war memorial for the “Sowjetischer Soldat” in Berlin:

https://www.openstreetmap.org/#map=19/52.51636/13.37227

The illusion to me is even so strong, that I wonder if it was either intended or some kind of practical joke of the original designer that was overlooked by the commissioners…

OpenStreetMap Isn't Unicode

Hi @Andy,

Thanks, I think you are right with your analysis:

“something has taken the original UTF-8 from OSM, converted it to UTF-16 in memory, and then something else is reading that same hex value from memory, is unaware of surrogate pairs and is treating “0xD86DDE6E” as the two Unicode characters U+D86D and U+DE6E - which is completely incorrect”

Still wondering which part of the local chain is failing here, but that requires digging deeper on my side.

For now, the workaround will do.

Yes, we can close this discussion. It still was useful to me to gain some more insights through thoughtful remarks like yours and mmd’s.

Thanks to both of you.

OpenStreetMap Isn't Unicode

I think the difference may be that you use the “one 16-bit code point” from the UTF-8 encoding, while the one I get from the processing (and likely OSM database), is the “two 16-bit surrogate code point” from UTF-16 for what apparently is the same character.

OpenStreetMap Isn't Unicode

@mmd,

Are you sure the first character in your example is referring to the same “two 16-bit character” code point as the one I encounter?

It seems highly unlikely to me we wouldn’t receive the same output from the same encoding statement in Python.

EDIT: I now tried your example by copying the:

“𫙮魚坑溪”.encode(encoding=”utf-8”).decode(encoding=”UTF-8”)

you supplied in your post.

That indeed gives me:

‘𫙮魚坑溪’

in the Python output.

So this again seems to me to indicate we are not using the same code points.

OpenStreetMap Isn't Unicode

@mmd,

When I do a “backslashreplace” in the Python encoding/decoding workaround, I get:

\ud86d\ude6e魚坑溪

in the error message, which seems to be consistent with what Andy and the Wiki state about the surrogate code point being a code point consisting of “two 16-bit characters”.

So, is that the same code point as:

U+2B66E

as you pointed out, or should you search for the “two 16-bit character” sequence, in order to be sure none such surrogate features exist in the OSM database?

OpenStreetMap Isn't Unicode

Question though, if the Unicodecodebook pages are complete then? Clearly, there is a discrepancy between what Python 3.x sees as surrogate and valid UTF-8, and the range stated by the Unicodecodebook for surrogates to be rejected.

There is a still question mark as well for me which part of the Python code base raises this error: both ‘pyodbc’ and ‘psycopg2’ generated the same error, so it must be something they have in common, some lower library imported by both tools to initiate the ODBC transfer (assuming this error is not handed down from the Windows PostgreSQL ODBC driver)?

OpenStreetMap Isn't Unicode

@mmd, ah, sorry, I just assumed that particular character would be part of that range as it errored out as “surrogate” and the Unicode docs referenced that range as surrogates…

No, I do not have an example then of such a surrogate in OSM.

The Python code I am running is Python 3.7.11.

“U+D800 and U+DFFF would be rejected as invalid already”.

Good to hear!

OpenStreetMap Isn't Unicode

Another interesting issue thread regarding UTF and “lone surrogates”:

https://bugs.python.org/issue27971

Which refers to: https://unicodebook.readthedocs.io/issues.html#non-strict-utf-8-decoder-overlong-byte-sequences-and-surrogates

Which states:

“Surrogates characters are also invalid in UTF-8: characters in U+D800—U+DFFF have to be rejected. See the table 3-7 in the Conformance chapter of the Unicode standard (december 2009); and the section 3 (UTF-8 definition) of UTF-8, a transformation format of ISO 10646 (RFC 3629, november 2003).”

So, the fact that OSM contains these surrogates, is at least discouraged from the point of view of UTF-8 conformance (“…have to be rejected.”).

OpenStreetMap Isn't Unicode

It may also be a Python issue based on some more research. According to this :

https://github.com/elastic/elasticsearch-py/issues/611

Elasticsearch GitHub issue, I should “backslashreplace” the surrogates, so instead of the:

“.encode(encoding=”UTF-8”, errors=”replace”).decode(encoding=”UTF-8”)”

I should likely be using:

“.encode(encoding=”UTF-8”, errors=”backslashreplace”).decode(encoding=”UTF-8”)”

as a workaround for this issue.

OpenStreetMap Isn't Unicode

By the way, I am using the latest “psqlodbc_13_02_0000-x64.zip” official 64-bit Windows PostgreSQL ODBC driver as downloadable from here (and run a PostgreSQL 13.5 database):

https://www.postgresql.org/ftp/odbc/versions/msi/

OpenStreetMap Isn't Unicode

Thanks Andy,

That helps in better understanding the problem.

The database runs in a Windows Hyper-V instance with Ubuntu 20.04 as the guest system. The data processing though, takes place in Windows 10.

However, considering the issue only pops up when I attempt to INSERT the data back into the database from the Windows system using Python and ODBC, this actually makes me conclude that the local toolchain is likely correctly handling the surrogate pair: if it didn’t, and had replaced the pair with e.g. a UTF-8 character in the BMP, then there would be no error about a “surrogate pair” once I attempt to INSERT it back into the database.

It really fails at the stage of the INSERT when I execute the SQL from Python using either ‘pyodbc’ or ‘psycopg2’.

This slightly makes me wonder if it is actually a potential PostgreSQL Windows ODBC driver issue?…

An issue showing at least some similarities to my issue, although involving the Microsoft Access ODBC driver, is listed on the ‘pyodbc’ GitHub repository:

https://github.com/mkleehammer/pyodbc/issues/328

There, the issue is blamed on the ODBC driver…

OpenStreetMap Isn't Unicode

@mmd,

Thanks.

None of the code I wrote in Python, actually does anything directly with the OSM ‘name’ tag. It is just general conversions of an entire table / record set, e.g. “PostgreSQL -> SQLite”. It seems likely the issue may be caused by one of the intermediate processing steps and conversions outside the database, but it is unlikely there is anything I personally can do about it.

So for now, the workaround I developed, will need to do the job.

I’d still be interested to hear a bit more about the “surrogate” thing as mentioned in the error message (https://www.openstreetmap.org/user/bdon/diary/397922#comment51501), and if that part of the error message makes any sense in this context and with the particular object you pointed out as the possible culprit of the processing error I experienced.