Saturday, March 30, 2019

Any port in a storm!

Gopher, of course, runs off of port 70, and has since the Gopher RFC 1436 from March, 1993. Since that, many protocols have embraced using SSL (or TLS in its more modern and secure form), often preferring to send all SSL-protected and encrypted data over some other port. HTTP, for example, uses port 80 for unencrypted traffic and 443 for encrypted traffic.

Heat map for the most popular ports in GopherSpace


So what about poor Gopher? How do Gophers in the wild handle SSL/TLS? I've seen a gopher/s protocol on the internet where if a port number is > 100,000 then it's assumed to be TLS. That has the problem that although it's a technically valid URL, some programs (like OneNote) don't seem to much like them.

Or khzea.net treats port 105 (normally the CSO port, and therefore pretty much unused in the real world) as the SSL port for Gopher.

What I want to know, of course, is the real-world distribution of port numbers in the existing GopherSpace. As always, I'm using the a bunch of data from an earlier Gopher crawl.

No surprise, the most common specifically mentioned port is port 70, the gopher port. There was just a single reference to a port > 100105, so the new standard of using very large ports to indicate SSL/TLS hasn't taken off yet.

The file entry overwhelmingly uses port 70; failing that, port 9999 is popular.
The directory entry is also mostly port 70 (really, not a surprise at all), with port 9999 the second-most popular port along with a smattering of 7070 7006 and 7005 ports.
The HTTP (h) entry is also mostly port 70 with port 80 (the official HTTP port) being the runner-up
The info (i) entry has an unsurprisingly variance. Since the info entry isn't a selectable entry, the port and host for it aren't actually used; developers can pick whatever random values they want. On common value is no value at all
Lastly, the Image (I) entry is once again most commonly on port 3298 with essentially no variation. Interestingly, the GIF(g) entry is also essentially always on port 70.

What can we learn from all this? My biggest takeaway is that there are enough users of non-standard ports that any gopher client that's worth anything should use the given port numbers. I suspect that I'd get a lot more secure gopher (perhaps on port 105 like khzae.net does it) showing up on my Gopher scan if the scanner actually supported TLS/SSL Gopher -- right now, it won't correctly follow most of the secure Gopher links.

What I don't know how to do is to correctly follow a secure link. If I'm at a secure Gopher site (like maybe the gophers://khzae.net:105 site), should I simply assume that all links are secure links? Or should I assume that all non-70 links are? Or just the 105 links? I suspect that the only way to know is to try to pull data from each Gopher directory in both TLS/SSL mode and plain text mode, and see what works.

Saturday, March 9, 2019

All my character sets

Yet another post about Gopher. In yet another diversion, I'm looking at what character sets are used on Gopher menu pages (the Gopher type 1 directory listings).
In the table, you can see that plain ASCII is the clear winner; almost every directory entry is printable ASCII characters (e.g., with no control characters, escape, DEL or any type of 8-bit character).

What are the others? UTF8 is about twice as popular as LATIN1 even though the Gopher spec is pretty clear that LATIN1 is the correct encoding. After that comes ASCII with some control characters. 

The last two categories are a little weird: they represent ASCII, but some of the space characters are actually character 160, the LATIN1 "Non-breaking space" character. There were 41 entries where the text was otherwise perfectly ordinary ASCII characters. 

Even more weirdly, the entries that were ASCII but included some kind of control character, the control characters are most often 1a (17), 7f (8) and 1b (5). I don't know why SUB and DEL are so popular. At least the 1b char (ESC) is understandable; it's used for fancy graphics.

There's often no way to automatically prove that a string is UTF8 or LATIN1. It is possible to prove that a string is not legal UTF8, but as we all know, sometimes strings are malformed, and what's supposed to be UTF8 is instead nearly or almost UTF8. What I do is to look at each string; if there are any characters with the 8th bit set, then I look to see if it's perfect UTF8. If it's not prefect UTF8, I assume that it's LATIN1.

Yes, there are other character sets in existence. I'm rather hoping that there aren't any in GopherSpace, because I'm not sure how I'd figure out which was which.

For C# programmers, I've learned two important things about UTF8 conversions. Firstly, many of the "Utf8 check" libraries are copies of each other and they have a common bug where if a buffer ends with a UTF8 sequence, the sequence is incorrectly flagged as being incorrect.

Secondly, the UWP Encoding.Utf8 has the pernicious habit of sometimes throwing on bad UTF8 sequences and sometimes not. I could understand them making a choice either way, but being in an in-between state is just plain programmer-unfriendly.

TL/DR: you have to handle UTF8 and LATIN1 char sets, and as a programmer, you have to double-check your conversion library.

Thursday, March 7, 2019

Dots, more dots, most dots

Welcome back to another part in a series I'm not really entitling, "ways in which servers completely fail to deliver correct Gopher pages". This post is all about the last line of each menu, where a gopher menu is the common type of page that you go to. Each menu has a list of links, files and information. And each menu is supposed to end with a single line consisting of a single dot.


BNF snippet for the Gopher Protocol, RFC 1436
The picture shows the a part of RFC 1436, the Gopher RFC. Most internet protocols are defined by one or more "Request for Comments" starting with the very early days of the Arpanet, the network from which the Internet was created.

This particular snippet shows what the "LastLine" should be. The Lastline ::= means that we're going to define a new part of the protocol. One the right had side of the ::= is the actual definition: '.' CR-LF. This means that the last line should be a period (the '.') followed by a CR-LF. CR-LF is defined earlier; it's an ASCII carriage return followed by a line feed.

And now the big question: how many menus actually follow this pattern? The answer, of course, is "most" but also "but not all". Here are the numbers from a recent Gopher crawl of part of the Gopherverse

Numerically, the results are:


Correct LastLine 1453 69.26%
No LastLine 633 30.17%
Dot and then close 10 0.48%
Dot then CR or LF 2 0.10%


Conclusion: if you're writing a Gopher parser, you have to handle the presence or absence of the dotted line, and several different ways the last line can be messed up.


Wednesday, March 6, 2019

Gopher: Carriage Returns, Line Feeds and Tabs (oh my)

Part of the Gopher menu screens (aka directory listing) is that the protocol carefully specifies the line endings (CR-LF, a carriage-return followed by line-feed). The last line should be just a period (.) followed by a CR-LF. Each line is supposed to have exactly three tabs so that a single directory entity is

Type User_Name Selector Host Port

Let's see, from the current Gopher survey data, how many menu (directory) pages match these requirements!

As an FYI, CR, carriage return and \r all refer to the same character. The same goes for LF, line feed and \n. Just to make life extra confusing, in the C programming language a string with a \n in is often called a new-line and will be "expanded" to a \r\n on some operating systems depending on how the file is written.

First let's check out menus with incorrect line endings. Out of 2098 menus with some data,

  • 93% (1941) were completely correct; all lines ended with CR LF
  • 5% (95) ended with just LF and not CR
  • 3% (54) had a mix of CRLF and either LF or CR line endings
  • .3% (7) are a confusing mix of line endings
  • 1 had no line endings at all. This data was seemingly garbage but might be a TN3270 telnet session. Or it might not be!
  • 0 ended with just CR and no LF
Just for fun I also counted up the number of menus that included any LF CR pairs (where the developer got the line endings the wrong way around). There are 9 such menus.

This mix of line endings makes like for Gopher clients more complicated, of course :-(

Next up: that TAB analysis I promised at the start of this post!


Tuesday, March 5, 2019

Directory entry says what? Current Gopher type field types

Ready to dig into the details of the modern Gopher ecosystem?  

The Gopher network protocol is like the precursor to modern HTML web browsers. Like HTML, a Gopher client could display links and text, download files, and display images. Unlike HTML, a Gopher client is using very spare, without the opportunity to display colorful, interactive displays. 

A typical Gopher screen is a directory, a menu-like list of lines, each of which does one thing. A line might display some text, or be a link to a file to view or download, or an image, or might be  a link to another Gopher directory, possibly on another server.  
The Amadeus Gopher Server, served up by my own Simple Gopher Client

I started looking more into modern Gopher sites and how they actually work when creating my own Simple Gopher Client for the Window Store.

 The Gopher RFC 1436 lists 14 different directory entry type fields, each of which is given a single letter identifier. 0, for example, means that the entry refers to a text file that can be displayed; 1 stands for a link to another gopher server. The 'g' type field is for a GIF file and 'I' is for an image type (but the type isn't explicitly given). Uniquely, directory entries can point to other protocols: '8' for a Telnet server, '2' for a particular phone-lookup protocol, '7' for a search engine, and 'T' for an IBM 3270-style terminal connection. 

Over the years other type fields have been informally added to the list. I recently did a crawl of the Gopher space as it exists in February, 2019 to see what kinds of directory entry type fields are in current usage across the current Gopher space. 

Many Gopher files are served up using generic descriptions and not the more precise descriptions. Type 9, "Binary file" is the most common, followed by type 'd', document (a modern addition that's not part of the official Gopher spec). These two account for 87% of the non-image files served up by Gopher. The third most common type is type 5, DOS Binary, followed by BinHex, PDF and UUEncoded. 


Something similar happens with picture image formats. The 'I' generic image field type is used about 30x more often than the more specific 'g' GIF field type. 


Looking at the most popular field types, the 0=file, i=information and 1=directory are the top three field types by far, accounting for about 90% of the field types. 




Some of the original field types are hardly present. The "T" type field that indicated a IBM TN3270 style interaction is entirely missing. The type field '2' CSO Phone book lookup is present on just a 4 pages total, but most of them seem to be samples of what a CSO phone book would be like, not a real phone book. There are actual field type '3' error pages, and no surprise, they seem to result from correctly handling errors from the scripts that generate some Gopher pages. There are also no Duplicate Server '+' type field entries. 


 (Note: I removed from the numbers pages that are test pages whose purpose is to validate Gopher clients) 

Type Fields (Alphabetical order) 
I'll finish this blog post with a handy table of existing Gopher tag types. Every tag that was found at least 10 times, or is part of the official RFC, is listed here 

Field 
Count 
Type 
Status 
; 
11 
Video 

+ 
0 
Duplicated Server 
RFC 
0 
60976 
File 
RFC 
1 
29335 
Directory 
RFC 
2 
5 
CSO Phone 
RFC 
3 
36 
Error 
RFC 
4 
223 
File (BinHex) 
RFC 
5 
631 
File (DOS Binary) 
RFC 
6 
3 
File (UUEncoded) 
RFC 
7 
257 
Index-search server (Veronica) 
RFC 
8 
479 
Telnet 
RFC 
9 
4799 
File (binary) 
RFC 
D 
12 
Some kind of binary file? 

d 
1590 
File (document) 

g 
102 
Image (gif) 
RFC 
H 
4 


h 
3914 
HTML Link 

I 
3300 
Image 
RFC 
i 
13216 
Information 

M 
115 
Mail file? 

P 
26 
PDF File 

p 
15 
Image (PNG) 

s 
278 
Sound 

T 
0 
IBM TN3270 
RFC 
w 
9 
Wiki edit link 
Field 
Count 
Type 
Status 
; 
11 
Video 

+ 
0 
Duplicated Server 
RFC 
0 
60976 
File 
RFC 
1 
29335 
Directory 
RFC 
2 
5 
CSO Phone 
RFC 
3 
36 
Error 
RFC 
4 
223 
File (BinHex) 
RFC 
5 
631 
File (DOS Binary) 
RFC 
6 
3 
File (UUEncoded) 
RFC 
7 
257 
Index-search server (Veronica) 
RFC 
8 
479 
Telnet 
RFC 
9 
4799 
File (binary) 
RFC 
D 
12 
Some kind of binary file? 

d 
1590 
File (document) 

g 
102 
Image (gif) 
RFC 
H 
4 


h 
3914 
HTML Link 

I 
3300 
Image 
RFC 
i 
13216 
Information 

M 
115 
Mail file? 

P 
26 
PDF File 

p 
15 
Image (PNG) 

s 
278 
Sound 

T 
0 
IBM TN3270 
RFC 
w 
9 
Wiki edit link 
Field 
Count 
Type 
Status 
; 
11 
Video 

+ 
0 
Duplicated Server 
RFC 
0 
60976 
File 
RFC 
1 
29335 
Directory 
RFC 
2 
5 
CSO Phone 
RFC 
3 
36 
Error 
RFC 
4 
223 
File (BinHex) 
RFC 
5 
631 
File (DOS Binary) 
RFC 
6 
3 
File (UUEncoded) 
RFC 
7 
257 
Index-search server (Veronica) 
RFC 
8 
479 
Telnet 
RFC 
9 
4799 
File (binary) 
RFC 
D 
12 
Some kind of binary file? 

d 
1590 
File (document) 

g 
102 
Image (gif) 
RFC 
H 
4 


h 
3914 
HTML Link 

I 
3300 
Image 
RFC 
i 
13216 
Information 

M 
115 
Mail file? 

P 
26 
PDF File 

p 
15 
Image (PNG) 

s 
278 
Sound 

T 
0 
IBM TN3270 
RFC 
w 
9 
Wiki edit link 
  
Field 
Count 
Type 
Status 
; 
11 
Video 

+ 
0 
Duplicated Server 
RFC 
0 
60976 
File 
RFC 
1 
29335 
Directory 
RFC 
2 
5 
CSO Phone 
RFC 
3 
36 
Error 
RFC 
4 
223 
File (BinHex) 
RFC 
5 
631 
File (DOS Binary) 
RFC 
6 
3 
File (UUEncoded) 
RFC 
7 
257 
Index-search server (Veronica) 
RFC 
8 
479 
Telnet 
RFC 
9 
4799 
File (binary) 
RFC 
D 
12 
Some kind of binary file? 

d 
1590 
File (document) 

g 
102 
Image (gif) 
RFC 
H 
4 


h 
3914 
HTML Link 

I 
3300 
Image 
RFC 
i 
13216 
Information 

M 
115 
Mail file? 

P 
26 
PDF File 

p 
15 
Image (PNG) 

s 
278 
Sound 

T 
0 
IBM TN3270 
RFC 
w 
9 
Wiki edit link 
  
Field
Count
Type
Status
;
11
Video

+
0
Duplicated Server
RFC
0
60976
File
RFC
1
29335
Directory
RFC
2
5
CSO Phone
RFC
3
36
Error
RFC
4
223
File (BinHex)
RFC
5
631
File (DOS Binary)
RFC
6
3
File (UUEncoded)
RFC
7
257
Index-search server (Veronica)
RFC
8
479
Telnet
RFC
9
4799
File (binary)
RFC
D
12
Some kind of binary file?

d
1590
File (document)

g
102
Image (gif)
RFC
H
4


h
3914
HTML Link

I
3300
Image
RFC
i
13216
Information

M
115
Mail file?

P
26
PDF File

p
15
Image (PNG)

s
278
Sound

T
0
IBM TN3270
RFC
w
9
Wiki edit link