Journal Abbreviator, a web app
This little project was inspired by my forgetting the abbreviated form of the Journal of Engineering for Gas Turbines and Power, which is a bit of a mouthful in its long form. The first search engine result for the title and ‘‘abbreviation’’ is a classic example of everything that is wrong with the internet. With my adblocker disabled, I observed:
- 300 requests transferring 8MB of data;
- a full-page modal advert covering over the content;
- a warning from Firefox that ‘‘a web page is slowing down your browser’’.
Researchers of the world deserve better than this.
My attempt (admittedly not monetised) needs a mere three requests transferring 18kB. This post describes how I built my first web app, and what I learned in the process.
LTWA
A consistent method for abbreviations is specified by the ISO4 standard 1. The standard defines an exhaustive list of title word abbreviations, or LTWA for short 2; and a few general rules: don’t abbreviate one-word titles or proper names, omit prepositions, and so on.
The word list ltwa.csv
looks like this,
"WORDS";"ABBREVIATIONS";"LANGUAGES"
...
"nat̡ional-";"natl.";"rum, fre, eng"
...
"power";"n.a.";"eng"
...
"-shire";"-sh.";"eng"
...
"turbomachinery";"turbomach.";"eng"
...
We see there are four types of entries. In alphabetical order:
- Prefixes: “national” or “nationality” both abbreviate to “natl.”;
- Non-abbreviations: “power” must be kept as-is;
- Suffixes: “Dorsetshire” and “Devonshire” abbreviate to “Dorsetsh.” and “Devonsh.”;
- Whole words: the simplest case, “turbomachinery” abbreviates to “turbomach.”
Did you notice the “t̡” in “nat̡ional-”? It seems that, where an abbreviation applies to multiple languages, the list records the long word in the alphabet of the first language. At the cost of generality, we can normalise characters such as these using,
iconv -f UTF-8 -t ASCII//TRANSLIT < ltwa.csv > ltwa_translit.csv
Abbreviation algorithm
The Python code to read and parse this word list into a WordList
class
containing four dictionaries, one
for each type of entry, is straightforward. We can look up individual words
efficiently using the in
keyword, as dictionaries are indexed by hash
tables 3. There are a couple of complications when applying this method to a
general query string containing a publication title.
Splitting titles on spaces was the logical first move, but this falls down with abbreviations like,
"United States of America";"U. S. A.";"eng"
where the long form contains spaces. Because hash table lookups are cheap, my workaround is to try all possible combinations of consecutive words of length two up to a maximum of four. If any of these trial words are found in the abbreviations dictionaries, they are joined together, and the rest of the algorithm can proceed as normal.
Matching prefixes and suffixes also requires some thought. A naive implementation could do something like,
def match_prefix_naive(word, prefixes):
# Loop over all keys in prefixes dictionary and return a match
for p in prefixes:
if word.startswith(p):
return prefixes[p]
# Otherwise, return nothing
return None
The problem with this implementation is that the runtime is proportional to the number of entries in the dictionary, as we do separate comparisons with each prefix. It is more efficient to iterate over shaved versions of the input word,
def match_prefix(word, prefixes, max_prefix_length):
# Trim characters off the input
for trim_stop in range(len(word),max_prefix_length-1,-1):
# Look up the trimmed word in the prefix dictionary
word_trim = word[:trim_stop]
print('Trying:', word_trim)
if word_trim in prefixes:
print('Matched:', prefixes[word_trim])
return prefixes[word_trim]
Where max_prefix_length
is the length of the longest prefix, which we can
pre-calculate while reading in the word list. Now, we have at most
len(word) - max_prefix_length + 1
hash table lookups with a runtime independent of the number of word list entries. For example,
>>> match_prefix('nationality',prefixes)
Trying: nationality
Trying: nationalit
Trying: nationali
Trying: national
Matched: natl.
All this is wrapped into an abbreviate
function, which given a string and
word list: does the four types of lookup, applies exceptions for special cases,
and outputs with the correct capitalisation. Finally, let’s make this script
more UNIXy so we can call it from the shell or reading from standard input as
part of a pipeline,
#!/usr/bin/env python3
# ... logic goes here ...
if __name__=="__main__":
# Get the abbreviation dictionary from default word list file
word_list = WordList("ltwa_translit.csv")
# Collect arguments from standard input or sys.argv
if not sys.stdin.isatty():
for line in sys.stdin:
print(abbreviate(line.strip('\n'), word_list))
else:
# If no following arguments, show usage
if len(sys.argv)==1:
print('Usage: jabbrev.py TITLE')
else:
# As argv is space-separated we must put title back to a single string
print(abbreviate(" ".join([w for w in sys.argv[1:]]), word_list))
After putting this script somewhere on our $PATH
and chmod u+x
, we can do,
$ jabbrev.py Journal of Turbomachinery
J. Turbomach.
$ cat my_favourite_journals.txt
Nature
Journal of Turbomachinery
The Lancet
Journal of Fluid Mechanics
International Journal of Turbomachinery, Propulsion and Power
$ grep Turbo my_favourite_journals.txt | jabbrev.py
J. Turbomach.
Int. J. Turbomach. Propuls. Power
Onto the web
There is an argument that client-side JavaScript would have been the best way to implement the abbreviator as a web page. However, I’m much more proficient in Python than JavaScript, and I had already made a Python version. Furthermore, sending a 1.7MB word list to every client seemed a bit unnecessary. So I went for a basic Flask web app 4.
Flask is a “micro” web framework written in Python that responds to HTTP requests with templated HTML. I did not have to write much code to make this work,
from flask import Flask, render_template
import jabbrev
app = Flask(__name__)
# Load word list on startup
word_list = jabbrev.WordList('ltwa_translit.csv')
# Return an empty page at base url
@app.route('/')
def index():
return render_template( 'index.html', short_title="", long_title="",)
# Input the url as journal to abbreviate, return a filled in page
@app.route('/<journal>')
def abbreviate(journal):
# Handle Apache rewrites that replace spaces
journal = journal.replace("+"," ")
short_title = jabbrev.abbreviate(journal,word_list)
return render_template(
'index.html',
short_title=short_title,
long_title=journal,
)
Here, index.html
is a template that contains placeholders {{ short_title }}
and {{ long_title }}
where the data is to be inserted. There is also some
limited JavaScript, to take the contents of an input text box, HTML encode it,
and append it to the URL when the user presses the enter key or “abbreviate” button. Spelling out the example from the introduction,
I imagine using this might be useful if you had a tea-time disagreement with a colleague about, say, the correct way to abbreviate Flow, Turbulence and Combustion and you wanted to send them a link afterwards to prove yourself right
Generous hosts
The web app, like this entire site, is hosted by the Student-Run Computing
Facility at the University of Cambridge (for free!). They provide excellent
documentation for getting your app up and running 56 on which I
have only one thing to add. One needs a rule in .htaccess
to tell the web
server to route traffic to the Flask process that sits ready watching a socket
file. Because I wanted to accept general input in the URL, my rule needed an
extra option,
RewriteRule "^journal-abbreviator(.*)$" "unix:/home/jb753/jabbrev/web.sock|http://jamesbrind.uk/$1" [P,NE,L,QSA,B]
where the [B]
directive instructs the web server to HTML escape all
characters in the (.*)
capture group.
Summary
In response to the bloat of existing websites, I built my own Journal Abbreviator web app in Python. Once I had a working script locally, only about 30 lines of code were needed to get it accessible using the Flask framework.
The entire codebase is available on my sourcehut. If a reader were to fork the app, host it on another domain and plaster it with adverts, I suppose I would be flattered.
Although I have verified abbreviations for the journals I commonly read, there are likely to be some incorrect answers or bugs for other journals I have not tried. To some extent, this is inevitable due to the many edge cases and ambiguity of the English language. Is “Carpenter” a name (which should not be abbreviated) or a job (which abbreviates to “Carpent.”)? If you spot an incorrect abbreviation, I would be grateful if you could open an issue.