Skip to content

Commit 1455781

Browse files
committed
Update README
1 parent 3b8b054 commit 1455781

File tree

1 file changed

+44
-24
lines changed

1 file changed

+44
-24
lines changed

README.adoc

Lines changed: 44 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,33 @@
1-
= Interscript: Interoperable Script Conversion Systems and a Ruby implementation
1+
= Interscript: Interoperable Script Conversion Systems, with a Ruby implementation
22

3-
image:https://github.com/riboseinc/interoperable-transliteration/workflows/test/badge.svg["Build Status", link="https://github.com/riboseinc/interoperable-transliteration/actions?workflow=test"]
3+
image:https://github.com/riboseinc/interscript/workflows/test/badge.svg["Build Status", link="https://github.com/riboseinc/interscript/actions?workflow=test"]
44

55
== Introduction
66

7-
This repository contains a number of transliteration schemes from:
7+
This repository contains interoperable transliteration schemes from:
88

9+
* ALA-LC
910
* BGN/PCGN
1011
* ICAO
1112
* ISO
1213
* UN (by UNGEGN)
14+
* Many, many other script conversion system authorities.
1315

1416
The goal is to achieve interoperable transliteration schemes allowing quality comparisons.
1517

16-
image:demo/20191118-interscript-demo-cast.gif["interscript screencast"]
17-
1818

19-
== STATUS (work in progress!)
19+
== Demonstration
2020

21-
These transliteration systems currently work:
21+
These transliteration systems are used in the demo:
2222

2323
`bgnpcgn-rus-Cyrl-Latn-1947`:: BGN/PCGN Romanization of Russian
2424
`iso-rus-Cyrl-Latn-iso9`:: ISO 9 Romanization of Russian
2525
`icao-rus-Cyrl-Latn-9303`:: ICAO MRZ Romanization of Russian
2626
`bas-rus-Cyrl-Latn-bss`:: Bulgaria Academy of Science Streamlined System for Russian
2727

28+
image:demo/20191118-interscript-demo-cast.gif["interscript screencast"]
29+
30+
2831
== Installation
2932

3033
Interscript depends on Ruby. Once you manage to install Ruby, it's easy.
@@ -88,7 +91,12 @@ diff bgnpcgn-rus-Latn.txt bas-rus-Latn.txt
8891

8992
== Adding transliteration system
9093

91-
Transliteration systems stored in a `maps` directory as YAML files. You can create a new file and add it to the directory. The file shout be named as `<system-code>.yaml`.
94+
Transliteration systems stored in a `maps/` directory as YAML files.
95+
You can create a new file and add it to the directory.
96+
97+
The file should be named as `<system-code>.yaml`, where `system-code`
98+
is in accordance with
99+
http://calconnect.gitlab.io/tc-localization/csd-transcription-systems[ISO/CC 24229].
92100

93101
=== File structure
94102

@@ -127,9 +135,10 @@ map:
127135
"\u0412": "V"
128136
----
129137

138+
130139
=== Rules
131140

132-
The subsection `rules` is placed in the `map` section. All rules are applied in order they are placed before the subsection `characters` applying. Rules apply to an original text, not to a result of previous rules applying.
141+
The subsection `rules` is placed under the `map` key. All rules are applied in order they are placed before the subsection `characters` applying. Rules apply to an original text, not to a result of previous rules applying.
133142

134143
Each rule has `pattern` and `result` elements.
135144

@@ -155,9 +164,9 @@ map:
155164
----
156165
map:
157166
rules:
158-
- pattern: (?<=\b)\u03BC[πΠ] # μπ (initially)
167+
- pattern: (?<=\b)\u03BC[πΠ] # μπ (initially)
159168
result: b
160-
- pattern: \u03BC[πΠ] # μπ (medially)
169+
- pattern: \u03BC[πΠ] # μπ (medially)
161170
result: mb
162171
----
163172

@@ -176,43 +185,50 @@ map:
176185

177186
(This guarantees that any `;` are converted to `?` before any new `;` are introduced; because all three are Latin script, they could be mixed up in ordering.)
178187

179-
Normally rules "bleed" each other: once a rule applies to a segment, that segment cannot trigger other rules, because it is already converted to Roman. Exceptionally, it will be necessary to have a rule add or remove characters in the original script, rather than transliterate them, so that the same context can be invoked by two rules in succession:
188+
Normally rules "`bleed`" each other: once a rule applies to a segment, that segment cannot trigger other rules, because it is already converted to Roman. Exceptionally, it will be necessary to have a rule add or remove characters in the original script, rather than transliterate them, so that the same context can be invoked by two rules in succession:
180189

181190
[source,yaml]
182191
----
183192
map:
184193
rules:
185194
- pattern: (?<=[АаЕеЁёИиОоУуЫыЭэЮюЯя])\u042b # Ы after any vowel character
186195
result: "\u00b7Ы"
187-
- pattern: \u042b(?=[АаУуЫыЭэ]) # Ы before а, у, ы, or э
196+
- pattern: \u042b(?=[АаУуЫыЭэ]) # Ы before а, у, ы, or э
188197
result: "Ы\u00b7"
189198
----
190199

191-
(If the result were "\u00B7Y", the second rule could not be applied afterwards; but we want ОЫУ to transliterate as `O·Y·U`. In order to make that happen, we preserve the Ы during the rules phase, resulting in О·Ы·У; we only convert the letters to Roman script in the `characters` phase.)
200+
(If the result were `\u00B7Y`, the second rule could not be applied afterwards; but we want ОЫУ to transliterate as `O·Y·U`. In order to make that happen, we preserve the Ы during the rules phase, resulting in О·Ы·У; we only convert the letters to Roman script in the `characters` phase.)
192201

193202
=== Testing transliteration systems
194203

195-
To test all transliteration systems in `maps` directory run a command:
204+
To test all transliteration systems in the `maps/` directory, run:
196205

197206
[source,sh]
198207
----
199208
bundle exec rspec
200209
----
201210

202-
The command takes `source` texts from `test` section, transforma it using `rules` and `charmaps` from `map` section and compare resultat with `expected` text form `text` section.
211+
The command takes `source` texts from the `test` section, transforms
212+
them using `rules` and `charmaps` from the `map` key, and compares the
213+
results with `expected:` text from the `source:` section.
203214

204-
To test specific transliteration system set environment variable `TRANSLIT_SYSTEM` to code of desired system. The code is name of YAML file without extension:
215+
To test a specific transliteration system, set the environment variable
216+
`TRANSLIT_SYSTEM` to the system code of the desired system
217+
(i.e. the "`basename`" of the system's YAML file):
205218

206219
[source,sh]
207220
----
208221
TRANSLIT_SYSTEM=bgnpcgn-rus-Cyrl-Latn-1947 bundle exec rspec
209222
----
210223

224+
211225
== ISCS system codes
212226

213-
The system code identifying a script conversion system has a few components:
227+
In accordance with
228+
http://calconnect.gitlab.io/tc-localization/csd-transcription-systems[ISO/CC 24229],
229+
the system code identifying a script conversion system has the following components:
214230

215-
e.g. `bgnpcgn-rus-Cyrl-Latn-1947`
231+
e.g. `bgnpcgn-rus-Cyrl-Latn-1947`:
216232

217233
`bgnpcgn`:: the authority identifier
218234
`rus`:: an ISO 639-2 3-letter language code that this system applies to
@@ -226,7 +242,7 @@ e.g. `bgnpcgn-rus-Cyrl-Latn-1947`
226242
Currently the schemes cover Cyrillic, Armenian, Greek, Arabic and Hebrew.
227243

228244

229-
== Sources
245+
== Samples to play with
230246

231247
* `rus-Cyrl-1.txt`: Copied from the XLS output from http://www.primorsk.vybory.izbirkom.ru/region/primorsk?action=show&global=true&root=254017025&tvd=4254017212287&vrn=100100067795849&prver=0&pronetvd=0&region=25&sub_region=25&type=242&vibid=4254017212287
232248

@@ -235,11 +251,15 @@ Currently the schemes cover Cyrillic, Armenian, Greek, Arabic and Hebrew.
235251

236252
== Links to system definitions
237253

238-
* ALA-LC Romanization systems from 1997 are available here: http://catdir.loc.gov/catdir/cpso/roman.html
239-
* ALA-LC Romanization systems in current use are here: https://www.loc.gov/catdir/cpso/roman.html
240-
* UN systems are available here: http://www.eki.ee/wgrs/
241-
254+
* https://www.iso.org/committee/48750.html[ISO/TC 46 (see standards published by WG 3)]
255+
* http://geonames.nga.mil/gns/html/romanization.html[BGN/PCGN and BGN Romanization systems (BGN)]
256+
* https://www.gov.uk/government/publications/romanization-systems[BGN/PCGN Romanization systems (PCGN)]
257+
* https://www.loc.gov/catdir/cpso/roman.html[ALA-LC Romanization systems in current use]
258+
* http://catdir.loc.gov/catdir/cpso/roman.html[ALA-LC Romanization systems from 1997]
259+
* http://www.eki.ee/wgrs/[UN Romanization systems]
260+
* http://www.eki.ee/knab/kblatyl2.htm[EKI KNAB systems]
242261

243262
== License
244263

245264
Copyright Ribose.
265+

0 commit comments

Comments
 (0)