You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Transliteration systems stored in a `maps` directory as YAML files. You can create a new file and add it to the directory. The file shout be named as `<system-code>.yaml`.
94
+
Transliteration systems stored in a `maps/` directory as YAML files.
95
+
You can create a new file and add it to the directory.
96
+
97
+
The file should be named as `<system-code>.yaml`, where `system-code`
The subsection `rules` is placed in the `map` section. All rules are applied in order they are placed before the subsection `characters` applying. Rules apply to an original text, not to a result of previous rules applying.
141
+
The subsection `rules` is placed under the `map` key. All rules are applied in order they are placed before the subsection `characters` applying. Rules apply to an original text, not to a result of previous rules applying.
133
142
134
143
Each rule has `pattern` and `result` elements.
135
144
@@ -155,9 +164,9 @@ map:
155
164
----
156
165
map:
157
166
rules:
158
-
- pattern: (?<=\b)\u03BC[πΠ] # μπ (initially)
167
+
- pattern: (?<=\b)\u03BC[πΠ] # μπ (initially)
159
168
result: b
160
-
- pattern: \u03BC[πΠ] # μπ (medially)
169
+
- pattern: \u03BC[πΠ] # μπ (medially)
161
170
result: mb
162
171
----
163
172
@@ -176,43 +185,50 @@ map:
176
185
177
186
(This guarantees that any `;` are converted to `?` before any new `;` are introduced; because all three are Latin script, they could be mixed up in ordering.)
178
187
179
-
Normally rules "bleed" each other: once a rule applies to a segment, that segment cannot trigger other rules, because it is already converted to Roman. Exceptionally, it will be necessary to have a rule add or remove characters in the original script, rather than transliterate them, so that the same context can be invoked by two rules in succession:
188
+
Normally rules "`bleed`" each other: once a rule applies to a segment, that segment cannot trigger other rules, because it is already converted to Roman. Exceptionally, it will be necessary to have a rule add or remove characters in the original script, rather than transliterate them, so that the same context can be invoked by two rules in succession:
180
189
181
190
[source,yaml]
182
191
----
183
192
map:
184
193
rules:
185
194
- pattern: (?<=[АаЕеЁёИиОоУуЫыЭэЮюЯя])\u042b # Ы after any vowel character
186
195
result: "\u00b7Ы"
187
-
- pattern: \u042b(?=[АаУуЫыЭэ]) # Ы before а, у, ы, or э
196
+
- pattern: \u042b(?=[АаУуЫыЭэ]) # Ы before а, у, ы, or э
188
197
result: "Ы\u00b7"
189
198
----
190
199
191
-
(If the result were "\u00B7Y", the second rule could not be applied afterwards; but we want ОЫУ to transliterate as `O·Y·U`. In order to make that happen, we preserve the Ы during the rules phase, resulting in О·Ы·У; we only convert the letters to Roman script in the `characters` phase.)
200
+
(If the result were `\u00B7Y`, the second rule could not be applied afterwards; but we want ОЫУ to transliterate as `O·Y·U`. In order to make that happen, we preserve the Ы during the rules phase, resulting in О·Ы·У; we only convert the letters to Roman script in the `characters` phase.)
192
201
193
202
=== Testing transliteration systems
194
203
195
-
To test all transliteration systems in `maps` directory run a command:
204
+
To test all transliteration systems in the `maps/` directory, run:
196
205
197
206
[source,sh]
198
207
----
199
208
bundle exec rspec
200
209
----
201
210
202
-
The command takes `source` texts from `test` section, transforma it using `rules` and `charmaps` from `map` section and compare resultat with `expected` text form `text` section.
211
+
The command takes `source` texts from the `test` section, transforms
212
+
them using `rules` and `charmaps` from the `map` key, and compares the
213
+
results with `expected:` text from the `source:` section.
203
214
204
-
To test specific transliteration system set environment variable `TRANSLIT_SYSTEM` to code of desired system. The code is name of YAML file without extension:
215
+
To test a specific transliteration system, set the environment variable
216
+
`TRANSLIT_SYSTEM` to the system code of the desired system
217
+
(i.e. the "`basename`" of the system's YAML file):
the system code identifying a script conversion system has the following components:
214
230
215
-
e.g. `bgnpcgn-rus-Cyrl-Latn-1947`
231
+
e.g. `bgnpcgn-rus-Cyrl-Latn-1947`:
216
232
217
233
`bgnpcgn`:: the authority identifier
218
234
`rus`:: an ISO 639-2 3-letter language code that this system applies to
@@ -226,7 +242,7 @@ e.g. `bgnpcgn-rus-Cyrl-Latn-1947`
226
242
Currently the schemes cover Cyrillic, Armenian, Greek, Arabic and Hebrew.
227
243
228
244
229
-
== Sources
245
+
== Samples to play with
230
246
231
247
* `rus-Cyrl-1.txt`: Copied from the XLS output from http://www.primorsk.vybory.izbirkom.ru/region/primorsk?action=show&global=true&root=254017025&tvd=4254017212287&vrn=100100067795849&prver=0&pronetvd=0®ion=25&sub_region=25&type=242&vibid=4254017212287
232
248
@@ -235,11 +251,15 @@ Currently the schemes cover Cyrillic, Armenian, Greek, Arabic and Hebrew.
235
251
236
252
== Links to system definitions
237
253
238
-
* ALA-LC Romanization systems from 1997 are available here: http://catdir.loc.gov/catdir/cpso/roman.html
239
-
* ALA-LC Romanization systems in current use are here: https://www.loc.gov/catdir/cpso/roman.html
240
-
* UN systems are available here: http://www.eki.ee/wgrs/
241
-
254
+
* https://www.iso.org/committee/48750.html[ISO/TC 46 (see standards published by WG 3)]
255
+
* http://geonames.nga.mil/gns/html/romanization.html[BGN/PCGN and BGN Romanization systems (BGN)]
256
+
* https://www.gov.uk/government/publications/romanization-systems[BGN/PCGN Romanization systems (PCGN)]
257
+
* https://www.loc.gov/catdir/cpso/roman.html[ALA-LC Romanization systems in current use]
258
+
* http://catdir.loc.gov/catdir/cpso/roman.html[ALA-LC Romanization systems from 1997]
0 commit comments