-
Notifications
You must be signed in to change notification settings - Fork 16
Sitemap Architecture
Sitemaps are a way for Web sites to declare what pages exist on their sites, to make it easier for robots like search engine spiders to find them.
The format is XML, and relatively simple. See link above for details.
One option for sitemaps would be to generate static files on a periodic basis -- say, once per day. We don't really have a place to store these files, and it seems like a big pain, so I opted to instead generate them dynamically from our API server.
One of the rules for sitemaps is that they can only include rules on the same server, and in the same "directory", as the the sitemap. So to include all our URLs, the sitemap URLs have to be at the top level of the URL hierarchy -- not under the "/api" path.
Our Ingress rules have two parts:
-
/apigoes to the API container - everything else goes to the UI container
We need URLs like /sitemap-*.xml to go to the API server, too. Unfortunately, that doesn't work with the default Ingress interface. So, we need to use ImplementationDependent backend routes. Fortunately, both AWS and NGINX support using regular expressions for routes.
The main limiting factor for sitemaps is that they are limited to 50,000 URLs per sitemap. Since we have >120,000 actors, each with their own URL, that's too many to fit in one sitemap file.
In order to handle this situation, it's possible to create a '''sitemap index''', which is a link to different sitemaps of 50000 URLs or less.
To do this, we'll have a single index, and break down additional sitemaps like this:
- Public actors (country, region or province, and city) go into country-wide sitemaps. We have 250 countries, so that's 250 sitemaps.
- Private actors (companies, facilities) go into sitemaps divided by LEI. That way, we don't have to do a lot of queries to figure out where the facilities or companies belong geographically.
LEIs are 20 character alphanumeric strings with these parts:
- First 4 characters: identifier for an LEI issuer (Local Operating Units or LOU)
- Next 14 characters: random-ish identifier
- Last 2 characters: a 2-digit checksum ('00', '01', ... '99')
There are currently about 2M LEIs assigned in the world.
We could make one sitemap for each initial character (26 letters + 10 digits = 36 characters), but that would give a little more than 50,000 LEIs for each sitemap, not including facilities, so it's too few.
Also, the distribution across sitemaps would probably be uneven, since the LOU identifiers are going to be clumped by country.
Instead, I'm using the last two digits (the checksum). There are 100 options there, which isn't too many for the sitemap, and it would allow ~20000 URLs per sitemap, which leaves plenty of space for facilities.
There's a route /sitemap-country-{id}.xml for each country.
The endpoint gets IDs for these actors:
- The country
- The level-1 administrative regions in that country
- The level-2 administrative regions in the level-1s or in the country
- The cities that are in the level 2s, level 1s, or directly in the country
That's a lot of queries, but it's not usually a lot of data.
There's a route /sitemap-company-{dd}.xml for each possible pair of digits ('00', '01', .. '99').
The endpoint gets IDs for these actors:
- The companies with this pair of digits at the end of their actor_id
- The facilities with this pair of digits at the end of their is_owned_by
That's a lot of queries, but it's not usually a lot of data.
PostgreSQL indices are usually a lot better with prefix match than suffix match, so we may want to add an index for the reverse values at some point.
One way that Web spiders can find our sitemaps is if we submit them to a Web form. I already did this for Google and Bing, which covers most of the Web search market.
The other way they can find our sitemap is through the robots.txt file. There's a special field, Sitemap: to use.
Unfortunately, this needs to be an absolute URL, so we have to include the hostname in there.
I put code in the startup script for the UI to read an environment variable $WEB_ROOT and generate the robots.txt file based on that env variable, using m4.
It's a little tricky including that value in different Kubernetes deployment files, though.