Editor’s note: OpenCorporates is a site that shares information on more than 100 million corporate entities around the world as open data. With its latest project on German companies, OpenCorporates added data on more than five million companies. To gather this data, the team spent months analyzing its identifiers and then extracting it from several sources — for the most part, unstructured gazette notices.
OpenCorporates collaborated on this project with Transparency International Deutschland and Open Knowledge Germany, as well as with German journalists to whom they granted early access to the data to see if they could find anything interesting. The data contributed to Süddeutsche Zeitung‘s report “The Owner Remains Secret,” NDR‘s report “Who Is Behind Which Company?” and Correctiv‘s report “Who Owns Hamburg? Rent Under Palm Trees.”
In this post, the OpenCorporates team explains how they assembled the data.
Since we launched the dataset of over five million German companies, we’ve had lots of questions about how we assembled the data. This post aims to answer that question. It’s obviously quite detailed and technical, but we hope it will be of interest to a technical audience at least. Apologies that it is not in German, but do feel free to translate!
(A reminder: As well as being on OpenCorporates, the dataset can be downloaded from offeneregister.de, run by the Open Knowledge Foundation Deutschland, to whom we donated the data, and who in almost zero time built a great website around it.)
First, a general overview:
- There are 5.3 million companies in the dataset, of which 2.3 million are currently registered, and 2.9 million of which are “removed.”
- Virtually every major data point comes not from the Handelsregister.de but from the Gazette Notices at Handelsregisterbekanntmachungen.de. These are simply a set of gazette notices about the incorporation or dissolution of the company, change in officers, change in district court or change in address. The text is unstructured and inconsistent, but we have been able to successfully extract data.
- These notices are numbered sequentially, and so we iterate through them, parsing the notices and extracting data – in particular the district court (Amtsgericht) with which the company is registered, the identifier (company number) issued by that court and the company type (HRB, HRA, etc.), which is used in the next step. We also parse the other information from the notices, including information on officers, addresses and also changes in court registration (when the entity moves from one district court to another).
- Parsing the officers is particularly difficult, given the free-text nature of the data, and in general we prioritize quality over quantity, i.e., we don’t want to generate lots of bad data (false positives) in our hunt for every last genuine result. This means there will definitely be some missing officers, and also a small number of minor parsing issues, particularly parsing the name into their constituent parts. So please email email@example.com with any issues you see – whether missing officers, or other problems. It’s also worth pointing out that if the underlying data was made available to all as open data – and not just to those who pay to have privileged access – this would cease to be a problem.
- We then visit the Handelregister search page to see if the company with those details exists on the register. Some don’t – mainly as a result of court reorganizations as well as changes in company numbers on the Handelsregister that are not reflected in the original gazette notices, mostly affecting registered clubs and associations (meaning that a court/number-based search using the legacy detail does not return a matching entry). We then retrieve the company name and the current status from the search results. The company name is a little tricky to parse from the gazette notices, given the dirtiness of the data, but it’s not impossible, should we need to rely on just the gazette notices themselves. The status could also be derived from these notices, and that’s something we’re considering.
- We are not currently taking any information from the “entity data” details page for the company on Handelsregister – the incorporation date, share capital, legal form, deletion date, registered address. We looked at scraping this and did some tests, but that would be a lot of requests and we haven’t yet figured out how to do it without putting a strain on the source, which we don’t wish to do. In fact, we hope that now that the data has been made available as open data, the need for others to scrape the register is much reduced.
- A tiny amount of information was manually obtained from searching the Handelsregister website. Twenty-two companies with more than one headquarters (“Doppelsitz” or “Mehrfachsitz”) were identified by means of the advanced search functionality on the website. This information was then manually transformed into “Alternate Registration” data and inserted into the relevant company in our dataset. Manual collection is not something we usually do – but in this case it was necessary.
- Data for around 45,000 companies has been sourced solely from Handelsregister search listings – these are mostly inactive companies that pre-date the publication of electronic gazette notices at Handelsregisterbekanntmachungen.
- There are a few pieces of data that are returned in Handelsregister searches that are stored in our database but actually redundant, and we may remove. These include Registered Office town (we extract the full registered office from the gazette notice), and different representations of the company number and Amstgericht [Editor’s Note: In Germany, the Amstgericht is the local jurisdiction].
- Another key piece of work was matching different registrations of the same entity together – this happens when a legal entity moves from one area, and thus Amstgericht, to another. There are also very messy gazette notices about this and we’ve done a huge amount of work to figure out the situation and represent it as data – something the Handelsregister hasn’t done. For these situations, there are usually two notices: one for the existing court stating that the registration has moved to the new court, and one for the new court stating that the registration has come from the old jurisdiction.
The Steps in Detail
- Scrape gazette notice from HRB.de – e.g. https://www.handelsregisterbekanntmachungen.de/en/skripte/hrb.php?rb_id=350704&land_abk=ni
- Parse gazette notice – attempt to extract following data:
- Company number
- Event date
- Publication date
- Type of notice (New, Amendment, Deletion)
- Related registration (subsequent or previous registration), including details of the related court and company number
- Officers – name, city, date of birth, position, type (derive company or person)
- Registered address
- Match gazette notice court name from 2b to valid XJustiz court ID so that it can be linked to Handelsregister. This requires fuzzy matching as the court names on Handelsregisterbekanntmachungen notices are often misspelled.
- Compute OpenCorporates company number for the gazette notice.
- For the computed number, search Handelsregister for that company.
- From Handelsregister search listings capture – name, current status, whether additional data is available from the register.
- Construct final company object:
- Attempt to cross-match new officers against existing list so that any resignation of officers can be marked with an end_date
- Construct officers array based on 7a
- Construct related_registrations based on information parsed from 2f
And here’s a concrete example. The JSON for PPP3 UG (haftungsbeschränkt) looks like: https://gist.github.com/CountCulture/e30c192b14018ffa3c563fa0b432f441
This information was derived from the following gazette notices:
- https://www.handelsregisterbekanntmachungen.de/skripte/hrb.php?rb_id=350099&land_abk=ni (Includes the move of the registration from one Amstgericht to another.)
- https://www.handelsregisterbekanntmachungen.de/skripte/hrb.php?rb_id=351890&land_abk=ni (Includes the move of the registration from one Amstgericht from another.)
This article first appeared on the OpenCorporates blog and is reproduced here with permission.