Documentation
Projects
FinRegistry
FinRegistry is a joint research project of the Finnish Institute of Health and Welfare (THL) and the Data Science and Genetic Epidemiology Lab research group at the Institute for Molecular Medicine Finland (FIMM), University of Helsinki. The project aims to develop new ways to model the complex relationships between health and risk factors. To do that we develop statistical and machine learning models to understand and predict disease occurrences using high-resolution longitudinal data. FinRegistry utilizes the unique registry system in Finland to combine health data with a wide range of other information from nearly the whole population of Finland. FinRegistry includes all individuals alive and living in Finland on 1st of January 2010 (FinRegistry index persons) as well as the index persons' parents, siblings, children, and spouses.
FinnGen
FinnGen is a large-scale academic/industrial research collaboration launched in Finland in 2017 with the scope to collect and analyze genomic and health data from 500,000 Finnish biobank participants by 2023. The project aims to improve human health through genetic research, and ultimately identify new therapeutic targets and diagnostics for treating numerous diseases. It produces near complete genome variant data from all the 500,000 participants using GWAS genotyping and imputation and utilizes the extensive longitudinal national health register data available on all Finns. The latest data freeze from April 2022 consists of over 392.000 individuals. The study currently involves Finnish biobanks, University Hospitals and their respective Universities, the Finnish Institute of Health and Welfare (THL), the Finnish Red Cross Blood Service, the Finnish Biobanks - FINBB and thirteen pharmaceutical companies. University of Helsinki is the organization responsible for the study.
Methods
Ontology
Endpoints are linked to international ontologies DOID, MESH, and EFO, and links to the ontologies
are presented on Risteys when available. The mapping is carried out using automated algorithm followed by manual curation.
First, the following hierarchical algorithm is used to link endpoints to DOID and MESH codes:
- ICD-10 codes are matched to DOID ICD-10-CM codes
- endpoint names are matched to DOID names and synonyms
- endpoints are matched to MESH codes and converted to DOID
- ICD-10 codes are matched one step up in the ICD-10 hierarchy
- endpoint names are matched with DOID codes using the Ratcliff/Obershelp pattern matching (similarity > 0.69)
The resulting DOID and MESH codes are mapped to EFO when a mapping is available.
Next, the fuzzy matching algorithm OnToma
and the ontology annotations for endpoints available on the
Open Targets
portal are used to link endpoints to EFO codes.
Finally, endpoints with discordant EFO annotations between the existing mappings,
OnToma, and Open Targets are manually checked and corrected.
Key figures & distributions
Key figures and the year and age distributions were computed using data of all persons in FinRegistry and FinnGen. Figures are presented for FinRegistry index persons, the whole population in FinRegistry, and FinnGen.
The key figures include the following statistics:
- Number of individuals: district number of individuals with the endpoint of interest
- Period prevalence: Number of individuals with the endpoint of interest divided by the total number of individuals in the cohort (FinRegistry index persons, FinRegistry, or FinnGen)
- Median age at first event: Median age at the first occurrence of the endpoint in the registry data
Distributions are presented by age and year at the first event. Bars in distributions are aggregated to include at least 5 individuals, given the sensitive nature of the data.
Cumulative incidence function (CIF)
The cumulative incidence function (CIF) presents the incidence of an endpoint by age and sex. When death is regarded as a competing event, the interpretation of CIF is the probability of getting the endpoint given it is also possible to die without the endpoint
. CIF was estimated using the Aalen-Johansen estimator in a competing risks framework where death was treated as a competing event. The model was stratified by sex, and age was used as a timescale to obtain CIF estimates by age.
The eligibility criteria for CIF are as follows:
- born before the end of the follow-up (31.12.2019)
- either not dead or died during the follow-up period (1.1.1998 to 31.12.2019)
- sex information is available
- for cases, the outcome endpoint has to occur during the follow-up period
We sampled all or at most 10,000 cases and 1.5 controls per care among the non-cases. Subjects were weighted by the inverse of the sampling probability to account for the sampling design. We required at least 50 cases and controls during this period for running the analysis. Moreover, CIF is only presented for ages with at least 5 cases due to the sensitive nature of the data.
The Aalen-Johansen estimates were obtained using the Lifelines Python library.
Mortality
The goal of the mortality analysis is to estimate the association between an exposure endpoint and death. The results include estimates for the coefficients as well as absolute mortality risk estimations. A Cox proportional hazards model was used to estimate mortality associated with an endpoint. Age was used as a timescale and birth year was included as a covariate to account for calendar effects. The model was stratified by sex.
The eligibility criteria for mortality analysis as as follows:
- born before the end of the follow-up (31.12.2019)
- either not dead or died during the follow-up period (1.1.1998 to 31.12.2019)
- sex information is available
- for the exposed persons, the exposure endpoint has to occur during the follow-up period and no more than 30 days prior to death. Persons exposed less than 30 days before death are considered unexposed.
Exposure-stratified sampling was applied to acquire a sufficient number of persons for the analysis. At least 50 exposed and unexposed cases and controls were required. We sampled all or at most 10,000 cases and 1.5 controls per cases among the non-cases. The model was weighted by the inverse of the sampling probability to account for the sampling design.
Mortality risks can be used to estimate the risk of death given exposure. Conditional mortality risks represent the risk of an event by time t given that no event has occurred by the time t0. Conditional mortality risks were computed using the following formula: MR(t | t0) = 1 - S(t) / S(t0) where t0 is age at baseline, t is the target age and S is the survival function. The difference between the baseline age and the current year was used as the birth year.
The Cox proportional hazards model was fitted using the Lifelines Python library.
Relationships - Survival analysis
The goal of the endpoint-to-endpoint survival analysis is to estimate the association between two clinical endpoints,
the prior endpoint and the outcome endpoint. We used a Cox proportional hazards model with age as a timescale to estimate
the hazard ratio between the prior endpoint and the outcome. Birth year and sex were used as covariates.
The eligibility criteria for the mortality analysis are as follows:
- born before the end of the follow-up (31.12.2019)
- either not dead or died during the follow-up period (1.1.1998 to 31.12.2019)
- sex information is available
- for individuals with the prior endpoint, the prior endpoint has to occur during the follow-up period and no more than 180 days prior to the outcome endpoint
We sampled all or at most 10,000 cases, i.e. persons with the outcome endpoint, and 1.5 controls per case among
the non-cases separately for individuals with and without the prior endpoint. For sex-specific endpoints,
controls were sampled of the same sex. The model was weighted by the inverse of the sampling probability to
account for the sampling design, as in the mortality analysis.
The Cox proportional hazards model was fitted using the
Lifelines Python library.