Relational database tools

Tools for loading MEDLINE into a local relational database
May 6, 2022 – 11:37 am
EPICS EPICS Limitations Bob Dalesio Marty Kraimer. - ppt download

Representation of author information in the database schema. The typical table has a PubMed identifier (pmid) associated with other fields.

Our development team included a group of researchers from the University of California at Berkeley and another group from Stanford University. We shared similar goals in that we all wanted to load MEDLINE into a relational database, but because we were in two different departments at two different institutions, we had different project constraints and timelines. Thus, our groups were loosely associated in the software development process, but not closely integrated, and therefore, the original schema that we shared diverged.

The result was three MEDLINE schemas and three software variants: One schema was used with Java code developed at Berkeley, another schema was used with Berkeley's code modified to run at Stanford, and the third schema was used with Perl code developed at Stanford. Here we describe the underlying design that influenced all three of the schemas. (The schema used for the Java program did not include information from DTD elements DataBankList and AccessionNumberList. This has been corrected in the most recent version of the software available on the Berkeley website.)

The main table in the schema is medline_citation. The table contains the PMID as the primary key and has additional columns that correspond to single-valued elements in the DTD, where the values of those elements depend on the PMID. The medline_abstract table is similar in that it has a PMID as the primary key and columns of data that depend on the PMID. Since document abstracts are larger than the other data types, we placed them in a separate table. However, since abstracts are stored as CLOBs (Character Large Objects), they are not stored in the same pages as the rest of the fields in the table. Therefore, in a more recent implementation, we removed the table from the schema, and added the abstract_text field as a CLOB in the table. This change reduces the number of tables by one, and eliminates the need for a join between the and tables.

Some tables in the schema have more than one row corresponding to the same PMID. Columns in these tables map to multi-valued elements in the DTD. Examples are the table medline_keyword_list, which stores multiple values of keyword for a given PMID, and medline_gene_symbol_list, which stores multiple values of gene_symbol for a given PMID.

The element Article in the DTD has a one-to-one relationship between an article and a PubMed identifier. Rather than giving its own table, we put single-valued data from into the table .

To keep track of the name of the file from which data are read for a given citation, we added the field xml_file_name to the table. This field does not correspond to any element in the DTD structure, but allows the database administrator to go back to the original XML file if necessary to find the original source of the data.

We could have stored each author only once in a table of its own, and assigned each author a unique integer primary key to serve as an author identifier. An author is represented by a combination of values in fields for last name, forename, first name, middle name, initials, suffix, affiliation, and collective name. Another table would have stored the set of author identifiers associated with each PMID, and because integer joins are fast, this design would have facilitated rapid search for all PMIDs associated with a given author, by joining the author table with the table of author identifiers and citations. However, there are several drawbacks to this approach. Generating integer primary keys during loading requires that either a lookup be done to see if each author of each citation already exists or not (35 million lookups), or all authors and primary keys must be kept in memory. The former approach is very time consuming during loading; the latter approach strains memory resources. In addition, regardless of how primary keys are managed during loading, it is not possible to determine algorithmically if two different representations of one author are actually the same author, or if one representation is actually two different authors. We therefore avoided generating unique primary keys and repeated all eight fields representing the author for every citation occurrence of that author.

Figure

shows relationships among the tables. The table

medline_journal

is a parent of thirteen other tables (it contains the primary key

pmid

medline_mesh_heading_qualifier

Dependencies in the database schema. Parent tables contain primary keys that child tables reference as foreign keys. The main table medline_citation, is a parent of thirteen other tables. The table medline_mesh_heading is a parent of medline_mesh_heading_qualifier.

Parsing and loading software

We implemented three versions of software that parses and loads MEDLINE. The first was Java MedlineParser, which was developed at Berkeley [see additional file 1]. The second was the same Java code, modified to run at Stanford. The third was Perl ParseMedline, which was developed at Stanford.

All versions of the software perform two basic tasks: (1) they parse the XML files to collect data, and (2) they load the data into the database. Figure

shows the steps involved. Data can be loaded as they are collected, or can be written out to disk initially, and loaded later. All three versions offer these two options to the user. Document parsing is processor intensive, data insertion is disk intensive, and if needed, the two tasks can be executed at different times to accommodate other demands on the server.

Figure 4

MEDLINE database development process. In Step 1, the user loads the schema, creating empty tables in the database. In Step 2, the conversion software parses the XML files and either loads the data directly into the database (2a), or writes the data out to intermediate text files (2b). If intermediate text files are generated, data from those files are loaded into the database as a separate step in Step 3.

one meal a day success stories other words for success quotes about success and failure sales success quotes salesforce nonprofit success pack shark tank success stories signs of success success bank success inspirational quotes success is no accident quote animal mating success c6 c7 surgery success rate capecitabine success rate client success manager college success foundation customer success management customer success software dds success dress for success atlanta dress for success locations eft the key to success great success borat how do you measure success how to stop being jealous of others success intermittent fasting success stories iui success rate lake success camping letrozole 2.5 mg success midwives brew success at 37 weeks motivation quotes about success motivational success quotes prayer for success at work project success quote motivation success quotes about education and success spinal stenosis surgery success rate standards for success login success academy job success auto sales success criterias success crossword clue success for all success leaves clues success stories success syn success vision paducah ky success wallpapers team success quotes visualize success what is the national society of leadership and success bible verses for success blighted ovum success stories failure to success financial success funny success quotes give and take: a revolutionary approach to success hard work success quotes innovative success systems scam inspirational quotes about success inspirational quotes for success manifesting a specific person success stories mindset: the new psychology of success no fetal pole at 7 weeks success stories nutrisystem success stories one purdue success factors palmer skin success positive affirmations success powerful words for success provera pregnancy success quotes about education and success reading for success remdesivir success rate saw palmetto hair regrowth success short success quotes skin success fade cream standards for success success academy login success auto success boil in bag rice success criteria success good morning quotes success iceberg success icon success isn t always about greatness success rate of bar rescue success rate of cervical epidural steroid injection suffering from success tyler perry's first "urban circuit" play i know i've been changed was an instant success. what does success mean to you what is the success rate of prostate surgery success rate success rate 4aa embryo success rate a worker's success in contributing to the common good is affirmations for success apicoectomy success rates auto success bar rescue success rate bariatric surgery success rates bed bug heat treatment success rate best success quotes bible verse about success bone marrow transplant success rate breastfeeding success burning mouth syndrome success stories business success capecitabine success rate cardioversion success rate chantix success rate chemotherapy success rate client success coq10 fertility success coq10 fertility success stories crystals for success customer success manager interview questions customer success manager jobs director of customer success salary dj khaled suffering from success dress for success locations near me dress for success near me dressed for success eleven rings the soul of success embryo grading and success rates everyday success team fear of success funny success quotes gamestop lake success gold success driving school grace stirs up success cast grace stirs up success cast herbalife success story hinge success stories hip labral tear surgery success rate how to spell success innovative success systems scam inspiration quotes for success is competition necessary for success iui success rate calculator ivf success rate by age jmu student success center john wooden pyramid of success keller williams success realty key success factors lake success ca lasik success rate lexapro success rate lymphoma treatment success rate midwives brew success stories millionaire success habits motivational quotes about success motivational success quote motivational success quote national society of leadership and success reddit noom success stories our lady of good success our lady of good success novena pain pump success rate positive success quotes powerful quotes about success powerful words for success prayer for work success prohealth lake success prohealth lake success quotes about success and failure rhinoceros success road to success quote roads to success quotes saw palmetto hair regrowth success sayings on success school success short success quotes singing success
Source: bmcbioinformatics.biomedcentral.com
Related Posts