Training

These sections demonstrate how to train a Snowball relationship extraction system to extract Néel temperature relationships from sentences within a scientific article.

The general training process works as follows:

  • Define the entities present in the Chemical Relationship
  • Parse a corpus of articles to find these examples in the text
  • Cluster the sentences to learn common patterns that specify the relationships
  • Assign likelihoods describing the probability that the patterns are correct

Curie Temperature Relationships

The Curie temperature of a magnetic material describes the temperature at which the material changes from >being paramagnetic to ferromagnetic. As such, a Curie temperature relationship consists of 4 entities:

  • A compound
  • A Specifier e.g. "Curie Temperature" or "TC"
  • A temperature value
  • A temperature unit, e.g. Kelvin

Defining the Entities

First define a standard ChemDataExtractor Model for Curie Temperature


from chemdataextractor.relex import Snowball, ChemicalRelationship
from chemdataextractor.model import BaseModel, StringType, ListType, ModelType, Compound
from chemdataextractor.parse import R, I, W, Optional, merge, join, OneOrMore, Any, ZeroOrMore, Start
from chemdataextractor.parse.cem import chemical_name, chemical_label
from chemdataextractor.parse.base import BaseParser
from chemdataextractor.parse.common import lrb, rrb, delim
from chemdataextractor.utils import first
from chemdataextractor.doc import Paragraph, Heading, Sentence
from lxml import etree
import re

class  CurieTemperature(BaseModel):
    specifier = StringType()
    value = StringType()
    units = StringType()

Compound.curie_temperatures = ListType(ModelType(CurieTemperature))

Now define expressions for identifying the entities


# Define a very basic entity tagger
specifier = (I('curie') + I('temperature') + Optional(lrb | delim) + Optional(R('^T(C|c)(urie)?')) + Optional(rrb) | R('^T(C|c)(urie)?'))('specifier').add_action(join)
units = (R('^[CFK]\.?$'))('units').add_action(merge)
value = (R('^\d+(\.\,\d+)?$'))('value')

Note we tag each with a unique identifier that will be used later. Now let the entities in a sentence be any ordering of these (or whatever ordering you feel like). Here we specify that the value and units must coincide, but this does not have to be the case. We also define an extremely general parse phrase, this will be used to identify candidate sentences.


# Let the entities be any combination of chemical names, specifier values and units
entities = (chemical_name | specifier | value + units)

# Now create a very generic parse phrase that will match any combination of these entities
curie_temperature_phrase = (entities + OneOrMore(entities | Any()))('curie_temperature')

# List all the entities
curie_temp_entities = [chemical_name, specifier, value, units]

We are now ready to start Snowballing. Lets formalise our ChemicalRelationship passing in the entities, the extraction phrase and a name.


curie_temp_relationship = ChemicalRelationship(curie_temp_entities, curie_temperature_phrase, name='curie_temperatures')

Training the system

Create a Snowball object to use on our relationship and point to a path for training. Here will we use the default parameters:

  • Tc = 0.95, the minimum Confidence required for a new relationship to be accepted
  • Tsim=0.95, The minimum similarity between phrases for them to be clustered together
  • learning_rate = 0.5, How quickly the system updates the confidences based on new information
  • prefix_length=1, number of tokens in phrase prefix
  • suffix_length = 1, number of tokens in phrase suffix
  • prefix_weight = 0.1, the weight of the prefix in determining similarity
  • middles_weight = 0.8, the weight of the middles in determining similarity
  • suffix_weight = 0.1, weight of suffix in determining similarity

Note increasing TC and Tsim yields more extraction patterns but stricter rules on new relations.

Now create a Snowball object and begin training


snowball = Snowball(curie_temp_relationship)
snowball.train(corpus='../tests/data/relex/curie_training_set/')

The training process in online. This means that the user can train the system on as many papers as they like, and it will continue to update the knowledge base. At each paper, the sentences are scanned for any matches to the parse phrase, and if the sentence matches, candidate relationships are formed. There can be many candidate relationships in a single sentence, so the output provides the user will all available candidates. The user can specify to accept a relationship by typing in the number (or numbers) of the candidates they wish to accept. I.e. If you want candidate 0 only, type '0' then press enter. If you want 0 and 3 type '0,3' and press enter. If you dont want any, then press any other key. e.g. 'n' or 'no'. This training process automatically clusters the sentences you accept and updates the knowlede base. You can check what has been learned by searching in the relex/data folder. You can always stop training and start again, or come back to the same training process if you wish, simply load in an existing snowball system using: Snowball.load()

Seeing what has been learned

Looking into data/relex/curie_temperatures_patterns.txt, we see what patterns were learned from our training:

  • "NAME is a ferromagnetic transition metal exhibiting a high SPECIFIER of VALUE UNITS ("
  • "the NAME nanocrystals show a transition temperature SPECIFIER at around VALUE UNITS ("
  • "NAME is ferromagnetic with a SPECIFIER of VALUE UNITS and"
  • ", NAME has recently attracted much attention due to its high SPECIFIER ∼ VALUE UNITS )"