Examples

Context

metaknowledge is a python library for creating and analyzing scientific metadata. It uses records obtained from Web of Science (WOS), Scopus and other sources. As it is intended to be usable by those who do not know much python. This page will be a short overview of its capabilities, to allow you to use it for your own work. For complete coverage of the package as well as install instructions read the full the documentation here.

This document was made from a jupyter notebook, if you know how to use them, you can download the notebook here and the sample file is here if you wish to have an interactive version of this page. Now lets begin.

Importing

First you need to import the metaknowledge package

In [1]:
import metaknowledge as mk

And you will often need the networkx package

In [2]:
import networkx as nx

I am using matplotlib to display the graphs and to make them look nice when displayed

In [3]:
import matplotlib.pyplot as plt
%matplotlib inline

metaknowledge also has a matplotlib based graph visualizer that we will use, the module also contains the titular contour plot generator

In [4]:
import metaknowledge.contour as mkv

pandas is also used in one example

In [5]:
import pandas

Reading Files

The files used here are for WOS, but the instructions apply to any of the sources. fRecords can be loaded into a RecordCollections by creating a RecordCollection with the path to the files given to it as a string.

In [6]:
RC = mk.RecordCollection("savedrecs.txt")
RC
Out[6]:
<metaknowledge.RecordCollection object savedrecs>

You can also read a whole directory, in this case it is reading the current working directory

In [7]:
RC = mk.RecordCollection(".")
repr(RC)
Out[7]:
'<metaknowledge.RecordCollection object files-from-.>'

metaknowledge can detect if a file is a valid WOS file or not and will read the entire directory and load only those that have the right header. You can also tell it to only read a certain type of file, by using the extension argument.

In [8]:
RC = mk.RecordCollection(".", extension = "txt")
repr(RC)
Out[8]:
'<metaknowledge.RecordCollection object txt-files-from-.>'

Now you have a RecordCollection composed of all the WOS records in the selected file(s).

In [9]:
print("RC is a " + str(RC))
RC is a RecordCollection(txt-files-from-.)

Record object

Record is an object that contains a simple record, for example a journal article, book, or conference proceedings. They are what RecordCollections contain. To see an individual Record at random from a RecordCollection you can use peek()

In [10]:
R = RC.peek()

A single Record can give you all the information it contains about its record. If for example you want its authors.

In [11]:
print(R['authorsFull'])
print(R.get('AF'))
['RIGNEAULT, H', 'FLORY, F', 'MONNERET, S']
['RIGNEAULT, H', 'FLORY, F', 'MONNERET, S']

Converting a Record to a string will give its title

In [12]:
print(R)
WOSRecord(NONLINEAR TOTALLY REFLECTING PRISM COUPLER - THERMOMECHANIC EFFECTS AND INTENSITY-DEPENDENT REFRACTIVE-INDEX OF THIN-FILMS)

If you try to access a tag the Record does not have it will raise a KeyError unless get() is used

In [13]:
try:
    print(R['GP'])
except KeyError as k:
    print(k)
"'GP' could not be found in the Record"

There are two ways of getting a tag, one is using the sources names, in this case WOS 2 letter abbreviations and the second is to use the human readable name. There is no standard for the human readable names, so they are specific to metaknowledge. They are: 'year', 'volume', 'beginningPage', 'DOI', 'address', 'j9', 'citations', 'grants', 'selfCitation', 'authorsShort', 'authorsFull', 'title', 'journal', 'keywords', 'abstract' and 'id'.

To see how the WOS names map to the long names look at the complete documentation. If you want all the tags a Record has use keys() like a dict.

In [14]:
print(list(R.keys()))
['PT', 'AU', 'AF', 'TI', 'SO', 'LA', 'DT', 'DE', 'ID', 'AB', 'RP', 'CR', 'NR', 'TC', 'Z9', 'PU', 'PI', 'PA', 'SN', 'J9', 'JI', 'PD', 'PY', 'VL', 'IS', 'BP', 'EP', 'PG', 'WC', 'SC', 'GA', 'UT', 'PM']

RecordCollection object

RecordCollection is the object that metaknowledge uses the most. It is your interface with the data you want.

To iterate over all of the Records you can use a for loop

In [15]:
for R in RC:
    print(R)
WOSRecord(NONLINEAR TOTALLY REFLECTING PRISM COUPLER - THERMOMECHANIC EFFECTS AND INTENSITY-DEPENDENT REFRACTIVE-INDEX OF THIN-FILMS)
WOSRecord(WHY ENERGY FLUX AND ABRAHAMS PHOTON MOMENTUM ARE MACROSCOPICALLY SUBSTITUTED FOR MOMENTUM DENSITY AND MINKOWSKIS PHOTON MOMENTUM)
WOSRecord(SHIFTS OF COHERENT-LIGHT BEAMS ON REFLECTION AT PLANE INTERFACES BETWEEN ISOTROPIC MEDIA)
WOSRecord(RESONANCE EFFECTS ON TOTAL INTERNAL-REFLECTION AND LATERAL (GOOS-HANCHEN) BEAM DISPLACEMENT AT THE INTERFACE BETWEEN NONLOCAL AND LOCAL DIELECTRIC)
WOSRecord(Goos-Hanchen shift as a probe in evanescent slab waveguide sensors)
WOSRecord(Transverse displacement at total reflection near the grazing angle: a way to discriminate between theories)
WOSRecord(GENERAL STUDY OF DISPLACEMENTS AT TOTAL REFLECTION)
WOSRecord(SPIN ANGULAR-MOMENTUM OF A FIELD INTERACTING WITH A PLANE INTERFACE)
WOSRecord(Numerical study of the displacement of a three-dimensional Gaussian beam transmitted at total internal reflection. Near-field applications)
WOSRecord(THEORETICAL NOTES ON AMPLIFICATION OF TRANSVERSE SHIFT BY TOTAL REFLECTION ON MULTILAYERED SYSTEM)
WOSRecord(MECHANICAL INTERPRETATION OF SHIFTS IN TOTAL REFLECTION OF SPINNING PARTICLES)
WOSRecord(Goos-Hanchen and Imbert-Fedorov shifts for leaky guided modes)
WOSRecord(ASYMMETRICAL MOMENTUM-ENERGY TENSORS AND 6-COMPONENT ANGULAR-MOMENTUM IN PROBLEM CONCERNING 2 PHOTON MOMENTA AND MAGNETODYNAMIC EFFECT PROBLEM)
WOSRecord(ANGULAR SPECTRUM AS AN ELECTRICAL NETWORK)
WOSRecord(TRANSVERSE DISPLACEMENT OF A TOTALLY REFLECTED LIGHT-BEAM AND PHASE-SHIFT METHOD)
WOSRecord(OBSERVATION OF SHIFTS IN TOTAL REFLECTION OF A LIGHT-BEAM BY A MULTILAYERED STRUCTURE)
WOSRecord(LONGITUDINAL AND TRANSVERSE DISPLACEMENTS OF A BOUNDED MICROWAVE BEAM AT TOTAL INTERNAL-REFLECTION)
WOSRecord(Optical properties of nanostructured thin films)
WOSRecord(PREDICTION OF A RESONANCE-ENHANCED LASER-BEAM DISPLACEMENT AT TOTAL INTERNAL-REFLECTION IN SEMICONDUCTORS)
WOSRecord(EXCHANGED MOMENTUM BETWEEN MOVING ATOMS AND A SURFACE-WAVE - THEORY AND EXPERIMENT)
WOSRecord(A Novel Method for Enhancing Goos-Hanchen Shift in Total Internal Reflection)
WOSRecord(DISCUSSIONS OF PROBLEM OF PONDEROMOTIVE FORCES)
WOSRecord(CALCULATION AND MEASUREMENT OF FORCES AND TORQUES APPLIED TO UNIAXIAL CRYSTAL BY EXTRAORDINARY WAVE)
WOSRecord(Longitudinal and transverse effects of nonspecular reflection)
WOSRecord(Simple technique for measuring the Goos-Hanchen effect with polarization modulation and a position-sensitive detector)
WOSRecord(INTERNAL PHOTON IMPULSE OF DIELECTRIC AND ON COUPLE APPLIED TO ANISOTROPIC CRYSTAL)
WOSRecord(DISPLACEMENT OF A TOTALLY REFLECTED LIGHT-BEAM - FILTERING OF POLARIZATION STATES AND AMPLIFICATION)
WOSRecord(CONSERVATION OF ANGULAR MOMENT WITH SIX COMPONENTS AND ASYMMETRICAL IMPULSE ENERGY TENSORS)
WOSRecord(SPIN ANGULAR-MOMENTUM OF A FIELD INTERACTING WITH A PLANE INTERFACE)
WOSRecord(Experimental observation of the Imbert-Fedorov transverse displacement after a single total reflection)
WOSRecord(INTERFERENCE THEORY OF REFLECTION FROM MULTILAYERED MEDIA)
WOSRecord(EXPERIMENTS IN PHENOMENOLOGICAL ELECTRODYNAMICS AND THE ELECTROMAGNETIC ENERGY-MOMENTUM TENSOR)

The individual Records are index by their id numbers so you can access a specific one in the collection if you know its number.

In [16]:
RC.getID("WOS:A1979GV55600001")
Out[16]:
<metaknowledge.WOSRecord object WOS:A1979GV55600001>

Citation object

Citation is an object to contain the results of parsing a citation. They can be created from a Record

In [17]:
Cite = R.createCitation()
print(Cite)
BREVIK I, 1979, PHYS REP, V52, P133, DOI 10.1016/0370-1573(79)90074-7

Citations allow for the raw strings of citations to be manipulated easily by metaknowledge.

Filtering

The for loop shown above is the main way to filter a RecordCollection, that said there are a few builtin filters, e.g. yearSplit(), but the for loop is an easily generalized way of filtering that is relatively simple to read so it the main way you should filter. An example of the workflow is as follows:

First create a new RecordCollection

In [18]:
RCfiltered = mk.RecordCollection()

Then add the records that meet your condition, in this case that their title's start with 'A'

In [19]:
for R in RC:
    if R['title'][0] == 'A':
        RCfiltered.add(R)
In [20]:
print(RCfiltered)
RecordCollection(Empty)

Now you have a RecordCollection RCfiltered of all the Records whose titles begin with 'A'.

One note about implementing this, the above code does not handle the case in which the title is missing i.e. R['title'] is missing. You will have to deal with this on your own.

Two builtin functions to filter collections are yearSplit() and localCitesOf(). To get a RecordCollection of all Records between 1970 and 1979:

In [21]:
RC70 = RC.yearSplit(1970, 1979)
print(RC70)
RecordCollection(txt-files-from-.(1970-1979))

The second function localCitesOf() takes in an object that a Citation can be created from and returns a RecordCollection of all the Records that cite it. So to see all the records that cite "Yariv A., 1971, INTRO OPTICAL ELECTR".

In [22]:
RCintroOpt = RC.localCitesOf("Yariv A., 1971, INTRO OPTICAL ELECTR")
print(RCintroOpt)
RecordCollection(Records_citing_Yariv A., 1971, INTRO OPTICAL ELECTR)

Exporting RecordCollections

Now you have a filtered RecordCollection you can write it as a file with writeFile()

In [23]:
 RCfiltered.writeFile("Records_Starting_with_A.txt")

The written file is identical to one of those produced by WOS.

If you wish to have a more useful file use writeCSV() which creates a CSV file of all the tags as columns and the Records as rows. IF you only care about a few tags the onlyTheseTags argument allows you to control the tags.

In [24]:
selectedTags = ['TI', 'UT', 'CR', 'AF']

This will give only the title, WOS number, citations, and authors.

In [25]:
RCfiltered.writeCSV("Records_Starting_with_A.csv", onlyTheseTags = selectedTags)

The last export feature is for using metaknowledge with other packages, in particular pandas but others should also work. makeDict() creates a dictionary with tags as keys and lists as values with each index of the lists corresponding to a Record. pandas can accept these directly to make DataFrames.

In [26]:
recDataFrame = pandas.DataFrame(RC.makeDict())

Making a co-citation network

To make a basic co-citation network of Records use networkCoCitation().

In [27]:
coCites = RC.networkCoCitation()
print(mk.graphStats(coCites))
The graph has 571 nodes, 17663 edges, 0 isolates, 25 self loops, a density of 0.108538 and a transitivity of 0.684276

graphStats() is a function to extract some of the statists of a graph and make them into a nice string.

coCites is now a networkx graph of the co-citation network, with the hashes of the Citations as nodes and the full citations stored as an attributes. Lets look at one node

In [28]:
coCites.nodes(data = True)[0]
Out[28]:
('Zeroug S, 1994, J ACOUST SOC AM',
 {'MK-ID': 'None',
  'count': 1,
  'inCore': False,
  'info': 'Zeroug S, 1994, J ACOUST SOC AM, V95, P3075'})

and an edge

In [29]:
coCites.edges(data = True)[0]
Out[29]:
('Zeroug S, 1994, J ACOUST SOC AM',
 'Zhang Sz, 1989, J OPT SOC AM A',
 {'weight': 1})

All the graphs metaknowledge use are networkx graphs, a few functions to trim them are implemented in metaknowledge, here is the example section, but many useful functions are implemented by it. Read the documentation here for more information.

The networkCoCitation() function has many options for filtering and determining the nodes. The default is to use the Citations themselves. If you wanted to make a network of co-citations of journals you would have to make the node type 'journal' and remove the non-journals.

In [30]:
coCiteJournals = RC.networkCoCitation(nodeType = 'journal', dropNonJournals = True)
print(mk.graphStats(coCiteJournals))
The graph has 89 nodes, 1379 edges, 0 isolates, 41 self loops, a density of 0.352145 and a transitivity of 0.635939

Lets take a look at the graph after a quick spring layout

In [31]:
nx.draw_spring(coCiteJournals)

A bit basic but gives a general idea. If you want to make a much better looking and more informative visualization you could try R, gephi or visone. Exporting to them is covered below in Exporting graphs.

Making a citation network

The networkCitation() method is nearly identical to coCiteNetwork() in its parameters. It has one additional keyword argument directed that controls if it produces a directed network. Read Making a co-citation network to learn more about networkCitation().

One small example is still worth providing. If you want to make a network of the citations of years by other years and have the letter 'A' in them then you would write:

In [32]:
citationsA = RC.networkCitation(nodeType = 'year', keyWords = ['A'])
print(mk.graphStats(citationsA))
The graph has 18 nodes, 24 edges, 0 isolates, 1 self loops, a density of 0.0784314 and a transitivity of 0.0344828
In [33]:
nx.draw_spring(citationsA, with_labels = True)

Making a co-author network

The networkCoAuthor() function produces the co-authorship network of the RecordCollection as is used as shown

In [34]:
coAuths = RC.networkCoAuthor()
print(mk.graphStats(coAuths))
The graph has 45 nodes, 46 edges, 9 isolates, 0 self loops, a density of 0.0464646 and a transitivity of 0.822581

Making a one-mode network

In addition to the specialized network generators metaknowledge lets you make a one-mode co-occurence network of any of the WOS tags, with the networkOneMode() function. For examples the WOS subject tag 'WC' can be examined.

In [35]:
wcCoOccurs = RC.networkOneMode('WC')
print(mk.graphStats(wcCoOccurs))
The graph has 9 nodes, 3 edges, 3 isolates, 0 self loops, a density of 0.0833333 and a transitivity of 0
In [36]:
nx.draw_spring(wcCoOccurs, with_labels = True)

Making a two-mode network

If you wish to study the relationships between 2 tags you can use the networkTwoMode() function which creates a two mode network showing the connections between the tags. For example to look at the connections between titles('TI') and subjects ('WC')

In [37]:
ti_wc = RC.networkTwoMode('WC', 'title')
print(mk.graphStats(ti_wc))
The graph has 40 nodes, 35 edges, 0 isolates, 0 self loops, a density of 0.0448718 and a transitivity of 0

The network is directed by default with the first tag going to the second.

In [38]:
mkv.quickVisual(ti_wc, showLabel = False) #default is False as there are usually lots of labels

quickVisual() makes a graph with the different types of nodes coloured differently and a couple other small visual tweaks from networkx's draw_spring.

Making a multi-mode network

For any number of tags the networkMultiMode() function will do the same thing as the oneModeNetwork() but with any number of tags and it will keep track of their types. So to look at the co-occurence of titles 'TI', WOS number 'UT' and authors 'AU'.

In [39]:
tags = ['TI', 'UT', 'AU']
multiModeNet = RC.networkMultiMode(tags)
mk.graphStats(multiModeNet)
Out[39]:
'The graph has 108 nodes, 163 edges, 0 isolates, 0 self loops, a density of 0.0282105 and a transitivity of 0.443946'
In [40]:
mkv.quickVisual(multiModeNet)

Beware this can very easily produce hairballs

In [41]:
tags = mk.commonRecordFields #All the tags, twice
sillyMultiModeNet = RC.networkMultiMode(tags)
mk.graphStats(sillyMultiModeNet)
Out[41]:
'The graph has 925 nodes, 14217 edges, 0 isolates, 39 self loops, a density of 0.0332678 and a transitivity of 0.33469'
In [42]:
mkv.quickVisual(sillyMultiModeNet)

Post processing graphs

If you wish to apply a well known algorithm or process to a graph networkx is a good place to look as they do a good job at implementing them.

One of the features it lacks though is pruning of graphs, metaknowledge has these capabilities. To remove edges outside of some weight range, use drop_edges(). The functions all mutate the given graph, so if you wish to keep the original keep a copy. For example if you wish to remove the self loops, edges with weight less than 2 and weight higher than 10 from coCiteJournals.

In [43]:
minWeight = 3
maxWeight = 10
proccessedCoCiteJournals = coCiteJournals.copy()
mk.dropEdges(proccessedCoCiteJournals, minWeight, maxWeight, dropSelfLoops = True)
mk.graphStats(proccessedCoCiteJournals)
Out[43]:
'The graph has 89 nodes, 459 edges, 1 isolates, 0 self loops, a density of 0.117211 and a transitivity of 0.20841'

Then to remove all the isolates, i.e. nodes with degree less than 1, use drop_nodesByDegree()

In [44]:
mk.dropNodesByDegree(proccessedCoCiteJournals, 1)
mk.graphStats(proccessedCoCiteJournals)
Out[44]:
'The graph has 88 nodes, 459 edges, 0 isolates, 0 self loops, a density of 0.119906 and a transitivity of 0.20841'

Now before the processing the graph can be seen here. After the processing it looks like

In [45]:
nx.draw_spring(proccessedCoCiteJournals)

Hm, it looks a bit thinner. Using a visualizer will make the difference a bit more noticeable.

Exporting graphs

Now you have a graph the last step is to write it to disk. networkx has a few ways of doing this, but they tend to be slow. metaknowledge can write an edge list and node attribute file that contain all the information of the graph. The function to do this is called writeGraph(). You give it the start of the file name and it makes two labeled files containing the graph.

In [46]:
mk.writeGraph(proccessedCoCiteJournals, "FinalJournalCoCites")

These files are simple CSVs an can be read easily by most systems. If you want to read them back into Python the readGraph() function will do that.

In [47]:
 FinalJournalCoCites = mk.readGraph("FinalJournalCoCites_edgeList.csv", "FinalJournalCoCites_nodeAttributes.csv")
mk.graphStats(FinalJournalCoCites)
Out[47]:
'The graph has 88 nodes, 459 edges, 0 isolates, 0 self loops, a density of 0.119906 and a transitivity of 0.20841'

This is full example workflow for metaknowledge, the package is flexible and you hopefully will be able to customize it to do what you want (I assume you do not want the Records staring with 'A').


Questions?

If you find bugs, or have questions, please write to:

Reid McIlroy-Young reid@reidmcy.com

John McLevey john.mclevey@uwaterloo.ca


License

metaknowledge is free and open source software, distributed under the GPL License.