Full Documentation 3.1.1

The modules of metaknowledge are:

The classes of metaknowledge are:

  1. Citation(Hashable)Citation are special, here is how they are handled
  2. Grant(Record, MutableMapping)The base for all the other Grants
  3. Collection(MutableSet, Hashable)The base of all other Collections, basically a set
  4. Record(Mapping, Hashable)The base of all the other Records, basically a dict

All the functions and methods of metaknowledge and its objects are as follows:

All the functions of the contour module are as follows:

All the functions of the WOS module are as follows:

All the functions of the medline module are as follows:

All the functions of the proquest module are as follows:

All the functions of the scopus module are as follows:

All the functions of the journalAbbreviations module are as follows:


metaknowledge is a Python3 package that simplifies bibliometric and computational analysis of Web of Science data.

Example

To load the data from files and make a network:

>>> import metaknowledge as mk
>>> RC = mk.RecordCollection("records/")
>>> print(RC)
Collection of 33 records
>>> G = RC.coCiteNetwork(nodeType = 'journal')
Done making a co-citation network of files-from-records                 1.1s
>>> print(len(G.nodes()))
223
>>> mk.writeGraph(G, "Cocitation-Network-of-Journals")

There is also a simple command line program called metaknowledge that comes with the package. It allows for creating networks without any need to know Python. More information about it can be found at networkslab.org/metaknowledge/cli

Overview

This package can read the files downloaded from the Thomson Reuters’ Web of Science (WOS), Elsevier’s Scopus, ProQuest and Medline files from PubMed. These files contain entries on the metadata of scientific records, such as authors, title, and citations. metaknowledge can also read grants from various organizations including NSF and NSERC which are handled similarly to records.

The metaknowledge.RecordCollection class can take a path to one or more of these files load and parse them. The object is the main way for work to be done on multiple records. For each individual record it creates an instance of the metaknowledge.Record class that contains the results of the parsing of the record.

The files read by metaknowledge are a databases containing a series of tags (implicitly or explicitly), e.g. 'TI' is the title for WOS. Each tag has one or more values and metaknowledge can read them and extract useful information. As the tags differ between providers a small set of values can be accessed by special tags, the tags are listed in commonRecordFields. These special tags can act on the whole Record and as such may contain information provided by any number of other tags.

Citations are handled by a special Citation class. This class can parse the citations given by WOS and journals cited by Scopus and allows for better comparisons when they are used in graphs.

Note for those reading the docstrings metaknowledge’s docs are written in markdown and are processed to produce the documentation found at networkslab.org/metaknowledge/documentation, but you should have no problem reading them from the help function.


The functions provided by metaknowledge are:

The Exceptions defined by metaknowledge are:

  1. mkException(Exception)
  2. TagError(mkException)
  3. RCValueError(mkException)
  4. BadInputFile(mkException)
  5. BadRecord(mkException)
  6. BadPubmedRecord(mkException)
  7. BadPubmedFile(mkException)
  8. BadScopusRecord(mkException)
  9. BadProQuestRecord(mkException)
  10. BadProQuestFile(mkException)
  11. CollectionTypeError(mkException, TypeError)
  12. RecordsNotCompatible(mkException)
  13. cacheError(mkException)
  14. BadWOSRecord(BadRecord)
  15. BadWOSFile(Warning)
  16. RCTypeError(mkException, TypeError)
  17. BadCitation(Warning)
  18. BadGrant(mkException)
  19. GrantCollectionException(mkException)
  20. UnknownFile(mkException)

downloadExtras():

Downloads all the external files used by metaknowledge. This will overwrite exiting files


filterNonJournals(citesLst, invert=False):

Removes the Citations from citesLst that are not journals

Parameters

citesLst : list [Citation]

A list of citations to be filtered

invert : optional [bool]

Default False, if True non-journals will be kept instead of journals

Returns

list [Citation]

A filtered list of Citations from citesLst


diffusionGraph(source, target, weighted=True, sourceType=’raw’, targetType=’raw’, labelEdgesBy=None):

Takes in two RecordCollections and produces a graph of the citations of source by the Records in target. By default the nodes in the are Record objects but this can be changed with the sourceType and targetType keywords. The edges of the graph go from the target to the source.

Each node on the output graph has two boolean attributes, "source" and "target" indicating if they are targets or sources. Note, if the types of the sources and targets are different the attributes will not be checked for overlap of the other type. e.g. if the source type is 'TI' (title) and the target type is 'UT' (WOS number), and there is some overlap of the targets and sources. Then the Record corresponding to a source node will not be checked for being one of the titles of the targets, only its WOS number will be considered.

Parameters

source : RecordCollection

A metaknowledge RecordCollection containing the Records being cited

target : RecordCollection

A metaknowledge RecordCollection containing the Records citing those in source

weighted : optional [bool]

Default True, if True each edge will have an attribute 'weight' giving the number of times the source has referenced the target.

sourceType : optional [str]

Default 'raw', if 'raw' the returned graph will contain Records as source nodes.

If Records are not wanted then it can be set to a WOS tag, such as 'SO' (for journals ), to make the nodes into the type of object returned by that tag from Records.

targetType : optional [str]

Default 'raw', if 'raw' the returned graph will contain Records as target nodes.

If Records are not wanted then it can be set to a WOS tag, such as 'SO' (for journals ), to make the nodes into the type of object returned by that tag from Records.

labelEdgesBy : optional [str]

Default None, if a WOS tag (or long name of WOS tag) then the edges of the output graph will have a attribute 'key' that is the value of the referenced tag, of source Record, i.e. if 'PY' is given then each edge will have a 'key' value equal to the publication year of the source.

This option will cause the output graph to be an MultiDiGraph and is likely to result in parallel edges. If a Record has multiple values for at tag (e.g. 'AF') the each tag will create its own edge.

Returns

networkx Directed Graph or networkx multi Directed Graph

A directed graph of the diffusion network, labelEdgesBy is used the graph will allow parallel edges.


diffusionCount(source, target, sourceType=’raw’, extraValue=None, pandasFriendly=False, compareCounts=False, numAuthors=True, useAllAuthors=True, extraMapping=None):

Takes in two RecordCollections and produces a dict counting the citations of source by the Records of target. By default the dict uses Record objects as keys but this can be changed with the sourceType keyword to any of the WOS tags.

Parameters

source : RecordCollection

A metaknowledge RecordCollection containing the Records being cited

target : RecordCollection

A metaknowledge RecordCollection containing the Records citing those in source

sourceType : optional [str]

default 'raw', if 'raw' the returned dict will contain Records as keys. If it is a WOS tag the keys will be of that type.

pandasFriendly : optional [bool]

default False, makes the output be a dict with two keys one "Record" is the list of Records ( or data type requested by sourceType) the other is their occurrence counts as "Counts". The lists are the same length.

compareCounts : optional [bool]

default False, if True the diffusion analysis will be run twice, first with source and target setup like the default (global scope) then using only the source RecordCollection (local scope).

extraValue : optional [str]

default None, if a tag the returned dictionary will have Records mapped to maps, these maps will map the entries for the tag to counts. If pandasFriendly is also True the resultant dictionary will have an additional column called 'year'. This column will contain the year the citations occurred, in addition the Records entries will be duplicated for each year they occur in.

For example if 'year' was given then the count for a single Record could be {1990 : 1, 2000 : 5}

useAllAuthors : optional [bool]

default True, if False only the first author will be used to generate the Citations for the source Records

Returns

dict[:int]

A dictionary with the type given by sourceType as keys and integers as values.

If compareCounts is True the values are tuples with the first integer being the diffusion in the target and the second the diffusion in the source.

If pandasFriendly is True the returned dict has keys with the names of the WOS tags and lists with their values, i.e. a table with labeled columns. The counts are in the column named "TargetCount" and if compareCounts the local count is in a column called "SourceCount".


diffusionAddCountsFromSource(grph, source, target, nodeType=’citations’, extraType=None, diffusionLabel=’DiffusionCount’, extraKeys=None, countsDict=None, extraMapping=None):

Does a diffusion using diffusionCount() and updates grph with it, using the nodes in the graph as keys in the diffusion, i.e. the source. The name of the attribute the counts are added to is given by diffusionLabel. If the graph is not composed of citations from the source and instead is another tag nodeType needs to be given the tag string.

Parameters

grph : networkx Graph

The graph to be updated

source : RecordCollection

The RecordCollection that created grph

target : RecordCollection

The RecordCollection that will be counted

nodeType : optional [str]

default 'citations', the tag that constants the values used to create grph

Returns

dict[:int]

The counts dictioanry used to add values to grph. Note grph is modified by the function and the return is done in case you need it.


downloadData(useUK=False):

Needs to be written


readGraph(edgeList, nodeList=None, directed=False, idKey=’ID’, eSource=’From’, eDest=’To’):

Reads the files given by edgeList and nodeList and creates a networkx graph for the files.

This is designed only for the files produced by metaknowledge and is meant to be the reverse of writeGraph(), if this does not produce the desired results the networkx builtin networkx.read_edgelist() could be tried as it is aimed at a more general usage.

The read edge list format assumes the column named eSource (default 'From') is the source node, then the column eDest (default 'To') givens the destination and all other columns are attributes of the edges, e.g. weight.

The read node list format assumes the column idKey (default 'ID') is the ID of the node for the edge list and the resulting network. All other columns are considered attributes of the node, e.g. count.

Note: If the names of the columns do not match those given to readGraph() a KeyError exception will be raised.

Note: If nodes appear in the edgelist but not the nodeList they will be created silently with no attributes.

Parameters

edgeList : str

a string giving the path to the edge list file

nodeList : optional [str]

default None, a string giving the path to the node list file

directed : optional [bool]

default False, if True the produced network is directed from eSource to eDest

idKey : optional [str]

default 'ID', the name of the ID column in the node list

eSource : optional [str]

default 'From', the name of the source column in the edge list

eDest : optional [str]

default 'To', the name of the destination column in the edge list

Returns

networkx Graph

the graph described by the input files


writeEdgeList(grph, name, extraInfo=True, allSameAttribute=False):

Writes an edge list of grph at the destination name.

The edge list has two columns for the source and destination of the edge, 'From' and 'To' respectively, then, if edgeInfo is True, for each attribute of the node another column is created.

Note: If any edges are missing an attribute it will be left blank by default, enable allSameAttribute to cause a KeyError to be raised.

Parameters

grph : networkx Graph

The graph to be written to name

name : str

The name of the file to be written

edgeInfo : optional [bool]

Default True, if True the attributes of each edge will be written

allSameAttribute : optional [bool]

Default False, if True all the edges must have the same attributes or an exception will be raised. If False the missing attributes will be left blank.


writeNodeAttributeFile(grph, name, allSameAttribute=False):

Writes a node attribute list of grph to the file given by the path name.

The node list has one column call 'ID' with the node ids used by networkx and all other columns are the node attributes.

Note: If any nodes are missing an attribute it will be left blank by default, enable allSameAttribute to cause a KeyError to be raised.

Parameters

grph : networkx Graph

The graph to be written to name

name : str

The name of the file to be written

allSameAttribute : optional [bool]

Default False, if True all the nodes must have the same attributes or an exception will be raised. If False the missing attributes will be left blank.


writeTnetFile(grph, name, modeNameString, weighted=False, sourceMode=None, timeString=None, nodeIndexString=’tnet-ID’, weightString=’weight’):

Writes an edge list designed for reading by the R package tnet.

The networkx graph provided must be a pure two-mode network, the modes must be 2 different values for the node attribute accessed by modeNameString and all edges must be between different node types. Each node will be given an integer id, stored in the attribute given by nodeIndexString, these ids are then written to the file as the endpoints of the edges. Unless sourceMode is given which mode is the source (first column) and which the target (second column) is random.

Note the grph will be modified by this function, the ids of the nodes will be written to the graph at the attribute nodeIndexString.

Parameters

grph : network Graph

The graph that will be written to name

name : str

The path of the file to write

modeNameString : str

The name of the attribute grph’s modes are stored in

weighted : optional bool

Default False, if True then the attribute weightString will be written to the weight column

sourceMode : optional str

Default None, if given the name of the mode used for the source (first column) in the output file

timeString : optional str

Default None, if present the attribute timeString of an edge will be written to the time column surrounded by double quotes (“).

Note The format used by tnet for dates is very strict it uses the ISO format, down to the second and without time zones.

nodeIndexString : optional str

Default 'tnet-ID', the name of the attribute to save the id for each node

weightString : optional str

Default 'weight', the name of the weight attribute


dropEdges(grph, minWeight=-inf, maxWeight=inf, parameterName=’weight’, ignoreUnweighted=False, dropSelfLoops=False):

Modifies grph by dropping edges whose weight is not within the inclusive bounds of minWeight and maxWeight, i.e after running grph will only have edges whose weights meet the following inequality: minWeight <= edge’s weight <= maxWeight. A Keyerror will be raised if the graph is unweighted unless ignoreUnweighted is True, the weight is determined by examining the attribute parameterName.

Note: none of the default options will result in grph being modified so only specify the relevant ones, e.g. dropEdges(G, dropSelfLoops = True) will remove only the self loops from G.

Parameters

grph : networkx Graph

The graph to be modified.

minWeight : optional [int or double]

default -inf, the minimum weight for an edge to be kept in the graph.

maxWeight : optional [int or double]

default inf, the maximum weight for an edge to be kept in the graph.

parameterName : optional [str]

default 'weight', key to weight field in the edge’s attribute dictionary, the default is the same as networkx and metaknowledge so is likely to be correct

ignoreUnweighted : optional [bool]

default False, if True unweighted edges will kept

dropSelfLoops : optional [bool]

default False, if True self loops will be removed regardless of their weight


dropNodesByDegree(grph, minDegree=-inf, maxDegree=inf, useWeight=True, parameterName=’weight’, includeUnweighted=True):

Modifies grph by dropping nodes that do not have a degree that is within inclusive bounds of minDegree and maxDegree, i.e after running grph will only have nodes whose degrees meet the following inequality: minDegree <= node’s degree <= maxDegree.

Degree is determined in two ways, the default useWeight is the weight attribute of the edges to a node will be summed, the attribute’s name is parameterName otherwise the number of edges touching the node is used. If includeUnweighted is True then useWeight will assign a degree of 1 to unweighted edges.

Parameters

grph : networkx Graph

The graph to be modified.

minDegree : optional [int or double]

default -inf, the minimum degree for an node to be kept in the graph.

maxDegree : optional [int or double]

default inf, the maximum degree for an node to be kept in the graph.

useWeight : optional [bool]

default True, if True the the edge weights will be summed to get the degree, if False the number of edges will be used to determine the degree.

parameterName : optional [str]

default 'weight', key to weight field in the edge’s attribute dictionary, the default is the same as networkx and metaknowledge so is likely to be correct.

includeUnweighted : optional [bool]

default True, if True edges with no weight will be considered to have a weight of 1, if False they will cause a KeyError to be raised.


dropNodesByCount(grph, minCount=-inf, maxCount=inf, parameterName=’count’, ignoreMissing=False):

Modifies grph by dropping nodes that do not have a count that is within inclusive bounds of minCount and maxCount, i.e after running grph will only have nodes whose degrees meet the following inequality: minCount <= node’s degree <= maxCount.

Count is determined by the count attribute, parameterName, and if missing will result in a KeyError being raised. ignoreMissing can be set to True to suppress the error.

minCount and maxCount default to negative and positive infinity respectively so without specifying either the output should be the input

Parameters

grph : networkx Graph

The graph to be modified.

minCount : optional [int or double]

default -inf, the minimum Count for an node to be kept in the graph.

maxCount : optional [int or double]

default inf, the maximum Count for an node to be kept in the graph.

parameterName : optional [str]

default 'count', key to count field in the nodes’s attribute dictionary, the default is the same thoughout metaknowledge so is likely to be correct.

ignoreMissing : optional [bool]

default False, if True nodes missing a count will be kept in the graph instead of raising an exception


mergeGraphs(targetGraph, addedGraph, incrementedNodeVal=’count’, incrementedEdgeVal=’weight’):

A quick way of merging graphs, this is meant to be quick and is only intended for graphs generated by metaknowledge. This does not check anything and as such may cause unexpected results if the source and target were not generated by the same method.

mergeGraphs() will modify targetGraph in place by adding the nodes and edges found in the second, addedGraph. If a node or edge exists targetGraph is given precedence, but the edge and node attributes given by incrementedNodeVal and incrementedEdgeVal are added instead of being overwritten.

Parameters

targetGraph : networkx Graph

the graph to be modified, it has precedence.

addedGraph : networkx Graph

the graph that is unmodified, it is added and does not have precedence.

incrementedNodeVal : optional [str]

default 'count', the name of the count attribute for the graph’s nodes. When merging this attribute will be the sum of the values in the input graphs, instead of targetGraph’s value.

incrementedEdgeVal : optional [str]

default 'weight', the name of the weight attribute for the graph’s edges. When merging this attribute will be the sum of the values in the input graphs, instead of targetGraph’s value.


graphStats(G, stats=(‘nodes’, ‘edges’, ‘isolates’, ‘loops’, ‘density’, ‘transitivity’), makeString=True, sentenceString=False):

Returns a string or list containing statistics about the graph G.

graphStats() gives 6 different statistics: number of nodes, number of edges, number of isolates, number of loops, density and transitivity. The ones wanted can be given to stats. By default a string giving each stat on a different line it can also produce a sentence containing all the requested statistics or the raw values can be accessed instead by setting makeString to False.

Parameters

G : networkx Graph

The graph for the statistics to be determined of

stats : optional [list or tuple [str]]

Default ('nodes', 'edges', 'isolates', 'loops', 'density', 'transitivity'), a list or tuple containing any number or combination of the strings:

"nodes", "edges", "isolates", "loops", "density" and "transitivity"`

At least one occurrence of the corresponding string causes the statistics to be provided in the string output. For the non-string (tuple) output the returned tuple has the same length as the input and each output is at the same index as the string that requested it, e.g.

_stats_ = ("edges", "loops", "edges")

The return is a tuple with 2 elements the first and last of which are the number of edges and the second is the number of loops

makeString : optional [bool]

Default True, if True a string is returned if False a tuple

sentenceString : optional [bool]

Default False : if True the returned string is a sentce, otherwise each value has a seperate line.

Returns

str or tuple [float and int]

The type is determined by makeString and the layout by stats


writeGraph(grph, name, edgeInfo=True, typing=False, suffix=’csv’, overwrite=True, allSameAttribute=False):

Writes both the edge list and the node attribute list of grph to files starting with name.

The output files start with name, the file type (edgeList, nodeAttributes) then if typing is True the type of graph (directed or undirected) then the suffix, the default is as follows:

name_fileType.suffix

Both files are csv’s with comma delimiters and double quote quoting characters. The edge list has two columns for the source and destination of the edge, 'From' and 'To' respectively, then, if edgeInfo is True, for each attribute of the node another column is created. The node list has one column call “ID” with the node ids used by networkx and all other columns are the node attributes.

To read back these files use readGraph() and to write only one type of lsit use writeEdgeList() or writeNodeAttributeFile().

Warning: this function will overwrite files, if they are in the way of the output, to prevent this set overwrite to False

Note: If any nodes or edges are missing an attribute a KeyError will be raised.

Parameters

grph : networkx Graph

A networkx graph of the network to be written.

name : str

The start of the file name to be written, can include a path.

edgeInfo : optional [bool]

Default True, if True the the attributes of each edge are written to the edge list.

typing : optional [bool]

Default False, if True the directed ness of the graph will be added to the file names.

suffix : optional [str]

Default "csv", the suffix of the file.

overwrite : optional [bool]

Default True, if True files will be overwritten silently, otherwise an OSError exception will be raised.


updatej9DB(dbname=’j9Abbreviations’, saveRawHTML=False):

Updates the database of Journal Title Abbreviations. Requires an internet connection. The data base is saved relative to the source file not the working directory.

Parameters

dbname : optional [str]

The name of the database file, default is “j9Abbreviations.db”

saveRawHTML : optional [bool]

Determines if the original HTML of the pages is stored, default False. If True they are saved in a directory inside j9Raws begining with todays date.


WOSRecord(ExtendedRecord):

WOSRecord.__init__(inRecord, sFile=’’, sLine=0):

Class for full WOS records

It is meant to be immutable; many of the methods and attributes are evaluated when first called, not when the object is created, and the results are stored privately.

The record’s meta-data is stored in an ordered dictionary labeled by WOS tags. To access the raw data stored in the original record the Tag() method can be used. To access data that has been processed and cleaned the attributes named after the tags are used.

Customizations

The Record’s hashing and equality testing are based on the WOS number (the tag is ‘UT’, and also called the accession number). They are strings starting with 'WOS:' and followed by 15 or so numbers and letters, although both the length and character set are known to vary. The numbers are unique to each record so are used for comparisons. If a record is bad all equality checks return False.

When converted to a string the records title is used so for a record R, R.TI == R.title == str(R) and its representation uses the WOS number instead of memory location.

Attributes

When a record is created if the parsing of the WOS file failed it is marked as bad. The bad attribute is set to True and the error attribute is created to contain the exception object.

Generally, to get the information from a Record its attributes should be used. For a Record R, calling R.CR causes citations() from the the tagProcessing module to be called on the contents of the raw ‘CR’ field. Then the result is saved and returned. In this case, a list of Citation objects is returned. You can also call R.citations to get the same effect, as each known field tag has a longer name (currently there are 61 field tags). These names are meant to make accessing tags more readable and mapping from tag to name can be found in the tagToFull dict. If a tag is known (in tagToFull) but not in the raw data None is returned instead. Most tags when cleaned return a string or list of strings, the exact results can be found in the help for the particular function.

The attribute authors is also defined as a convenience and returns the same as ‘AF’ or if that is not found ‘AU’.

__Init__

Records are generally create as collections in Recordcollections, and not as individual objects. If you wish to create one on its own it is possible, the arguments are as follows.

Parameters

inRecord: files stream, dict, str or itertools.chain

If it is a file stream the file must be open at the location of the first tag in the record, usually ‘PT’, and the file will be read until ‘ER’ is found, which indicates the end of the record in the file.

If a dict is passed the dictionary is used as the database of fields and tags, so each key is considered a WOS tag and each value a list of the lines of the original associated with the tag. This is the same form of dict that recordParser returns.

For a string the input must be the raw textual data of a single record in the WOS style, like the file stream it must start at the first tag and end in 'ER'.

itertools.chain is treated identically to a file stream and is used by RecordCollections.

sFile : optional [str]

Is the name of the file the raw data was in, by default it is blank. It is mostly used to make error messages more informative.

sLine : optional [int]

Is the line the record starts on in the raw data file. It is mostly used to make error messages more informative.


The WOSRecord class has the following methods:


WOSRecord.tagProcessingFunc(tag):

An abstractmethod, gives the function for processing tag

Parameters

tag : optional [str]

The tag in need of processing

Returns

fucntion

The function to process the raw tag


WOSRecord.specialFuncs(key):

An abstractmethod, process the special tag, key using the whole Record

Parameters

key : str

One of the special tags: 'authorsFull', 'keywords', 'grants', 'j9', 'authorsShort', 'volume', 'selfCitation', 'citations', 'address', 'abstract', 'title', 'month', 'year', 'journal', 'beginningPage' and 'DOI'

Returns

The processed value of key


WOSRecord.writeRecord(infile):

Writes to infile the original contents of the Record. This is intended for use by RecordCollections to write to file. What is written to infile is bit for bit identical to the original record file (if utf-8 is used). No newline is inserted above the write but the last character is a newline.

Parameters

infile : file stream

An open utf-8 encoded file


WOSRecord.encoding():

An abstractmethod, gives the encoding string of the record.

Returns

str

The encoding


WOSRecord.getAltName(tag):

An abstractmethod, gives the alternate name of tag or None

Parameters

tag : str

The requested tag

Returns

str

The alternate name of tag or None


Citation(Hashable):

Citation.__init__(cite, scopusMode=False):

A class to hold citation strings and allow for comparison between them.

The initializer takes in a string representing a WOS citation in the form:

Author, Year, Journal, Volume, Page, DOI

Author is the author’s name in the form of first last name first initial sometimes followed by a period. Year is the year of publication. Journal being the 29-Character Source Abbreviation of the journal. Volume is the volume number(s) of the publication preceded by a V Page is the page number the record starts on DOI is the DOI number of the cited record preceeded by the letters 'DOI' Combined they look like:

Nunez R., 1998, MATH COGNITION, V4, P85, DOI 10.1080/135467998387343

Note: any of the fields have been known to be missing and the requirements for the fields are not always met. If something is in the source string that cannot be interpreted as any of these it is put in the misc attribute. That is the reason to use this class, it gracefully handles missing information while still allowing for comparison between WOS citation strings.

Customizations

Citation’s hashing and equality checking are based on ID() and use the values of author, year and journal.

When converted to a string a Citation will return the original string.

Attributes

As noted above, citations are considered to be divided into six distinct fields (Author, Year, Journal, Volume, Page and DOI) with a seventh misc for anything not in those. Records thus have an attribute with a name corresponding to each author, year, journal, V, P, DOI and misc respectively. These are created if there is anything in the field. So a Citation created from the string: 'Nunez R., 1998, MATH COGNITION' would have author, year and journal defined. While one from 'Nunez R.' would have only the attribute misc.

If the parsing of a citation string fails the attribute bad is set to True and the attribute error is created to contain said error, which is a BadCitation object. If no errors occur bad is False.

The attribute original is the unmodified string (cite) given to create the Citation, it can also be accessed by converting to a string, e.g. with str().

__Init__

Citations can be created by Records or by giving the initializer a string containing a WOS style citation.

Parameters

cite : str

A str containing a WOS style citation.


The Citation class has the following methods:


Citation.isAnonymous():

Checks if the author is given as '[ANONYMOUS]' and returns True if so.

Returns

bool

True if the author is '[ANONYMOUS]' otherwise False.


Citation.ID():

Returns all of author, year and journal available separated by ' ,'. It is for shortening labels when creating networks as the resultant strings are often unique. Extra() gets everything not returned by ID().

This is also used for hashing and equality checking.

Returns

str

A string to use as the ID of a node.


Citation.allButDOI():

Returns a string of the normalized values from the Citation excluding the DOI number. Equivalent to getting the ID with ID() then appending the extra values from Extra() and then removing the substring containing the DOI number.

Returns

str

A string containing the data of the Citation.


Citation.Extra():

Returns any V, P, DOI or misc values as a string. These are all the values not returned by ID(), they are separated by ' ,'.

Returns

str

A string containing the data not in the ID of the Citation.


Citation.isJournal(dbname=’j9Abbreviations’, manualDB=’manualj9Abbreviations’, returnDict=’both’, checkIfExcluded=False):

Returns True if the Citation’s journal field is a journal abbreviation from the WOS listing found at http://images.webofknowledge.com/WOK46/help/WOS/A_abrvjt.html, i.e. checks if the citation is citing a journal.

Note: Requires the j9Abbreviations database file and will raise an error if it cannot be found.

Note: All parameters are used for getting the data base with getj9dict().

Parameters

dbname : optional [str]

The name of the downloaded database file, the default is determined at run time. It is recommended that this remain untouched.

manualDB : optional [str]

The name of the manually created database file, the default is determined at run time. It is recommended that this remain untouched.

returnDict : optional [str]

default 'both', can be used to get both databases or only one with 'WOS' or 'manual'.

Returns

bool

True if the Citation is for a journal


Citation.FullJournalName():

Returns the full name of the Citation’s journal field. Requires the j9Abbreviations database file.

Note: Requires the j9Abbreviations database file and will raise an error if it cannot be found.

Returns

str

The first full name given for the journal of the Citation (or the first name in the WOS list if multiple names exist), if there is not one then None is returned


Citation.addToDB(manualName=None, manualDB=’manualj9Abbreviations’, invert=False):

Adds the journal of this Citation to the user created database of journals. This will cause isJournal() to return True for this Citation and all others with its journal.

Note: Requires the j9Abbreviations database file and will raise an error if it cannot be found.

Parameters

manualName : optional [str]

Default None, the full name of journal to use. If not provided the full name will be the same as the abbreviation.

manualDB : optional [str]

The name of the manually created database file, the default is determined at run time. It is recommended that this remain untouched.

invert : optional [bool]

Default False, if True the journal will be removed instead of added


GrantCollection(CollectionWithIDs):

GrantCollection.__init__(inGrants=None, name=’’, extension=’’, cached=False, quietStart=False):

A Collection with a few extra methods that assume all the contained items have an id attribute and a bad attribute, e.g. Records or Grants.

__Init__

As CollectionWithIDs is mostly meant to be base for other classes all but one of the arguments in the __init__ are not optional and the optional one is not used. The __init__() function is the same as a Collection.


The GrantCollection class has the following methods:


GrantCollection.networkCoInvestigatorInstitution(targetTags=None, tagSeperator=’;’, count=True, weighted=True):

This works the same as networkCoInvestigator() see it for details.


GrantCollection.networkCoInvestigator(targetTags=None, tagSeperator=’;’, count=True, weighted=True):

Creates a co-investigator from the collection

Most grants do not have a known investigator tag so it must be provided by the user in targetTags and the separator character if it is not a semicolon should also be given.

Parameters

targetTags : optional list[str]

A list of all the Grant tags to check for investigators

tagSeperator : optional str

The character that separates the individual investigator’s names

count : optional bool

Default True, if True the number of time a name occurs will be given

weighted : optional bool

Default True, if True the edge weights will be calculated and added to the edges

Returns

networkx Graph

The graph of co-investigator


FallbackGrant(Grant):

FallbackGrant.__init__(original, grantdDict, sFile=’’, sLine=0):

A subclass of Grant, it has the same attributes and is returned from the fall back constructor for grants.



Grant(Record, MutableMapping):

Grant.__init__(original, grantdDict, idValue, bad, error, sFile=’’, sLine=0):

A dictionary with error handling and an id string.

Record is the base class of the all objects in metaknowledge that contain information as key-value pairs, these are the grants and the records from different sources.

The error handling of the Record is done with the bad attribute. If there is some issue with the data bad should be True and error given an Exception that was caused by or explains the error.

Customizations

Record is a subclass of abc.collections.Mapping which means it has almost all the methods a dictionary does, the missing ones are those that modify entries. So to access the value of the key 'title' from a Record R, you would use either the square brace notation t = R['title'] or the get() function t = R.get('title') just like a dictionary. The other methods like keys() or copy() also work.

In addition to being a mapping Records are also hashable with their hashes being based on a unique id string they are given on creation, usually some kind of accession number the source gives them. The two optional arguments sFile and sLine, which should be given the name of the file the records came from and the line it started on respectively, are used to make the errors more useful.

__Init__

fieldDict is the dictionary the Record will use and idValue is the unique identifier of the Record.

Parameters

fieldDict : dict[str:]

A dictionary that maps from strings to values

idValue : str

A unique identifier string for the Record

bad : bool

True if there are issues with the Record, otherwise False

error : Exception

The Exception that caused whatever error made the record be marked as bad or None

sFile : str

A string that gives the source file of the original records

sLine : int

The first line the original record is found on in the source file


The Grant class has the following methods:


Grant.getInvestigators(tags=None, seperator=’;’):

Returns a list of the names of investigators. This is done by looking (in order) for any of fields in tags and splitting the strings on seperator. If no strings are found an empty list will be returned.

Note for some Grants getInvestigators has been overwritten and will ignore the arguments and simply provide the investigators.

Parameters

tags : optional list[str]

A list of the tags to look for investigators in

seperator : optional str

The string that separators each investigators name within the column

Returns

list [str]

A list of all the found investigator’s names


Grant.getInstitutions(tags=None, seperator=’;’):

Returns a list of the names of institutions. This is done by looking (in order) for any of fields in tags and splitting the strings on seperator (in case of multiple institutions). If no strings are found an empty list will be returned.

Note for some Grants getInstitutions has been overwritten and will ignore the arguments and simply provide the investigators.

Parameters

tags : optional list[str]

A list of the tags to look for institutions in

seperator : optional str

The string that separators each institutions name within the column

Returns

list [str]

A list of all the found institution’s names


Grant.update(other):

Adds all the tag-entry pairs from other to the Grant. If there is a conflict other takes precedence.

Parameters

other : Grant

Another Grant of the same type as self


CIHRGrant(Grant):

CIHRGrant.__init__(original, grantdDict, sFile, sLine):

A dictionary with error handling and an id string.

Record is the base class of the all objects in metaknowledge that contain information as key-value pairs, these are the grants and the records from different sources.

The error handling of the Record is done with the bad attribute. If there is some issue with the data bad should be True and error given an Exception that was caused by or explains the error.

Customizations

Record is a subclass of abc.collections.Mapping which means it has almost all the methods a dictionary does, the missing ones are those that modify entries. So to access the value of the key 'title' from a Record R, you would use either the square brace notation t = R['title'] or the get() function t = R.get('title') just like a dictionary. The other methods like keys() or copy() also work.

In addition to being a mapping Records are also hashable with their hashes being based on a unique id string they are given on creation, usually some kind of accession number the source gives them. The two optional arguments sFile and sLine, which should be given the name of the file the records came from and the line it started on respectively, are used to make the errors more useful.

__Init__

fieldDict is the dictionary the Record will use and idValue is the unique identifier of the Record.

Parameters

fieldDict : dict[str:]

A dictionary that maps from strings to values

idValue : str

A unique identifier string for the Record

bad : bool

True if there are issues with the Record, otherwise False

error : Exception

The Exception that caused whatever error made the record be marked as bad or None

sFile : str

A string that gives the source file of the original records

sLine : int

The first line the original record is found on in the source file



MedlineGrant(Grant):

MedlineGrant.__init__(grantString):

A dictionary with error handling and an id string.

Record is the base class of the all objects in metaknowledge that contain information as key-value pairs, these are the grants and the records from different sources.

The error handling of the Record is done with the bad attribute. If there is some issue with the data bad should be True and error given an Exception that was caused by or explains the error.

Customizations

Record is a subclass of abc.collections.Mapping which means it has almost all the methods a dictionary does, the missing ones are those that modify entries. So to access the value of the key 'title' from a Record R, you would use either the square brace notation t = R['title'] or the get() function t = R.get('title') just like a dictionary. The other methods like keys() or copy() also work.

In addition to being a mapping Records are also hashable with their hashes being based on a unique id string they are given on creation, usually some kind of accession number the source gives them. The two optional arguments sFile and sLine, which should be given the name of the file the records came from and the line it started on respectively, are used to make the errors more useful.

__Init__

fieldDict is the dictionary the Record will use and idValue is the unique identifier of the Record.

Parameters

fieldDict : dict[str:]

A dictionary that maps from strings to values

idValue : str

A unique identifier string for the Record

bad : bool

True if there are issues with the Record, otherwise False

error : Exception

The Exception that caused whatever error made the record be marked as bad or None

sFile : str

A string that gives the source file of the original records

sLine : int

The first line the original record is found on in the source file



NSERCGrant(Grant):

NSERCGrant.__init__(original, grantdDict, sFile, sLine):

A dictionary with error handling and an id string.

Record is the base class of the all objects in metaknowledge that contain information as key-value pairs, these are the grants and the records from different sources.

The error handling of the Record is done with the bad attribute. If there is some issue with the data bad should be True and error given an Exception that was caused by or explains the error.

Customizations

Record is a subclass of abc.collections.Mapping which means it has almost all the methods a dictionary does, the missing ones are those that modify entries. So to access the value of the key 'title' from a Record R, you would use either the square brace notation t = R['title'] or the get() function t = R.get('title') just like a dictionary. The other methods like keys() or copy() also work.

In addition to being a mapping Records are also hashable with their hashes being based on a unique id string they are given on creation, usually some kind of accession number the source gives them. The two optional arguments sFile and sLine, which should be given the name of the file the records came from and the line it started on respectively, are used to make the errors more useful.

__Init__

fieldDict is the dictionary the Record will use and idValue is the unique identifier of the Record.

Parameters

fieldDict : dict[str:]

A dictionary that maps from strings to values

idValue : str

A unique identifier string for the Record

bad : bool

True if there are issues with the Record, otherwise False

error : Exception

The Exception that caused whatever error made the record be marked as bad or None

sFile : str

A string that gives the source file of the original records

sLine : int

The first line the original record is found on in the source file


The NSERCGrant class has the following methods:


NSERCGrant.update(other):

Adds all the tag-entry pairs from other to the Grant. If there is a conflict other takes precedence.

Parameters

other : Grant

Another Grant of the same type as self


NSERCGrant.getInvestigators(tags=None, seperator=’;’):

Returns a list of the names of investigators. The optional arguments are ignored.

Returns

list [str]

A list of all the found investigator’s names


NSERCGrant.getInstitutions(tags=None, seperator=’;’):

Returns a list with the names of the institution. The optional arguments are ignored

Returns

list [str]

A list with 1 entry the name of the institution


NSFGrant(Grant):

NSFGrant.__init__(grantdDict, sFile):

A dictionary with error handling and an id string.

Record is the base class of the all objects in metaknowledge that contain information as key-value pairs, these are the grants and the records from different sources.

The error handling of the Record is done with the bad attribute. If there is some issue with the data bad should be True and error given an Exception that was caused by or explains the error.

Customizations

Record is a subclass of abc.collections.Mapping which means it has almost all the methods a dictionary does, the missing ones are those that modify entries. So to access the value of the key 'title' from a Record R, you would use either the square brace notation t = R['title'] or the get() function t = R.get('title') just like a dictionary. The other methods like keys() or copy() also work.

In addition to being a mapping Records are also hashable with their hashes being based on a unique id string they are given on creation, usually some kind of accession number the source gives them. The two optional arguments sFile and sLine, which should be given the name of the file the records came from and the line it started on respectively, are used to make the errors more useful.

__Init__

fieldDict is the dictionary the Record will use and idValue is the unique identifier of the Record.

Parameters

fieldDict : dict[str:]

A dictionary that maps from strings to values

idValue : str

A unique identifier string for the Record

bad : bool

True if there are issues with the Record, otherwise False

error : Exception

The Exception that caused whatever error made the record be marked as bad or None

sFile : str

A string that gives the source file of the original records

sLine : int

The first line the original record is found on in the source file


The NSFGrant class has the following methods:


NSFGrant.getInvestigators(tags=None, seperator=’;’):

Returns a list of the names of investigators. The optional arguments are ignored.

Returns

list [str]

A list of all the found investigator’s names


NSFGrant.getInstitutions(tags=None, seperator=’;’):

Returns a list with the names of the institution. The optional arguments are ignored

Returns

list [str]

A list with 1 entry the name of the institution


MedlineRecord(ExtendedRecord):

MedlineRecord.__init__(inRecord, sFile=’’, sLine=0):

Class for full Medline(Pubmed) entries.

This class is an ExtendedRecord capable of generating its own id number. You should not create them directly, but instead use medlineParser() on a medline file.


The MedlineRecord class has the following methods:


MedlineRecord.encoding():

An abstractmethod, gives the encoding string of the record.

Returns

str

The encoding


MedlineRecord.getAltName(tag):

An abstractmethod, gives the alternate name of tag or None

Parameters

tag : str

The requested tag

Returns

str

The alternate name of tag or None


MedlineRecord.tagProcessingFunc(tag):

An abstractmethod, gives the function for processing tag

Parameters

tag : optional [str]

The tag in need of processing

Returns

fucntion

The function to process the raw tag


MedlineRecord.specialFuncs(key):

An abstractmethod, process the special tag, key using the whole Record

Parameters

key : str

One of the special tags: 'authorsFull', 'keywords', 'grants', 'j9', 'authorsShort', 'volume', 'selfCitation', 'citations', 'address', 'abstract', 'title', 'month', 'year', 'journal', 'beginningPage' and 'DOI'

Returns

The processed value of key


MedlineRecord.writeRecord(f):

This is nearly identical to the original the FAU tag is the only tag not writen in the same place, doing so would require changing the parser and lots of extra logic.


Collection(MutableSet, Hashable):

Collection.__init__(inSet, allowedTypes, collectedTypes, name, bad, errors, quietStart=False):

A named hashable set with some error reporting.

Collections have all the methods of builtin sets as well as error reporting with bad and error, and control over the contained items with allowedTypes and collectedTypes.

Customizations

When created name should be a string that allows users to easily determine the source of the Collection

When created the you must provided a set of types, allowedTypes, when new items are added they will be checked and if they are not instances of any of the types an CollectionTypeError exception will be raised. The collectedTypes set that is provided should be a set of only the types in the Collection.

If any of the elements in the Collection are bad then bad should be set to True and the dict errors should map the item to it’s exception.

All of these customizations are managed when operations occur on the Collection and if 2 Collections are modified with one of the binary operators (|, -, etc) the _collectedTypes and errors attributes will be modified the same way. name will be updated to explain the operation(s) that occurred.

__Init__

As Collection is mostly meant to be base for other classes all but one of the arguments in the __Init__ are not optional and the optional one is not used.

Parameters

inSet : set

The objects to be contained

allowedTypes : set[type]

A set of types, {object} will allow virtually everything

collectedTypes : set[type]

The types (or supertypes) of the objects in inSet

name : str

The name of the Collection

bad : bool

If any of the elements are bad

errors : dict[:Exception]

A mapping from items to their errors

quietStart : optional [bool]

Default False, does nothing. This is here for use as a interface by subclasses


The Collection class has the following methods:


Collection.add(elem):

Adds elem to the collection.

Parameters

elem : object

The object to be added


Collection.discard(elem):

Removes elem from the collection, will not raise an Exception if elem is missing

Parameters

elem : object

The object to be removed


Collection.remove(elem):

Removes elem from the collection, will raise a KeyError is elem is missing

Parameters

elem : object

The object to be removed


Collection.clear():

“Removes all elements from the collection and resets the error handling


Collection.pop():

Removes a random element from the collection and returns it

Returns

object

A random object from the collection


Collection.copy():

Creates a shallow copy of the collection

Returns

Collection

A copy of the Collection


Collection.peek():

returns a random element from the collection. If ran twice the same element will usually be returned

Returns

object

A random object from the collection


Collection.chunk(maxSize):

Splits the Collection into maxSize size or smaller Collections

Parameters

maxSize : int

The maximum number of elements in a retuned Collection

Returns

list [Collection]

A list of Collections that if all merged (| operator) would create the original


Collection.split(maxSize):

Destructively, splits the Collection into maxSize size or smaller Collections. The source Collection will be empty after this operation

Parameters

maxSize : int

The maximum number of elements in a retuned Collection

Returns

list [Collection]

A list of Collections that if all merged (| operator) would create the original


CollectionWithIDs(Collection):

CollectionWithIDs.__init__(inSet, allowedTypes, collectedTypes, name, bad, errors, quietStart=False):

A Collection with a few extra methods that assume all the contained items have an id attribute and a bad attribute, e.g. Records or Grants.

__Init__

As CollectionWithIDs is mostly meant to be base for other classes all but one of the arguments in the __init__ are not optional and the optional one is not used. The __init__() function is the same as a Collection.


The CollectionWithIDs class has the following methods:


CollectionWithIDs.networkMultiMode(*tags, recordType=True, nodeCount=True, edgeWeight=True, stemmer=None, edgeAttribute=None):

Creates a network of the objects found by all tags in tags, each node is marked by which tag spawned it making the resultant graph n-partite.

A networkMultiMode() looks are each item in the collection and extracts its values for the tags given by tags. Then for all objects returned an edge is created between them, regardless of their type. Each node will have an attribute call 'type' that gives the tag that created it or both if both created it, e.g. if 'LA' were in tags node 'English' would have the type attribute be 'LA'.

For example if tags was set to ['CR', 'UT', 'LA'], a three mode network would be created, composed of a co-citation network from the 'CR' tag. Then each citation would also have edges to all the languages of Records that cited it and to the WOS number of the those Records.

The number of times each object occurs is count if nodeCount is True and the edges count the number of co-occurrences if edgeWeight is True. Both areTrue by default.

Parameters

tags : str, str, str, … or list [str]

Any number of tags, or a list of tags

nodeCount : optional [bool]

Default True, if True each node will have an attribute called 'count' that contains an int giving the number of time the object occurred.

edgeWeight : optional [bool]

Default True, if True each edge will have an attribute called 'weight' that contains an int giving the number of time the two objects co-occurrenced.

stemmer : optional [func]

Default None, If stemmer is a callable object, basically a function or possibly a class, it will be called for the ID of every node in the graph, note that all IDs are strings.

For example: the function f = lambda x: x[0] if given as the stemmer will cause all IDs to be the first character of their unstemmed IDs. e.g. the title 'Goos-Hanchen and Imbert-Fedorov shifts for leaky guided modes' will create the node 'G'.

Returns

networkx Graph

A networkx Graph with the objects of the tags tags as nodes and their co-occurrences as edges


CollectionWithIDs.containsID(idVal):

Checks if the collected items contains the give idVal

Parameters

idVal : str

The queried id string

Returns

bool

True if the item is in the collection


CollectionWithIDs.discardID(idVal):

Checks if the collected items contains the give idVal and discards it if it is found, will not raise an exception if item is not found

Parameters

idVal : str

The discarded id string


CollectionWithIDs.removeID(idVal):

Checks if the collected items contains the give idVal and removes it if it is found, will raise a KeyError if item is not found

Parameters

idVal : str

The removed id string


CollectionWithIDs.getID(idVal):

Looks up an item with idVal and returns it if it is found, returns None if it does not find the item

Parameters

idVal : str

The requested item’s id string

Returns

object

The requested object or None


CollectionWithIDs.badEntries():

Creates a new collection of the same type with only the bad entries

Returns

CollectionWithIDs

A collection of only the bad entries


CollectionWithIDs.dropBadEntries():

Removes all the bad entries from the collection


CollectionWithIDs.tags():

Creates a list of all the tags of the contained items

Returns

list [str]

A list of all the tags


CollectionWithIDs.glimpse(*tags, compact=False):

Creates a printable table with the most frequently occurring values of each of the requested tags, or if none are provided the top authors, journals and citations. The table will be as wide and as tall as the terminal (or 80x24 if there is no terminal) so print(RC.glimpse())should always create a nice looking table. Below is a table created from some of the testing files:

> > print(RC.glimpse())
+RecordCollection glimpse made at: 2016-01-01 12:00:00++++++++++++++++++++++++++
|33 Records from testFile++++++++++++++++++++++++++++++++++++++++++++++++++++++|
|Columns are ranked by num. of occurrences and are independent of one another++|
|-------Top Authors--------+------Top Journals-------+--------Top Cited--------|
|1                Girard, S|1 CANADIAN JOURNAL OF PH.|1 LEVY Y, 1975, OPT COMM.|
|1                Gilles, H|1 JOURNAL OF THE OPTICAL.|2 GOOS F, 1947, ANN PHYS.|
|2                IMBERT, C|2          APPLIED OPTICS|3 LOTSCH HKV, 1970, OPTI.|
|2                Pillon, F|2   OPTICS COMMUNICATIONS|4 RENARD RH, 1964, J OPT.|
|3          BEAUREGARD, OCD|2 NUOVO CIMENTO DELLA SO.|5 IMBERT C, 1972, PHYS R.|
|3               Laroche, M|2 JOURNAL OF THE OPTICAL.|6 ARTMANN K, 1948, ANN P.|
|3                 HUARD, S|2 JOURNAL OF THE OPTICAL.|6 COSTADEB.O, 1973, PHYS.|
|4                  PURI, A|2 NOUVELLE REVUE D OPTIQ.|6 ROOSEN G, 1973, CR ACA.|
|4               COSTADEB.O|3 PHYSICS REPORTS-REVIEW.|7 Imbert C., 1972, Nouve.|
|4           PATTANAYAK, DN|3 PHYSICAL REVIEW LETTERS|8 HOROWITZ BR, 1971, J O.|
|4           Gazibegovic, A|3 USPEKHI FIZICHESKIKH N.|8 BRETENAKER F, 1992, PH.|
|4                ROOSEN, G|3 APPLIED PHYSICS B-LASE.|8 SCHILLIN.H, 1965, ANN .|
|4               BIRMAN, JL|3 AEU-INTERNATIONAL JOUR.|8 FEDOROV FI, 1955, DOKL.|
|4                Kaiser, R|3 COMPTES RENDUS HEBDOMA.|8 MAZET A, 1971, CR ACAD.|
|5                  LEVY, Y|3 CHINESE PHYSICS LETTERS|9 IMBERT C, 1972, CR ACA.|
|5              BEAUREGA.OC|3       PHYSICAL REVIEW B|9 LOTSCH HKV, 1971, OPTI.|
|5               PAVLOV, VI|3 LETTERE AL NUOVO CIMEN.|9 ASHBY N, 1973, PHYS RE.|
|5                BREVIK, I|3 PROGRESS IN QUANTUM EL.|9 BOULWARE DG, 1973, PHY.|
> >
Parameters

*tags : str, str, ...

Any number of tag strings to be made into columns in the output table

Returns

str

A string containing the table


CollectionWithIDs.rankedSeries(tag, outputFile=None, giveCounts=True, giveRanks=False, greatestFirst=True, pandasMode=True, limitTo=None):

Creates an pandas dict of the ordered list of all the values of tag, with and ranked by their number of occurrences. A list can also be returned with the the counts or ranks added or it can be written to a file.

Parameters

tag : str

The tag to be ranked

outputFile : optional str

A file path to write a csv with 2 columns, one the tag values the other their counts

giveCounts : optional bool

Default True, if True the retuned list will be composed of tuples the first values being the tag value and the second their counts. This supersedes giveRanks.

giveRanks : optional bool

Default False, if True and giveCounts is False, the retuned list will be composed of tuples the first values being the tag value and the second their ranks. This is superseded by giveCounts.

greatestFirst : optional bool

Default True, if True the returned list will be ordered with the highest ranked value first, otherwise the lowest ranked will be first.

pandasMode : optional bool

Default True, if True a dict ready for pandas will be returned, otherwise a list

limitTo : optional list[values]

Default None, if a list is provided only those values in the list will be counted or returned

Returns

dict[str:list[value]] or list[str]

A dict or list will be returned depending on if pandasMode is True


CollectionWithIDs.timeSeries(tag=None, outputFile=None, giveYears=True, greatestFirst=True, limitTo=False, pandasMode=True):

Creates an pandas dict of the ordered list of all the values of tag, with and ranked by the year the occurred in, multiple year occurrences will create multiple entries. A list can also be returned with the the counts or years added or it can be written to a file.

If no tag is given the Records in the collection will be used

Parameters

tag : optional str

Default None, if provided the tag will be ordered

outputFile : optional str

A file path to write a csv with 2 columns, one the tag values the other their years

giveYears : optional bool

Default True, if True the retuned list will be composed of tuples the first values being the tag value and the second their years.

greatestFirst : optional bool

Default True, if True the returned list will be ordered with the highest years first, otherwise the lowest years will be first.

pandasMode : optional bool

Default True, if True a dict ready for pandas will be returned, otherwise a list

limitTo : optional list[values]

Default None, if a list is provided only those values in the list will be counted or returned

Returns

dict[str:list[value]] or list[str]

A dict or list will be returned depending on if pandasMode is True


CollectionWithIDs.cooccurrenceCounts(keyTag, *countedTags):

Counts the number of times values from any of the countedTags occurs with keyTag. The counts are retuned as a dictionary with the values of keyTag mapping to dictionaries with each of the countedTags values mapping to thier counts.

Parameters

keyTag : str

The tag used as the key for the returned dictionary

*countedTags : str, str, str, ...

The tags used as the key for the returned dictionary’s values

Returns

dict[str:dict[str:int]]

The dictionary of counts


CollectionWithIDs.networkMultiLevel(*modes, nodeCount=True, edgeWeight=True, stemmer=None, edgeAttribute=None, nodeAttribute=None):

Creates a network of the objects found by any number of tags modes, with edges between all co-occurring values. IF you only want edges between co-occurring values from different tags use networkMultiMode().

A networkMultiLevel() looks are each entry in the collection and extracts its values for the tag given by each of the modes, e.g. the 'authorsFull' tag. Then if multiple are returned an edge is created between them. So in the case of the author tag 'authorsFull' a co-authorship network is created. Then for each other tag the entries are also added and edges between the first tag’s node and theirs are created.

The number of times each object occurs is count if nodeCount is True and the edges count the number of co-occurrences if edgeWeight is True. Both areTrue by default.

Note Do not use this for the construction of co-citation networks use Recordcollection.networkCoCitation() it is more accurate and has more options.

Parameters

mode : str

A two character WOS tag or one of the full names for a tag

nodeCount : optional [bool]

Default True, if True each node will have an attribute called “count” that contains an int giving the number of time the object occurred.

edgeWeight : optional [bool]

Default True, if True each edge will have an attribute called “weight” that contains an int giving the number of time the two objects co-occurrenced.

stemmer : optional [func]

Default None, If stemmer is a callable object, basically a function or possibly a class, it will be called for the ID of every node in the graph, all IDs are strings. For example:

The function ` f = lambda x: x[0] if given as the stemmer will cause all IDs to be the first character of their unstemmed IDs. e.g. the title ‘Goos-Hanchen and Imbert-Fedorov shifts for leaky guided modes’ will create the node ‘G’`.

Returns

networkx Graph

A networkx Graph with the objects of the tag mode as nodes and their co-occurrences as edges


CollectionWithIDs.networkOneMode(mode, nodeCount=True, edgeWeight=True, stemmer=None, edgeAttribute=None, nodeAttribute=None):

Creates a network of the objects found by one tag mode. This is the same as networkMultiLevel() with only one tag.

A networkOneMode() looks are each entry in the collection and extracts its values for the tag given by mode, e.g. the 'authorsFull' tag. Then if multiple are returned an edge is created between them. So in the case of the author tag 'authorsFull' a co-authorship network is created.

The number of times each object occurs is count if nodeCount is True and the edges count the number of co-occurrences if edgeWeight is True. Both areTrue by default.

Note Do not use this for the construction of co-citation networks use Recordcollection.networkCoCitation() it is more accurate and has more options.

Parameters

mode : str

A two character WOS tag or one of the full names for a tag

nodeCount : optional [bool]

Default True, if True each node will have an attribute called “count” that contains an int giving the number of time the object occurred.

edgeWeight : optional [bool]

Default True, if True each edge will have an attribute called “weight” that contains an int giving the number of time the two objects co-occurrenced.

stemmer : optional [func]

Default None, If stemmer is a callable object, basically a function or possibly a class, it will be called for the ID of every node in the graph, all IDs are strings. For example:

The function ` f = lambda x: x[0] if given as the stemmer will cause all IDs to be the first character of their unstemmed IDs. e.g. the title ‘Goos-Hanchen and Imbert-Fedorov shifts for leaky guided modes’ will create the node ‘G’`.

Returns

networkx Graph

A networkx Graph with the objects of the tag mode as nodes and their co-occurrences as edges


CollectionWithIDs.networkTwoMode(tag1, tag2, directed=False, recordType=True, nodeCount=True, edgeWeight=True, stemmerTag1=None, stemmerTag2=None, edgeAttribute=None):

Creates a network of the objects found by two WOS tags tag1 and tag2, each node marked by which tag spawned it making the resultant graph bipartite.

A networkTwoMode() looks at each Record in the RecordCollection and extracts its values for the tags given by tag1 and tag2, e.g. the 'WC' and 'LA' tags. Then for each object returned by each tag and edge is created between it and every other object of the other tag. So the WOS defined subject tag 'WC' and language tag 'LA', will give a two-mode network showing the connections between subjects and languages. Each node will have an attribute call 'type' that gives the tag that created it or both if both created it, e.g. the node 'English' would have the type attribute be 'LA'.

The number of times each object occurs is count if nodeCount is True and the edges count the number of co-occurrences if edgeWeight is True. Both areTrue by default.

The directed parameter if True will cause the network to be directed with the first tag as the source and the second as the destination.

Parameters

tag1 : str

A two character WOS tag or one of the full names for a tag, the source of edges on the graph

tag1 : str

A two character WOS tag or one of the full names for a tag, the target of edges on the graph

directed : optional [bool]

Default False, if True the returned network is directed

nodeCount : optional [bool]

Default True, if True each node will have an attribute called “count” that contains an int giving the number of time the object occurred.

edgeWeight : optional [bool]

Default True, if True each edge will have an attribute called “weight” that contains an int giving the number of time the two objects co-occurrenced.

stemmerTag1 : optional [func]

Default None, If stemmerTag1 is a callable object, basically a function or possibly a class, it will be called for the ID of every node given by tag1 in the graph, all IDs are strings.

For example: the function f = lambda x: x[0] if given as the stemmer will cause all IDs to be the first character of their unstemmed IDs. e.g. the title 'Goos-Hanchen and Imbert-Fedorov shifts for leaky guided modes' will create the node 'G'.

stemmerTag2 : optional [func]

Default None, see stemmerTag1 as it is the same but for tag2

Returns

networkx Graph or networkx DiGraph

A networkx Graph with the objects of the tags tag1 and tag2 as nodes and their co-occurrences as edges.


ExtendedRecord(Record):

ExtendedRecord.__init__(fieldDict, idValue, bad, error, sFile=’’, sLine=0):

A subclass of Record that adds processing to the dictionary. It also cannot be use directly and must be subclassed.

The ExtendedRecord class is a extension of Record that is intended for use with the records on scientific papers provided by different organizations such as WOS or Pubmed. The 5 abstract (virtual) methods must be defined for each subclass and define how the data in the different fields is processed and how the record can be rewritten to a file.

Processing fields

When an ExtendedRecord is created a dictionary, fieldDict, must be provided this contains the raw data from the file reader, usually as lists of strings. tagProcessingFunc is a staticmethod function that takes in a tag string an returns another function to process it.

Each tag may also be given a second name, as usually what the they are called in the raw data are not very easy to understand (e.g. 'SO' is the journal name for WOs records). The mapping from the raw tag ('SO') to the human friendly string ('journal') is done with the getAltName staticmethod. getAltName takes in a tag string and returns either None or the other name for that string. Note, getAltName must go both directions WOSRecord.getAltName(WOSRecord.getAltName('SO')) == 'SO'.

The last method for processing entries is specialFuncs The following are the special keys for ExtendedRecords. These must be the alternate names of tags or strings accepted by the specialFuncs method.

  • 'authorsFull'
  • 'keywords'
  • 'grants'
  • 'j9'
  • 'authorsShort'
  • 'volume'
  • 'selfCitation'
  • 'citations'
  • 'address'
  • 'abstract'
  • 'title'
  • 'month'
  • 'year'
  • 'journal'
  • 'beginningPage'
  • 'DOI'

specialFuncs when given one of these must raise a KeyError or return an object of the same type as that returned by the MedlineRecord or WOSRecord. e.g. 'title' would return a string giving the title of the record.

For an example of how this works lets first look at the 'SO' tag on a WOSRecord accessed with the alternate name 'journal'.

t = R['journal']

First the private dictionary _computedFields is checked for the key 'title', which will fail if this is the first time 'journal' or 'SO' has been requested, after this the results will be added to the dictionary to speed up future requests.

Then the fieldDict will be checked for the key and when that fails the key will go through getAltName and be checked again. If the record had a journal entry this will succeed and the raw data will be given to the tagProcessingFunc using the same key as fieldDict, in this case SO.

The results will then be written to _computedFields and returned.

If the requested key was instead 'grants' (g = R['grants'])the both lookups to fieldDict would have failed and the string 'grants' would have been given to specialFuncs which would return a list of all the grants in the WOSRecord (this is always [] as WOS does not provided grant information).

What if the key were not present anywhere? Then the specialFuncs should raise a KeyError which will be caught then re-raised like a dictionary would with an invalid key look up.

File Handling fields

The two other required methods encoding and writeRecord define how the records can be rewritten to a file. encoding is should return a string giving the encoding python would use, e.g. 'utf-8' or 'latin-1'. This is the same encoding that the files written by writeRecord should have, writeRecord when called should write the original record to the provided open file, infile. The opening, closing, header and footer of the file will be handled by RecordCollection’s writeFile function which should me modified accordingly. If the order of the fields in a record is important you can use a collections.OrderedDict for fieldDict.

__Init__

The __init__ of ExtendedRecord takes the same arguments as Record


The ExtendedRecord class has the following methods:


ExtendedRecord.get(tag, default=None, raw=False):

Allows access to the raw values or is an Exception safe wrapper to __getitem__.

Parameters

tag : str

The requested tag

default : optional [Object]

Default None, the object returned when tag is not found

raw : optional [bool]

Default False, if True the unprocessed value of tag is returned

Returns

Object

The processed value of tag or default


ExtendedRecord.values(raw=False):

Like values for dicts but with a raw option

Parameters

raw : optional [bool]

Default False, if True the ValuesView contains the raw values

Returns

ValuesView

The values of the record


ExtendedRecord.items(raw=False):

Like items for dicts but with a raw option

Parameters

raw : optional [bool]

Default False, if True the KeysView contains the raw values as the values

Returns

KeysView

The key-value pairs of the record


ExtendedRecord.writeRecord(infile):

An abstractmethod, writes the record in its original form to infile

Parameters

infile : writable file

The file to be written to


ExtendedRecord.encoding():

An abstractmethod, gives the encoding string of the record.

Returns

str

The encoding


ExtendedRecord.getAltName(tag):

An abstractmethod, gives the alternate name of tag or None

Parameters

tag : str

The requested tag

Returns

str

The alternate name of tag or None


ExtendedRecord.tagProcessingFunc(tag):

An abstractmethod, gives the function for processing tag

Parameters

tag : optional [str]

The tag in need of processing

Returns

fucntion

The function to process the raw tag


ExtendedRecord.specialFuncs(key):

An abstractmethod, process the special tag, key using the whole Record

Parameters

key : str

One of the special tags: 'authorsFull', 'keywords', 'grants', 'j9', 'authorsShort', 'volume', 'selfCitation', 'citations', 'address', 'abstract', 'title', 'month', 'year', 'journal', 'beginningPage' and 'DOI'

Returns

The processed value of key


ExtendedRecord.getCitations(field=None, values=None, pandasFriendly=True):

Creates a pandas ready dict with each row a different citation and columns containing the original string, year, journal and author’s name.

There are also options to filter the output citations with field and values

Parameters

field : optional str

Default None, if given all citations missing the named field will be dropped.

values : optional str or list[str]

Default None, if field is also given only those citations with one of the strings given in values will be included.

e.g. to get only citations from 1990 or 1991: field = year, values = [1991, 1990]

pandasFriendly : optional bool

Default True, if False a list of the citations will be returned instead of the more complicated pandas dict

Returns

dict

A pandas ready dict with all the citations


ExtendedRecord.subDict(tags, raw=False):

Creates a dict of values of tags from the Record. The tags are the keys and the values are the values. If the tag is missing the value will be None.

Parameters

tags : list[str]

The list of tags requested

raw : optional [bool]

default False if True the retuned values of the dict will be unprocessed

Returns

dict

A dictionary with the keys tags and the values from the record


ExtendedRecord.createCitation(multiCite=False):

Creates a citation string, using the same format as other WOS citations, for the Record by reading the relevant special tags ('year', 'J9', 'volume', 'beginningPage', 'DOI') and using it to create a Citation object.

Parameters

multiCite : optional [bool]

Default False, if True a tuple of Citations is returned with each having a different one of the records authors as the author

Returns

Citation

A Citation object containing a citation for the Record.


ExtendedRecord.authGenders(countsOnly=False, fractionsMode=False):

Creates a dict mapping 'Male', 'Female' and 'Unknown' to lists of the names of all the authors.

Parameters

countsOnly : optional bool

Default False, if True the counts (lengths of the lists) will be given instead of the lists of names

fractionsMode : optional bool

Default False, if True the fraction counts (lengths of the lists divided by the total number of authors) will be given instead of the lists of names. This supersedes countsOnly

Returns

dict[str:str or int]

The mapping of genders to author’s names or counts


ExtendedRecord.bibString(maxLength=1000, WOSMode=False, restrictedOutput=False, niceID=True):

Makes a string giving the Record as a bibTex entry. If the Record is of a journal article (PT J) the bibtext type is set to 'article', otherwise it is set to 'misc'. The ID of the entry is the WOS number and all the Record’s fields are given as entries with their long names.

Note This is not meant to be used directly with LaTeX none of the special characters have been escaped and there are a large number of unnecessary fields provided. niceID and maxLength have been provided to make conversions easier.

Note Record entries that are lists have their values seperated with the string ' and '

Parameters

maxLength : optional [int]

default 1000, The max length for a continuous string. Most bibTex implementation only allow string to be up to 1000 characters (source), this splits them up into substrings then uses the native string concatenation (the '#' character) to allow for longer strings

WOSMode : optional [bool]

default False, if True the data produced will be unprocessed and use double curly braces. This is the style WOS produces bib files in and mostly macthes that.

restrictedOutput : optional [bool]

default False, if True the tags output will be limited to tose found in metaknowledge.commonRecordFields

niceID : optional [bool]

default True, if True the ID used will be derived from the authors, publishing date and title, if False it will be the UT tag

Returns

str

The bibTex string of the Record


Record(Mapping, Hashable):

Record.__init__(fieldDict, idValue, bad, error, sFile=’’, sLine=0):

A dictionary with error handling and an id string.

Record is the base class of the all objects in metaknowledge that contain information as key-value pairs, these are the grants and the records from different sources.

The error handling of the Record is done with the bad attribute. If there is some issue with the data bad should be True and error given an Exception that was caused by or explains the error.

Customizations

Record is a subclass of abc.collections.Mapping which means it has almost all the methods a dictionary does, the missing ones are those that modify entries. So to access the value of the key 'title' from a Record R, you would use either the square brace notation t = R['title'] or the get() function t = R.get('title') just like a dictionary. The other methods like keys() or copy() also work.

In addition to being a mapping Records are also hashable with their hashes being based on a unique id string they are given on creation, usually some kind of accession number the source gives them. The two optional arguments sFile and sLine, which should be given the name of the file the records came from and the line it started on respectively, are used to make the errors more useful.

__Init__

fieldDict is the dictionary the Record will use and idValue is the unique identifier of the Record.

Parameters

fieldDict : dict[str:]

A dictionary that maps from strings to values

idValue : str

A unique identifier string for the Record

bad : bool

True if there are issues with the Record, otherwise False

error : Exception

The Exception that caused whatever error made the record be marked as bad or None

sFile : str

A string that gives the source file of the original records

sLine : int

The first line the original record is found on in the source file


The Record class has the following methods:


Record.__eq__(other):

Compares Records using their hashes if their hashes are the same then True is returned.

Parameters

other : Record

Another Record to be compared against

Returns

bool

If the records are the same then True is returned


Record.__str__():

Makes a string with the title of the file as given by self.title, if there is not one it returns “Untitled record”

Returns

str

The title of the Record


Record.__repr__():

Makes a string with the id of the file and its type

Returns

str

The representation of the Record


Record.copy():

Correctly copies the Record

Returns

Record

A completely decoupled copy of the original


Record.__hash__():

Gives a hash of the id or if bad returns a hash of the fields combined with the error messages, either of these could be blank

bad Records are more likely to cause hash collisions due to their lack of entropy when created.

Returns

int

A hopefully unique random number


ProQuestRecord(ExtendedRecord):

ProQuestRecord.__init__(inRecord, recNum=None, sFile=’’, sLine=0):

Class for full ProQuest entries.

This class is an ExtendedRecord capable of generating its own id number. You should not create them directly, but instead use proQuestParser() on a ProQuest file.


The ProQuestRecord class has the following methods:


ProQuestRecord.encoding():

An abstractmethod, gives the encoding string of the record.

Returns

str

The encoding


ProQuestRecord.getAltName(tag):

An abstractmethod, gives the alternate name of tag or None

Parameters

tag : str

The requested tag

Returns

str

The alternate name of tag or None


ProQuestRecord.tagProcessingFunc(tag):

An abstractmethod, gives the function for processing tag

Parameters

tag : optional [str]

The tag in need of processing

Returns

fucntion

The function to process the raw tag


ProQuestRecord.specialFuncs(key):

An abstractmethod, process the special tag, key using the whole Record

Parameters

key : str

One of the special tags: 'authorsFull', 'keywords', 'grants', 'j9', 'authorsShort', 'volume', 'selfCitation', 'citations', 'address', 'abstract', 'title', 'month', 'year', 'journal', 'beginningPage' and 'DOI'

Returns

The processed value of key


ProQuestRecord.writeRecord(infile):

An abstractmethod, writes the record in its original form to infile

Parameters

infile : writable file

The file to be written to


RecordCollection(CollectionWithIDs):

RecordCollection.__init__(inCollection=None, name=’’, extension=’’, cached=False, quietStart=False):

A container for a large number of indivual records.

RecordCollection provides ways of creating Records from an isi file, string, list of records or directory containing isi files.

When being created if there are issues the Record collection will be declared bad, bad wil be set to False, it will then mostly return None or False. The attribute error contains the exception that occurred.

They also possess an attribute name also accessed accessed with __repr__(), this is used to auto generate the names of files and can be set at creation, note though that any operations that modify the RecordCollection’s contents will update the name to include what occurred.

Customizations

The Records are containing within a set and as such many of the set operations are defined, pop, union, in … also records are hashed with their WOS string so no duplication can occur. The comparison operators <, <=, >, >= are based strictly on the number of Records within the collection, while equality looks for an exact match on the Records

__Init__

inCollection is the object containing the information about the Records to be constructed it can be an isi file, string, list of records or directory containing isi files

Parameters

inCollection : optional [str] or None

the name of the source of WOS records. It can be skipped to produce an empty collection.

If a file is provided. First it is checked to see if it is a WOS file (the header is checked). Then records are read from it one by one until the ‘EF’ string is found indicating the end of the file.

If a directory is provided. First each file in the directory is checked for the correct header and all those that do are then read like indivual files. The records are then collected into a single set in the RecordCollection.

name : optional [str]

The name of the RecordCollection, defaults to empty string. If left empty the name of the Record collection is set to the name of the file or directory used to create the collection. If provided the name id set to name

extension : optional [str]

The extension to search for when reading a directory for files. extension is the suffix searched for when a directory is read for files, by default it is empty so all files are read.

cached : optional [bool]

Default False, if True and the inCollection is a directory (a string giving the path to a directory) then the initialized RecordCollection will be saved in the directory as a Python pickle with the suffix '.mkDirCache'. Then if the RecordCollection is initialized a second time it will be recovered from the file, which is much faster than reprising every file in the directory.

metaknowledge saves the names of the parsed files as well as their last modification times and will check these when recreating the RecordCollection, so modifying existing files or adding new ones will result in the entire directory being reanalyzed and a new cache file being created. The extension given to __init__() is taken into account as well and each suffix is given its own cache.

Note The pickle allows for arbitrary python code execution so only use caches that you trust.


The RecordCollection class has the following methods:


RecordCollection.networkCoCitation(dropAnon=True, nodeType=’full’, nodeInfo=True, fullInfo=False, weighted=True, dropNonJournals=False, count=True, keyWords=None, detailedCore=True, detailedCoreAttributes=False, coreOnly=False, expandedCore=False, addCR=False):

Creates a co-citation network for the RecordCollection.

Parameters

nodeType : optional [str]

One of "full", "original", "author", "journal" or "year". Specifies the value of the nodes in the graph. The default "full" causes the citations to be compared holistically using the metaknowledge.Citation builtin comparison operators. "original" uses the raw original strings of the citations. While "author", "journal" and "year" each use the author, journal and year respectively.

dropAnon : optional [bool]

default True, if True citations labeled anonymous are removed from the network

nodeInfo : optional [bool]

default True, if True an extra piece of information is stored with each node. The extra inforamtion is detemined by nodeType.

fullInfo : optional [bool]

default False, if True the original citation string is added to the node as an extra value, the attribute is labeled as fullCite

weighted : optional [bool]

default True, wether the edges are weighted. If True the edges are weighted by the number of citations.

dropNonJournals : optional [bool]

default False, wether to drop citations of non-journals

count : optional [bool]

default True, causes the number of occurrences of a node to be counted

keyWords : optional [str] or [list[str]]

A string or list of strings that the citations are checked against, if they contain any of the strings they are removed from the network

detailedCore : optional [bool or iterable[WOS tag Strings]]

default True, if True all Citations from the core (those of records in the RecordCollection) and the nodeType is 'full' all nodes from the core will be given info strings composed of information from the Record objects themselves. This is Equivalent to passing the list: ['AF', 'PY', 'TI', 'SO', 'VL', 'BP'].

If detailedCore is an iterable (That evaluates to True) of WOS Tags (or long names) The values of those tags will be used to make the info attribute. All

The resultant string is the values of each tag, with commas removed, seperated by ', ', just like the info given by non-core Citations. Note that for tags like 'AF' that return lists only the first entry in the list will be used. Also a second attribute is created for all nodes called inCore wich is a boolean describing if the node is in the core or not.

Note: detailedCore is not identical to the detailedInfo argument of Recordcollection.networkCoAuthor()

coreOnly : optional [bool]

default False, if True only Citations from the RecordCollection will be included in the network

expandedCore : optional [bool]

default False, if True all citations in the ouput graph that are records in the collection will be duplicated for each author. If the nodes are "full", "original" or "author" this will result in new noded being created for the other options the results are not defined or tested. Edges will be created between each of the nodes for each record expanded, attributes will be copied from exiting nodes.

Returns

Networkx Graph

A networkx graph with hashes as ID and co-citation as edges


RecordCollection.networkCitation(dropAnon=False, nodeType=’full’, nodeInfo=True, fullInfo=False, weighted=True, dropNonJournals=False, count=True, directed=True, keyWords=None, detailedCore=True, detailedCoreAttributes=False, coreOnly=False, expandedCore=False, recordToCite=True, addCR=False):

Creates a citation network for the RecordCollection.

Parameters

nodeType : optional [str]

One of "full", "original", "author", "journal" or "year". Specifies the value of the nodes in the graph. The default "full" causes the citations to be compared holistically using the metaknowledge.Citation builtin comparison operators. "original" uses the raw original strings of the citations. While "author", "journal" and "year" each use the author, journal and year respectively.

dropAnon : optional [bool]

default True, if True citations labeled anonymous are removed from the network

nodeInfo : optional [bool]

default True, whether an extra piece of information is stored with each node.

fullInfo : optional [bool]

default False, whether the original citation string is added to the node as an extra value, the attribute is labeled as fullCite

weighted : optional [bool]

default True, whether the edges are weighted. If True the edges are weighted by the number of citations.

dropNonJournals : optional [bool]

default False, whether to drop citations of non-journals

count : optional [bool]

default True, causes the number of occurrences of a node to be counted

keyWords : optional [str] or [list[str]]

A string or list of strings that the citations are checked against, if they contain any of the strings they are removed from the network

directed : optional [bool]

Determines if the output graph is directed, default True

detailedCore : optional [bool or iterable[WOS tag Strings]]

default True, if True all Citations from the core (those of records in the RecordCollection) and the nodeType is 'full' all nodes from the core will be given info strings composed of information from the Record objects themselves. This is Equivalent to passing the list: ['AF', 'PY', 'TI', 'SO', 'VL', 'BP'].

If detailedCore is an iterable (That evaluates to True) of WOS Tags (or long names) The values of those tags will be used to make the info attribute. All

The resultant string is the values of each tag, with commas removed, seperated by ', ', just like the info given by non-core Citations. Note that for tags like 'AF' that return lists only the first entry in the list will be used. Also a second attribute is created for all nodes called inCore wich is a boolean describing if the node is in the core or not.

Note: detailedCore is not identical to the detailedInfo argument of Recordcollection.networkCoAuthor()

coreOnly : optional [bool]

default False, if True only Citations from the RecordCollection will be included in the network

expandedCore : optional [bool]

default False, if True all citations in the ouput graph that are records in the collection will be duplicated for each author. If the nodes are "full", "original" or "author" this will result in new noded being created for the other options the results are not defined or tested. Edges will be created between each of the nodes for each record expanded, attributes will be copied from exiting nodes.

Returns

Networkx DiGraph or Networkx Graph

See directed for explanation of returned type

A networkx digraph with hashes as ID and citations as edges


RecordCollection.networkBibCoupling(weighted=True, fullInfo=False):

Creates a bibliographic coupling network based on citations for the RecordCollection.

Parameters

weighted : optional bool

Default True, if True the weight of the edges will be added to the network

fullInfo : optional bool

Default False, if True the full citation string will be added to each of the nodes of the network.

Returns

Networkx Graph

A graph of the bibliographic coupling


RecordCollection.yearSplit(startYear, endYear, dropMissingYears=True):

Creates a RecordCollection of Records from the years between startYear and endYear inclusive.

Parameters

startYear : int

The smallest year to be included in the returned RecordCollection

endYear : int

The largest year to be included in the returned RecordCollection

dropMissingYears : optional [bool]

Default True, if True Records with missing years will be dropped. If False a TypeError exception will be raised

Returns

RecordCollection

A RecordCollection of Records from startYear to endYear


RecordCollection.localCiteStats(pandasFriendly=False, keyType=’citation’):

Returns a dict with all the citations in the CR field as keys and the number of times they occur as the values

Parameters

pandasFriendly : optional [bool]

default False, makes the output be a dict with two keys one 'Citations' is the citations the other is their occurrence counts as 'Counts'.

keyType : optional [str]

default 'citation', the type of key to use for the dictionary, the valid strings are 'citation', 'journal', 'year' or 'author'. IF changed from 'citation' all citations matching the requested option will be contracted and their counts added together.

Returns

dict[str, int or Citation : int]

A dictionary with keys as given by keyType and integers giving their rates of occurrence in the collection


RecordCollection.localCitesOf(rec):

Takes in a Record, WOS string, citation string or Citation and returns a RecordCollection of all records that cite it.

Parameters

rec : Record, str or Citation

The object that is being cited

Returns

RecordCollection

A RecordCollection containing only those Records that cite rec


RecordCollection.citeFilter(keyString=’’, field=’all’, reverse=False, caseSensitive=False):

Filters Records by some string, keyString, in their citations and returns all Records with at least one citation possessing keyString in the field given by field.

Parameters

keyString : optional [str]

Default '', gives the string to be searched for, if it is is blank then all citations with the specified field will be matched

field : optional [str]

Default 'all', gives the component of the citation to be looked at, it can be one of a few strings. The default is 'all' which will cause the entire original Citation to be searched. It can be used to search across fields, e.g. '1970, V2' is a valid keystring The other options are:

  • 'author', searches the author field
  • 'year', searches the year field
  • 'journal', searches the journal field
  • 'V', searches the volume field
  • 'P', searches the page field
  • 'misc', searches all the remaining uncategorized information
  • 'anonymous', searches for anonymous Citations, keyString is not ignored
  • 'bad', searches for bad citations, keyString is not used

reverse : optional [bool]

Default False, being set to True causes all Records not matching the query to be returned

caseSensitive : optional [bool]

Default False, if True causes the search across the original to be case sensitive, only the 'all' option can be case sensitive


RecordCollection.dropNonJournals(ptVal=’J’, dropBad=True, invert=False):

Drops the non journal type Records from the collection, this is done by checking ptVal against the PT tag

Parameters

ptVal : optional [str]

Default 'J', The value of the PT tag to be kept, default is 'J' the journal tag, other tags can be substituted.

dropBad : optional [bool]

Default True, if True bad Records will be dropped as well those that are not journal entries

invert : optional [bool]

Default False, Set True to drop journals (or the PT tag given by ptVal) instead of keeping them. Note, it still drops bad Records if dropBad is True


RecordCollection.writeFile(fname=None):

Writes the RecordCollection to a file, the written file’s format is identical to those download from WOS. The order of Records written is random.

Parameters

fname : optional [str]

Default None, if given the output file will written to fanme, if None the RecordCollection’s name’s first 200 characters are used with the suffix .isi


RecordCollection.writeCSV(_fname=None, splitByTag=None, onlyTheseTags=None, numAuthors=True, genderCounts=True, longNames=False, firstTags=None, csvDelimiter=’,’, csvQuote=’”’, listDelimiter=’ ‘_):

Writes all the Records from the collection into a csv file with each row a record and each column a tag.

Parameters

fname : optional [str]

Default None, the name of the file to write to, if None it uses the collections name suffixed by .csv.

splitByTag : optional [str]

Default None, if a tag is given the output will be divided into different files according to the value of the tag, with only the records associated with that tag. For example if 'authorsFull' is given then each file will only have the lines for Records that author is named in.

The file names are the values of the tag followed by a dash then the normale name for the file as given by fname, e.g. for the year 2016 the file could be called '2016-fname.csv'.

onlyTheseTags : optional [iterable]

Default None, if an iterable (list, tuple, etc) only the tags in onlyTheseTags will be used, if not given then all tags in the records are given.

If you want to use all known tags pass metaknowledge.knownTagsList.

numAuthors : optional [bool]

Default True, if True adds the number of authors as the column 'numAuthors'.

longNames : optional [bool]

Default False, if True will convert the tags to their longer names, otherwise the short 2 character ones will be used.

firstTags : optional [iterable]

Default None, if None the iterable ['UT', 'PT', 'TI', 'AF', 'CR'] is used. The tags given by the iterable are the first ones in the csv in the order given.

Note if tags are in firstTags but not in onlyTheseTags, onlyTheseTags will override firstTags

csvDelimiter : optional [str]

Default ',', the delimiter used for the cells of the csv file.

csvQuote : optional [str]

Default '"', the quote character used for the csv.

listDelimiter : optional [str]

Default '|', the delimiter used between values of the same cell if the tag for that record has multiple outputs.


RecordCollection.writeBib(fname=None, maxStringLength=1000, wosMode=False, reducedOutput=False, niceIDs=True):

Writes a bibTex entry to fname for each Record in the collection.

If the Record is of a journal article (PT J) the bibtext type is set to 'article', otherwise it is set to 'misc'. The ID of the entry is the WOS number and all the Record’s fields are given as entries with their long names.

Note This is not meant to be used directly with LaTeX none of the special characters have been escaped and there are a large number of unnecessary fields provided. niceID and maxLength have been provided to make conversions easier only.

Note Record entries that are lists have their values separated with the string ' and ', as this is the way bibTex understands

Parameters

fname : optional [str]

Default None, The name of the file to be written. If not given one will be derived from the collection and the file will be written to .

maxStringLength : optional [int]

Default 1000, The max length for a continuous string. Most bibTex implementation only allow string to be up to 1000 characters (source), this splits them up into substrings then uses the native string concatenation (the '#' character) to allow for longer strings

WOSMode : optional [bool]

Default False, if True the data produced will be unprocessed and use double curly braces. This is the style WOS produces bib files in and mostly macthes that.

restrictedOutput : optional [bool]

Default False, if True the tags output will be limited to: 'AF', 'BF', 'ED', 'TI', 'SO', 'LA', 'NR', 'TC', 'Z9', 'PU', 'J9', 'PY', 'PD', 'VL', 'IS', 'SU', 'PG', 'DI', 'D2', and 'UT'

niceID : optional [bool]

Default True, if True the IDs used will be derived from the authors, publishing date and title, if False it will be the UT tag


RecordCollection.findProbableCopyright():

Finds the (likely) copyright string from all abstracts in the RecordCollection

Returns

list[str]

A deduplicated list of all the copyright strings


RecordCollection.forBurst(tag, outputFile=None, dropList=None, lower=True, removeNumbers=True, removeNonWords=True, removeWhitespace=True, stemmer=None):

Creates a pandas friendly dictionary with 2 columns one 'year' and the other 'word'. Each row is a word that occurred in the field given by tag in a Record and the year of the record. Unfortunately getting the month or day with any type of accuracy has proved to be impossible so year is the only option.

Parameters

tag : str

The tag giving the field for the words to be extracted from.

outputFile : optional str

Default None, if a path is given a csv file will be created from the returned dictionary and written to that file

dropList : optional list[str]

Default None, if a list of strings is given each field will be checked for substrings, before any other processing, in the field, surrounded by spaces, matching those in dropList. The strings will only be dropped if they are surrounded on both sides with spaces (' ') so if dropList = ['a'] then 'a cat' will become 'cat'.

lower : optional bool

default True, if True the output will made lower case

removeNumbers : optional bool

default True, if True all numbers will be removed

removeNonWords : optional bool

default True, if True all non-number non-number characters will be removed

removeWhitespace : optional bool

default True, if True all whitespace will be converted to a single space (' ')

stemmer : optional func

default None, if a function is provided it will be run on each individual word in the field and the output will replace it. For example to use the PorterStemmer in the nltk package you would give nltk.PorterStemmer().stem


RecordCollection.forNLP(outputFile=None, extraColumns=None, dropList=None, lower=True, removeNumbers=True, removeNonWords=True, removeWhitespace=True, removeCopyright=False, stemmer=None):

Creates a pandas friendly dictionary with each row a Record in the RecordCollection and the columns fields natural language processing uses (id, title, publication year, keywords and the abstract). The abstract is by default is processed to remove non-word, non-space characters and the case is lowered.

Parameters

outputFile : optional str

default None, if a file path is given a csv of the returned data will be written

extraColumns : optional list[str]

default None, if a list of tags is given each of the tag’s values for a Record will be added to the output(s)

dropList : optional list[str]

default None, if a list of strings is provided they will be dropped from the output’s abstracts. The matching is case sensitive and done before any other processing. The strings will only be dropped if they are surrounded on both sides with spaces (' ') so if dropList = ['a'] then 'a cat' will become 'cat'.

lower : optional bool

default True, if True the abstract will made to lower case

removeNumbers : optional bool

default True, if True all numbers will be removed

removeNonWords : optional bool

default True, if True all non-number non-number characters will be removed

removeWhitespace : optional bool

default True, if True all whitespace will be converted to a single space (' ')

removeCopyright : optional bool

default False, if True the copyright statement at the end of the abstract will be removed and added to a new column. Note this is heuristic based and will not work for all papers.

stemmer : optional func

default None, if a function is provided it will be run on each individual word in the abstract and the output will replace it. For example to use the PorterStemmer in the nltk package you would give nltk.PorterStemmer().stem


RecordCollection.makeDict(onlyTheseTags=None, longNames=False, raw=False, numAuthors=True, genderCounts=True):

Returns a dict with each key a tag and the values being lists of the values for each of the Records in the collection, None is given when there is no value and they are in the same order across each tag.

When used with pandas: pandas.DataFrame(RC.makeDict()) returns a data frame with each column a tag and each row a Record.

Parameters

onlyTheseTags : optional [iterable]

Default None, if an iterable (list, tuple, etc) only the tags in onlyTheseTags will be used, if not given then all tags in the records are given.

If you want to use all known tags pass metaknowledge.knownTagsList.

longNames : optional [bool]

Default False, if True will convert the tags to their longer names, otherwise the short 2 character ones will be used.

cleanedVal : optional [bool]

Default True, if True the processed values for each Record’s field will be provided, otherwise the raw values are given.

numAuthors : optional [bool]

Default True, if True adds the number of authors as the column 'numAuthors'.


RecordCollection.rpys(minYear=None, maxYear=None, dropYears=None, rankEmptyYears=False):

This implements Referenced Publication Years Spectroscopy a techinique for finding import years in citation data. The authors of the original papers have a website with more information, found here.

This function computes the spectra of the RecordCollection and returns a dictionary mapping strings to lists of ints. Each list is ordered and the values of each with the same index form a row and each list a column. The strings are the names of the columns. This is intended to be read directly by pandas DataFrames.

The columns returned are:

  1. 'year', the years of the counted citations, missing years are inserted with a count of 0, unless they are outside the bounds of the highest year or the lowest year and the default value is used. e.g. if the highest year is 2016, 2017 will not be inserted unless maxYear has been set to 2017 or higher
  2. 'count', the number of times the year was cited
  3. 'abs-deviation', deviation from the 5-year median. Calculated by taking the absolute deviation of the count from the median of it and the next 2 years and the preceding 2 years
  4. 'rank', the rank of the year, the highest ranked year being the one with the highest deviation, the second highest being the second highest deviation and so on. All years with 0 count are given the rank 0 by default
Parameters

minYear : optional int

Default 1000, The lowest year to be returned, note years outside this bound will be used to calculate the deviation from the 5-year median

maxYear : optional int

Default 2100, The highest year to be returned, note years outside this bound will be used to calculate the deviation from the 5-year median

dropYears : optional int or list[int]

Default None, year or collection of years that will be removed from the returned value, note the dropped years will still be used to calculate the deviation from the 5-year

rankEmptyYears : optional [bool]

Default False, if True years with 0 count will be ranked according to their deviance, if many 0 count years exist their ordering is not guaranteed to be stable

Returns

dict[str:list]

The table of values from the Referenced Publication Years Spectroscopy


RecordCollection.genderStats(asFractions=False):

Creates a dict ({'Male' : maleCount, 'Female' : femaleCount, 'Unknown' : unknownCount}) with the numbers of male, female and unknown names in the collection.

Parameters

asFractions : optional bool

Default False, if True the counts will be divided by the total number of names, giving the fraction of names in each category instead of the raw counts.

Returns

dict[str:int]

A dict with three keys 'Male', 'Female' and 'Unknown' mapping to their respective counts


RecordCollection.getCitations(field=None, values=None, pandasFriendly=True, counts=True):

Creates a pandas ready dict with each row a different citation the contained Records and columns containing the original string, year, journal, author’s name and the number of times it occured.

There are also options to filter the output citations with field and values

Parameters

field : optional str

Default None, if given all citations missing the named field will be dropped.

values : optional str or list[str]

Default None, if field is also given only those citations with one of the strings given in values will be included.

e.g. to get only citations from 1990 or 1991: field = year, values = [1991, 1990]

pandasFriendly : optional bool

Default True, if False a list of the citations will be returned instead of the more complicated pandas dict

counts : optional bool

Default True, if False the counts columns will be removed

Returns

dict

A pandas ready dict with all the Citations


RecordCollection.networkCoAuthor(detailedInfo=False, weighted=True, dropNonJournals=False, count=True, useShortNames=False):

Creates a coauthorship network for the RecordCollection.

Parameters

detailedInfo : optional [bool or iterable[WOS tag Strings]]

Default False, if True all nodes will be given info strings composed of information from the Record objects themselves. This is Equivalent to passing the list: ['PY', 'TI', 'SO', 'VL', 'BP'].

If detailedInfo is an iterable (that evaluates to True) of WOS Tags (or long names) The values of those tags will be used to make the info attributes.

For each of the selected tags an attribute will be added to the node using the values of those tags on the first Record encountered. Warning iterating over RecordCollection objects is not deterministic the first Record will not always be same between runs. The node will be given attributes with the names of the WOS tags for each of the selected tags. The attributes will contain strings of containing the values (with commas removed), if multiple values are encountered they will be comma separated.

Note: detailedInfo is not identical to the detailedCore argument of Recordcollection.networkCoCitation() or Recordcollection.networkCitation()

weighted : optional [bool]

Default True, whether the edges are weighted. If True the edges are weighted by the number of co-authorships.

dropNonJournals : optional [bool]

Default False, whether to drop authors from non-journals

count : optional [bool]

Default True, causes the number of occurrences of a node to be counted

Returns

Networkx Graph

A networkx graph with author names as nodes and collaborations as edges.


ScopusRecord(ExtendedRecord):

ScopusRecord.__init__(inRecord, sFile=’’, sLine=0):

Class for full Scopus entries.

This class is an ExtendedRecord capable of generating its own id number. You should not create them directly, but instead use scopusParser() on a scopus CSV file.


The ScopusRecord class has the following methods:


ScopusRecord.encoding():

An abstractmethod, gives the encoding string of the record.

Returns

str

The encoding


ScopusRecord.getAltName(tag):

An abstractmethod, gives the alternate name of tag or None

Parameters

tag : str

The requested tag

Returns

str

The alternate name of tag or None


ScopusRecord.tagProcessingFunc(tag):

An abstractmethod, gives the function for processing tag

Parameters

tag : optional [str]

The tag in need of processing

Returns

fucntion

The function to process the raw tag


ScopusRecord.specialFuncs(key):

An abstractmethod, process the special tag, key using the whole Record

Parameters

key : str

One of the special tags: 'authorsFull', 'keywords', 'grants', 'j9', 'authorsShort', 'volume', 'selfCitation', 'citations', 'address', 'abstract', 'title', 'month', 'year', 'journal', 'beginningPage' and 'DOI'

Returns

The processed value of key


ScopusRecord.writeRecord(f):

An abstractmethod, writes the record in its original form to infile

Parameters

infile : writable file

The file to be written to


contour

Two functions based on matplotlib for generating nicer looking graphs

This is the only module that depends on anything besides networkx, it depends on numpy, scipy and matplotlib.

The contour module provides the following functions:


contour.graphDensityContourPlot(G, iters=50, layout=None, layoutScaleFactor=1, overlay=False, nodeSize=10, axisSamples=100, blurringFactor=0.1, contours=15, graphType=’coloured’):

Creates a 3D plot giving the density of nodes on a 2D plane, as a surface in 3D.

Most of the options are for tweaking the final appearance. layout and layoutScaleFactor allow a pre-layout graph to be provided. If a layout is not provided the networkx.spring_layout() is used after iters iterations. Then, once the graph has been laid out a grid of axisSamples cells by axisSamples cells is overlaid and the number of nodes in each cell is determined, a gaussian blur is then applied with a sigma of blurringFactor. This then forms a surface in 3 dimensions, which is then plotted.

If you find the resultant image looks too banded raise the the contours number to ~50.

Parameters

G : networkx Graph

The graph to be plotted

iters : optional [int]

Default 50, the number of iterations for the spring layout if layout is not provided.

layout : optional [networkx layout dictionary]

Default None, if provided will be used as a layout of the graph, the maximum distance from the origin along any axis must also given as layoutScaleFactor, which is by default 1.

layoutScaleFactor : optional [double]

Default 1, The maximum distance from the origin allowed along any axis given by layout, i.e. the layout must fit in a square centered at the origin with side lengths 2 * layoutScaleFactor

overlay : optional [bool]

Default False, if True the 2D graph will be plotted on the X-Y plane at Z = 0.

nodeSize : optional [double]

Default 10, the size of the nodes dawn in the overlay

axisSamples : optional [int]

Default 100, the number of cells used along each axis for sampling. A larger number will mean a lower average density.

blurringFactor : optional [double]

Default 0.1, the sigma value used for smoothing the surface density. The higher this number the smoother the surface.

contours : optional [int]

Default 15, the number of different heights drawn. If this number is low the resultant image will look very banded. It is recommended this be raised above 50 if you want your images to look good, Warning this will make them much slower to generate and interact with.

graphType : optional [str]

Default 'coloured', if 'coloured' the image will have a destiny based colourization applied, the only other option is 'solid' which removes the colourization.


contour.quickVisual(G, showLabel=False):

Just makes a simple matplotlib figure and displays it, with each node coloured by its type. You can add labels with showLabel. This looks a bit nicer than the one provided my networkx’s defaults.

Parameters

showLabel : optional [bool]

Default False, if True labels will be added to the nodes giving their IDs.


WOS

These are the functions used to process medline (pubmed) files at the backend. They are meant for use internal use by metaknowledge.

The WOS module provides the following functions:


WOS.recordParser(paper):

This is function that is used to create Records from files.

recordParser() reads the file paper until it reaches ‘ER’. For each field tag it adds an entry to the returned dict with the tag as the key and a list of the entries as the value, the list has each line separately, so for the following two lines in a record:

AF BREVIK, I
   ANICIN, B

The entry in the returned dict would be {'AF' : ["BREVIK, I", "ANICIN, B"]}

Record objects can be created with these dictionaries as the initializer.

Parameters

paper : file stream

An open file, with the current line at the beginning of the WOS record.

Returns

OrderedDict[str : List[str]]

A dictionary mapping WOS tags to lists, the lists are of strings, each string is a line of the record associated with the tag.


WOS.getMonth(s):

Known formats: Month (“%b”) Month Day (“%b %d”) Month-Month (“%b-%b”) — this gets coerced to the first %b, dropping the month range Season (“%s”) — this gets coerced to use the first month of the given season Month Day Year (“%b %d %Y”) Month Year (“%b %Y”)


WOS.confHost(val):

The HO Tag

extracts the host of the conference

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The host


WOS.publisherAddress(val):

The PA Tag

extracts the publishers address

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The publisher address


WOS.endingPage(val):

The EP Tag

return the last page the record occurs on as a string, not aall are intergers

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The final page number


WOS.year(val):

The PY Tag

extracts the year the record was published in as an int

Parameters

val: list[str]

The raw data from a WOS file

Returns

int

The year


WOS.authKeywords(val):

The DE Tag

extracts the keywords assigned by the author of the Record. The WOS description is:

Author keywords are included in records of articles from 1991 forward. They are also include in conference proceedings records.
Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

The list of keywords


WOS.reprintAddress(val):

The RP Tag

extracts the reprint address string

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The reprint address


WOS.bookAuthor(val):

The BA Tag

extracts a list of the short names of the authors of a book Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of shortened author’s names


WOS.totalTimesCited(val):

The Z9 Tag

extracts the total number of citations of the record

Parameters

val: list[str]

The raw data from a WOS file

Returns

int

The total number of citations


WOS.partNumber(val):

The PN Tag

return an integer giving the part of the issue the Record is in

Parameters

val: list[str]

The raw data from a WOS file

Returns

int

The part of the issue of the Record


WOS.specialIssue(val):

The SI Tag

extracts the special issue value

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The special issue value


WOS.subjects(val):

The WC Tag

extracts a list of subjects as assigned by WOS

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

The subjects list


WOS.keywords(val):

The ID Tag

extracts the WOS keywords of the Record. The WOS description is:

KeyWords Plus are index terms created by Thomson Reuters from significant, frequently occurring words in the titles of an article's cited references.
Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

The keyWords list


WOS.pubMedID(val):

The PM Tag

extracts the pubmed ID of the record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The pubmed ID


WOS.documentDeliveryNumber(val):

The GA Tag

extracts the document delivery number of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The document delivery number


WOS.bookAuthorFull(val):

The BF Tag

extracts a list of the long names of the authors of a book Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of author’s names


WOS.groupName(val):

The CA Tag

extracts the name of the group associated with the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The group’s name


WOS.title(val):

The TI Tag

extracts the title of the record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The title of the record


WOS.editors(val):

Needs Work

currently not well understood, returns val


WOS.journal(val):

The SO Tag

extracts the full name of the publication and normalizes it to uppercase

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The name of the journal


WOS.seriesTitle(val):

The SE Tag

extracts the title of the series the Record is in

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The title of the series


WOS.seriesSubtitle(val):

The BS Tag

extracts the title of the series the Record is in

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The subtitle of the series


WOS.language(val):

The LA Tag

extracts the languages of the Record as a string with languages separated by ‘, ‘, usually there is only one language

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The language(s) of the record


WOS.docType(val):

The DT Tag

extracts the type of document the Record contains

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The type of the Record


WOS.authorsFull(val):

The AF Tag

extracts a list of authors full names

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of author’s names


WOS.confTitle(val):

The CT Tag

extracts the title of the conference associated with the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The title of the conference


WOS.confDate(val):

The CY Tag

extracts the date string of the conference associated with the Record, the date is not normalized

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The data of the conference


WOS.confSponsors(val):

The SP Tag

extracts a list of sponsors for the conference associated with the record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

A the list of of sponsors


WOS.wosTimesCited(val):

The TC Tag

extracts the number of times the Record has been cited by records in WOS

Parameters

val: list[str]

The raw data from a WOS file

Returns

int

The number of time the Record has been cited


WOS.authAddress(val):

The C1 Tag

extracts the address of the authors as given by WOS. Warning the mapping of author to address is not very good and is given in multiple ways.

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of addresses


WOS.confLocation(val):

The CL Tag

extracts the sting giving the conference’s location

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The conferences address


WOS.j9(val):

The J9 Tag

extracts the J9 (29-Character Source Abbreviation) of the publication

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The 29-Character Source Abbreviation


WOS.funding(val):

The FU Tag

extracts a list of the groups funding the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of funding groups


WOS.subjectCategory(val):

The SC Tag

extracts a list of the subjects associated with the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of the subjects associated with the Record


WOS.group(val):

The GP Tag

extracts the group associated with the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

A the name of the group


WOS.citations(val):

The CR Tag

extracts a list of all the citations in the record, the citations are the metaknowledge.Citation class.

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[metaknowledge.Citation]

A list of Citations


WOS.publisherCity(val):

The PI Tag

extracts the city the publisher is in

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The city of the publisher


WOS.ISSN(val):

The SN Tag

extracts the ISSN of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The ISSN string


WOS.articleNumber(val):

The AR Tag

extracts a string giving the article number, not all are integers

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The article number


WOS.issue(val):

The IS Tag

extracts a string giving the issue or range of issues the Record was in, not all are integers

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The issue number/range


WOS.email(val):

The EM Tag

extracts a list of emails given by the authors of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of emails


WOS.eISSN(val):

The EI Tag

extracts the EISSN of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The EISSN string


WOS.DOI(val):

The DI Tag

return the DOI number of the record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The DOI number string


WOS.wosString(val):

The UT Tag

extracts the WOS number of the record as a string preceded by “WOS:”

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The WOS number


WOS.orcID(val):

The OI Tag

extracts a list of orc IDs of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The orc ID


WOS.pubType(val):

The PT Tag

extracts the type of publication as a character: conference, book, journal, book in series, or patent

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

A string


WOS.editedBy(val):

The BE Tag

extracts a list of the editors of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of editors


WOS.meetingAbstract(val):

The MA Tag

extracts the ID of the meeting abstract prefixed by ‘EPA-‘

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The meeting abstract prefixed


WOS.isoAbbreviation(val):

The JI Tag

extracts the iso abbreviation of the journal

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The iso abbreviation of the journal


WOS.pageCount(val):

The PG Tag

returns an integer giving the number of pages of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

int

The page count


WOS.publisher(val):

The PU Tag

extracts the publisher of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The publisher


WOS.ISBN(val):

The BN Tag

extracts a list of ISBNs associated with the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list

The ISBNs


WOS.month(val):

The PD Tag

extracts the month the record was published in as an int with January as 1, February 2, …

Parameters

val: list[str]

The raw data from a WOS file

Returns

int

A integer giving the month


WOS.fundingText(val):

The FX Tag

extracts a string of the funding thanks

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The funding thank-you


WOS.bookDOI(val):

The D2 Tag

extracts the book DOI of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The DOI number


WOS.volume(val):

The VL Tag

return the volume the record is in as a string, not all are integers

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The volume number


WOS.ResearcherIDnumber(val):

The RI Tag

extracts a list of the research IDs of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

The list of the research IDs


WOS.authorsShort(val):

The AU Tag

extracts a list of authors shortened names

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of shortened author’s names


WOS.citedRefsCount(val):

The NR Tag

extracts the number citations, length of CR list

Parameters

val: list[str]

The raw data from a WOS file

Returns

int

The number of CRs


WOS.beginningPage(val):

The BP Tag

extracts the first page the record occurs on, not all are integers

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The first page number


WOS.abstract(val):

The AB Tag

return abstract of the record, with newlines hopefully in the correct places

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The abstract


WOS.supplement(val):

The SU Tag

extracts the supplement number

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The supplement number


WOS.wosParser(isifile):

This is function that is used to create RecordCollections from files.

wosParser() reads the file given by the path isifile, checks that the header is correct then reads until it reaches EF. All WOS records it encounters are parsed with recordParser() and converted into Records. A list of these Records is returned.

BadWOSFile is raised if an issue is found with the file.

Parameters

isifile : str

The path to the target file

Returns

List[Record]

All the Records found in isifile


WOS.isWOSFile(infile, checkedLines=3):

Determines if infile is the path to a WOS file. A file is considerd to be a WOS file if it has the correct encoding (utf-8 with a BOM) and within the first checkedLines a line starts with "VR 1.0".

Parameters

infile : str

The path to the targets file

checkedLines : optional [int]

default 2, the number of lines to check for the header

Returns

bool

True if the file is a WOS file


medline

These are the functions used to process medline (pubmed) files at the backend. They are meant for use internal use by metaknowledge.

The medline module provides the following functions:


medline.medlineParser(pubFile):

Parses a medline file, pubFile, to extract the individual entries as MedlineRecords.

A medline file is a series of entries, each entry is a series of tags. A tag is a 2 to 4 character string each tag is padded with spaces on the left to make it 4 characters which is followed by a dash and a space ('- '). Everything after the tag and on all lines after it not starting with a tag is considered associated with the tag. Each entry’s first tag is PMID, so a first line looks something like PMID- 26524502. Entries end with a single blank line.

Parameters

pubFile : str

A path to a valid medline file, use isMedlineFile to verify

Returns

set[MedlineRecord]

Records for each of the entries


medline.isMedlineFile(infile, checkedLines=2):

Determines if infile is the path to a Medline file. A file is considerd to be a Medline file if it has the correct encoding (latin-1) and within the first checkedLines a line starts with "PMID- ".

Parameters

infile : str

The path to the targets file

checkedLines : optional [int]

default 2, the number of lines to check for the header

Returns

bool

True if the file is a Medline file


medline.medlineRecordParser(record):

The parser MedlineRecord use. This takes an entry from medlineParser() and parses it a part of the creation of a MedlineRecord.

Parameters

record : enumerate object

a file wrapped by enumerate()

Returns

collections.OrderedDict

An ordered dictionary of the key-vaue pairs in the entry


medline.FPS(val):

FullPersonalNameSubject


medline.TT(val):

TransliteratedTitle


medline.PROF(val):

PartialRetractionOf


medline.PHST(val):

PublicationHistoryStatus


medline.EFR(val):

ErratumFor


medline.PST(val):

PublicationStatus


medline.SPIN(val):

SummaryForPatients


medline.AU(val):

Author


medline.FED(val):

Editor


medline.NM(val):

SubstanceName


medline.SO(val):

Source


medline.IP(val):

Issue


medline.OABL(val):

OtherAbstract


medline.PUBM(val):

PublishingModel


medline.CRDT(val):

CreateDate


medline.DDIN(val):

DatasetIn


medline.MH(val):

MeSHTerms


medline.DP(val):

DatePublication


medline.GN(val):

GeneralNote


medline.CRF(val):

CorrectedRepublishedFrom


medline.TI(val):

Title only one per record


medline.CRI(val):

CorrectedRepublishedIn


medline.OT(val):

OtherTerm Nothing needs to be done


medline.ROF(val):

RetractionOf


medline.CN(val):

CorporateAuthor


medline.OTO(val):

OtherTermOwner one line field


medline.OID(val):

OtherID


medline.PT(val):

PublicationType


medline.RPI(val):

RepublishedIn


medline.AB(val):

Abstract basically a one liner after parsing


medline.EN(val):

Edition


medline.AD(val):

Affiliation Undoing what the parser does then splitting at the semicolons and dropping newlines extra fitlering is required beacuse some AD’s end with a semicolon


medline.LA(val):

Language


medline.MHDA(val):

MeSHDate


medline.TA(val):

JournalTitleAbbreviation One line only


medline.JT(val):

JournalTitle One line only


medline.IRAD(val):

InvestigatorAffiliation


medline.PS(val):

PersonalNameSubject


medline.IS(val):

ISSN


medline.PL(val):

PlacePublication


medline.CTI(val):

CollectionTitle


medline.FAU(val):

FullAuthor


medline.VTI(val):

VolumeTitle


medline.DCOM(val):

DateCompleted


medline.LID(val):

LocationIdentifier


medline.BTI(val):

BookTitle


medline.CI(val):

CopyrightInformation


medline.STAT(val):

Status


medline.DRIN(val):

DatasetUseReportedIn


medline.RF(val):

NumberReferences


medline.UIN(val):

UpdateIn


medline.LR(val):

DateLastRevised


medline.IR(val):

Investigator


medline.SFM(val):

SpaceFlightMission


medline.EIN(val):

ErratumIn


medline.AID(val):

ArticleIdentifier The given values do not require any work


medline.EDAT(val):

EntrezDate


medline.PRIN(val):

PartialRetractionIn


medline.DEP(val):

DateElectronicPublication


medline.AUID(val):

AuthorIdentifier one line only just need to undo the parser’s effects


medline.SI(val):

SecondarySourceID


medline.ISBN(val):

ISBN


medline.RN(val):

RegistryNumber


medline.JID(val):

NLMID


medline.GR(val):

GrantNumber


medline.OCI(val):

OtherCopyright


medline.SB(val):

Subset


medline.DA(val):

DateCreated


medline.PMCR(val):

PubMedCentralRelease


medline.PG(val):

Pagination all pagination seen so far seems to be only on one line


medline.GS(val):

GeneSymbol


medline.VI(val):

Volume The volumes as a string as volume is single line


medline.UOF(val):

UpdateOf


medline.FIR(val):

InvestigatorFull


medline.OWN(val):

Owner


medline.ORI(val):

OriginalReportIn


medline.MID(val):

ManuscriptIdentifier


medline.PMID(val):

PubMedUniqueIdentifier


medline.PMC(val):

PubMedCentralIdentifier


medline.RIN(val):

RetractionIn


medline.RPF(val):

RepublishedFrom


medline.CIN(val):

CommentIn


proquest

These are the functions used to process medline (pubmed) files at the backend. They are meant for use internal use by metaknowledge.

The proquest module provides the following functions:


proquest.proQuestParser(proFile):

Parses a ProQuest file, proFile, to extract the individual entries.

A ProQuest file has three sections, first a list of the contained entries, second the full metadata and finally a bibtex formatted entry for the record. This parser only uses the first two as the bibtex contains no information the second section does not. Also, the first section is only used to verify the second section. The returned ProQuestRecords contains the data from the second section, with the same key strings as ProQuest uses and the unlabeled sections are called in order, 'Name', 'Author' and 'url'.

Parameters

proFile : str

A path to a valid ProQuest file, use isProQuestFile to verify

Returns

set[ProQuestRecord]

Records for each of the entries


proquest.isProQuestFile(infile, checkedLines=2):

Determines if infile is the path to a ProQuest file. A file is considered to be a Proquest file if it has the correct encoding (utf-8) and within the first checkedLines the following starts.

____________________________________________________________

Report Information from ProQuest
Parameters

infile : str

The path to the targets file

checkedLines : optional [int]

default 2, the number of lines to check for the header

Returns

bool

True if the file is a valid ProQuest file


proquest.proQuestRecordParser(enRecordFile, recNum):

The parser ProQuestRecords use. This takes an entry from proQuestParser() and parses it a part of the creation of a ProQuestRecord.

Parameters

enRecordFile : enumerate object

a file wrapped by enumerate()

recNum : int

The number given to the entry in the first section of the ProQuest file

Returns

collections.OrderedDict

An ordered dictionary of the key-vaue pairs in the entry


proquest.proQuestTagToFunc(tag):

Takes a tag string, tag, and returns the processing function for its data. If their is not a predefined function returns the identity function (lambda x : x).

Parameters

tag : str

The requested tag

Returns

function

A function to process the tag’s data


scopus

These are the functions used to process scopus csv files at the backend. They are meant for use internal use by metaknowledge.

The scopus module provides the following functions:


scopus.scopusRecordParser(record):

The parser ScopusRecords use. This takes a line from scopusParser() and parses it as a part of the creation of a ScopusRecord.

Note this is for csv files downloaded from scopus not the text records as those are less complete. Also, Scopus uses double quotes (") to quote strings, such as abstracts, in the csv so double quotes in the string must be escaped. For reasons not fully understandable by mortals they choose to use two double quotes in a row ("") to represent an escaped double quote. This parser does not unescape these quotes, but it does correctly handle their interacts with the outer double quotes.

Parameters

record : str

string ending with a newline containing the record’s entry

Returns

dict

A dictionary of the key-vaue pairs in the entry


scopus.scopusParser(scopusFile):

Parses a scopus file, scopusFile, to extract the individual lines as ScopusRecords.

A Scopus file is a csv (Comma-separated values) with a complete header, see scopus.scopusHeader for the entries, and each line after it containing a record’s entry. The string valued entries are quoted with double quotes which means double quotes inside them can cause issues, see scopusRecordParser() for more information.

Parameters

scopusFile : str

A path to a valid scopus file, use isScopusFile() to verify

Returns

set[ScopusRecord]

Records for each of the entries


scopus.isScopusFile(infile, checkedLines=2):

Determines if infile is the path to a Scopus csv file. A file is considerd to be a Scopus file if it has the correct encoding (utf-8 with BOM (Byte Order Mark)) and within the first checkedLines a line contains the complete header, the list of all header entries in order is found in scopus.scopusHeader.

Note this is for csv files not plain text files from scopus, plain text files are not complete.

Parameters

infile : str

The path to the targets file

checkedLines : optional [int]

default 2, the number of lines to check for the header

Returns

bool

True if the file is a Scopus csv file


journalAbbreviations

This module handles the abbreviations, known as J29 abbreviations and given by the J9 tag in WOS Records and for journal titles that WOS employs in citations.

The citations provided by WOS used abbreviated journal titles instead of the full names. The full list of abbreviations can be found at a series pages divided by letter starting at images.webofknowledge.com/WOK46/help/WOS/A_abrvjt.html. The function updatej9DB() is used to scape and parse the pages, it must be run without error before the other features can be used. metaknowledge. If the database is requested by getj9dict(), which is what Citations use, and the database is not found or is corrupted then updatej9DB() will be run to download the database if this fails an mkException will be raised, the download and parsing usually takes less than a second on a good internet connection.

The other functions of the module are for manually adding and removing abbreviations from the database. It is recommended that this be done with the command-line tool metaknowledge instead of with a script.

The journalAbbreviations module provides the following functions:


journalAbbreviations.getj9dict(dbname=’j9Abbreviations’, manualDB=’manualj9Abbreviations’, returnDict=’both’):

Returns the dictionary of journal abbreviations mapping to a list of the associated journal names. By default the local database is used. The database is in the file dbname in the same directory as this source file

Parameters

dbname : optional [str]

The name of the downloaded database file, the default is determined at run time. It is recommended that this remain untouched.

manualDB : optional [str]

The name of the manually created database file, the default is determined at run time. It is recommended that this remain untouched.

returnDict : optional [str]

default 'both', can be used to get both databases or only one with 'WOS' or 'manual'.


journalAbbreviations.addToDB(abbr=None, dbname=’manualj9Abbreviations’):

Adds abbr to the database of journals. The database is kept separate from the one scraped from WOS, this supersedes it. The database by default is stored with the WOS one and the name is given by metaknowledge.journalAbbreviations.manualDBname. To create an empty database run addToDB without an abbr argument.

Parameters

abbr : optional [str or dict[str : str]]

The journal abbreviation to be added to the database, it can either be a single string in which case that string will be added with its self as the full name, or a dict can be given with the abbreviations as keys and their names as strings, use pipes ('|') to separate multiple names. Note, if the empty string is given as a name the abbreviation will be considered manually excluded, i.e. having excludeFromDB() run on it.

dbname : optional [str]

The name of the database file, default is metaknowledge.journalAbbreviations.manualDBname.


Questions?

If you find bugs, or have questions, please write to:

Reid McIlroy-Young reid@reidmcy.com

John McLevey john.mclevey@uwaterloo.ca


License

metaknowledge is free and open source software, distributed under the GPL License.