Full Documentation

The classes and modules of metaknowledge are:

All the functions and methods of metaknowledge are as follows:


metaknowledge is a Python3 package that simplifies bibliometric and computational analysis of Web of Science data.

Example

To load the data from files and make a network:

>>> import metaknowledge as mk
>>> RC = mk.RecordCollection("records/")
>>> print(RC)
Collection of 33 records
>>> G = RC.coCiteNetwork(nodeType = 'journal')
Done making a co-citation network of files-from-records                 1.1s
>>> print(len(G.nodes()))
223
>>> mk.writeGraph(G, "Cocitation-Network-of-Journals")

There is also a simple command line program called metaknowledge that comes with the package. It allows for creating networks without any need to know Python. More information about it can be found at networkslab.org/metaknowledge/cli

Overview

This package can read the files downloaded from the Thomson Reuters Web of Science (WOS) as plain text. These files contain metadata about scientific records, such as the authors, title, and citations. The records are exported in groups of up-to 500 individual records to a file.

The metaknowledge.RecordCollection class can take a path to one or more of these files load and parse them. The object is the main way for work to be done on multiple records. For each individual record it creates an instance of the metaknowledge.Record class that contains the results of the parsing of the record.

The files given by WOS are a flat database containing a series of 2 character tags, e.g. ‘TI’ is the title. Each WOS tag has one or more values and metaknowledge can read them to extract useful information. The approximate meanings of the tags are listed in the tagProcessing package, along with the parsing functions for each tag. If you simply want the mapping tagToFull() is a function that maps tags to their full names it, as well as a few other similar functions are provided by the base metaknowledge import. Note, the long names can be used in place of the short 2 character codes within metaknowledge. There are no full official public listings of tag the meanings available. metaknowledge is not attempting to provide the definitive or authoritative meanings.

Citations are handled by a special Citation class. This class can parse the citations given by WOS as well as extra details about the full name of their journal and allow simple comparisons.

Note for those reading the docstrings metaknowledge’s docs are written in markdown and are processed to produce the documentation found at networkslab.org/metaknowledge/documentation, but you should have no problem reading them from the help function.


The functions provided by metaknowledge are:


filterNonJournals(citesLst, invert=False):

Removes the Citations from citesLst that are not journals

Parameters

citesLst : list [Citation]

A list of citations to be filtered

invert : optional [bool]

Default False, if True non-journals will be kept instead of journals

Returns

list [Citation]

A filtered list of Citations from citesLst


diffusionGraph(source, target, weighted=True, sourceType=’raw’, targetType=’raw’, labelEdgesBy=None):

Takes in two RecordCollections and produces a graph of the citations of source by the Records in target. By default the nodes in the are Record objects but this can be changed with the sourceType and targetType keywords. The edges of the graph go from the target to the source.

Each node on the output graph has two boolean attributes, "source" and "target" indicating if they are targets or sources. Note, if the types of the sources and targets are different the attributes will not be checked for overlap of the other type. e.g. if the source type is 'TI' (title) and the target type is 'UT' (WOS number), and there is some overlap of the targets and sources. Then the Record corresponding to a source node will not be checked for being one of the titles of the targets, only its WOS number will be considered.

Parameters

source : RecordCollection

A metaknowledge RecordCollection containing the Records being cited

target : RecordCollection

A metaknowledge RecordCollection containing the Records citing those in source

weighted : optional [bool]

Default True, if True each edge will have an attribute 'weight' giving the number of times the source has referenced the target.

sourceType : optional [str]

Default 'raw', if 'raw' the returned graph will contain Records as source nodes.

If Records are not wanted then it can be set to a WOS tag, such as 'SO' (for journals ), to make the nodes into the type of object returned by that tag from Records.

targetType : optional [str]

Default 'raw', if 'raw' the returned graph will contain Records as target nodes.

If Records are not wanted then it can be set to a WOS tag, such as 'SO' (for journals ), to make the nodes into the type of object returned by that tag from Records.

labelEdgesBy : optional [str]

Default None, if a WOS tag (or long name of WOS tag) then the edges of the output graph will have a attribute 'key' that is the value of the referenced tag, of source Record, i.e. if 'PY' is given then each edge will have a 'key' value equal to the publication year of the source.

This option will cause the output graph to be an MultiDiGraph and is likely to result in parallel edges. If a Record has multiple values for at tag (e.g. 'AF') the each tag will create its own edge.

Returns

networkx Directed Graph or networkx multi Directed Graph

A directed graph of the diffusion network, labelEdgesBy is used the graph will allow parallel edges.


diffusionCount(source, target, sourceType=’raw’, pandasFriendly=False, compareCounts=False, numAuthors=True, byYear=False):

Takes in two RecordCollections and produces a dict counting the citations of source by the Records of target. By default the dict uses Record objects as keys but this can be changed with the sourceType keyword to any of the WOS tags.

Parameters

source : RecordCollection

A metaknowledge RecordCollection containing the Records being cited

target : RecordCollection

A metaknowledge RecordCollection containing the Records citing those in source

sourceType : optional [str]

default 'raw', if 'raw' the returned dict will contain Records as keys. If it is a WOS tag the keys will be of that type.

pandasFriendly : optional [bool]

default False, makes the output be a dict with two keys one "Record" is the list of Records ( or data type requested by sourceType) the other is their occurrence counts as "Counts". The lists are the same length.

compareCounts : optional [bool]

default False, if True the diffusion analysis will be run twice, first with source and target setup like the default (global scope) then using only the source RecordCollection (local scope).

byYear : optional [bool]

default False, if True the returned dictionary will have Records mapped to maps, these maps will map years (‘ints’) to counts. If pandasFriendly is also True the resultant dictionary will have an additional column called 'year'. This column will contain the year the citations occurred, in addition the Records entries will be duplicated for each year they occur in.

Returns

dict[:int]

A dictionary with the type given by sourceType as keys and integers as values.

If compareCounts is True the values are tuples with the first integer being the diffusion in the target and the second the diffusion in the source.

If pandasFriendly is True the returned dict has keys with the names of the WOS tags and lists with their values, i.e. a table with labeled columns. The counts are in the column named "TargetCount" and if compareCounts the local count is in a column called "SourceCount".


readGraph(edgeList, nodeList=None, directed=False, idKey=’ID’, eSource=’From’, eDest=’To’):

Reads the files given by edgeList and nodeList and creates a networkx graph for the files.

This is designed only for the files produced by metaknowledge and is meant to be the reverse of writeGraph(), if this does not produce the desired results the networkx builtin networkx.read_edgelist() could be tried as it is aimed at a more general usage.

The read edge list format assumes the column named eSource (default 'From') is the source node, then the column eDest (default 'To') givens the destination and all other columns are attributes of the edges, e.g. weight.

The read node list format assumes the column idKey (default 'ID') is the ID of the node for the edge list and the resulting network. All other columns are considered attributes of the node, e.g. count.

Note: If the names of the columns do not match those given to readGraph() a KeyError exception will be raised.

Note: If nodes appear in the edgelist but not the nodeList they will be created silently with no attributes.

Parameters

edgeList : str

a string giving the path to the edge list file

nodeList : optional [str]

default None, a string giving the path to the node list file

directed : optional [bool]

default False, if True the produced network is directed from eSource to eDest

idKey : optional [str]

default 'ID', the name of the ID column in the node list

eSource : optional [str]

default 'From', the name of the source column in the edge list

eDest : optional [str]

default 'To', the name of the destination column in the edge list

Returns

networkx Graph

the graph described by the input files


writeEdgeList(grph, name, extraInfo=True, allSameAttribute=False):

Writes an edge list of grph at the destination name.

The edge list has two columns for the source and destination of the edge, 'From' and 'To' respectively, then, if edgeInfo is True, for each attribute of the node another column is created.

Note: If any edges are missing an attribute it will be left blank by default, enable allSameAttribute to cause a KeyError to be raised.

Parameters

grph : networkx Graph

The graph to be written to name

name : str

The name of the file to be written

edgeInfo : optional [bool]

Default True, if True the attributes of each edge will be written

allSameAttribute : optional [bool]

Default False, if True all the edges must have the same attributes or an exception will be raised. If False the missing attributes will be left blank.


writeNodeAttributeFile(grph, name, allSameAttribute=False):

Writes a node attribute list of grph to the file given by the path name.

The node list has one column call 'ID' with the node ids used by networkx and all other columns are the node attributes.

Note: If any nodes are missing an attribute it will be left blank by default, enable allSameAttribute to cause a KeyError to be raised.

Parameters

grph : networkx Graph

The graph to be written to name

name : str

The name of the file to be written

allSameAttribute : optional [bool]

Default False, if True all the nodes must have the same attributes or an exception will be raised. If False the missing attributes will be left blank.


dropEdges(grph, minWeight=-inf, maxWeight=inf, parameterName=’weight’, ignoreUnweighted=False, dropSelfLoops=False):

Modifies grph by dropping edges whose weight is not within the inclusive bounds of minWeight and maxWeight, i.e after running grph will only have edges whose weights meet the following inequality: minWeight <= edge’s weight <= maxWeight. A Keyerror will be raised if the graph is unweighted unless ignoreUnweighted is True, the weight is determined by examining the attribute parameterName.

Note: none of the default options will result in grph being modified so only specify the relevant ones, e.g. dropEdges(G, dropSelfLoops = True) will remove only the self loops from G.

Parameters

grph : networkx Graph

The graph to be modified.

minWeight : optional [int or double]

default -inf, the minimum weight for an edge to be kept in the graph.

maxWeight : optional [int or double]

default inf, the maximum weight for an edge to be kept in the graph.

parameterName : optional [str]

default 'weight', key to weight field in the edge’s attribute dictionary, the default is the same as networkx and metaknowledge so is likely to be correct

ignoreUnweighted : optional [bool]

default False, if True unweighted edges will kept

dropSelfLoops : optional [bool]

default False, if True self loops will be removed regardless of their weight


dropNodesByDegree(grph, minDegree=-inf, maxDegree=inf, useWeight=True, parameterName=’weight’, includeUnweighted=True):

Modifies grph by dropping nodes that do not have a degree that is within inclusive bounds of minDegree and maxDegree, i.e after running grph will only have nodes whose degrees meet the following inequality: minDegree <= node’s degree <= maxDegree.

Degree is determined in two ways, the default useWeight is the weight attribute of the edges to a node will be summed, the attribute’s name is parameterName otherwise the number of edges touching the node is used. If includeUnweighted is True then useWeight will assign a degree of 1 to unweighted edges.

Parameters

grph : networkx Graph

The graph to be modified.

minDegree : optional [int or double]

default -inf, the minimum degree for an node to be kept in the graph.

maxDegree : optional [int or double]

default inf, the maximum degree for an node to be kept in the graph.

useWeight : optional [bool]

default True, if True the the edge weights will be summed to get the degree, if False the number of edges will be used to determine the degree.

parameterName : optional [str]

default 'weight', key to weight field in the edge’s attribute dictionary, the default is the same as networkx and metaknowledge so is likely to be correct.

includeUnweighted : optional [bool]

default True, if True edges with no weight will be considered to have a weight of 1, if False they will cause a KeyError to be raised.


dropNodesByCount(grph, minCount=-inf, maxCount=inf, parameterName=’count’, ignoreMissing=False):

Modifies grph by dropping nodes that do not have a count that is within inclusive bounds of minCount and maxCount, i.e after running grph will only have nodes whose degrees meet the following inequality: minCount <= node’s degree <= maxCount.

Count is determined by the count attribute, parameterName, and if missing will result in a KeyError being raised. ignoreMissing can be set to True to suppress the error.

minCount and maxCount default to negative and positive infinity respectively so without specifying either the output should be the input

Parameters

grph : networkx Graph

The graph to be modified.

minCount : optional [int or double]

default -inf, the minimum Count for an node to be kept in the graph.

maxCount : optional [int or double]

default inf, the maximum Count for an node to be kept in the graph.

parameterName : optional [str]

default 'count', key to count field in the nodes’s attribute dictionary, the default is the same thoughout metaknowledge so is likely to be correct.

ignoreMissing : optional [bool]

default False, if True nodes missing a count will be kept in the graph instead of raising an exception


mergeGraphs(targetGraph, addedGraph, incrementedNodeVal=’count’, incrementedEdgeVal=’weight’):

A quick way of merging graphs, this is meant to be quick and is only intended for graphs generated by metaknowledge. This does not check anything and as such may cause unexpected results if the source and target were not generated by the same method.

mergeGraphs() will modify targetGraph in place by adding the nodes and edges found in the second, addedGraph. If a node or edge exists targetGraph is given precedence, but the edge and node attributes given by incrementedNodeVal and incrementedEdgeVal are added instead of being overwritten.

Parameters

targetGraph : networkx Graph

the graph to be modified, it has precedence.

addedGraph : networkx Graph

the graph that is unmodified, it is added and does not have precedence.

incrementedNodeVal : optional [str]

default 'count', the name of the count attribute for the graph’s nodes. When merging this attribute will be the sum of the values in the input graphs, instead of targetGraph’s value.

incrementedEdgeVal : optional [str]

default 'weight', the name of the weight attribute for the graph’s edges. When merging this attribute will be the sum of the values in the input graphs, instead of targetGraph’s value.


graphStats(G, stats=(‘nodes’, ‘edges’, ‘isolates’, ‘loops’, ‘density’, ‘transitivity’), makeString=True):

Returns a string or list containing statistics about the graph G.

graphStats() gives 6 different statistics: number of nodes, number of edges, number of isolates, number of loops, density and transitivity. The ones wanted can be given to stats. By default a string giving a sentence containing all the requested statistics is returned but the raw values can be accessed instead by setting makeString to False.

Parameters

G : networkx Graph

The graph for the statistics to be determined of

stats : optional [list or tuple [str]]

Default ('nodes', 'edges', 'isolates', 'loops', 'density', 'transitivity'), a list or tuple containing any number or combination of the strings:

"nodes", "edges", "isolates", "loops", "density" and "transitivity"`

At least one occurrence of the corresponding string causes the statistics to be provided in the string output. For the non-string (tuple) output the returned tuple has the same length as the input and each output is at the same index as the string that requested it, e.g.

_stats_ = ("edges", "loops", "edges")

The return is a tuple with 2 elements the first and last of which are the number of edges and the second is the number of loops

makeString : optional [bool]

Default True, if True a string is returned if False a tuple

Returns

str or tuple [float and int]

The type is determined by makeString and the layout by stats


writeGraph(grph, name, edgeInfo=True, typing=False, suffix=’csv’, overwrite=True):

Writes both the edge list and the node attribute list of grph to files starting with name.

The output files start with name, the file type (edgeList, nodeAttributes) then if typing is True the type of graph (directed or undirected) then the suffix, the default is as follows:

name_fileType.suffix

Both files are csv’s with comma delimiters and double quote quoting characters. The edge list has two columns for the source and destination of the edge, 'From' and 'To' respectively, then, if edgeInfo is True, for each attribute of the node another column is created. The node list has one column call “ID” with the node ids used by networkx and all other columns are the node attributes.

To read back these files use readGraph() and to write only one type of lsit use writeEdgeList() or writeNodeAttributeFile().

Warning: this function will overwrite files, if they are in the way of the output, to prevent this set overwrite to False

Note: If any nodes or edges are missing an attribute a KeyError will be raised.

Parameters

grph : networkx Graph

A networkx graph of the network to be written.

name : str

The start of the file name to be written, can include a path.

edgeInfo : optional [bool]

Default True, if True the the attributes of each edge are written to the edge list.

typing : optional [bool]

Default False, if True the directed ness of the graph will be added to the file names.

suffix : optional [str]

Default "csv", the suffix of the file.

overwrite : optional [bool]

Default True, if True files will be overwritten silently, otherwise an OSError exception will be raised.


recordParser(paper):

This is function that is used to create Records from files.

recordParser() reads the file paper until it reaches ‘ER’. For each field tag it adds an entry to the returned dict with the tag as the key and a list of the entries as the value, the list has each line separately, so for the following two lines in a record:

AF BREVIK, I
   ANICIN, B

The entry in the returned dict would be {'AF' : ["BREVIK, I", "ANICIN, B"]}

Record objects can be created with these dictionaries as the initializer.

Parameters

paper : file stream

An open file, with the current line at the beginning of the WOS record.

Returns

OrderedDict[str : List[str]]

A dictionary mapping WOS tags to lists, the lists are of strings, each string is a line of the record associated with the tag.


wosParser(isifile):

This is function that is used to create RecordCollections from files.

wosParser() reads the file given by the path isifile, checks that the header is correct then reads until it reaches EF. All WOS records it encounters are parsed with recordParser() and converted into Records. A list of these Records is returned.

BadWOSFile is raised if an issue is found with the file.

Parameters

isifile : str

The path to the target file

Returns

List[Record]

All the Records found in isifile


tagToFull(tag):

A wrapper for tagToFullDict it maps 2 character tags to their full names.

Parameters

tag: str

A two character string giving the tag

Returns

str

The full name of tag


normalizeToTag(val):

Converts tags or full names to 2 character tags, case insensitive

Parameters

val: str

A two character string giving the tag or its full name

Returns

str

The short name of val


normalizeToName(val):

Converts tags or full names to full names, case sensitive

Parameters

val: str

A two character string giving the tag or its full name

Returns

str

The full name of val


isTagOrName(val):

Checks if val is a tag or full name of tag if so returns True

Parameters

val: str

A string possible forming a tag or name

Returns

bool

True if val is a tag or name, otherwise False


BadCitation(Warning):

Exception thrown by Citation


BadWOSRecord(Warning):

Exception thrown by the record parser to indicate a mis-formated record. This occurs when some component of the record does not parse. The messages will be any of:

* _Missing field on line (line Number):(line)_, which indicates a line was to short, there should have been a tag followed by information

* _End of file reached before ER_, which indicates the file ended before the 'ER' indicator appeared, 'ER' indicates the end of a record. This is often due to a copy and paste error.

* _Duplicate tags in record_, which indicates the record had 2 or more lines with the same tag.

* _Missing WOS number_, which indicates the record did not have a 'UT' tag.

Records with a BadWOSRecord error are likely incomplete or the combination of two or more single records.


Citation(object):

Citation.__init__(cite):

A class to hold citation strings and allow for comparison between them.

The initializer takes in a string representing a WOS citation in the form:

Author, Year, Journal, Volume, Page, DOI

Author is the author’s name in the form of first last name first initial sometimes followed by a period. Year is the year of publication. Journal being the 29-Character Source Abbreviation of the journal. Volume is the volume number(s) of the publication preceded by a V Page is the page number the record starts on DOI is the DOI number of the cited record preceeded by the letters 'DOI' Combined they look like:

Nunez R., 1998, MATH COGNITION, V4, P85, DOI 10.1080/135467998387343

Note: any of the fields have been known to be missing and the requirements for the fields are not always met. If something is in the source string that cannot be interpreted as any of these it is put in the misc attribute. That is the reason to use this class, it gracefully handles missing information while still allowing for comparison between WOS citation strings.

Customizations

Citation’s hashing and equality checking are based on ID() and use the values of author, year and journal.

When converted to a string a Citation will return the original string.

Attributes

As noted above, citations are considered to be divided into six distinct fields (Author, Year, Journal, Volume, Page and DOI) with a seventh misc for anything not in those. Records thus have an attribute with a name corresponding to each author, year, journal, V, P, DOI and misc respectively. These are created if there is anything in the field. So a Citation created from the string: 'Nunez R., 1998, MATH COGNITION' would have author, year and journal defined. While one from 'Nunez R.' would have only the attribute misc.

If the parsing of a citation string fails the attribute bad is set to True and the attribute error is created to contain said error, which is a BadCitation object. If no errors occur bad is False.

The attribute original is the unmodified string (cite) given to create the Citation, it can also be accessed by converting to a string, e.g. with str().

__Init__

Citations can be created by Records or by giving the initializer a string containing a WOS style citation.

Parameters

cite : str

A str containing a WOS style citation.


The Citation class has the following methods:


Citation.isAnonymous():

Checks if the author is given as '[ANONYMOUS]' and returns True if so.

Returns

bool

True if the author is '[ANONYMOUS]' otherwise False.


Citation.ID():

Returns all of author, year and journal available separated by ' ,'. It is for shortening labels when creating networks as the resultant strings are often unique. Extra() gets everything not returned by ID().

This is also used for hashing and equality checking.

Returns

str

A string to use as the ID of a node.


Citation.allButDOI():

Returns a string of the normalized values from the Citation excluding the DOI number. Equivalent to getting the ID with ID() then appending the extra values from Extra() and then removing the substring containing the DOI number.

Returns

str

A string containing the data of the Citation.


Citation.Extra():

Returns any V, P, DOI or misc values as a string. These are all the values not returned by ID(), they are separated by ' ,'.

Returns

str

A string containing the data not in the ID of the Citation.


Citation.isJournal(dbname=’j9Abbreviations’, manaulDB=’manualj9Abbreviations’, returnDict=’both’, checkIfExcluded=False):

Returns True if the Citation’s journal field is a journal abbreviation from the WOS listing found at http://images.webofknowledge.com/WOK46/help/WOS/A_abrvjt.html, i.e. checks if the citation is citing a journal.

Note: Requires the j9Abbreviations database file and will raise an error if it cannot be found.

Note: All parameters are used for getting the data base with getj9dict().

Parameters

dbname : optional [str]

The name of the downloaded database file, the default is determined at run time. It is recommended that this remain untouched.

manaulDB : optional [str]

The name of the manually created database file, the default is determined at run time. It is recommended that this remain untouched.

returnDict : optional [str]

default 'both', can be used to get both databases or only one with 'WOS' or 'manual'.

Returns

bool

True if the Citation is for a journal


Citation.FullJournalName():

Returns the full name of the Citation’s journal field. Requires the j9Abbreviations database file.

Note: Requires the j9Abbreviations database file and will raise an error if it cannot be found.

Returns

str

The first full name given for the journal of the Citation (or the first name in the WOS list if multiple names exist), if there is not one then None is returned


Citation.addToDB(manualName=None, manaulDB=’manualj9Abbreviations’, invert=False):

Adds the journal of this Citation to the user created database of journals. This will cause isJournal() to return True for this Citation and all others with its journal.

Note: Requires the j9Abbreviations database file and will raise an error if it cannot be found.

Parameters

manualName : optional [str]

Default None, the full name of journal to use. If not provided the full name will be the same as the abbreviation.

manaulDB : optional [str]

The name of the manually created database file, the default is determined at run time. It is recommended that this remain untouched.

invert : optional [bool]

Default False, if True the journal will be removed instead of added


Record(object):

Record.__init__(inRecord, taglist=(), sFile=’’, sLine=0):

Class for full WOS records

It is meant to be immutable; many of the methods and attributes are evaluated when first called, not when the object is created, and the results are stored privately.

The record’s meta-data is stored in an ordered dictionary labeled by WOS tags. To access the raw data stored in the original record the Tag() method can be used. To access data that has been processed and cleaned the attributes named after the tags are used.

Customizations

The Record’s hashing and equality testing are based on the WOS number (the tag is ‘UT’, and also called the accession number). They are strings starting with 'WOS:' and followed by 15 or so numbers and letters, although both the length and character set are known to vary. The numbers are unique to each record so are used for comparisons. If a record is bad all equality checks return False.

When converted to a string the records title is used so for a record R, R.TI == R.title == str(R) and its representation uses the WOS number instead of memory location.

Attributes

When a record is created if the parsing of the WOS file failed it is marked as bad. The bad attribute is set to True and the error attribute is created to contain the exception object.

Generally, to get the information from a Record its attributes should be used. For a Record R, calling R.CR causes citations() from the the tagProcessing module to be called on the contents of the raw ‘CR’ field. Then the result is saved and returned. In this case, a list of Citation objects is returned. You can also call R.citations to get the same effect, as each known field tag has a longer name (currently there are 61 field tags). These names are meant to make accessing tags more readable and mapping from tag to name can be found in the tagToFull dict. If a tag is known (in tagToFull) but not in the raw data None is returned instead. Most tags when cleaned return a string or list of strings, the exact results can be found in the help for the particular function.

The attribute authors is also defined as a connivence and returns the same as ‘AF’ or if that is not found ‘AU’.

__Init__

Records are generally create as collections in Recordcollections, and not as individual objects. If you wish to create one on its own it is possible, the arguments are as follows.

Parameters

inRecord: files stream, dict, str or itertools.chain

If it is a file stream the file must be open at the location of the first tag in the record, usually ‘PT’, and the file will be read until ‘ER’ is found, which indicates the end of the record in the file.

If a dict is passed the dictionary is used as the database of fields and tags, so each key is considered a WOS tag and each value a list of the lines of the original associated with the tag. This is the same form of dict that recordParser returns.

For a string the input must be the raw textual data of a single record in the WOS style, like the file stream it must start at the first tag and end in 'ER'.

itertools.chain is treated identically to a file stream and is used by RecordCollections.

sFile : optional [str]

Is the name of the file the raw data was in, by default it is blank. It is mostly used to make error messages more informative.

sLine : optional [int]

Is the line the record starts on in the raw data file. It is mostly used to make error messages more informative.


The Record class has the following methods:


Record.numAuthors():

Returns the number of authors of the records, i.e. len(self.authors)

Returns

int

The number of authors


Record.Tag(tag, clean=False):

Returns a list containing the raw data of the record associated with tag. Each line of the record is one string in the list.

Parameters

tag : str

tag can be a two character string corresponding to a WOS tag e.g. ‘J9’, the matching is case insensitive so ‘j9’ is the same as ‘J9’. Or it can be one of the full names for a tag with the mappings in fullToTag. If the string is not found in the original record or after being translated through fullToTag, None is returned.

clean : optional [bool]

Default False, if True the processed data will be returned instead of the raw data.

Returns

List [str]

Each string in the list is a line from the record associated with tag or None if not found.


Record.createCitation(multiCite=False):

Creates a citation string, using the same format as other WOS citations, for the Record by reading the relevant tags (year, J9, volume, beginningPage, DOI) and using it to create a Citation object.

Parameters

multiCite : optional [bool]

Default False, if True a tuple of Citations is returned with each having a different one of the records authors as the author

Returns

Citation

A Citation object containing a citation for the Record.


Record.TagsList(taglst, cleaned=False):

Returns a list of the results of Tag() for each tag in taglist, the return has the same order as the original.

Parameters

taglst : List[str]

Each string in taglst can be a two character string corresponding to a WOS tag e.g. ‘J9’, the matching is case insensitive so ‘j9’ is the same as ‘J9’, or it can be one of the full names for a tag with the mappings in fullToTag. If the string is not found in the original record before or after being translated through fullToTag, None is used instead. Same as in Tag()

Then they are compiled into a list in the same order as taglst

Returns

List[str]

a list of the values for each tag in taglst, in the same order


Record.TagsDict(taglst, cleaned=False):

returns a dict of the results of Tag, with the elements of taglst as the keys and the results as the values.

Parameters

taglst : List[str]

Each string in taglst can be a two character string corresponding to a WOS tag e.g. ‘J9’, the matching is case insensitive so ‘j9’ is the same as ‘J9’. Or it can be one of the full names for a tag with the mappings in fullToTag. If the string is not found in the oriagnal record before or after being translated through fullToTag, None is used instead. Same as in Tag()

Returns

dict[str : List [str]]

a dictionary with keys as the original tags in taglst and the values as the results


Record.activeTags():

Returns a list of all the tags the original WOS record had. These are all the tags that Tag() will not return None for.

Returns

List[str]

a list of WOS tags in the Record


Record.writeRecord(infile):

Writes to infile the original contents of the Record. This is intended for use by RecordCollections to write to file. What is written to infile is bit for bit identical to the original record file (if utf-8 is used). No newline is inserted above the write but the last character is a newline.

Parameters

infile : file stream

An open utf-8 encoded file


Record.bibString(maxLength=1000, WOSMode=False, restrictedOutput=False, niceID=True):

Makes a string giving the Record as a bibTex entry. If the Record is of a journal article (PT J) the bibtext type is set to 'article', otherwise it is set to 'misc'. The ID of the entry is the WOS number and all the Record’s fields are given as entries with their long names.

Note This is not meant to be used directly with LaTeX none of the special characters have been escaped and there are a large number of unnecessary fields provided. niceID and maxLength have been provided to make conversions easier.

Note Record entries that are lists have their values seperated with the string ' and '

Parameters

maxLength : optional [int]

default 1000, The max length for a continuous string. Most bibTex implementation only allow string to be up to 1000 characters (source), this splits them up into substrings then uses the native string concatenation (the '#' character) to allow for longer strings

WOSMode : optional [bool]

default False, if True the data produced will be unprocessed and use double curly braces. This is the style WOS produces bib files in and mostly macthes that.

restrictedOutput : optional [bool]

default False, if True the tags output will be limited to: 'AF', 'BF', 'ED', 'TI', 'SO', 'LA', 'NR', 'TC', 'Z9', 'PU', 'J9', 'PY', 'PD', 'VL', 'IS', 'SU', 'PG', 'DI', 'D2', and 'UT'

niceID : optional [bool]

default True, if True the ID used will be derived from the authors, publishing date and title, if False it will be the UT tag

Returns

str

The bibTex string of the Record


Record.bibTexType():

Returns the bibTex type corresonding to the record

Returns

str

The bibTex type string


RecordCollection(object):

RecordCollection.__init__(inCollection=None, name=’’, extension=’’, cached=False):

A container for a large number of indivual WOS records.

RecordCollection provides ways of creating [Records](#Record) from an isi file, string, list of records or directory containing isi files.

When being created if there are issues the Record collection will be declared bad, bad wil be set to False, it will then mostly return None or False. The attribute error contains the exception that occurred.

They also possess an attribute name also accessed accessed with __repr__(), this is used to auto generate the names of files and can be set at creation, note though that any operations that modify the RecordCollection’s contents will update the name to include what occurred.

Customizations

The Records are containing within a set and as such many of the set operations are defined, pop, union, in … also records are hashed with their WOS string so no duplication can occur. The comparison operators <, <=, >, >= are based strictly on the number of Records within the collection, while equality looks for an exact match on the Records

__Init__

inCollection is the object containing the information about the Records to be constructed it can be an isi file, string, list of records or directory containing isi files

Parameters

inCollection : optional [str] or None

the name of the source of WOS records. It can be skipped to produce an empty collection.

If a file is provided. First it is checked to see if it is a WOS file (the header is checked). Then records are read from it one by one until the ‘EF’ string is found indicating the end of the file.

If a directory is provided. First each file in the directory is checked for the correct header and all those that do are then read like indivual files. The records are then collected into a single set in the RecordCollection.

name : optional [str]

The name of the RecordCollection, defaults to empty string. If left empty the name of the Record collection is set to the name of the file or directory used to create the collection. If provided the name id set to name

extension : optional [str]

The extension to search for when reading a directory for files. extension is the suffix searched for when a directory is read for files, by default it is empty so all files are read.

cached : optional [bool]

Default False, if True and the inCollection is a directory (a string giving the path to a directory) then the initialized RecordCollection will be saved in the directory as a Python pickle with the suffix '.mkDirCache'. Then if the RecordCollection is initialized a second time it will be recovered from the file, which is much faster than reprising every file in the directory.

metaknowledge saves the names of the parsed files as well as their last modification times and will check these when recreating the RecordCollection, so modifying existing files or adding new ones will result in the entire directory being reanalyzed and a new cache file being created. The extension given to __init__() is taken into account as well and each suffix is given its own cache.

Note The pickle allows for arbitrary python code exicution so only use caches that you trust.


The RecordCollection class has the following methods:


RecordCollection.twoModeNetwork(tag1, tag2, directed=False, recordType=True, nodeCount=True, edgeWeight=True, stemmerTag1=None, stemmerTag2=None):

Creates a network of the objects found by two WOS tags tag1 and tag2, each node marked by which tag spawned it making the resultant graph bipartite.

A twoModeNetwork() looks at each Record in the RecordCollection and extracts its values for the tags given by tag1 and tag2, e.g. the 'WC' and 'LA' tags. Then for each object returned by each tag and edge is created between it and every other object of the other tag. So the WOS defined subject tag 'WC' and language tag 'LA', will give a two-mode network showing the connections between subjects and languages. Each node will have an attribute call 'type' that gives the tag that created it or both if both created it, e.g. the node 'English' would have the type attribute be 'LA'.

The number of times each object occurs is count if nodeCount is True and the edges count the number of co-occurrences if edgeWeight is True. Both areTrue by default.

The directed parameter if True will cause the network to be directed with the first tag as the source and the second as the destination.

Parameters

tag1 : str

A two character WOS tag or one of the full names for a tag, the source of edges on the graph

tag1 : str

A two character WOS tag or one of the full names for a tag, the target of edges on the graph

directed : optional [bool]

Default False, if True the returned network is directed

nodeCount : optional [bool]

Default True, if True each node will have an attribute called “count” that contains an int giving the number of time the object occurred.

edgeWeight : optional [bool]

Default True, if True each edge will have an attribute called “weight” that contains an int giving the number of time the two objects co-occurrenced.

stemmerTag1 : optional [func]

Default None, If stemmerTag1 is a callable object, basically a function or possibly a class, it will be called for the ID of every node given by tag1 in the graph, all IDs are strings.

For example: the function f = lambda x: x[0] if given as the stemmer will cause all IDs to be the first character of their unstemmed IDs. e.g. the title 'Goos-Hanchen and Imbert-Fedorov shifts for leaky guided modes' will create the node 'G'.

stemmerTag2 : optional [func]

Default None, see stemmerTag1 as it is the same but for tag2

Returns

networkx Graph or networkx DiGraph

A networkx Graph with the objects of the tags tag1 and tag2 as nodes and their co-occurrences as edges.


RecordCollection.nModeNetwork(tags, recordType=True, nodeCount=True, edgeWeight=True, stemmer=None):

Creates a network of the objects found by all WOS tags in tags, each node is marked by which tag spawned it making the resultant graph n-partite.

A nModeNetwork() looks are each Record in the RecordCollection and extracts its values for the tags given by tags. Then for all objects returned an edge is created between them, regardless of their type. Each node will have an attribute call 'type' that gives the tag that created it or both if both created it, e.g. if 'LA' were in tags node 'English' would have the type attribute be 'LA'.

For example if tags was set to ['CR', 'UT', 'LA'], a three mode network would be created, composed of a co-citation network from the 'CR' tag. Then each citation would also have edges to all the languages of Records that cited it and to the WOS number of the those Records.

The number of times each object occurs is count if nodeCount is True and the edges count the number of co-occurrences if edgeWeight is True. Both areTrue by default.

Parameters

mode : str

A two character WOS tag or one of the full names for a tag

nodeCount : optional [bool]

Default True, if True each node will have an attribute called 'count' that contains an int giving the number of time the object occurred.

edgeWeight : optional [bool]

Default True, if True each edge will have an attribute called 'weight' that contains an int giving the number of time the two objects co-occurrenced.

stemmer : optional [func]

Default None, If stemmer is a callable object, basically a function or possibly a class, it will be called for the ID of every node in the graph, note that all IDs are strings.

For example: the function f = lambda x: x[0] if given as the stemmer will cause all IDs to be the first character of their unstemmed IDs. e.g. the title 'Goos-Hanchen and Imbert-Fedorov shifts for leaky guided modes' will create the node 'G'.

Returns

networkx Graph

A networkx Graph with the objects of the tags tags as nodes and their co-occurrences as edges


RecordCollection.localCiteStats(pandasFriendly=False, keyType=’citation’):

Returns a dict with all the citations in the CR field as keys and the number of times they occur as the values

Parameters

pandasFriendly : optional [bool]

default False, makes the output be a dict with two keys one 'Citations' is the citations the other is their occurrence counts as 'Counts'.

keyType : optional [str]

default 'citation', the type of key to use for the dictionary, the valid strings are 'citation', 'journal', 'year' or 'author'. IF changed from 'citation' all citations matching the requested option will be contracted and their counts added together.

Returns

dict[str, int or Citation : int]

A dictionary with keys as given by keyType and integers giving their rates of occurrence in the collection


RecordCollection.localCitesOf(rec):

Takes in a Record, WOS string, citation string or Citation and returns a RecordCollection of all records that cite it.

Parameters

rec : Record, str or Citation

The object that is being cited

Returns

RecordCollection

A RecordCollection containing only those Records that cite rec


RecordCollection.citeFilter(keyString=’’, field=’all’, reverse=False, caseSensitive=False):

Filters Records by some string, keyString, in their citations and returns all Records with at least one citation possessing keyString in the field given by field.

Parameters

keyString : optional [str]

Default '', gives the string to be searched for, if it is is blank then all citations with the specified field will be matched

field : optional [str]

Default 'all', gives the component of the citation to be looked at, it can be one of a few strings. The default is 'all' which will cause the entire original Citation to be searched. It can be used to search across fields, e.g. '1970, V2' is a valid keystring The other options are:

  • 'author', searches the author field
  • 'year', searches the year field
  • 'journal', searches the journal field
  • 'V', searches the volume field
  • 'P', searches the page field
  • 'misc', searches all the remaining uncategorized information
  • 'anonymous', searches for anonymous Citations, keyString is not ignored
  • 'bad', searches for bad citations, keyString is not used

reverse : optional [bool]

Default False, being set to True causes all Records not matching the query to be returned

caseSensitive : optional [bool]

Default False, if True causes the search across the original to be case sensitive, only the 'all' option can be case sensitive


RecordCollection.pop():

Returns a random Record from the RecordCollection, the Record is deleted from the collection, use peak() for nondestructive, but slower, access

Returns

Record

A random Record that has been removed from the collection


RecordCollection.peak():

Returns a random Record from the RecordCollection, the Record is kept in the collection, use pop() for faster destructive access.

Returns

Record

A random Record in the collection


RecordCollection.dropWOS(wosNum):

Removes the Record with WOS number (ID number) wosNum from the collection. If it cannot be found nothing happens.

Parameters

wosNum : str

wosNum is the WOS number of the Record to be dropped. wosNum must begin with 'WOS:' or a valueError is raise.


RecordCollection.addRec(Rec):

Adds a Record or Records to the collection.

Parameters

Rec : Record or iterable[Record]

A Record or some iterable containing Records to add


RecordCollection.WOS(wosNum, drop=False):

Gets the Record from the collection by its WOS number (ID number) wosNum.

Parameters

wosNum : str

wosNum is the WOS number of the Record to be extracted. wosNum must begin with 'WOS:' or a valueError is raise.

drop : optional [bool]

Default False. If True the Record is dropped from the collection after being extract, i.e. if False WOS() acts like peak(), if True it acts like pop()

Returns

metaknowledge.Record

The Record whose WOS number is wosNum


RecordCollection.BadRecords():

creates a RecordCollection containing all the Record which have their bad attribute set to True, i.e. all those removed by dropBadRecords().

Returns

RecordCollection

All the bad Records in one collection


RecordCollection.dropBadRecords():

Removes all Records with bad attribute True from the collection, i.e. drop all those returned by BadRecords().


RecordCollection.dropNonJournals(ptVal=’J’, dropBad=True, invert=False):

Drops the non journal type Records from the collection, this is done by checking ptVal against the PT tag

Parameters

ptVal : optional [str]

Default 'J', The value of the PT tag to be kept, default is 'J' the journal tag, other tags can be substituted.

dropBad : optional [bool]

Default True, if True bad Records will be dropped as well those that are not journal entries

invert : optional [bool]

Default False, Set True to drop journals (or the PT tag given by ptVal) instead of keeping them. Note, it still drops bad Records if dropBad is True


RecordCollection.writeFile(fname=None):

Writes the RecordCollection to a file, the written file’s format is identical to those download from WOS. The order of Records written is random.

Parameters

fname : optional [str]

Default None, if given the output file will written to fanme, if None the RecordCollection’s name’s first 200 characters are used with the suffix .isi


RecordCollection.writeCSV(_fname=None, onlyTheseTags=None, numAuthors=True, longNames=False, firstTags=None, csvDelimiter=’,’, csvQuote=’”’, listDelimiter=’ ‘_):

Writes all the Records from the collection into a csv file with each row a record and each column a tag.

Parameters

fname : optional [str]

Default None, the name of the file to write to, if None it uses the collections name suffixed by .csv.

onlyTheseTags : optional [iterable]

Default None, if an iterable (list, tuple, etc) only the tags in onlyTheseTags will be used, if not given then all tags in the records are given.

If you want to use all known tags pass metaknowledge.knownTagsList.

numAuthors : optional [bool]

Default True, if True adds the number of authors as the column 'numAuthors'.

longNames : optional [bool]

Default False, if True will convert the tags to their longer names, otherwise the short 2 character ones will be used.

firstTags : optional [iterable]

Default None, if None the iterable ['UT', 'PT', 'TI', 'AF', 'CR'] is used. The tags given by the iterable are the first ones in the csv in the order given.

Note if tags are in firstTags but not in onlyTheseTags, onlyTheseTags will override firstTags

csvDelimiter : optional [str]

Default ',', the delimiter used for the cells of the csv file.

csvQuote : optional [str]

Default '"', the quote character used for the csv.

listDelimiter : optional [str]

Default '|', the delimiter used between values of the same cell if the tag for that record has multiple outputs.


RecordCollection.writeBib(fname=None, maxStringLength=1000, wosMode=False, reducedOutput=False, niceIDs=True):

Writes a bibTex entry to fname for each Record in the collection.

If the Record is of a journal article (PT J) the bibtext type is set to 'article', otherwise it is set to 'misc'. The ID of the entry is the WOS number and all the Record’s fields are given as entries with their long names.

Note This is not meant to be used directly with LaTeX none of the special characters have been escaped and there are a large number of unnecessary fields provided. niceID and maxLength have been provided to make conversions easier only.

Note Record entries that are lists have their values separated with the string ' and ', as this is the way bibTex understands

Parameters

fname : optional [str]

Default None, The name of the file to be written. If not given one will be derived from the collection and the file will be written to .

maxStringLength : optional [int]

Default 1000, The max length for a continuous string. Most bibTex implementation only allow string to be up to 1000 characters (source), this splits them up into substrings then uses the native string concatenation (the '#' character) to allow for longer strings

WOSMode : optional [bool]

Default False, if True the data produced will be unprocessed and use double curly braces. This is the style WOS produces bib files in and mostly macthes that.

restrictedOutput : optional [bool]

Default False, if True the tags output will be limited to: 'AF', 'BF', 'ED', 'TI', 'SO', 'LA', 'NR', 'TC', 'Z9', 'PU', 'J9', 'PY', 'PD', 'VL', 'IS', 'SU', 'PG', 'DI', 'D2', and 'UT'

niceID : optional [bool]

Default True, if True the IDs used will be derived from the authors, publishing date and title, if False it will be the UT tag


RecordCollection.makeDict(onlyTheseTags=None, longNames=False, cleanedVal=True, numAuthors=True):

Returns a dict with each key a tag and the values being lists of the values for each of the Records in the collection, None is given when there is no value and they are in the same order across each tag.

When used with pandas: pandas.DataFrame(RC.makeDict()) returns a data frame with each column a tag and each row a Record.

Parameters

onlyTheseTags : optional [iterable]

Default None, if an iterable (list, tuple, etc) only the tags in onlyTheseTags will be used, if not given then all tags in the records are given.

If you want to use all known tags pass metaknowledge.knownTagsList.

longNames : optional [bool]

Default False, if True will convert the tags to their longer names, otherwise the short 2 character ones will be used.

cleanedVal : optional [bool]

Default True, if True the processed values for each Record’s field will be provided, otherwise the raw values are given.

numAuthors : optional [bool]

Default True, if True adds the number of authors as the column 'numAuthors'.


RecordCollection.coAuthNetwork(detailedInfo=False, weighted=True, dropNonJournals=False, count=True):

Creates a coauthorship network for the RecordCollection.

Parameters

detailedInfo : optional [bool or iterable[WOS tag Strings]]

Default False, if True all nodes will be given info strings composed of information from the Record objects themselves. This is Equivalent to passing the list: ['PY', 'TI', 'SO', 'VL', 'BP'].

If detailedInfo is an iterable (that evaluates to True) of WOS Tags (or long names) The values of those tags will be used to make the info attributes.

For each of the selected tags an attribute will be added to the node using the values of those tags on the first Record encountered. Warning iterating over RecordCollection objects is not deterministic the first Record will not always be same between runs. The node will be given attributes with the names of the WOS tags for each of the selected tags. The attributes will contain strings of containing the values (with commas removed), if multiple values are encountered they will be comma separated.

Note: detailedInfo is not identical to the detailedCore argument of Recordcollection.coCiteNetwork() or Recordcollection.citationNetwork()

weighted : optional [bool]

Default True, wether the edges are weighted. If True the edges are weighted by the number of co-authorships.

dropNonJournals : optional [bool]

Default False, wether to drop authors from non-journals

count : optional [bool]

Default True, causes the number of occurrences of a node to be counted

Returns

Networkx Graph

A networkx graph with author names as nodes and collaborations as edges.


RecordCollection.coCiteNetwork(dropAnon=True, nodeType=’full’, nodeInfo=True, fullInfo=False, weighted=True, dropNonJournals=False, count=True, keyWords=None, detailedCore=None, coreOnly=False, expandedCore=False):

Creates a co-citation network for the RecordCollection.

Parameters

nodeType : optional [str]

One of "full", "original", "author", "journal" or "year". Specifies the value of the nodes in the graph. The default "full" causes the citations to be compared holistically using the metaknowledge.Citation builtin comparison operators. "original" uses the raw original strings of the citations. While "author", "journal" and "year" each use the author, journal and year respectively.

dropAnon : optional [bool]

default True, if True citations labeled anonymous are removed from the network

nodeInfo : optional [bool]

default True, if True an extra piece of information is stored with each node. The extra inforamtion is detemined by nodeType.

fullInfo : optional [bool]

default False, if True the original citation string is added to the node as an extra value, the attribute is labeled as fullCite

weighted : optional [bool]

default True, wether the edges are weighted. If True the edges are weighted by the number of citations.

dropNonJournals : optional [bool]

default False, wether to drop citations of non-journals

count : optional [bool]

default True, causes the number of occurrences of a node to be counted

keyWords : optional [str] or [list[str]]

A string or list of strings that the citations are checked against, if they contain any of the strings they are removed from the network

detailedCore : optional [bool or iterable[WOS tag Strings]]

default False, if True all Citations from the core (those of records in the RecordCollection) and the nodeType is 'full' all nodes from the core will be given info strings composed of information from the Record objects themselves. This is Equivalent to passing the list: ['AF', 'PY', 'TI', 'SO', 'VL', 'BP'].

If detailedCore is an iterable (That evaluates to True) of WOS Tags (or long names) The values of those tags will be used to make the info attribute. All

The resultant string is the values of each tag, with commas removed, seperated by ', ', just like the info given by non-core Citations. Note that for tags like 'AF' that return lists only the first entry in the list will be used. Also a second attribute is created for all nodes called inCore wich is a boolean describing if the node is in the core or not.

Note: detailedCore is not identical to the detailedInfo argument of Recordcollection.coAuthNetwork()

coreOnly : optional [bool]

default False, if True only Citations from the RecordCollection will be included in the network

expandedCore : optional [bool]

default False, if True all citations in the ouput graph that are records in the collection will be duplicated for each author. If the nodes are "full", "original" or "author" this will result in new noded being created for the other options the results are not defined or tested. Edges will be created between each of the nodes for each record expanded, attributes will be copied from exiting nodes.

Returns

Networkx Graph

A networkx graph with hashes as ID and co-citation as edges


RecordCollection.citationNetwork(dropAnon=True, nodeType=’full’, nodeInfo=True, fullInfo=False, weighted=True, dropNonJournals=False, count=True, directed=True, keyWords=None, detailedCore=None, coreOnly=False, expandedCore=False):

Creates a citation network for the RecordCollection.

Parameters

nodeType : optional [str]

One of "full", "original", "author", "journal" or "year". Specifies the value of the nodes in the graph. The default "full" causes the citations to be compared holistically using the metaknowledge.Citation builtin comparison operators. "original" uses the raw original strings of the citations. While "author", "journal" and "year" each use the author, journal and year respectively.

dropAnon : optional [bool]

default True, if True citations labeled anonymous are removed from the network

nodeInfo : optional [bool]

default True, wether an extra piece of information is stored with each node.

fullInfo : optional [bool]

default False, wether the original citation string is added to the node as an extra value, the attribute is labeled as fullCite

weighted : optional [bool]

default True, wether the edges are weighted. If True the edges are weighted by the number of citations.

dropNonJournals : optional [bool]

default False, wether to drop citations of non-journals

count : optional [bool]

default True, causes the number of occurrences of a node to be counted

keyWords : optional [str] or [list[str]]

A string or list of strings that the citations are checked against, if they contain any of the strings they are removed from the network

directed : optional [bool]

Determines if the output graph is directed, default True

detailedCore : optional [bool or iterable[WOS tag Strings]]

default False, if True all Citations from the core (those of records in the RecordCollection) and the nodeType is 'full' all nodes from the core will be given info strings composed of information from the Record objects themselves. This is Equivalent to passing the list: ['AF', 'PY', 'TI', 'SO', 'VL', 'BP'].

If detailedCore is an iterable (That evaluates to True) of WOS Tags (or long names) The values of those tags will be used to make the info attribute. All

The resultant string is the values of each tag, with commas removed, seperated by ', ', just like the info given by non-core Citations. Note that for tags like 'AF' that return lists only the first entry in the list will be used. Also a second attribute is created for all nodes called inCore wich is a boolean describing if the node is in the core or not.

Note: detailedCore is not identical to the detailedInfo argument of Recordcollection.coAuthNetwork()

coreOnly : optional [bool]

default False, if True only Citations from the RecordCollection will be included in the network

expandedCore : optional [bool]

default False, if True all citations in the ouput graph that are records in the collection will be duplicated for each author. If the nodes are "full", "original" or "author" this will result in new noded being created for the other options the results are not defined or tested. Edges will be created between each of the nodes for each record expanded, attributes will be copied from exiting nodes.

Returns

Networkx DiGraph or Networkx Graph

See directed for explanation of returned type

A networkx digraph with hashes as ID and citations as edges


RecordCollection.yearSplit(startYear, endYear, dropMissingYears=True):

Creates a RecordCollection of Records from the years between startYear and endYear inclusive.

Parameters

startYear : int

The smallest year to be included in the returned RecordCollection

endYear : int

The largest year to be included in the returned RecordCollection

dropMissingYears : optional [bool]

Default True, if True Records with missing years will be dropped. If False a TypeError exception will be raised

Returns

RecordCollection

A RecordCollection of Records from startYear to endYear


RecordCollection.oneModeNetwork(mode, nodeCount=True, edgeWeight=True, stemmer=None):

Creates a network of the objects found by one WOS tag mode.

A oneModeNetwork() looks are each Record in the RecordCollection and extracts its values for the tag given by mode, e.g. the 'AF' tag. Then if multiple are returned an edge is created between them. So in the case of the author tag 'AF' a co-authorship network is created.

The number of times each object occurs is count if nodeCount is True and the edges count the number of co-occurrences if edgeWeight is True. Both areTrue by default.

Note Do not use this for the construction of co-citation networks use Recordcollection.coCiteNetwork() it is more accurate and has more options.

Parameters

mode : str

A two character WOS tag or one of the full names for a tag

nodeCount : optional [bool]

Default True, if True each node will have an attribute called “count” that contains an int giving the number of time the object occurred.

edgeWeight : optional [bool]

Default True, if True each edge will have an attribute called “weight” that contains an int giving the number of time the two objects co-occurrenced.

stemmer : optional [func]

Default None, If stemmer is a callable object, basically a function or possibly a class, it will be called for the ID of every node in the graph, all IDs are strings. For example:

The function ` f = lambda x: x[0] if given as the stemmer will cause all IDs to be the first character of their unstemmed IDs. e.g. the title ‘Goos-Hanchen and Imbert-Fedorov shifts for leaky guided modes’ will create the node ‘G’`.

Returns

networkx Graph

A networkx Graph with the objects of the tag mode as nodes and their co-occurrences as edges


contour

Two functions based on matplotlib for generating nicer looking graphs

This is the only module that depends on anything besides networkx, as it depends on numpy, scipy and matplotlib as well.

The contour module provides the following functions:


contour.graphDensityContourPlot(G, iters=50, layout=None, layoutScaleFactor=1, overlay=False, nodeSize=10, axisSamples=100, blurringFactor=0.1, contours=15, graphType=’coloured’):

Creates a 3D plot giving the density of nodes on a 2D layout as a surface in 3 dimensions.

Most of the options are for tweaking the final appearance. layout and layoutScaleFactor allow a pre-layout graph to be provided. If a layout is not provided the networkx.spring_layout() is used after iters iterations. Then, once the graph has been laid out a grid of axisSamples cells by axisSamples cells is overlaid and the number of nodes in each cell is determined. This then forms a surface in 3-space, gaussian blur is applied with a sigma of blurringFactor. The surface is then be plotted.

Parameters

G : networkx Graph

The graph to be plotted

iters : optional [int]

Default 50, the number of iterations for the spring layout if layout is not provided.

layout : optional [networkx layout dictionary]

Default None, if provided will be used as a layout of the graph, the maximum distance from the origin along any axis must also given as layoutScaleFactor, which is by default 1.

layoutScaleFactor : optional [double]

Default 1, The maximum distance from the origin allowed along any axis given by layout, i.e. the layout must fit in a square centered at the origin with side lengths 2 * layoutScaleFactor

overlay : optional [bool]

Default False, if True the graph will be plotted on the X-Y plane at Z = 0.

nodeSize : optional [double]

Default 10, the size of the nodes dawn in the overlay

axisSamples : optional [int]

Default 100, the number of cells used along each axis for sampling. A larger number will mean a lower average density.

blurringFactor : optional [double]

Default 0.1, the sigma value used for smoothing the surface density. The higher this number the smoother the surface.

contours : optional [int]

Default 15, the number of different heights drawn. If this number is low the resultant image will look very banded. It is recommended this be raised above 50 if you want your images to look good, Warning this will make them much slower to generate and interact with.

graphType : optional [str]

Default 'coloured', if 'coloured' the image will have a destiny based colourization applied, the only other option is 'solid' which removes the colourization.


contour.quickVisual(G, showLabel=False):

just makes a simple matplotlib figure and displays it, with each node coloured by its type. You can add labels with showLabel.

Parameters

showLabel : optional [bool]

Default False, if True labels will be added to the nodes giving their IDs.


journalAbbreviations

This module handles the abbreviations, known as J29 abbreviations and given by the J9 tag in records, for journal titles that WOS employs in citations.

The citations provided by WOS used abbreviated journal titles instead of the full names. The full list of abbreviations can be found at a series pages divided by letter starting at images.webofknowledge.com/WOK46/help/WOS/A_abrvjt.html. The function updatej9DB() is used to scape and parse the pages, it must be run without error before the other features can be used. metaknowledge will try running it once during the installation but it could easily have been canceled or failed so it is best to it manually. updatej9DB() creates a database in the metaknowledge install directory that gives each abbreviation and the titles it corresponds to, note there can be many titles for one abbreviation. The database can be accessed as a dictionary with getj9dict().

The other functions of the module are for manually adding and removing abbreviations from the database. It is recommended that this be done with the command-line tool metaknowledge, unless you know what you are doing.

The journalAbbreviations module provides the following functions:


journalAbbreviations.getj9dict(dbname=’j9Abbreviations’, manualDB=’manualj9Abbreviations’, returnDict=’both’):

Returns the dictionary of journal abbreviations mapping to a list of the associated journal names. By default the local database is used. The database is in the file dbname in the same directory as this source file

Parameters

dbname : optional [str]

The name of the downloaded database file, the default is determined at run time. It is recommended that this remain untouched.

manaulDB : optional [str]

The name of the manually created database file, the default is determined at run time. It is recommended that this remain untouched.

returnDict : optional [str]

default 'both', can be used to get both databases or only one with 'WOS' or 'manual'.


journalAbbreviations.addToDB(abbr=None, dbname=’manualj9Abbreviations’):

Adds abbr to the database of journals. The database is kept separate from the one scraped from WOS, this supersedes it. The database by default is stored with the WOS one and the name is given by metaknowledge.journalAbbreviations.manaulDBname. To create an empty database run addToDB without an abbr argument.

Parameters

abbr : optional [str or dict[str : str]]

The journal abbreviation to be added to the database, it can either be a single string in which case that string will be added with its self as the full name, or a dict can be given with the abbreviations as keys and their names as strings, use pipes ('|') to separate multiple names. Note, if the empty string is given as a name the abbreviation will be considered manually excluded, i.e. having excludeFromDB() run on it.

dbname : optional [str]

The name of the database file, default is metaknowledge.journalAbbreviations.manaulDBname.


journalAbbreviations.excludeFromDB(abbr=None, dbname=’manualj9Abbreviations’):

Marks abbr to be excluded the database of journals. The database is kept separate from the one scraped from WOS, this supersedes it. The database by default is stored with the WOS one and the name is given by metaknowledge.journalAbbreviations.manaulDBname. To create an empty database run addToDB() without an abbr argument.

Parameters

abbr : optional [str or tuple[str] or list[str]

The journal abbreviation to be excluded from the database, it can either be a single string in which case that string will be exclude or a list/tuple of strings can be given with the abbreviations.

dbname : optional [str]

The name of the database file, default is metaknowledge.journalAbbreviations.manaulDBname.


journalAbbreviations.updatej9DB(dbname=’j9Abbreviations’, saveRawHTML=False):

Updates the database of Journal Title Abbreviations. Requires an internet connection. The data base is saved relative to the source file not the working directory.

Parameters

dbname : optional [str]

The name of the database file, default is “j9Abbreviations.db”

saveRawHTML : optional [bool]

Determines if the original HTML of the pages is stored, default False. If True they are saved in a directory inside j9Raws begining with todays date.


tagProcessing

The functions used by metaknowledge to handle WOS tags

Each of the functions in tagProcessing is named after the long name of its tag and is responsible for taking the raw data from a WOS file and returning usable information. The raw data is a list containing each line associated with the tag as a string. So the section of a record:

TI The Motion Behind the Symbols: A Vital Role for Dynamism in the
  Conceptualization of Limits and Continuity in Expert Mathematics

would be the list:

["The Motion Behind the Symbols: A Vital Role for Dynamism in the",
"Conceptualization of Limits and Continuity in Expert Mathematics"
]

The function to process it is called title() which is determined by looking up the tag in the tagProcessing.tagToFunc dictionary. Which is a dictionary mapping WOS tag strings to their functions. For a simple mapping of tags to their long strings use metaknowledge.tagToFull.

The objects tagToFullDict, fullToTagDict, tagNameConverterDict, tagsAndNameSet and knownTagsList are also provided. They are the objects used by metaknowledge to keep track of tag names. tagToFullDict and fullToTagDict are dictionaries that convert from tags to full names and vice versa, respectively, while tagNameConverterDict goes both ways. tagsAndNameSet is a set of all full names and tags, while knownTagsList contains only tags and is a list. For a less raw interface look the functions provided by the base metaknowledge module, e.g. tagToFull().

The full list of tags and their long names is provided below followed by the descriptions of the functions, they are ordered by their occurrence in WOS records:

tag Name
'PT' pubType
'AF' authorsFull
'GP' group
'BE' editedBy
'AU' authorsShort
'BA' bookAuthor
'BF' bookAuthorFull
'CA' groupName
'ED' editors
'TI' title
'SO' journal
'SE' seriesTitle
'BS' seriesSubtitle
'LA' language
'DT' docType
'CT' confTitle
'CY' confDate
'HO' confHost
'CL' confLocation
'SP' confSponsors
'DE' authKeyWords
'ID' keyWords
'AB' abstract
'C1' authAddress
'RP' reprintAddress
'EM' email
'RI' ResearcherIDnumber
'OI' orcID
'FU' funding
'FX' fundingText
'CR' citations
'NR' citedRefsCount
'TC' wosTimesCited
'Z9' totalTimesCited
'PU' publisher
'PI' publisherCity
'PA' publisherAddress
'SC' subjectCategory
'SN' ISSN
'EI' eISSN
'BN' ISBN
'J9' j9
'JI' isoAbbreviation
'PD' month
'PY' year
'VL' volume
'IS' issue
'PN' partNumber
'SU' supplement
'SI' specialIssue
'MA' meetingAbstract
'BP' beginningPage
'EP' endingPage
'AR' articleNumber
'PG' pageCount
'WC' subjects
'DI' DOI
'D2' bookDOI
'GA' documentDeliveryNumber
'UT' wosString
'PM' pubMedID

tagProcessing.pubType(val):

######The PT Tag

extracts the type of publication as a character: conference, book, journal, book in series, or patent

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

A string


tagProcessing.authorsFull(val):

######The AF Tag

extracts a list of authors full names

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of author’s names


tagProcessing.group(val):

######The GP Tag

extracts the group associated with the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

A the name of the group


tagProcessing.editedBy(val):

######The BE Tag

extracts a list of the editors of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of editors


tagProcessing.authorsShort(val):

######The AU Tag

extracts a list of authors shortened names

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of shortened author’s names


tagProcessing.bookAuthor(val):

######The BA Tag

extracts a list of the short names of the authors of a book Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of shortened author’s names


tagProcessing.bookAuthorFull(val):

######The BF Tag

extracts a list of the long names of the authors of a book Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of author’s names


tagProcessing.groupName(val):

######The CA Tag

extracts the name of the group associated with the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The group’s name


tagProcessing.editors(val):

Needs Work

currently not well understood, returns val


tagProcessing.title(val):

######The TI Tag

extracts the title of the record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The title of the record


tagProcessing.journal(val):

######The SO Tag

extracts the full name of the publication and normalizes it to uppercase

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The name of the journal


tagProcessing.seriesTitle(val):

######The SE Tag

extracts the title of the series the Record is in

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The title of the series


tagProcessing.seriesSubtitle(val):

######The BS Tag

extracts the title of the series the Record is in

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The subtitle of the series


tagProcessing.language(val):

######The LA Tag

extracts the languages of the Record as a string with languages separated by ‘, ‘, usually there is only one language

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The language(s) of the record


tagProcessing.docType(val):

######The DT Tag

extracts the type of document the Record contains

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The type of the Record


tagProcessing.confTitle(val):

######The CT Tag

extracts the title of the conference associated with the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The title of the conference


tagProcessing.confDate(val):

######The CY Tag

extracts the date string of the conference associated with the Record, the date is not normalized

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The data of the conference


tagProcessing.confHost(val):

######The HO Tag

extracts the host of the conference

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The host


tagProcessing.confLocation(val):

######The CL Tag

extracts the sting giving the conference’s location

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The conferences address


tagProcessing.confSponsors(val):

######The SP Tag

extracts a list of sponsors for the conference associated with the record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

A the list of of sponsors


tagProcessing.authKeyWords(val):

######The DE Tag

extracts the keywords assigned by the author of the Record. The WOS description is:

Author keywords are included in records of articles from 1991 forward. They are also include in conference proceedings records.
Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

The list of keywords


tagProcessing.keyWords(val):

######The ID Tag

extracts the WOS keywords of the Record. The WOS description is:

KeyWords Plus are index terms created by Thomson Reuters from significant, frequently occurring words in the titles of an article's cited references.
Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

The keyWords list


tagProcessing.abstract(val):

######The AB Tag

return abstract of the record, with newlines hopefully in the correct places

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The abstract


tagProcessing.authAddress(val):

The C1 Tag

extracts the address of the authors as given by WOS. Warning the mapping of author to address is not very good and is given in multiple ways.

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of addresses


tagProcessing.reprintAddress(val):

######The RP Tag

extracts the reprint address string

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The reprint address


tagProcessing.email(val):

######The EM Tag

extracts a list of emails given by the authors of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of emails


tagProcessing.ResearcherIDnumber(val):

######The RI Tag

extracts a list of the research IDs of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

The list of the research IDs


tagProcessing.orcID(val):

######The OI Tag

extracts a list of orc IDs of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The orc ID


tagProcessing.funding(val):

######The FU Tag

extracts a list of the groups funding the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of funding groups


tagProcessing.fundingText(val):

######The FX Tag

extracts a string of the funding thanks

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The funding thank-you


tagProcessing.citations(val):

######The CR Tag

extracts a list of all the citations in the record, the citations are the metaknowledge.Citation class.

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[metaknowledge.Citation]

A list of Citations


tagProcessing.citedRefsCount(val):

######The NR Tag

extracts the number citations, length of CR list

Parameters

val: list[str]

The raw data from a WOS file

Returns

int

The number of CRs


tagProcessing.wosTimesCited(val):

######The TC Tag

extracts the number of times the Record has been cited by records in WOS

Parameters

val: list[str]

The raw data from a WOS file

Returns

int

The number of time the Record has been cited


tagProcessing.totalTimesCited(val):

######The Z9 Tag

extracts the total number of citations of the record

Parameters

val: list[str]

The raw data from a WOS file

Returns

int

The total number of citations


tagProcessing.publisher(val):

######The PU Tag

extracts the publisher of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The publisher


tagProcessing.publisherCity(val):

######The PI Tag

extracts the city the publisher is in

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The city of the publisher


tagProcessing.publisherAddress(val):

######The PA Tag

extracts the publishers address

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The publisher address


tagProcessing.subjectCategory(val):

######The SC Tag

extracts a list of the subjects associated with the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

A list of the subjects associated with the Record


tagProcessing.ISSN(val):

######The SN Tag

extracts the ISSN of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The ISSN string


tagProcessing.eISSN(val):

######The EI Tag

extracts the EISSN of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The EISSN string


tagProcessing.ISBN(val):

######The BN Tag

extracts a list of ISBNs associated with the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

list

The ISBNs


tagProcessing.j9(val):

######The J9 Tag

extracts the J9 (29-Character Source Abbreviation) of the publication

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The 29-Character Source Abbreviation


tagProcessing.isoAbbreviation(val):

######The JI Tag

extracts the iso abbreviation of the journal

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The iso abbreviation of the journal


tagProcessing.month(val):

######The PD Tag

extracts the month the record was published in as an int with January as 1, February 2, …

Parameters

val: list[str]

The raw data from a WOS file

Returns

int

A integer giving the month


tagProcessing.year(val):

######The PY Tag

extracts the year the record was published in as an int

Parameters

val: list[str]

The raw data from a WOS file

Returns

int

The year


tagProcessing.volume(val):

######The VL Tag

return the volume the record is in as a string, not all are integers

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The volume number


tagProcessing.issue(val):

######The IS Tag

extracts a string giving the issue or range of issues the Record was in, not all are integers

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The issue number/range


tagProcessing.partNumber(val):

######The PN Tag

return an integer giving the part of the issue the Record is in

Parameters

val: list[str]

The raw data from a WOS file

Returns

int

The part of the issue of the Record


tagProcessing.supplement(val):

######The SU Tag

extracts the supplement number

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The supplement number


tagProcessing.specialIssue(val):

######The SI Tag

extracts the special issue value

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The special issue value


tagProcessing.meetingAbstract(val):

######The MA Tag

extracts the ID of the meeting abstract prefixed by ‘EPA-‘

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The meeting abstract prefixed


tagProcessing.beginningPage(val):

######The BP Tag

extracts the first page the record occurs on, not all are integers

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The first page number


tagProcessing.endingPage(val):

######The EP Tag

return the last page the record occurs on as a string, not aall are intergers

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The final page number


tagProcessing.articleNumber(val):

######The AR Tag

extracts a string giving the article number, not all are integers

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The article number


tagProcessing.pageCount(val):

######The PG Tag

returns an integer giving the number of pages of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

int

The page count


tagProcessing.subjects(val):

######The WC Tag

extracts a list of subjects as assigned by WOS

Parameters

val: list[str]

The raw data from a WOS file

Returns

list[str]

The subjects list


tagProcessing.DOI(val):

######The DI Tag

return the DOI number of the record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The DOI number string


tagProcessing.bookDOI(val):

######The D2 Tag

extracts the book DOI of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The DOI number


tagProcessing.documentDeliveryNumber(val):

######The GA Tag

extracts the document delivery number of the Record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The document delivery number


tagProcessing.wosString(val):

######The UT Tag

extracts the WOS number of the record as a string preceded by “WOS:”

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The WOS number


tagProcessing.pubMedID(val):

######The PM Tag

extracts the pubmed ID of the record

Parameters

val: list[str]

The raw data from a WOS file

Returns

str

The pubmed ID


Questions?

If you find bugs, or have questions, please write to:

Reid McIlroy-Young reid@reidmcy.com

John McLevey john.mclevey@uwaterloo.ca


License

metaknowledge is free and open source software, distributed under the GPL License.