RecordCollection(CollectionWithIDs):

RecordCollection.__init__(inCollection=None, name=’’, extension=’’, cached=False, quietStart=False):

A container for a large number of indivual records.

RecordCollection provides ways of creating Records from an isi file, string, list of records or directory containing isi files.

When being created if there are issues the Record collection will be declared bad, bad wil be set to False, it will then mostly return None or False. The attribute error contains the exception that occurred.

They also possess an attribute name also accessed accessed with __repr__(), this is used to auto generate the names of files and can be set at creation, note though that any operations that modify the RecordCollection’s contents will update the name to include what occurred.

Customizations

The Records are containing within a set and as such many of the set operations are defined, pop, union, in … also records are hashed with their WOS string so no duplication can occur. The comparison operators <, <=, >, >= are based strictly on the number of Records within the collection, while equality looks for an exact match on the Records

__Init__

inCollection is the object containing the information about the Records to be constructed it can be an isi file, string, list of records or directory containing isi files

Parameters

inCollection : optional [str] or None

the name of the source of WOS records. It can be skipped to produce an empty collection.

If a file is provided. First it is checked to see if it is a WOS file (the header is checked). Then records are read from it one by one until the ‘EF’ string is found indicating the end of the file.

If a directory is provided. First each file in the directory is checked for the correct header and all those that do are then read like indivual files. The records are then collected into a single set in the RecordCollection.

name : optional [str]

The name of the RecordCollection, defaults to empty string. If left empty the name of the Record collection is set to the name of the file or directory used to create the collection. If provided the name id set to name

extension : optional [str]

The extension to search for when reading a directory for files. extension is the suffix searched for when a directory is read for files, by default it is empty so all files are read.

cached : optional [bool]

Default False, if True and the inCollection is a directory (a string giving the path to a directory) then the initialized RecordCollection will be saved in the directory as a Python pickle with the suffix '.mkDirCache'. Then if the RecordCollection is initialized a second time it will be recovered from the file, which is much faster than reprising every file in the directory.

metaknowledge saves the names of the parsed files as well as their last modification times and will check these when recreating the RecordCollection, so modifying existing files or adding new ones will result in the entire directory being reanalyzed and a new cache file being created. The extension given to __init__() is taken into account as well and each suffix is given its own cache.

Note The pickle allows for arbitrary python code execution so only use caches that you trust.

The RecordCollection class has the following methods:


RecordCollection.networkCoCitation(dropAnon=True, nodeType=’full’, nodeInfo=True, fullInfo=False, weighted=True, dropNonJournals=False, count=True, keyWords=None, detailedCore=True, detailedCoreAttributes=False, coreOnly=False, expandedCore=False, addCR=False):

Creates a co-citation network for the RecordCollection.

Parameters

nodeType : optional [str]

One of "full", "original", "author", "journal" or "year". Specifies the value of the nodes in the graph. The default "full" causes the citations to be compared holistically using the metaknowledge.Citation builtin comparison operators. "original" uses the raw original strings of the citations. While "author", "journal" and "year" each use the author, journal and year respectively.

dropAnon : optional [bool]

default True, if True citations labeled anonymous are removed from the network

nodeInfo : optional [bool]

default True, if True an extra piece of information is stored with each node. The extra inforamtion is detemined by nodeType.

fullInfo : optional [bool]

default False, if True the original citation string is added to the node as an extra value, the attribute is labeled as fullCite

weighted : optional [bool]

default True, wether the edges are weighted. If True the edges are weighted by the number of citations.

dropNonJournals : optional [bool]

default False, wether to drop citations of non-journals

count : optional [bool]

default True, causes the number of occurrences of a node to be counted

keyWords : optional [str] or [list[str]]

A string or list of strings that the citations are checked against, if they contain any of the strings they are removed from the network

detailedCore : optional [bool or iterable[WOS tag Strings]]

default True, if True all Citations from the core (those of records in the RecordCollection) and the nodeType is 'full' all nodes from the core will be given info strings composed of information from the Record objects themselves. This is Equivalent to passing the list: ['AF', 'PY', 'TI', 'SO', 'VL', 'BP'].

If detailedCore is an iterable (That evaluates to True) of WOS Tags (or long names) The values of those tags will be used to make the info attribute. All

The resultant string is the values of each tag, with commas removed, seperated by ', ', just like the info given by non-core Citations. Note that for tags like 'AF' that return lists only the first entry in the list will be used. Also a second attribute is created for all nodes called inCore wich is a boolean describing if the node is in the core or not.

Note: detailedCore is not identical to the detailedInfo argument of Recordcollection.networkCoAuthor()

coreOnly : optional [bool]

default False, if True only Citations from the RecordCollection will be included in the network

expandedCore : optional [bool]

default False, if True all citations in the ouput graph that are records in the collection will be duplicated for each author. If the nodes are "full", "original" or "author" this will result in new noded being created for the other options the results are not defined or tested. Edges will be created between each of the nodes for each record expanded, attributes will be copied from exiting nodes.

Returns

Networkx Graph

A networkx graph with hashes as ID and co-citation as edges


RecordCollection.networkCitation(dropAnon=False, nodeType=’full’, nodeInfo=True, fullInfo=False, weighted=True, dropNonJournals=False, count=True, directed=True, keyWords=None, detailedCore=True, detailedCoreAttributes=False, coreOnly=False, expandedCore=False, recordToCite=True, addCR=False):

Creates a citation network for the RecordCollection.

Parameters

nodeType : optional [str]

One of "full", "original", "author", "journal" or "year". Specifies the value of the nodes in the graph. The default "full" causes the citations to be compared holistically using the metaknowledge.Citation builtin comparison operators. "original" uses the raw original strings of the citations. While "author", "journal" and "year" each use the author, journal and year respectively.

dropAnon : optional [bool]

default True, if True citations labeled anonymous are removed from the network

nodeInfo : optional [bool]

default True, whether an extra piece of information is stored with each node.

fullInfo : optional [bool]

default False, whether the original citation string is added to the node as an extra value, the attribute is labeled as fullCite

weighted : optional [bool]

default True, whether the edges are weighted. If True the edges are weighted by the number of citations.

dropNonJournals : optional [bool]

default False, whether to drop citations of non-journals

count : optional [bool]

default True, causes the number of occurrences of a node to be counted

keyWords : optional [str] or [list[str]]

A string or list of strings that the citations are checked against, if they contain any of the strings they are removed from the network

directed : optional [bool]

Determines if the output graph is directed, default True

detailedCore : optional [bool or iterable[WOS tag Strings]]

default True, if True all Citations from the core (those of records in the RecordCollection) and the nodeType is 'full' all nodes from the core will be given info strings composed of information from the Record objects themselves. This is Equivalent to passing the list: ['AF', 'PY', 'TI', 'SO', 'VL', 'BP'].

If detailedCore is an iterable (That evaluates to True) of WOS Tags (or long names) The values of those tags will be used to make the info attribute. All

The resultant string is the values of each tag, with commas removed, seperated by ', ', just like the info given by non-core Citations. Note that for tags like 'AF' that return lists only the first entry in the list will be used. Also a second attribute is created for all nodes called inCore wich is a boolean describing if the node is in the core or not.

Note: detailedCore is not identical to the detailedInfo argument of Recordcollection.networkCoAuthor()

coreOnly : optional [bool]

default False, if True only Citations from the RecordCollection will be included in the network

expandedCore : optional [bool]

default False, if True all citations in the ouput graph that are records in the collection will be duplicated for each author. If the nodes are "full", "original" or "author" this will result in new noded being created for the other options the results are not defined or tested. Edges will be created between each of the nodes for each record expanded, attributes will be copied from exiting nodes.

Returns

Networkx DiGraph or Networkx Graph

See directed for explanation of returned type

A networkx digraph with hashes as ID and citations as edges


RecordCollection.networkBibCoupling(weighted=True, fullInfo=False):

Creates a bibliographic coupling network based on citations for the RecordCollection.

Parameters

weighted : optional bool

Default True, if True the weight of the edges will be added to the network

fullInfo : optional bool

Default False, if True the full citation string will be added to each of the nodes of the network.

Returns

Networkx Graph

A graph of the bibliographic coupling


RecordCollection.yearSplit(startYear, endYear, dropMissingYears=True):

Creates a RecordCollection of Records from the years between startYear and endYear inclusive.

Parameters

startYear : int

The smallest year to be included in the returned RecordCollection

endYear : int

The largest year to be included in the returned RecordCollection

dropMissingYears : optional [bool]

Default True, if True Records with missing years will be dropped. If False a TypeError exception will be raised

Returns

RecordCollection

A RecordCollection of Records from startYear to endYear


RecordCollection.localCiteStats(pandasFriendly=False, keyType=’citation’):

Returns a dict with all the citations in the CR field as keys and the number of times they occur as the values

Parameters

pandasFriendly : optional [bool]

default False, makes the output be a dict with two keys one 'Citations' is the citations the other is their occurrence counts as 'Counts'.

keyType : optional [str]

default 'citation', the type of key to use for the dictionary, the valid strings are 'citation', 'journal', 'year' or 'author'. IF changed from 'citation' all citations matching the requested option will be contracted and their counts added together.

Returns

dict[str, int or Citation : int]

A dictionary with keys as given by keyType and integers giving their rates of occurrence in the collection


RecordCollection.localCitesOf(rec):

Takes in a Record, WOS string, citation string or Citation and returns a RecordCollection of all records that cite it.

Parameters

rec : Record, str or Citation

The object that is being cited

Returns

RecordCollection

A RecordCollection containing only those Records that cite rec


RecordCollection.citeFilter(keyString=’’, field=’all’, reverse=False, caseSensitive=False):

Filters Records by some string, keyString, in their citations and returns all Records with at least one citation possessing keyString in the field given by field.

Parameters

keyString : optional [str]

Default '', gives the string to be searched for, if it is is blank then all citations with the specified field will be matched

field : optional [str]

Default 'all', gives the component of the citation to be looked at, it can be one of a few strings. The default is 'all' which will cause the entire original Citation to be searched. It can be used to search across fields, e.g. '1970, V2' is a valid keystring The other options are:

  • 'author', searches the author field
  • 'year', searches the year field
  • 'journal', searches the journal field
  • 'V', searches the volume field
  • 'P', searches the page field
  • 'misc', searches all the remaining uncategorized information
  • 'anonymous', searches for anonymous Citations, keyString is not ignored
  • 'bad', searches for bad citations, keyString is not used

reverse : optional [bool]

Default False, being set to True causes all Records not matching the query to be returned

caseSensitive : optional [bool]

Default False, if True causes the search across the original to be case sensitive, only the 'all' option can be case sensitive


RecordCollection.dropNonJournals(ptVal=’J’, dropBad=True, invert=False):

Drops the non journal type Records from the collection, this is done by checking ptVal against the PT tag

Parameters

ptVal : optional [str]

Default 'J', The value of the PT tag to be kept, default is 'J' the journal tag, other tags can be substituted.

dropBad : optional [bool]

Default True, if True bad Records will be dropped as well those that are not journal entries

invert : optional [bool]

Default False, Set True to drop journals (or the PT tag given by ptVal) instead of keeping them. Note, it still drops bad Records if dropBad is True


RecordCollection.writeFile(fname=None):

Writes the RecordCollection to a file, the written file’s format is identical to those download from WOS. The order of Records written is random.

Parameters

fname : optional [str]

Default None, if given the output file will written to fanme, if None the RecordCollection’s name’s first 200 characters are used with the suffix .isi


RecordCollection.writeCSV(_fname=None, splitByTag=None, onlyTheseTags=None, numAuthors=True, genderCounts=True, longNames=False, firstTags=None, csvDelimiter=’,’, csvQuote=’”’, listDelimiter=’ ‘_):

Writes all the Records from the collection into a csv file with each row a record and each column a tag.

Parameters

fname : optional [str]

Default None, the name of the file to write to, if None it uses the collections name suffixed by .csv.

splitByTag : optional [str]

Default None, if a tag is given the output will be divided into different files according to the value of the tag, with only the records associated with that tag. For example if 'authorsFull' is given then each file will only have the lines for Records that author is named in.

The file names are the values of the tag followed by a dash then the normale name for the file as given by fname, e.g. for the year 2016 the file could be called '2016-fname.csv'.

onlyTheseTags : optional [iterable]

Default None, if an iterable (list, tuple, etc) only the tags in onlyTheseTags will be used, if not given then all tags in the records are given.

If you want to use all known tags pass metaknowledge.knownTagsList.

numAuthors : optional [bool]

Default True, if True adds the number of authors as the column 'numAuthors'.

longNames : optional [bool]

Default False, if True will convert the tags to their longer names, otherwise the short 2 character ones will be used.

firstTags : optional [iterable]

Default None, if None the iterable ['UT', 'PT', 'TI', 'AF', 'CR'] is used. The tags given by the iterable are the first ones in the csv in the order given.

Note if tags are in firstTags but not in onlyTheseTags, onlyTheseTags will override firstTags

csvDelimiter : optional [str]

Default ',', the delimiter used for the cells of the csv file.

csvQuote : optional [str]

Default '"', the quote character used for the csv.

listDelimiter : optional [str]

Default '|', the delimiter used between values of the same cell if the tag for that record has multiple outputs.


RecordCollection.writeBib(fname=None, maxStringLength=1000, wosMode=False, reducedOutput=False, niceIDs=True):

Writes a bibTex entry to fname for each Record in the collection.

If the Record is of a journal article (PT J) the bibtext type is set to 'article', otherwise it is set to 'misc'. The ID of the entry is the WOS number and all the Record’s fields are given as entries with their long names.

Note This is not meant to be used directly with LaTeX none of the special characters have been escaped and there are a large number of unnecessary fields provided. niceID and maxLength have been provided to make conversions easier only.

Note Record entries that are lists have their values separated with the string ' and ', as this is the way bibTex understands

Parameters

fname : optional [str]

Default None, The name of the file to be written. If not given one will be derived from the collection and the file will be written to .

maxStringLength : optional [int]

Default 1000, The max length for a continuous string. Most bibTex implementation only allow string to be up to 1000 characters (source), this splits them up into substrings then uses the native string concatenation (the '#' character) to allow for longer strings

WOSMode : optional [bool]

Default False, if True the data produced will be unprocessed and use double curly braces. This is the style WOS produces bib files in and mostly macthes that.

restrictedOutput : optional [bool]

Default False, if True the tags output will be limited to: 'AF', 'BF', 'ED', 'TI', 'SO', 'LA', 'NR', 'TC', 'Z9', 'PU', 'J9', 'PY', 'PD', 'VL', 'IS', 'SU', 'PG', 'DI', 'D2', and 'UT'

niceID : optional [bool]

Default True, if True the IDs used will be derived from the authors, publishing date and title, if False it will be the UT tag


RecordCollection.findProbableCopyright():

Finds the (likely) copyright string from all abstracts in the RecordCollection

Returns

list[str]

A deduplicated list of all the copyright strings


RecordCollection.forBurst(tag, outputFile=None, dropList=None, lower=True, removeNumbers=True, removeNonWords=True, removeWhitespace=True, stemmer=None):

Creates a pandas friendly dictionary with 2 columns one 'year' and the other 'word'. Each row is a word that occurred in the field given by tag in a Record and the year of the record. Unfortunately getting the month or day with any type of accuracy has proved to be impossible so year is the only option.

Parameters

tag : str

The tag giving the field for the words to be extracted from.

outputFile : optional str

Default None, if a path is given a csv file will be created from the returned dictionary and written to that file

dropList : optional list[str]

Default None, if a list of strings is given each field will be checked for substrings, before any other processing, in the field, surrounded by spaces, matching those in dropList. The strings will only be dropped if they are surrounded on both sides with spaces (' ') so if dropList = ['a'] then 'a cat' will become 'cat'.

lower : optional bool

default True, if True the output will made lower case

removeNumbers : optional bool

default True, if True all numbers will be removed

removeNonWords : optional bool

default True, if True all non-number non-number characters will be removed

removeWhitespace : optional bool

default True, if True all whitespace will be converted to a single space (' ')

stemmer : optional func

default None, if a function is provided it will be run on each individual word in the field and the output will replace it. For example to use the PorterStemmer in the nltk package you would give nltk.PorterStemmer().stem


RecordCollection.forNLP(outputFile=None, extraColumns=None, dropList=None, lower=True, removeNumbers=True, removeNonWords=True, removeWhitespace=True, removeCopyright=False, stemmer=None):

Creates a pandas friendly dictionary with each row a Record in the RecordCollection and the columns fields natural language processing uses (id, title, publication year, keywords and the abstract). The abstract is by default is processed to remove non-word, non-space characters and the case is lowered.

Parameters

outputFile : optional str

default None, if a file path is given a csv of the returned data will be written

extraColumns : optional list[str]

default None, if a list of tags is given each of the tag’s values for a Record will be added to the output(s)

dropList : optional list[str]

default None, if a list of strings is provided they will be dropped from the output’s abstracts. The matching is case sensitive and done before any other processing. The strings will only be dropped if they are surrounded on both sides with spaces (' ') so if dropList = ['a'] then 'a cat' will become 'cat'.

lower : optional bool

default True, if True the abstract will made to lower case

removeNumbers : optional bool

default True, if True all numbers will be removed

removeNonWords : optional bool

default True, if True all non-number non-number characters will be removed

removeWhitespace : optional bool

default True, if True all whitespace will be converted to a single space (' ')

removeCopyright : optional bool

default False, if True the copyright statement at the end of the abstract will be removed and added to a new column. Note this is heuristic based and will not work for all papers.

stemmer : optional func

default None, if a function is provided it will be run on each individual word in the abstract and the output will replace it. For example to use the PorterStemmer in the nltk package you would give nltk.PorterStemmer().stem


RecordCollection.makeDict(onlyTheseTags=None, longNames=False, raw=False, numAuthors=True, genderCounts=True):

Returns a dict with each key a tag and the values being lists of the values for each of the Records in the collection, None is given when there is no value and they are in the same order across each tag.

When used with pandas: pandas.DataFrame(RC.makeDict()) returns a data frame with each column a tag and each row a Record.

Parameters

onlyTheseTags : optional [iterable]

Default None, if an iterable (list, tuple, etc) only the tags in onlyTheseTags will be used, if not given then all tags in the records are given.

If you want to use all known tags pass metaknowledge.knownTagsList.

longNames : optional [bool]

Default False, if True will convert the tags to their longer names, otherwise the short 2 character ones will be used.

cleanedVal : optional [bool]

Default True, if True the processed values for each Record’s field will be provided, otherwise the raw values are given.

numAuthors : optional [bool]

Default True, if True adds the number of authors as the column 'numAuthors'.


RecordCollection.rpys(minYear=None, maxYear=None, dropYears=None, rankEmptyYears=False):

This implements Referenced Publication Years Spectroscopy a techinique for finding import years in citation data. The authors of the original papers have a website with more information, found here.

This function computes the spectra of the RecordCollection and returns a dictionary mapping strings to lists of ints. Each list is ordered and the values of each with the same index form a row and each list a column. The strings are the names of the columns. This is intended to be read directly by pandas DataFrames.

The columns returned are:

  1. 'year', the years of the counted citations, missing years are inserted with a count of 0, unless they are outside the bounds of the highest year or the lowest year and the default value is used. e.g. if the highest year is 2016, 2017 will not be inserted unless maxYear has been set to 2017 or higher
  2. 'count', the number of times the year was cited
  3. 'abs-deviation', deviation from the 5-year median. Calculated by taking the absolute deviation of the count from the median of it and the next 2 years and the preceding 2 years
  4. 'rank', the rank of the year, the highest ranked year being the one with the highest deviation, the second highest being the second highest deviation and so on. All years with 0 count are given the rank 0 by default
Parameters

minYear : optional int

Default 1000, The lowest year to be returned, note years outside this bound will be used to calculate the deviation from the 5-year median

maxYear : optional int

Default 2100, The highest year to be returned, note years outside this bound will be used to calculate the deviation from the 5-year median

dropYears : optional int or list[int]

Default None, year or collection of years that will be removed from the returned value, note the dropped years will still be used to calculate the deviation from the 5-year

rankEmptyYears : optional [bool]

Default False, if True years with 0 count will be ranked according to their deviance, if many 0 count years exist their ordering is not guaranteed to be stable

Returns

dict[str:list]

The table of values from the Referenced Publication Years Spectroscopy


RecordCollection.genderStats(asFractions=False):

Creates a dict ({'Male' : maleCount, 'Female' : femaleCount, 'Unknown' : unknownCount}) with the numbers of male, female and unknown names in the collection.

Parameters

asFractions : optional bool

Default False, if True the counts will be divided by the total number of names, giving the fraction of names in each category instead of the raw counts.

Returns

dict[str:int]

A dict with three keys 'Male', 'Female' and 'Unknown' mapping to their respective counts


RecordCollection.getCitations(field=None, values=None, pandasFriendly=True, counts=True):

Creates a pandas ready dict with each row a different citation the contained Records and columns containing the original string, year, journal, author’s name and the number of times it occured.

There are also options to filter the output citations with field and values

Parameters

field : optional str

Default None, if given all citations missing the named field will be dropped.

values : optional str or list[str]

Default None, if field is also given only those citations with one of the strings given in values will be included.

e.g. to get only citations from 1990 or 1991: field = year, values = [1991, 1990]

pandasFriendly : optional bool

Default True, if False a list of the citations will be returned instead of the more complicated pandas dict

counts : optional bool

Default True, if False the counts columns will be removed

Returns

dict

A pandas ready dict with all the Citations


RecordCollection.networkCoAuthor(detailedInfo=False, weighted=True, dropNonJournals=False, count=True, useShortNames=False):

Creates a coauthorship network for the RecordCollection.

Parameters

detailedInfo : optional [bool or iterable[WOS tag Strings]]

Default False, if True all nodes will be given info strings composed of information from the Record objects themselves. This is Equivalent to passing the list: ['PY', 'TI', 'SO', 'VL', 'BP'].

If detailedInfo is an iterable (that evaluates to True) of WOS Tags (or long names) The values of those tags will be used to make the info attributes.

For each of the selected tags an attribute will be added to the node using the values of those tags on the first Record encountered. Warning iterating over RecordCollection objects is not deterministic the first Record will not always be same between runs. The node will be given attributes with the names of the WOS tags for each of the selected tags. The attributes will contain strings of containing the values (with commas removed), if multiple values are encountered they will be comma separated.

Note: detailedInfo is not identical to the detailedCore argument of Recordcollection.networkCoCitation() or Recordcollection.networkCitation()

weighted : optional [bool]

Default True, whether the edges are weighted. If True the edges are weighted by the number of co-authorships.

dropNonJournals : optional [bool]

Default False, whether to drop authors from non-journals

count : optional [bool]

Default True, causes the number of occurrences of a node to be counted

Returns

Networkx Graph

A networkx graph with author names as nodes and collaborations as edges.


Questions?

If you find bugs, or have questions, please write to:

Reid McIlroy-Young reid@reidmcy.com

John McLevey john.mclevey@uwaterloo.ca


License

metaknowledge is free and open source software, distributed under the GPL License.