Publication list creation

with bibtex and python

To keep my publication list up to date, I use a bibtex file and python to produce markdown and pandoc to further convert it to pdf. Here I show and try to explain the python script, how I do it. The converted pdf can, with some fine tuning, be included in applications.

I wrote the script already years ago, while I still was employed at BiK-F in Frankfurt. And I’m surprised it is still working and I even really still understand, what I once wrote, meaning it’s reproducible ;-) The bibtex file contains a field keywords, where I define the publication categories (see further down). The underlying bibtex-file can be found here.

There are still a few TODO comments in there, which I might continue to work on.

Load required packages

For Debian the required packages to run the python script are:

  • python3-pandas
  • python3-bibtexparser
  • python3-numpy

I also define a dictionary of three categories here. The keys are defined in the bibtex-file as keywords. And the values is simply a nicer, longer form of it.

#!/usr/bin/python3

import os
import re
import argparse
import bibtexparser
from bibtexparser.customization import convert_to_unicode
import pandas as pd
from numpy import NaN

categories = {'peer-review': 'Peer-reviewed articles',
              'conference':  'Conferences',
              'other':       'Other publication'}

Text-standardization

Author names in the bibtex entries are not consistent, since I once wrote them by hand, copy/paste then e.g. from the publishers webpages or google-scholar. These two functions try to harmonize them with initials for the surname and long family name.

def getInitials(name):
    '''
    return the initial letters of any name.
    input: string
    output: string
    '''
    ## check for spaces
    initials = name[0]+"."
    ## remove duplicate seperators
    name = re.sub('\. *', " ", name)
    name = re.sub('  *', " ", name)
    name = re.sub('\.-', "-", name)
    # loop over each surname
    names = re.search(r'[\.\-\s]', name)
    while names:
        ALL_CHARS = re.compile('[^\W]*'+names.group(), re.IGNORECASE | re.UNICODE)
        name = re.sub(ALL_CHARS, "", name, count=1)
        if name == "":
            break
        if names.group() == "-":
            initials += "-"+name[0]+"."
        else:
            initials += name[0]+"."
        names = re.search(r'[\.\- ]', name)
    return(initials)
	
def unifyAuthors(namelist, nmax=999):
    '''
    return a unified list of all authors sepparated by 'and' in the input as usual 
	in bibtex. All authors will have surname initials and long given name.
    input: string
    output: string
    '''
    namelist = namelist.split(" and ")
    new_namelist = []
    for name in namelist:
        if re.search(r',', name):
            name = name.split(",")
            name[1] = re.sub(r"^\s+", "", name[1])
            initials = getInitials(name[1])
            new_namelist.append(name[0]+", "+initials)
        else:
            name = name.split(" ")
            if len(name) > 1:
                initials = getInitials(' '.join(name[0:(len(name) - 1)]))
                new_namelist.append(name[len(name) - 1]+", "+initials)
            else:
                new_namelist.append(name[0])
    return(", ".join(new_namelist))

Parse the input

Read a bibtex file given as parameter and return a pandas dataframe.

def bib2pandas(file):
    '''
    Load a bibtex file and return it as pandas dataframe.
    input: filename
    output: pandas dataframe
    '''
    bib_db = bibtexparser.load(open(file))
    for i in range(0, len(bib_db.entries)):
        bib_db.entries[i] = convert_to_unicode(bib_db.entries[i])
    pdbib = pd.DataFrame(bib_db.entries)
    ## sort by decending year/ascending author
    pdbib = pdbib.sort_values(['year', 'author'], ascending=[0,1])
    pdbib = pdbib.reset_index()
    return(pdbib)

Format the output

Each entry in the previously created pandas dataframe is checked ‘line by line’ So far, I did it for three entry types (categories):

  • article
  • conference
  • inproceedings

def bibline(index, bib):
    '''
    Returns a publication as string with 'index' + 1 as list-number.
    '''
    if bib['ENTRYTYPE'] == 'article':
        bibstr = "%d. %s (%s): %s. *%s*, %s, %s." %(index + 1,
                                                    bib['author'],
                                                    bib['year'],
                                                    bib['title'],
                                                    bib['journal'],
                                                    bib['volume'],
                                                    bib['pages'])
    elif bib['ENTRYTYPE'] == 'conference':
        if bib['pages'] == "" and not type(bib['pages']) is float:
            bibstr = "%d. %s (%s): %s. *%s*. %s." %(index + 1,
                                                    bib['author'],
                                                    bib['year'],
                                                    bib['title'],
                                                    bib['booktitle'],
                                                    bib['address'])
        else:
            bibstr = "%d. %s (%s): %s. *%s*. %s: %s." %(index + 1,
                                                        bib['author'],
                                                        bib['year'],
                                                        bib['title'],
                                                        bib['booktitle'],
                                                        bib['address'],
                                                        bib['pages'])
    elif bib['ENTRYTYPE'] == 'inproceedings':
        if bib['pages'] == '' and not type(bib['pages']) is float:
            bibstr = "%d. %s (%s): %s. in %s (Eds.) *%s*. %s." %(index + 1,
                                                                 bib['author'],
                                                                 bib['year'],
                                                                 bib['title'],
                                                                 bib['editor'],
                                                                 bib['booktitle'],
                                                                 bib['address'])
        else:
            bibstr = "%d. %s (%s): %s. in %s (Eds.) *%s*. %s: %s." %(index + 1,
                                                                     bib['author'],
                                                                     bib['year'],
                                                                     bib['title'],
                                                                     bib['editor'],
                                                                     bib['booktitle'],
                                                                     bib['address'],
                                                                     bib['pages'])
    else:
        print("Don't know how to handle %s"%(bib['ENTRYTYPE']))
        exit()
    if 'doi' in bib.keys():
        if bib['doi'] != '' and not type(bib['doi']) is float:
            bibstr = bibstr+" [DOI:%s](https://dx.doi.org/%s)" % (bib['doi'], bib['doi'])
    return(bibstr)

Headline

This is simply the header for the markdown used here.

def printMyHeader():
    print("---")
    print("title: 'Publications'")
    print("blackfriday:")
    print("  fractions: false")
    print("---\n")
    print('<img src="/img/google-scholar-logo.png" width="16"><a href="https://scholar.google.de/citations?user=sc47UnIAAAAJ">Google scholar</a>')
    print('&nbsp;&nbsp;')
    print('<img src="/img/orcid-logo.svg" width="16"><a href="https://orcid.org/0000-0002-7861-8789">ORCID</a>')
    print('')

Main program

The main program can be used with some commandline arguments:

  • -f : the bibtex file
  • -b <‘Author Name’>: an author name to be highlighted in bold
  • -k <category cateegory …>
def main():
    #TODO: make markdown header optional
    printMyHeader()
    parser = argparse.ArgumentParser(description='Read a bibtex file and create a Mardown publication list')
    parser.add_argument('-f', dest='file', default=None, metavar='Filename',
                        help='Path to the bibtex file')
    parser.add_argument('-b', dest='bold', default=None, metavar='Bold name',
                        help='Name of an author to higlight as bold markdown text')
    # append or nargs: https://stackoverflow.com/questions/15753701/argparse-option-for-passing-a-list-as-option
    parser.add_argument('-k', dest='keywords', default=None, nargs='*', metavar='Keywords',
                        help='keyword to extract in the "keywords"-field in the bibtex file')
    args = parser.parse_args()

    pdbib = bib2pandas(args.file)

    for k in args.keywords:
        if k in categories:
            print("## %s" % categories[k])
        else:
            print("## %s" % k)
            
        i = 0
        for index, row in pdbib.iterrows():
            keywords = row['keywords'].split(", ")
            if k not in keywords:
                continue
            row['author'] = unifyAuthors(row['author'])
            ## Highlight a single author
            ##TODO: make this optional
            if args.bold != None:
                row['author'] = re.sub(args.bold, '**%s**' % args.bold, row['author'])
            ## Need to check, if all fields are available
            print(bibline(i, row) + "\n")
            i = i+1

if __name__ == "__main__":
    main()

Execute and Convert to pdf

Make sure that the environment variable LANG is set to UTF-8 encoding (e.g. en_US.UTF-8, de_DE.UTF-8, …) for special characters. And finally once all the code above is saved in the file bibtex2md.py, execute it with the correct parameters.

chmod +x bibtex2md.py
./bibtex2md.py -f mine.bib -b "Steinkamp, J." -k peer-review conference other > publications.md
pandoc -V papersize:a4 --latex-engine=xelatex publications.md -o publications.pdf