To keep my publication list up to date, I use a bibtex file and python to produce markdown and pandoc to further convert it to pdf. Here I show and try to explain the python script, how I do it. The converted pdf can, with some fine tuning, be included in applications.
I wrote the script already years ago, while I still was employed at BiK-F in Frankfurt. And I’m surprised it is still working and I even really still understand, what I once wrote, meaning it’s reproducible ;-) The bibtex file contains a field keywords, where I define the publication categories (see further down). The underlying bibtex-file can be found here.
There are still a few TODO comments in there, which I might continue to work on.
Load required packages
For Debian the required packages to run the python script are:
- python3-pandas
- python3-bibtexparser
- python3-numpy
I also define a dictionary of three categories here. The keys are defined in the bibtex-file as keywords. And the values is simply a nicer, longer form of it.
#!/usr/bin/python3
import os
import re
import argparse
import bibtexparser
from bibtexparser.customization import convert_to_unicode
import pandas as pd
from numpy import NaN
categories = {'peer-review': 'Peer-reviewed articles',
'conference': 'Conferences',
'other': 'Other publication'}
Text-standardization
Author names in the bibtex entries are not consistent, since I once wrote them by hand, copy/paste then e.g. from the publishers webpages or google-scholar. These two functions try to harmonize them with initials for the surname and long family name.
def getInitials(name):
'''
return the initial letters of any name.
input: string
output: string
'''
## check for spaces
initials = name[0]+"."
## remove duplicate seperators
name = re.sub('\. *', " ", name)
name = re.sub(' *', " ", name)
name = re.sub('\.-', "-", name)
# loop over each surname
names = re.search(r'[\.\-\s]', name)
while names:
ALL_CHARS = re.compile('[^\W]*'+names.group(), re.IGNORECASE | re.UNICODE)
name = re.sub(ALL_CHARS, "", name, count=1)
if name == "":
break
if names.group() == "-":
initials += "-"+name[0]+"."
else:
initials += name[0]+"."
names = re.search(r'[\.\- ]', name)
return(initials)
def unifyAuthors(namelist, nmax=999):
'''
return a unified list of all authors sepparated by 'and' in the input as usual
in bibtex. All authors will have surname initials and long given name.
input: string
output: string
'''
namelist = namelist.split(" and ")
new_namelist = []
for name in namelist:
if re.search(r',', name):
name = name.split(",")
name[1] = re.sub(r"^\s+", "", name[1])
initials = getInitials(name[1])
new_namelist.append(name[0]+", "+initials)
else:
name = name.split(" ")
if len(name) > 1:
initials = getInitials(' '.join(name[0:(len(name) - 1)]))
new_namelist.append(name[len(name) - 1]+", "+initials)
else:
new_namelist.append(name[0])
return(", ".join(new_namelist))
Parse the input
Read a bibtex file given as parameter and return a pandas dataframe.
def bib2pandas(file):
'''
Load a bibtex file and return it as pandas dataframe.
input: filename
output: pandas dataframe
'''
bib_db = bibtexparser.load(open(file))
for i in range(0, len(bib_db.entries)):
bib_db.entries[i] = convert_to_unicode(bib_db.entries[i])
pdbib = pd.DataFrame(bib_db.entries)
## sort by decending year/ascending author
pdbib = pdbib.sort_values(['year', 'author'], ascending=[0,1])
pdbib = pdbib.reset_index()
return(pdbib)
Format the output
Each entry in the previously created pandas dataframe is checked ‘line by line’ So far, I did it for three entry types (categories):
- article
- conference
- inproceedings
def bibline(index, bib):
'''
Returns a publication as string with 'index' + 1 as list-number.
'''
if bib['ENTRYTYPE'] == 'article':
bibstr = "%d. %s (%s): %s. *%s*, %s, %s." %(index + 1,
bib['author'],
bib['year'],
bib['title'],
bib['journal'],
bib['volume'],
bib['pages'])
elif bib['ENTRYTYPE'] == 'conference':
if bib['pages'] == "" and not type(bib['pages']) is float:
bibstr = "%d. %s (%s): %s. *%s*. %s." %(index + 1,
bib['author'],
bib['year'],
bib['title'],
bib['booktitle'],
bib['address'])
else:
bibstr = "%d. %s (%s): %s. *%s*. %s: %s." %(index + 1,
bib['author'],
bib['year'],
bib['title'],
bib['booktitle'],
bib['address'],
bib['pages'])
elif bib['ENTRYTYPE'] == 'inproceedings':
if bib['pages'] == '' and not type(bib['pages']) is float:
bibstr = "%d. %s (%s): %s. in %s (Eds.) *%s*. %s." %(index + 1,
bib['author'],
bib['year'],
bib['title'],
bib['editor'],
bib['booktitle'],
bib['address'])
else:
bibstr = "%d. %s (%s): %s. in %s (Eds.) *%s*. %s: %s." %(index + 1,
bib['author'],
bib['year'],
bib['title'],
bib['editor'],
bib['booktitle'],
bib['address'],
bib['pages'])
else:
print("Don't know how to handle %s"%(bib['ENTRYTYPE']))
exit()
if 'doi' in bib.keys():
if bib['doi'] != '' and not type(bib['doi']) is float:
bibstr = bibstr+" [DOI:%s](https://dx.doi.org/%s)" % (bib['doi'], bib['doi'])
return(bibstr)
Headline
This is simply the header for the markdown used here.
def printMyHeader():
print("---")
print("title: 'Publications'")
print("blackfriday:")
print(" fractions: false")
print("---\n")
print('<img src="/img/google-scholar-logo.png" width="16"><a href="https://scholar.google.de/citations?user=sc47UnIAAAAJ">Google scholar</a>')
print(' ')
print('<img src="/img/orcid-logo.svg" width="16"><a href="https://orcid.org/0000-0002-7861-8789">ORCID</a>')
print('')
Main program
The main program can be used with some commandline arguments:
- -f
: the bibtex file - -b <‘Author Name’>: an author name to be highlighted in bold
- -k <category cateegory …>
def main():
#TODO: make markdown header optional
printMyHeader()
parser = argparse.ArgumentParser(description='Read a bibtex file and create a Mardown publication list')
parser.add_argument('-f', dest='file', default=None, metavar='Filename',
help='Path to the bibtex file')
parser.add_argument('-b', dest='bold', default=None, metavar='Bold name',
help='Name of an author to higlight as bold markdown text')
# append or nargs: https://stackoverflow.com/questions/15753701/argparse-option-for-passing-a-list-as-option
parser.add_argument('-k', dest='keywords', default=None, nargs='*', metavar='Keywords',
help='keyword to extract in the "keywords"-field in the bibtex file')
args = parser.parse_args()
pdbib = bib2pandas(args.file)
for k in args.keywords:
if k in categories:
print("## %s" % categories[k])
else:
print("## %s" % k)
i = 0
for index, row in pdbib.iterrows():
keywords = row['keywords'].split(", ")
if k not in keywords:
continue
row['author'] = unifyAuthors(row['author'])
## Highlight a single author
##TODO: make this optional
if args.bold != None:
row['author'] = re.sub(args.bold, '**%s**' % args.bold, row['author'])
## Need to check, if all fields are available
print(bibline(i, row) + "\n")
i = i+1
if __name__ == "__main__":
main()
Execute and Convert to pdf
Make sure that the environment variable LANG is set to UTF-8 encoding (e.g. en_US.UTF-8, de_DE.UTF-8, …) for special characters. And finally once all the code above is saved in the file bibtex2md.py, execute it with the correct parameters.
chmod +x bibtex2md.py
./bibtex2md.py -f mine.bib -b "Steinkamp, J." -k peer-review conference other > publications.md
pandoc -V papersize:a4 --latex-engine=xelatex publications.md -o publications.pdf