Core functionality for any reference management software is the ability to create a citable record from a research paper added to a user’s library.
Since its inception, the Mendeley Desktop application has been able to automatically extract from PDFs the metadata needed to create these citable records: author, title, year and publication details, and we’ve augmented this with data from external, open resources such as CrossRef [This paper describes how MD works in more detail.]
For a number of years, this approach has served us well. In fact, in a recent review of PDF metadata extraction tools, Mendeley Desktop was ranked 2nd out of the 7 freely available tools evaluated.
Our existing system was about 75-80% accurate for author and title extraction, and we wanted to improve on this. Also, our existing system was built into Mendeley Desktop, and we wanted to make PDF extraction available as a service, so that it could be used by any client, including MD, and our forthcoming web library. And we wanted the system to return the best metadata available – which might not be in the PDF – yet without the need to consult resources external to Mendeley.
We spent several months researching the best, open-source tools for extracting the character, font and coordinate information from binary PDF data. [There are many out there: PDFBox, PDFMiner, XPDF among others.] We developed a modular system that wasn’t tied to any one of these tools, so we could replace the core extraction engine with another, as and when newer tools become available.
The raw data is processed using components from an open-source tool called Grobid, which we retrained with a Conditional Random Fields classifier on thousands of research papers, so that we can turn this into citable fields: title, authors, DOI, publication, volume, issue, year, page ranges. We can now extract many of these fields with a real-world accuracy of over 90% (based on tens of thousands of test cases across a wide range of user-supplied articles), and we have a continual improvement process in place to develop this further.
The new API service takes a PDF and creates a document in your library that represents the best metadata available: either the validated record from our catalogue, if it’s a paper we already know about, or the data extracted from the PDF, if it’s one we’ve not seen yet.
To be able to create a document from a file then you can use the ‘documents’ service. Here is an example of extracting metadata from a PDF using cURL:
curl 'https://api.mendeley.com/documents' \ -X POST \ -H 'Authorization: Bearer <ACCESS_TOKEN>' \ -H 'Content-Type: application/pdf' \ -H 'Content-Disposition: attachment; filename="example.pdf"' \ --data-binary @example.pdf
If you prefer Python then we have updated our Python SDK on GitHub with a method that you can call:
document = mendeley.documents.create_from_file('example.pdf')