PDFTables has an API (Application Programming Interface) which you can use to convert your documents from PDF documents to Excel, CSV (Comma Separated Values) or HTML.
Usage is straightforward, just do a multipart HTTP request with the content of the
file to http://3.124.114.50/api
.
Here's an example using cURL, a commonly available command-line tool for running HTTP requests.
curl -F f=@example.pdf "http://3.124.114.50/api?format=xml"
The name of the form variable (f=
above) is ignored, and only the first file is processed.
The above example converts to an XML file. To specify a
different format, change the value of the format=
parameter. For example, to download a
single-sheet XLSX from the API, you might use:
curl -F f=@example.pdf "http://3.124.114.50/api?format=xlsx-single"
Format | URL Parameter | Notes |
---|---|---|
CSV | format=csv | Comma Separated Values, blank row between pages. |
HTML | format=html | Table as HTML fragment. New pages are separated by <h2> elements that have class="pagenumber" and "Page X" as the element text, where X is the page number. |
XML | format=xml | Contains HTML <table> tags; <td> tags may have colspan= attributes. See XML format for details. |
XLSX | format=xlsx-single | Excel, all PDF pages on one sheet, blank row between pages. |
format=xlsx-multiple | Excel, one sheet per page of the PDF. |
We plan to support other formats in the future, according to demand. If you need something else, contact us!
Use the requests library or another library capable of doing multi-part HTTP requests in a straightforward manner.
This example saves an Excel spreadsheet:
import requests files = {'f': ('example.pdf', open('example.pdf', 'rb'))} response = requests.post("http://3.124.114.50/api?format=xlsx-single", files=files) response.raise_for_status() # ensure we notice bad responses with open("example.xlsx", "wb") as f: f.write(response.content)
Use the cURL library, with CURLFile to send the file. This example converts the
file test.pdf
to XML.
<?php $c = curl_init(); $cfile = curl_file_create('test.pdf', 'application/pdf'); curl_setopt($c, CURLOPT_URL, 'http://3.124.114.50/api?format=xml'); curl_setopt($c, CURLOPT_POSTFIELDS, array('file' => $cfile)); curl_setopt($c, CURLOPT_RETURNTRANSFER, true); curl_setopt($c, CURLOPT_ENCODING, "gzip,deflate"); $result = curl_exec($c); if (curl_errno($c)) { print('Error calling PDFTables: ' . curl_error($c)); } // save the XML we got from PDFTables to a file file_put_contents ("test.xml", $result); curl_close($c);
using System; using System.Net; using System.Net.Http; using System.Threading.Tasks; class Program { static string uploadURL = "http://3.124.114.50/api?format=xml"; static void Main() { var task = PDFToTable(@"C:\temp\your_test_pdf.pdf"); task.Wait(); Console.Write(task.Result); Console.WriteLine("Press enter to continue..."); Console.ReadLine(); } static async Task<string> PDFToTable(string filename) { using (var f = System.IO.File.OpenRead(filename)) { var client = new HttpClient(); var upload = new StreamContent(f); var mpcontent = new MultipartFormDataContent(); Console.WriteLine("Uploading content..."); mpcontent.Add(upload); using (var response = await client.PostAsync(uploadURL, mpcontent)) { Console.WriteLine("Response status {0} {1}", (int)response.StatusCode, response.StatusCode); using (var content = response.Content) { return await content.ReadAsStringAsync(); } } } } }
This macro lets the user select a file, converts it to Excel and opens it, all in VBA.
'--- https://support.microsoft.com/en-us/kb/195763 ' NB: remove PtrSafe if old Excel Private Declare PtrSafe Function GetTempPath Lib "kernel32" _ Alias "GetTempPathA" (ByVal nBufferLength As Long, _ ByVal lpBuffer As String) As Long '--- https://support.microsoft.com/en-us/kb/195763 ' NB: remove PtrSafe if old Excel Private Declare PtrSafe Function GetTempFileName Lib "kernel32" _ Alias "GetTempFileNameA" (ByVal lpszPath As String, _ ByVal lpPrefixString As String, ByVal wUnique As Long, _ ByVal lpTempFileName As String) As Long Private Function CreateTempFile(sPrefix As String) As String '--- https://support.microsoft.com/en-us/kb/195763 ' Generate the name of a temporary file Dim sTmpPath As String * 512 Dim sTmpName As String * 576 Dim nRet As Long nRet = GetTempPath(512, sTmpPath) If (nRet > 0 And nRet < 512) Then nRet = GetTempFileName(sTmpPath, sPrefix, 0, sTmpName) If nRet <> 0 Then CreateTempFile = Left$(sTmpName, _ InStr(sTmpName, vbNullChar) - 1) End If End If End Function Private Function pvToByteArray(sText As String) As Byte() '--- http://tinyurl.com/vbapost pvToByteArray = StrConv(sText, vbFromUnicode) End Function Private Function pvPostFile(sUrl As String, sFileName As String, Optional ByVal bAsync As Boolean) As Variant '--- HTTP POST a file as multipart '--- http://tinyurl.com/vbapost -- modified slightly Const STR_BOUNDARY As String = "3fbd04f5Rb1edX4060q99b9Nfca7ff59c113" Dim nFile As Integer Dim baBuffer() As Byte Dim sPostData As String '--- read file nFile = FreeFile Open sFileName For Binary Access Read As nFile If LOF(nFile) > 0 Then ReDim baBuffer(0 To LOF(nFile) - 1) As Byte Get nFile, , baBuffer sPostData = StrConv(baBuffer, vbUnicode) End If Close nFile '--- prepare body sPostData = "--" & STR_BOUNDARY & vbCrLf & _ "Content-Disposition: form-data; name=""uploadfile""; filename=""" & Mid$(sFileName, InStrRev(sFileName, "\") + 1) & """" & vbCrLf & _ "Content-Type: application/octet-stream" & vbCrLf & vbCrLf & _ sPostData & vbCrLf & _ "--" & STR_BOUNDARY & "--" '--- post With CreateObject("Microsoft.XMLHTTP") .Open "POST", sUrl, bAsync .SetRequestHeader "Content-Type", "multipart/form-data; boundary=" & STR_BOUNDARY .Send pvToByteArray(sPostData) If Not bAsync Then pvPostFile = .ResponseBody End If End With End Function Private Sub pdftables_worker(filename As String) data = pvPostFile("http://3.124.114.50/api?format=xlsx-single", filename, False) xls_file = CreateTempFile("pdf") nFileNum = FreeFile Dim data_bytearray() As Byte 'needed to get rid of header data_bytearray = data Open xls_file For Binary Lock Read Write As #nFileNum Put #nFileNum, , data_bytearray Close #nFileNum Workbooks.Open (xls_file) End Sub Sub pdftables() '--- https://msdn.microsoft.com/en-us/library/office/aa219843(v=office.11).aspx 'Declare a variable as a FileDialog object. Dim fd As FileDialog 'Create a FileDialog object as a File Picker dialog box. Set fd = Application.FileDialog(msoFileDialogFilePicker) 'Declare a variable to contain the path 'of each selected item. Even though the path is a String, 'the variable must be a Variant because For Each...Next 'routines only work with Variants and Objects. Dim vrtSelectedItem As Variant 'Use a With...End With block to reference the FileDialog object. With fd 'Use the Show method to display the File Picker dialog box and return the user's action. 'The user pressed the action button. If .Show = -1 Then 'Step through each string in the FileDialogSelectedItems collection. For Each vrtSelectedItem In .SelectedItems 'vrtSelectedItem is a String that contains the path of each selected item. 'You can use any file I/O functions that you want to work with this path. 'This example simply displays the path in a message box. 'MsgBox "The path is: " & vrtSelectedItem pdftables_worker (vrtSelectedItem) Next vrtSelectedItem 'The user pressed Cancel. Else End If End With 'Set the object variable to Nothing. Set fd = Nothing End Sub
This example uses the Apache HttpClient library.
import java.io.File; import org.apache.http.HttpEntity; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpPost; import org.apache.http.entity.mime.MultipartEntityBuilder; import org.apache.http.entity.mime.content.FileBody; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; public class PDFTablesExample { public static void main(String[] args) throws Exception { if (args.length != 1) { System.out.println("File path of PDF not given"); System.exit(1); } CloseableHttpClient httpclient = HttpClients.createDefault(); try { HttpPost httppost = new HttpPost("http://3.124.114.50/api?format=xml"); FileBody bin = new FileBody(new File(args[0])); HttpEntity reqEntity = MultipartEntityBuilder.create() .addPart("f", bin) .build(); httppost.setEntity(reqEntity); System.out.println("executing request " + httppost.getRequestLine()); CloseableHttpResponse response = httpclient.execute(httppost); try { System.out.println(response.getStatusLine()); HttpEntity resEntity = response.getEntity(); if (resEntity != null) { System.out.println(EntityUtils.toString(resEntity)); } EntityUtils.consume(resEntity); } finally { response.close(); } } finally { httpclient.close(); } } }
There's an unofficial R package on GitHub.
If your favourite language isn't listed here, and you'd like help, contact us.
The XML output format contains HTML style tables.
We strongly recommend you use an XML parsing library. We may later add attributes to tags, and add tags with different names to the XML document.
Currently, the outermost tag is a <document>
tag, which corresponds to a single PDF document.
It is possible that in the future this will be contained in a <documents>
tag, if multiple PDF files were uploaded.
Contains any number of <page>
tags. Will not contain text.
page-count
: the number of pages in the PDF document.A single page from the PDF document.
Contains any number of <table>
tags. May in future contain text that is not part of a table.
number
: the page number (starting at 1, and ignoring PDF page numbering)A single table. At the moment, only one table is identified per page, which covers the whole page; this may change in the future.
Contains any number of <tr>
tags. Will not contain text.
id
: a unique identifier for this page in the XML document. You should not attempt to parse it.data-filename
: should be ignored, internal use only.data-page
: a number matching the number
of the page
tag, i.e. the page number on which the table was found.data-table
: an index number for the tables on a page. Currently always 1
, but this should not be relied upon.A single row from a table.
Contains any number of <td>
tags. Will not contain text.
Currently none, but we reserve the right to add some.
A table cell.
Contains text — the value of the cell.
style
: Currently used for formatting numbers. Should not be used; we intend to remove it.class
: It is proposed that numbers will contain a class
attribute instead of a style
; details to follow.colspan
: The width of a cell which is wider than a single column. Not always present. Should be interpreted as per HTML.rowspan
: The height of a cell which is taller than a single row. Not always present. Should be interpreted as per HTML. Not yet implemented.(the HTML 4 spec is informative)