API

PDFTables has an API (Application Programming Interface) which you can use to convert your documents from PDF documents to Excel, CSV (Comma Separated Values) or HTML.

Using the API

Usage is straightforward, just do a multipart HTTP request with the content of the file to http://3.124.114.50/api.

Here's an example using cURL, a commonly available command-line tool for running HTTP requests.

curl -F f=@example.pdf "http://3.124.114.50/api?format=xml"

The name of the form variable (f= above) is ignored, and only the first file is processed.

Choosing format

The above example converts to an XML file. To specify a different format, change the value of the format= parameter. For example, to download a single-sheet XLSX from the API, you might use:

curl -F f=@example.pdf "http://3.124.114.50/api?format=xlsx-single"
FormatURL ParameterNotes
CSVformat=csvComma Separated Values, blank row between pages.
HTMLformat=htmlTable as HTML fragment. New pages are separated by <h2> elements that have class="pagenumber" and "Page X" as the element text, where X is the page number.
XMLformat=xmlContains HTML <table> tags; <td> tags may have colspan= attributes. See XML format for details.
XLSXformat=xlsx-singleExcel, all PDF pages on one sheet, blank row between pages.
format=xlsx-multipleExcel, one sheet per page of the PDF.

We plan to support other formats in the future, according to demand. If you need something else, contact us!

Language examples

Python

Use the requests library or another library capable of doing multi-part HTTP requests in a straightforward manner.

This example saves an Excel spreadsheet:

  import requests
  files = {'f': ('example.pdf', open('example.pdf', 'rb'))}
  response = requests.post("http://3.124.114.50/api?format=xlsx-single", files=files)
  response.raise_for_status() # ensure we notice bad responses
  with open("example.xlsx", "wb") as f:
      f.write(response.content)

PHP

Use the cURL library, with CURLFile to send the file. This example converts the file test.pdf to XML.

<?php
  $c = curl_init();
  $cfile = curl_file_create('test.pdf', 'application/pdf');

  curl_setopt($c, CURLOPT_URL, 'http://3.124.114.50/api?format=xml');
  curl_setopt($c, CURLOPT_POSTFIELDS, array('file' => $cfile));
  curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($c, CURLOPT_ENCODING, "gzip,deflate");

  $result = curl_exec($c);
  if (curl_errno($c)) {
      print('Error calling PDFTables: ' . curl_error($c));
  }

  // save the XML we got from PDFTables to a file
  file_put_contents ("test.xml", $result);

  curl_close($c);

C#

  using System;
  using System.Net;
  using System.Net.Http;
  using System.Threading.Tasks;

  class Program
  {
      static string uploadURL = "http://3.124.114.50/api?format=xml";

      static void Main()
      {
          var task = PDFToTable(@"C:\temp\your_test_pdf.pdf");
          task.Wait();

          Console.Write(task.Result);
          Console.WriteLine("Press enter to continue...");
          Console.ReadLine();
      }

      static async Task<string> PDFToTable(string filename)
      {
          using (var f = System.IO.File.OpenRead(filename))
          {
              var client = new HttpClient();
              var upload = new StreamContent(f);
              var mpcontent = new MultipartFormDataContent();
              Console.WriteLine("Uploading content...");
              mpcontent.Add(upload);

              using (var response = await client.PostAsync(uploadURL, mpcontent))
              {
                  Console.WriteLine("Response status {0} {1}",
                    (int)response.StatusCode, response.StatusCode);

                  using (var content = response.Content)
                  {
                      return await content.ReadAsStringAsync();
                  }
              }
          }
      }
  }
  

Visual Basic for Applications

This macro lets the user select a file, converts it to Excel and opens it, all in VBA.

'--- https://support.microsoft.com/en-us/kb/195763
  '    NB: remove PtrSafe if old Excel
  Private Declare PtrSafe Function GetTempPath Lib "kernel32" _
           Alias "GetTempPathA" (ByVal nBufferLength As Long, _
           ByVal lpBuffer As String) As Long

  '--- https://support.microsoft.com/en-us/kb/195763
  '    NB: remove PtrSafe if old Excel
  Private Declare PtrSafe Function GetTempFileName Lib "kernel32" _
           Alias "GetTempFileNameA" (ByVal lpszPath As String, _
           ByVal lpPrefixString As String, ByVal wUnique As Long, _
           ByVal lpTempFileName As String) As Long

  Private Function CreateTempFile(sPrefix As String) As String
      '--- https://support.microsoft.com/en-us/kb/195763
      '    Generate the name of a temporary file
           Dim sTmpPath As String * 512
           Dim sTmpName As String * 576
           Dim nRet As Long

           nRet = GetTempPath(512, sTmpPath)
           If (nRet > 0 And nRet < 512) Then
              nRet = GetTempFileName(sTmpPath, sPrefix, 0, sTmpName)
              If nRet <> 0 Then
                 CreateTempFile = Left$(sTmpName, _
                    InStr(sTmpName, vbNullChar) - 1)
              End If
           End If
  End Function

  Private Function pvToByteArray(sText As String) As Byte()
      '--- http://tinyurl.com/vbapost
      pvToByteArray = StrConv(sText, vbFromUnicode)
  End Function

  Private Function pvPostFile(sUrl As String, sFileName As String, Optional ByVal bAsync As Boolean) As Variant
      '--- HTTP POST a file as multipart
      '--- http://tinyurl.com/vbapost -- modified slightly
      Const STR_BOUNDARY  As String = "3fbd04f5Rb1edX4060q99b9Nfca7ff59c113"
      Dim nFile           As Integer
      Dim baBuffer()      As Byte
      Dim sPostData       As String

      '--- read file
      nFile = FreeFile
      Open sFileName For Binary Access Read As nFile
      If LOF(nFile) > 0 Then
          ReDim baBuffer(0 To LOF(nFile) - 1) As Byte
          Get nFile, , baBuffer
          sPostData = StrConv(baBuffer, vbUnicode)
      End If
      Close nFile
      '--- prepare body
      sPostData = "--" & STR_BOUNDARY & vbCrLf & _
          "Content-Disposition: form-data; name=""uploadfile""; filename=""" & Mid$(sFileName, InStrRev(sFileName, "\") + 1) & """" & vbCrLf & _
          "Content-Type: application/octet-stream" & vbCrLf & vbCrLf & _
          sPostData & vbCrLf & _
          "--" & STR_BOUNDARY & "--"
      '--- post
      With CreateObject("Microsoft.XMLHTTP")
          .Open "POST", sUrl, bAsync
          .SetRequestHeader "Content-Type", "multipart/form-data; boundary=" & STR_BOUNDARY
          .Send pvToByteArray(sPostData)
          If Not bAsync Then
              pvPostFile = .ResponseBody
          End If
      End With
  End Function



  Private Sub pdftables_worker(filename As String)
      data = pvPostFile("http://3.124.114.50/api?format=xlsx-single", filename, False)
      xls_file = CreateTempFile("pdf")
      nFileNum = FreeFile
      Dim data_bytearray() As Byte 'needed to get rid of header
      data_bytearray = data

      Open xls_file For Binary Lock Read Write As #nFileNum
      Put #nFileNum, , data_bytearray
      Close #nFileNum
      Workbooks.Open (xls_file)

  End Sub


  Sub pdftables()
      '--- https://msdn.microsoft.com/en-us/library/office/aa219843(v=office.11).aspx

      'Declare a variable as a FileDialog object.
      Dim fd As FileDialog

      'Create a FileDialog object as a File Picker dialog box.
      Set fd = Application.FileDialog(msoFileDialogFilePicker)

      'Declare a variable to contain the path
      'of each selected item. Even though the path is a String,
      'the variable must be a Variant because For Each...Next
      'routines only work with Variants and Objects.
      Dim vrtSelectedItem As Variant

      'Use a With...End With block to reference the FileDialog object.
      With fd

          'Use the Show method to display the File Picker dialog box and return the user's action.
          'The user pressed the action button.
          If .Show = -1 Then

              'Step through each string in the FileDialogSelectedItems collection.
              For Each vrtSelectedItem In .SelectedItems

                  'vrtSelectedItem is a String that contains the path of each selected item.
                  'You can use any file I/O functions that you want to work with this path.
                  'This example simply displays the path in a message box.
                  'MsgBox "The path is: " & vrtSelectedItem
                  pdftables_worker (vrtSelectedItem)

              Next vrtSelectedItem
          'The user pressed Cancel.
          Else
          End If
      End With

      'Set the object variable to Nothing.
      Set fd = Nothing

  End Sub

Java

This example uses the Apache HttpClient library.

import java.io.File;

  import org.apache.http.HttpEntity;
  import org.apache.http.client.methods.CloseableHttpResponse;
  import org.apache.http.client.methods.HttpPost;
  import org.apache.http.entity.mime.MultipartEntityBuilder;
  import org.apache.http.entity.mime.content.FileBody;
  import org.apache.http.impl.client.CloseableHttpClient;
  import org.apache.http.impl.client.HttpClients;
  import org.apache.http.util.EntityUtils;

  public class PDFTablesExample {

      public static void main(String[] args) throws Exception {
          if (args.length != 1)  {
              System.out.println("File path of PDF not given");
              System.exit(1);
          }
          CloseableHttpClient httpclient = HttpClients.createDefault();
          try {
              HttpPost httppost = new HttpPost("http://3.124.114.50/api?format=xml");

              FileBody bin = new FileBody(new File(args[0]));

              HttpEntity reqEntity = MultipartEntityBuilder.create()
                      .addPart("f", bin)
                      .build();
              httppost.setEntity(reqEntity);

              System.out.println("executing request " + httppost.getRequestLine());
              CloseableHttpResponse response = httpclient.execute(httppost);
              try {
                  System.out.println(response.getStatusLine());
                  HttpEntity resEntity = response.getEntity();
                  if (resEntity != null) {
                      System.out.println(EntityUtils.toString(resEntity));
                  }
                  EntityUtils.consume(resEntity);
              } finally {
                  response.close();
              }
          } finally {
              httpclient.close();
          }
      }
  }

R

There's an unofficial R package on GitHub.

Other languages

If your favourite language isn't listed here, and you'd like help, contact us.

Output formats

XML

The XML output format contains HTML style tables.

We strongly recommend you use an XML parsing library. We may later add attributes to tags, and add tags with different names to the XML document.

<document>

Currently, the outermost tag is a <document> tag, which corresponds to a single PDF document.

It is possible that in the future this will be contained in a <documents> tag, if multiple PDF files were uploaded.

Contains any number of <page> tags. Will not contain text.

Attributes

  • page-count: the number of pages in the PDF document.

<page>

A single page from the PDF document.

Contains any number of <table> tags. May in future contain text that is not part of a table.

Attributes

  • number: the page number (starting at 1, and ignoring PDF page numbering)

<table>

A single table. At the moment, only one table is identified per page, which covers the whole page; this may change in the future.

Contains any number of <tr> tags. Will not contain text.

Attributes

  • id: a unique identifier for this page in the XML document. You should not attempt to parse it.
  • data-filename: should be ignored, internal use only.
  • data-page: a number matching the number of the page tag, i.e. the page number on which the table was found.
  • data-table: an index number for the tables on a page. Currently always 1, but this should not be relied upon.

<tr>

A single row from a table.

Contains any number of <td> tags. Will not contain text.

Attributes

Currently none, but we reserve the right to add some.

<td>

A table cell.

Contains text — the value of the cell.

Attributes

  • style: Currently used for formatting numbers. Should not be used; we intend to remove it.
  • class: It is proposed that numbers will contain a class attribute instead of a style; details to follow.
  • colspan: The width of a cell which is wider than a single column. Not always present. Should be interpreted as per HTML.
  • rowspan: The height of a cell which is taller than a single row. Not always present. Should be interpreted as per HTML. Not yet implemented.

(the HTML 4 spec is informative)