In today’s digital age, extracting information from documents efficiently and accurately has become a necessity for businesses and individuals alike. Fortunately, there are several services and tools available to simplify this process. In this article, I will compare two popular options: Amazon Textract and Camelot. The purpose of this demonstration is to showcase the differences in the output generated by these services, helping me understand their capabilities and make informed decisions without the need for signing up or installing packages.
Test-1: Extract single and large table.
Refer to the image above, left-hand-side is the Camelot output saved as csv file; middle is the AWS textract output; right-hand-side is the input pdf page.
In this case, i think AWS produced better output for me:
- (1) Capture the entire table;
- (2) Long line item are captured and grouped correctly (i.e., see the third underlined section title - Other comphrensions Income);
- (3) Negative numbers are parsed as the original format (i.e., output as (9513) instead of -9513);
- (4) Column header texts are grouped together (i.e., Mar 2022 $’000’).
The third point is not necessarily an issue but personally i prefer original format.
Test-2: Extract multi-tables.
Similarly, camelot output is on the left hand side whereas AWS textract output is the middle one.
In this test case, AWS produced better result again:
- (1) Long line item are grouped as one cell (look at last row of the first table.);
- (2) Captured the second table at the bottom of the page. However, one noticable mistake from AWS
is that the
Note
column in the second table is merged into the left-hand-side row-header column.
Image below is the screenshot of AWS textract UI.
Sample Camelot code
Related: