Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Camelot's Table Extraction to Exclude Specific Rows Based on Alignment Issues #504

Open
iammkullah opened this issue Feb 23, 2024 · 5 comments

Comments

@iammkullah
Copy link

I am using Camelot for table extraction in PDF documents, which generally works well for my needs. However, I've encountered a recurring issue where the first and last rows of tables cause problems during the extraction process, primarily due to their alignment. These rows often differ in format from the rest of the table, affecting the consistency and accuracy of the extracted data. Currently, Camelot does not seem to offer a direct way to exclude specific rows based on their characteristics or alignment.

This feature would be incredibly beneficial for scenarios where table headers or footers consistently deviate in style or alignment from the main table body, leading to extraction inaccuracies. A parameter or method to specify rows to ignore (by index or pattern recognition) during extraction could significantly improve the utility and flexibility of Camelot for users facing similar challenges.

Is there an existing solution or workaround to address this issue, or could this functionality be considered for future updates?

For the details.
page 1
image
page 2 ( long table and is on 2, 3, and 4 pages in some pdf)
image

You can see because of this last row and first, it is making 11 columns for this data frame instead actually they are 10 columns. In my PDFs sometimes there are such footers (last row of the table on pdf) and (first row of header) which I am not interested in getting extracted and my header is after this.

I have already tried to play with line_tol, joint_tol, split_text, line_scale, shift_text, etc (and it works for smaller differences like in the 1st screenshot of page 1 it works but in the case of the second screenshot it fails.

Here is my appending tables function which makes a a single result_df for long tables

`

def append_tables_to_dataframe(tables):
try:
df_list = []

    for i, table in enumerate(tables):
        # If the table has at least 10 columns
        if table.shape[1] >= 10:
            # Handle header extraction for the first table
            if i == 0:
                
                # Find the index where "Date" is in the first cell
                date_index = table.df[table.df.iloc[:, 0].str.contains(r"\bD\s*a\s*t\s*e\s*o\s*f\s*T\s*r\s*a\s*n\s*s\s*a\s*c\s*t\s*i\s*o\s*n\b", case=False, regex=True)].index

                if not date_index.empty:
                    print("Got the row having header ...")
                    # header_index = date_index[0]
                    table.df.columns = range(len(table.df.columns))
                                 
                
            df_list.append(table.df)

    # Concatenate all tables in the DataFrame list
    result_df = pd.concat(df_list, ignore_index=True)

    return result_df

except Exception as e:
    print("Error in result_df creation:", e)
    return pd.DataFrame()  # Return an empty DataFrame in case of an error

`

Is there any way we can search the appearing text or anything else to exclude that first row and last while extraction so that Camelot focuses on the main table (containing data we are interested in)?

I hope it is making sense, if not do let me know, I would love to explain more and if somehow you will be able to add this to Camelot it will make more powerful to this library.

Thanks

@bosd
Copy link

bosd commented Aug 8, 2024

Hey all!

We try to build a maintained fork at pypdf_table_extraction.

You are welcome to check it out and contribute there.
@iammkullah Can you open an issue there? (if it still exsists)

@rodfloripa
Copy link

I have the same problem.
@iammkullah did you find any other solution?

@iammkullah
Copy link
Author

@rodfloripa I haven't got any solution, then I handled this all in processing of the data

@rodfloripa
Copy link

Can you open an issue on https://github.com/py-pdf/pypdf_table_extraction ??

@bosd
Copy link

bosd commented Aug 20, 2024

Is there any way we can search the appearing text or anything else to exclude that first row and last while extraction so that Camelot focuses on the main table (containing data we are interested in)?

Have you tried setting table regions ?
Or table areas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants