Skip to content

ShivamSrng/Text-Summarization

Repository files navigation

Text Summarization

1. Introduction to Text Summarization

Text summarization is the process of generating short, fluent, and most importantly accurate summary of a respectively longer text document. The main idea behind automatic text summarization is to be able to find a short subset of the most essential information from the entire set and present it in a human-readable format. As online textual data grows, automatic text summarization methods have the potential to be very helpful because more useful information can be read in a short time.

There are basically 2 types of text summarization techniques:

  • Extractive Text Summarization
    Extractive summarization takes the text, ranks all the sentences according to the understanding and relevance of the text, and presents us with the most important sentences.

    This method does not create new words or phrases, it just takes the already existing words and phrases and presents only that. We can imagine this as taking a page of text and marking the most important sentences using a highlighter.

  • Abstractive Text Summarization
    Abstractive summarization, on the other hand, tries to guess the meaning of the whole text and presents the meaning to us. More like a Semantic Analyzer.

    It creates words and phrases, puts them together in a meaningful way, and along with that, adds the most important facts found in the text. This way, abstractive summarization techniques are more complex than extractive summarization techniques and are also computationally more expensive.

Here, I have perform text summarization following the principles of Extractive Text Summarization using the PageRank function that is provided in networkx package and it is similar as that of TextRank. Ok much of the theory, let's start with the practical implementation.

giphy


2. Importing and Analyzing the Dataset

Data is one the most important thing in any Machine Learning or NLP program, becuase "Data dosen't lie". Let's start by having a dataset given below.
news.csv

Checking if our dataframe has any null values and if they are comparitively lesser, than would drop them. If they are in larger in number them would have to perform capping or imputation. But, the missing values were less, hence considered dropping them:
image
image

On analyzing randomly selected blocks of data, I realized there are many HTML Tags, HTML entities, punctuations, contractions, and many other undesired features in the dataset.

Take a look here, we can see there are many '\n' in the data, which will not contribute to anything, instead it will just deduce the model's performance:
image

Also, here's an another block of data, in which we can see tons of HTML tags, entities, extra spaces, tabs, punctuations, etc.:
image


3. Data Pre-Processing

Data Pre-processing, or simply Data cleaning is the process of reslving the irregularities or undesired data from the the raw data in a way that dosen't reflect the semantic meaning of data. It is most important step in entire project, as data forms the basis for development of vectors. Also, one of the most important thing, "Order of Data Pre-Processing is also crucial step, so I had to think and act !!!"

Initially I decided to get rid of HTML tags and entities. To do this task, I used BeautifulSoup which is a package that is provided by Python, that is majorly used for scrapping data. But here, I am using it to filter the data, as it already provides pre-defined methods for the same. In code implementation, is done using:
image

Having the HTML tags and entities removed, I decided to filter out '\n' characters and replace them with '' and also decieded to add fullstops to data blocks. Fullstop will play a very important role, at the time of sentence tokenization, which actually splits the string on the basis of fullstop. If a particular datablock has no fullstop, it will be treated entirely as a single sentence. For example, see the data block given below, it has no fullstop:
image

So, now had to figure out where to keep fullstops. And this entire thing is handled by a single function as:
image

At this point of time, I found out that the sentences that already had fullstop, will have multiple fullstops after application of above function on them. So to prevent this, it needs to be passed through this function:
image
Here sent_tokenize function of nltk.tokenize package is the method that will tokenize the entire block of data, based on the fullstop.

I also found out that data had certain contractions like "I'd". So, I decided to process them further and convert them into an actual expanded form, with the help of contractions package, which consists of commonly used contractions in natural speech. This task of contraction removal is done by the follwoing function: image

I decided to make a function that does all Data Pre-processing till here. Remember, I haven't removed the Stopwords yet. This is because, until now the pre-processing we did is general for all the text and it is the data, that I am going to semantically compare with the summary that we are going to get for it at the end. Hence there is no point of removing Stopwords in original data. This task is done using: image

Finally, I reached the step where I planned to remove Stopwords and punctuations. Stopwords removal is done by using the stopwords function that is provided in the nltk.corpus package. It is done as: image

Combining the entire Data Pre-Processing into a single function as: image

Although, I was not able to figure out how to remove a link because, I wasn't able to find any regular expression for removing it. I did research on link removal fuctions through BeautifulSoup but wasnt able to do it. The above function will work perfectly when provided row by row input but at that time we cannot utilize the multiprocessing functions of NLP. Hence creating same function, but this time we shall pass an entire column rather than a row to function. So, for that fucntion is:
image
Here tqdm is used to show Progress Bar, which gives visual idea how many rows are processed and how many are left.

Now, using spacy I fastened the Data Pre-processing process, by allwoing concurrent processes to execute together in a batch size of 5000. image image Here n_process = -1 indicates that take as many as process to execute together that the processor can handle.

Finally, checking for any errors that might have occured at time of Data Pre-Processing using: image

With this, we completed the entire process of Data Pre-Processing:
giphy (2)


4. Generating the summary for any one random row

For converting words to vector I used Word2Vec function that is provided by gensim.models package.

Some theory behind Word2Vec:
In Word2Vec method, unlike One Hot Encoding and TF-IDF methods, unsupervised learning process is performed. Unlabeled data is trained via artificial neural networks to create the Word2Vec model that generates word vectors. Unlike other methods, the vector size is not as much as the number of unique words in the corpus. The size of the vector can be selected according to the corpus size and the type of project. This is particularly beneficial for very large data. For example, if we assume that there are 300 000 unique words in a large corpus, when vector creation is performed with One Hot Encoding, a vector of 300 000 size is created for each word, with the value of only one element of 1 and the others 0. However, by choosing the vector size 300 (it can be more or less depending on the user’s choice) on the Word2Vec side, unnecessary large size vector operations are avoided.

Choosing the random data block using: image image

To get the tokens, splitting the sentence on the basis of intermediate white space as: image

Getting the embedding of words in the sentence as: image

Taking the words embeddings generated in the previous stage as input for the cosine similarity matrix generation as: image

Now, based on the similarity matrix generated, a PageRank algorithm classifies which sentence are sematically more important amoung all the sentences. This is coded as: image

If the original content has 12 sentences then our summary will be generated of 4 lines, which is exccatly depicted above. Finally generateing the summary, by: image

Henceforth, till now, we can generate summary as: image


5. Generating summary for entire Dataframe

It is a function that combines each and every step we performed above into a single function, as: image

Let's see if it works when provided any random data block: image

giphy (3)

For faster result generation using spacy library, I framed a column-wise function as:
image

It was a long process, took atmost 35 mins of time:
giphy (4)


6. Generating a new Dataframe

It stores the data under Content column of Original Dataframe as Original Content after applying initial pre-processing in it, which removes HTML tags, entities, extra characters, etc. image

This is the part, where I faced a major problem, as it was unexpected. For data in some rows (like the one in the below picture: 229th row) the vectors generated were very long. Due to this, I was continously getting error as "Power Iteration failed". At last, I increased maximum interations to 1 Lakh (this is obviously veryyyyy high). But still, error persisted.

Even, I rechecked everything from Data Pre-processing till the point, but I found no more pre-processing of data could be done to reduce the vectors. Eventually, I had to add try-except block in the function, so that if summary for a particular data cannot be generated, just replace it with empty string.

giphy (5)

Here, after thinking of each and every possibility, I decided to drop the row as its just 1 row, by: image
image

Saving, my progress until here as: image


7. Getting Metrics ready

We already sorted out the lines which were most important in Semantic Analysis, but I needed to keep track of the lines that were removed, which I waas able to do with follwing function. And at last stored in the datafram, under Removed Line column. image
image

Now for generating the metrics, I decided to use Cosine Similarity and Semantic Similarity between the Original Content and New Content column's data. For Cosine Similarity, I used the function provided by spacy library which I used earlier. image

And for Semantic Similarity, I used the similarity function that is again provided by spacy library. image

Finally, after computation created their seperate columns. And, I realized there was very vast difference in the mean, median values of the semantic similarity generated by both the functions.
image image

So, to get a final value, I decided to perform Harmonic Mean of both the values and also add it's respective column in out dataframe as: image
image


8. The Result CSV File

After completeing everything, I made a CSV file of the dataframe on which I was working by: image

Finally, completed the entire task of task summarization, with an average accuracy of 82.66%.


9. Result files

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published