Text Summarization

1. Introduction to Text Summarization

Text summarization is the process of generating short, fluent, and most importantly accurate summary of a respectively longer text document. The main idea behind automatic text summarization is to be able to find a short subset of the most essential information from the entire set and present it in a human-readable format. As online textual data grows, automatic text summarization methods have the potential to be very helpful because more useful information can be read in a short time.

There are basically 2 types of text summarization techniques:

Extractive Text Summarization
Extractive summarization takes the text, ranks all the sentences according to the understanding and relevance of the text, and presents us with the most important sentences.

This method does not create new words or phrases, it just takes the already existing words and phrases and presents only that. We can imagine this as taking a page of text and marking the most important sentences using a highlighter.
Abstractive Text Summarization
Abstractive summarization, on the other hand, tries to guess the meaning of the whole text and presents the meaning to us. More like a Semantic Analyzer.

It creates words and phrases, puts them together in a meaningful way, and along with that, adds the most important facts found in the text. This way, abstractive summarization techniques are more complex than extractive summarization techniques and are also computationally more expensive.

Here, I have perform text summarization following the principles of Extractive Text Summarization using the PageRank function that is provided in networkx package and it is similar as that of TextRank. Ok much of the theory, let's start with the practical implementation.

2. Importing and Analyzing the Dataset

Data is one the most important thing in any Machine Learning or NLP program, becuase "Data dosen't lie". Let's start by having a dataset given below.
news.csv

Checking if our dataframe has any null values and if they are comparitively lesser, than would drop them. If they are in larger in number them would have to perform capping or imputation. But, the missing values were less, hence considered dropping them:

On analyzing randomly selected blocks of data, I realized there are many HTML Tags, HTML entities, punctuations, contractions, and many other undesired features in the dataset.

Take a look here, we can see there are many '\n' in the data, which will not contribute to anything, instead it will just deduce the model's performance:

Also, here's an another block of data, in which we can see tons of HTML tags, entities, extra spaces, tabs, punctuations, etc.:

3. Data Pre-Processing

Data Pre-processing, or simply Data cleaning is the process of reslving the irregularities or undesired data from the the raw data in a way that dosen't reflect the semantic meaning of data. It is most important step in entire project, as data forms the basis for development of vectors. Also, one of the most important thing, "Order of Data Pre-Processing is also crucial step, so I had to think and act !!!"

Initially I decided to get rid of HTML tags and entities. To do this task, I used BeautifulSoup which is a package that is provided by Python, that is majorly used for scrapping data. But here, I am using it to filter the data, as it already provides pre-defined methods for the same. In code implementation, is done using:

Having the HTML tags and entities removed, I decided to filter out '\n' characters and replace them with '' and also decieded to add fullstops to data blocks. Fullstop will play a very important role, at the time of sentence tokenization, which actually splits the string on the basis of fullstop. If a particular datablock has no fullstop, it will be treated entirely as a single sentence. For example, see the data block given below, it has no fullstop:

So, now had to figure out where to keep fullstops. And this entire thing is handled by a single function as:

At this point of time, I found out that the sentences that already had fullstop, will have multiple fullstops after application of above function on them. So to prevent this, it needs to be passed through this function:

Here sent_tokenize function of nltk.tokenize package is the method that will tokenize the entire block of data, based on the fullstop.

I also found out that data had certain contractions like "I'd". So, I decided to process them further and convert them into an actual expanded form, with the help of contractions package, which consists of commonly used contractions in natural speech. This task of contraction removal is done by the follwoing function:

I decided to make a function that does all Data Pre-processing till here. Remember, I haven't removed the Stopwords yet. This is because, until now the pre-processing we did is general for all the text and it is the data, that I am going to semantically compare with the summary that we are going to get for it at the end. Hence there is no point of removing Stopwords in original data. This task is done using:

Finally, I reached the step where I planned to remove Stopwords and punctuations. Stopwords removal is done by using the stopwords function that is provided in the nltk.corpus package. It is done as:

Combining the entire Data Pre-Processing into a single function as:

Although, I was not able to figure out how to remove a link because, I wasn't able to find any regular expression for removing it. I did research on link removal fuctions through BeautifulSoup but wasnt able to do it. The above function will work perfectly when provided row by row input but at that time we cannot utilize the multiprocessing functions of NLP. Hence creating same function, but this time we shall pass an entire column rather than a row to function. So, for that fucntion is:

Here tqdm is used to show Progress Bar, which gives visual idea how many rows are processed and how many are left.

Now, using spacy I fastened the Data Pre-processing process, by allwoing concurrent processes to execute together in a batch size of 5000. Here n_process = -1 indicates that take as many as process to execute together that the processor can handle.

Finally, checking for any errors that might have occured at time of Data Pre-Processing using:

With this, we completed the entire process of Data Pre-Processing:

4. Generating the summary for any one random row

For converting words to vector I used Word2Vec function that is provided by gensim.models package.

Some theory behind Word2Vec:
In Word2Vec method, unlike One Hot Encoding and TF-IDF methods, unsupervised learning process is performed. Unlabeled data is trained via artificial neural networks to create the Word2Vec model that generates word vectors. Unlike other methods, the vector size is not as much as the number of unique words in the corpus. The size of the vector can be selected according to the corpus size and the type of project. This is particularly beneficial for very large data. For example, if we assume that there are 300 000 unique words in a large corpus, when vector creation is performed with One Hot Encoding, a vector of 300 000 size is created for each word, with the value of only one element of 1 and the others 0. However, by choosing the vector size 300 (it can be more or less depending on the user’s choice) on the Word2Vec side, unnecessary large size vector operations are avoided.

Choosing the random data block using:

To get the tokens, splitting the sentence on the basis of intermediate white space as:

Getting the embedding of words in the sentence as:

Taking the words embeddings generated in the previous stage as input for the cosine similarity matrix generation as:

Now, based on the similarity matrix generated, a PageRank algorithm classifies which sentence are sematically more important amoung all the sentences. This is coded as:

If the original content has 12 sentences then our summary will be generated of 4 lines, which is exccatly depicted above. Finally generateing the summary, by:

Henceforth, till now, we can generate summary as:

5. Generating summary for entire Dataframe

It is a function that combines each and every step we performed above into a single function, as:

Let's see if it works when provided any random data block:

For faster result generation using spacy library, I framed a column-wise function as:

It was a long process, took atmost 35 mins of time:

6. Generating a new Dataframe

It stores the data under Content column of Original Dataframe as Original Content after applying initial pre-processing in it, which removes HTML tags, entities, extra characters, etc.

This is the part, where I faced a major problem, as it was unexpected. For data in some rows (like the one in the below picture: 229th row) the vectors generated were very long. Due to this, I was continously getting error as "Power Iteration failed". At last, I increased maximum interations to 1 Lakh (this is obviously veryyyyy high). But still, error persisted.

Even, I rechecked everything from Data Pre-processing till the point, but I found no more pre-processing of data could be done to reduce the vectors. Eventually, I had to add try-except block in the function, so that if summary for a particular data cannot be generated, just replace it with empty string.

Here, after thinking of each and every possibility, I decided to drop the row as its just 1 row, by:

Saving, my progress until here as:

7. Getting Metrics ready

We already sorted out the lines which were most important in Semantic Analysis, but I needed to keep track of the lines that were removed, which I waas able to do with follwing function. And at last stored in the datafram, under Removed Line column.

Now for generating the metrics, I decided to use Cosine Similarity and Semantic Similarity between the Original Content and New Content column's data. For Cosine Similarity, I used the function provided by spacy library which I used earlier.

And for Semantic Similarity, I used the similarity function that is again provided by spacy library.

Finally, after computation created their seperate columns. And, I realized there was very vast difference in the mean, median values of the semantic similarity generated by both the functions.

So, to get a final value, I decided to perform Harmonic Mean of both the values and also add it's respective column in out dataframe as:

8. The Result CSV File

After completeing everything, I made a CSV file of the dataframe on which I was working by:

Finally, completed the entire task of task summarization, with an average accuracy of 82.66%.

9. Result files

Input Dataset: news.csv
Jupyter Notebook ipynb file: Text Summarization.zip
PDF of Jupyter Notebook: Text Summarization - Jupyter Notebook.pdf
Final Result CSV File: Final Result.csv

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Final Result.csv		Final Result.csv
README.md		README.md
Text Summarization - Jupyter Notebook.pdf		Text Summarization - Jupyter Notebook.pdf
Text Summarization.ipynb		Text Summarization.ipynb
news.csv		news.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Summarization

1. Introduction to Text Summarization

2. Importing and Analyzing the Dataset

3. Data Pre-Processing

4. Generating the summary for any one random row

5. Generating summary for entire Dataframe

6. Generating a new Dataframe

7. Getting Metrics ready

8. The Result CSV File

9. Result files

About

Releases

Packages

Languages

ShivamSrng/Text-Summarization

Folders and files

Latest commit

History

Repository files navigation

Text Summarization

1. Introduction to Text Summarization

2. Importing and Analyzing the Dataset

3. Data Pre-Processing

4. Generating the summary for any one random row

5. Generating summary for entire Dataframe

6. Generating a new Dataframe

7. Getting Metrics ready

8. The Result CSV File

9. Result files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages