Skip to content

Commit

Permalink
merge
Browse files Browse the repository at this point in the history
  • Loading branch information
Moeen89 committed May 25, 2024
2 parents f39a039 + ab4e92b commit b97d8af
Show file tree
Hide file tree
Showing 56 changed files with 2,831 additions and 205 deletions.
38 changes: 38 additions & 0 deletions .github/workflows/documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: documentation

on:
push:
branches:
- main
pull_request:
branches:
- main
workflow_dispatch:

permissions:
contents: write

jobs:
docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v3
- name: Install dependencies
run: |
cd Logic/
pip install sphinx myst_parser sphinx-book-theme
cd ../UI/
pip install -r requirements.txt
- name: Sphinx build
run: |
sphinx-build documentation/source _build
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
with:
publish_branch: gh-pages
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: _build/
force_orphan: true

56 changes: 43 additions & 13 deletions Logic/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,16 @@ This module contains files and classes responsible for doing the main tasks of t
**Attention:**
Inputs, outputs and logic of each function is explained in the comments of each function. So, **Please read** the comments and the docstrings of each class and method to understand the logic and the requirements of each part.

## 1. [Crawler](./core/crawler.py)
## 1. [Crawler](./core/utility/crawler.py)

In the beginning, we need to crawl our required data and create a dataset for our needs. For this sake, we implement a [crawler](./core/crawler.py). The structure and functions required for this part, are explained in the `crawler.py` file.

For **Testing** the correctness of your implementation for crawler part, you can run `tests/test_crawler.py` and see if you crawled correctly. Feel free to change `json_file_path` variable to meet the path of your crawled data.

## 2. [Near-duplicate page detecion](./core/LSH.py)
## 2. [Near-duplicate page detecion](./core/indexer/LSH.py)
We provided you `MinHashLSH` class. This class is responsible for doing near duplicate detection. As you know, this section consists of 3 sub-sections. First, you need to shingle documents. Then, after characteristic matrix, using mini-hashing technique, improve near duplicate detection. Finally, you need to use LSH so that you can find movies that are suspicious to being duplicate. **Note** that you are only allowed to use `perform_lsh` function outside of your class and other methods only inside the class. **Another Note** is that in your crawled data, you have one section named `first_page_summary` and another section named `summaries`. The first one is a String and the second one is a list of Strings and note that you should work with the second one and by combining those Strings make a summary of the movie and do LSH on the set of summaries. The final output of this class should be a dictionary where the keys are the hashes of the buckets, and the corresponding values should be lists of document IDs, representing the indices of those summaries in the main list of all summaries. We have provided you with a file containing some fake movies in JSON format. Specifically for the Locality-Sensitive Hashing (LSH) part, please integrate this additional data into your main dataset and proceed with LSH. It's important to note that the file includes 20 movies, and each pair of consecutive movies is considered a near duplicate. For instance, the first and second movies, the third and fourth movies, and so on, are near duplicates. Verify your code to account for this characteristic. However, it is crucial to emphasize that after this stage, you must remove all fake movies from your corpus and refrain from utilizing them in further steps. There is a method in the class called `jaccard_similarity_test`. You can assess your results using this method by passing the bucket dictionary and the documents containing all the summaries, where the indexes correspond to the summaries in the buckets.

## 3. [Preprocess](./core/preprocess.py)
## 3. [Preprocess](./core/utility/preprocess.py)
This class is responsible for doing preprocessings required on the input data. The input the crawled data and the output is the data without extra info.

Using prebuilt libraries for stopwords is an option, but it can be slow to process large amounts of text. For faster performance, we have prepared a `stopword.txt` file containing common stopwords that you can use instead. The stopwords file allows preprocessing to be completed more efficiently by removing common, non-informative words from the text before further analysis.
Expand All @@ -32,27 +32,57 @@ Report the results to us.

- **Note** that one or many of the methods (or signatures of methods) in this class may need to be changed based on your implementations. Feel free to do so!

## 5. [Spell Correction](./core/spell_correction.py)
## 5. [Search](./core/search.py)
in this part you have to work on implementing the search feature, which is the most important part of the retrieval process. To accomplish this, you need to create search functions and a scorer that will score each document based on the input query. Keep in mind that you may need to index additional information that was not previously indexed. Make sure to carefully review the structures and functions documentation of the added files.

## 6. [Spell Correction](./core/utility/spell_correction.py)
In this file, you have a class for the spell correction task. You must implement the shingling and Jaccard similarity approach for this task, aiming to correct misspelled words in the query. Additionally, integrate the Term Frequency (TF) of the token into your candidate selection. For instance, if you input `whle`, both `while` and `whale` should be considered as candidates with the same score. However, it is more likely that the user intended to enter `while`. Therefore, enhance your spell correction module by adding a normalized TF score. Achieve this by dividing the TF of the top 5 candidates by the maximum TF of the top 5 candidates and multiplying this normalized TF by the Jaccard score. In the UI component of your project, present these probable corrections to the user in case there are any mistakes in the query.

## 6. [Snippet](./core/snippet.py)
## 7. [Snippet](./core/utility/snippet.py)
In the snippet module, extract a good summary from the document. To achieve this, focus on non-stop word tokens from the query. For each token, locate the token or its variations in the document. Display "n" tokens before and after each occurrence of the token in the document. Merge these windows with '...' to create the snippet. Also put query tokens in the summary inside three stars without any space between stars and the word inside them; for example if token2 is present in the query, the returned snippet should be like "token1 \*\*\*token2\*\*\* token3". But you should find these windows carefully, for example if you have token1 in the doc in 2 places and 3 tokens before the second token1, is token2 of the query, you must consider the second window instead of the first one. Additionally, identify tokens in the query that are absent in the document and return them.

## 7. [Utils](./utils.py)
## 8. [Utils](./utils.py)

This file contains functions that is needed by UI to do some of the important functionalities. For now, you should complete the `clean_text` function that is used by UI to do the pre-processing operations that you implemented in `Preprocessor` class, on the input query by user. You can **test** your implementation by running the UI, and giving different inputs and see that how is it being corrected (or actually, being cleaned! so it can be used better as we proceed in the project).

## 8. [Evaluation](./core/utility/evaluation.py)
## 9. [Evaluation](./core/utility/evaluation.py)
This file contains code to evaluate the performance of an information retrieval or ranking system. There are several common evaluation metrics that can be implemented to systematically score a system's ability to retrieve and rank relevant results. The metrics calculated here are `precision`, `recall`, `F1 score`, `mean average precision (MAP)`, `normalized discounted cumulative gain (NDCG)`, and `mean reciprocal rank (MRR)`.

Each metric makes use of the actual relevant items and the predicted ranking to calculate an overall score. A higher score indicates better performance for that particular aspect of retrieval or ranking.

- Precision measures the percentage of predicted items that are relevant.
- Recall measures the percentage of relevant items that were correctly predicted.
- The F1 score combines precision and recall into a single measure.
- MAP considers the rank of the relevant items, rewarding systems that rank relevant documents higher.
- NDCG applies greater weight to hits at the top of the ranking.
- MRR looks at the position of the first relevant document in the predicted list.
- Precision measures the percentage of predicted items that are relevant.
- Recall measures the percentage of relevant items that were correctly predicted.
- The F1 score combines precision and recall into a single measure.
- MAP considers the rank of the relevant items, rewarding systems that rank relevant documents higher.
- NDCG applies greater weight to hits at the top of the ranking.
- MRR looks at the position of the first relevant document in the predicted list.

Together, these metrics provide a more complete picture of how well the system is able to accurately retrieve and highly rank relevant information.

## 10. [Scorer](./core/utility/scorer.py)
Please refer to the docstrings in the `scorer.py` for complete explanation of each functionality and what you should complete.

# Phase 2

## 1. Extending [Search](./core/search.py)

In this section, you should implement the `find_scores_with_unigram_model` function in the `Search` class, where it is responsible for finding document scores based on the Unigram Model. You can use the new prototype functions that we have added to [Scorer](./core/utility/scorer.py) to calculate these scores.

## 2. Extending [Scorer](./core/utility/scorer.py)

In this section, you should implement the `compute_scores_with_unigram_model` and `compute_score_with_unigram_model` functions in the `Scorer` class. These functions are responsible for creating document scores based on the unigram model to be used in [Search](./core/search.py) and computing the best match documents for a given query.

## 3. [Link Analysis](./core/link_analysis/analyzer.py)

This section involves analyzing the link between actors and movies using the Hits algorithm, and thereafter determining which actors and movies received the most scores based on the algorithm. We do this step-by-step in the `analyzer.py`. The first step is to initialize the parameters of your link analyzer, such as the list of hubs and authorities and the links graph from the given root set. You may need preprocessing for this, so you can pass these to the `initiate_params` function and call it in your code. Graphs derived from the root set can be expanded before the Hits algorithm is run. For this purpose, `expand_graph` is defined. You can read the link analysis slide for a better understanding. At the end, run the algorithm by calling the `hits` function and output ten actors and movies with the highest scores.

**Note**: To implement the Hits algorithm, you need to implement a graph. For this, you can get help from the `LinkGraph` class in the `graph.py`. In this class, a template is placed for your implementation. You are free to modify this class in any way you like.

## 4. [Word Embedding](./core/word_embedding/README.md)
Please refer to the specific [Readme file](./core/word_embedding/README.md) for the explanation of the word embedding part.

## 5. [Classification](./core/classification/README.md)
Please refer to the specific [Readme file](./core/classification/README.md) for the explanation of the classification part.

## 6. [Clustering](./core/clustering/README.md)
Please refer to the specific [Readme file](./core/clustering/README.md) for the explanation of the clustering part.
5 changes: 5 additions & 0 deletions Logic/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from .core import *
from .utils import *


__all__ = [k for k in globals().keys() if not k.startswith("_")]
9 changes: 9 additions & 0 deletions Logic/core/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
from .indexer import *
from .utility import *
from .search import *
from .link_analysis import *
from .classification import *
from .clustering import *
from .word_embedding import *

__all__ = [k for k in globals().keys() if not k.startswith("_")]
27 changes: 27 additions & 0 deletions Logic/core/classification/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Classification

This package contains the code for the classification phase of the project.
The classification phase is responsible for classifying the comment data into two classes: positive and negative.
you have to train the models on the training data and then use the trained model to classify the comment data which you crawled in the first phase. You can access the training data using [this link](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/code).

## Classes

Here is a brief description of the files in this package:

### 1. [Basic Classifier](basic_classifier.py)
This file contains a abstract class `BasicClassifier` which is the base class for all the classifiers in this package. You should implement function `get_percent_of_positive_reviews` in this class which is responsible to compute percentage of positive reviews of a list of reviews. In all classifiers, you have to use the fasttext embeddings as the input to the classifier except for the Naive Bayes classifier. In the Naive Bayes classifier, you have to use the count vectorizer to convert the text data into the input for the classifier.

### 2. [Naive Bayes](naive_bayes.py)
This file contains the implementation of the Naive Bayes classifier.

### 3. [SVM](svm.py)
This file contains the implementation of the Support Vector Machine classifier. you can use the sci-kit learn library to implement the SVM classifier.

### 4. [KNN](knn.py)
This file contains the implementation of the K-Nearest Neighbors classifier.

### 5. [Deep Model](deep.py)
This file contains the implementation of the MLP model using the PyTorch library.

### 6. [Data Loader](data_loader.py)
This file contains the implementation of the data loader class which is responsible for loading the data from the disk and use fasttext model to generate the word embeddings. you have to split the data into training and testing data in this file.
9 changes: 9 additions & 0 deletions Logic/core/classification/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
from .basic_classifier import *
from .data_loader import *
from .deep import *
from .knn import *
from .naive_bayes import *
from .svm import *


__all__ = [k for k in globals().keys() if not k.startswith("_")]
34 changes: 34 additions & 0 deletions Logic/core/classification/basic_classifier.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
import numpy as np
import sklearn
from tqdm import tqdm

from ..word_embedding.fasttext_model import FastText


class BasicClassifier:
def __init__(self):
pass

def fit(self, x, y):
pass

def predict(self, x):
pass

def prediction_report(self, x, y):
pass

def get_percent_of_positive_reviews(self, sentences):
"""
Get the percentage of positive reviews in the given sentences
Parameters
----------
sentences: list
The list of sentences to get the percentage of positive reviews
Returns
-------
float
The percentage of positive reviews
"""
pass

58 changes: 58 additions & 0 deletions Logic/core/classification/data_loader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
import numpy
import numpy as np
import pandas as pd
import sklearn
import tqdm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from Logic.core.word_embedding.fasttext_model import FastText


class ReviewLoader:
def __init__(self, file_path: str):
self.file_path = file_path
self.fasttext_model = None
self.review_tokens = []
self.sentiments = []
self.embeddings = []

def load_data(self):
"""
Load the data from the csv file and preprocess the text. Then save the normalized tokens and the sentiment labels.
Also, load the fasttext model.
"""
self.fasttext_model = FastText(method='skipgram')
self.fasttext_model.prepare(None,mode="load")
df = pd.read_csv(self.file_path)
le = LabelEncoder()
df['sentiment'] = le.fit_transform(df['sentiment'])
self.review_tokens = df['review'].to_numpy()
self.sentiments = df['sentiment'].to_numpy()
return self.review_tokens,self.sentiments

def get_embeddings(self):
"""
Get the embeddings for the reviews using the fasttext model.
"""
self.embeddings = numpy.array([self.fasttext_model.get_query_embedding(token) for token in tqdm.tqdm(self.review_tokens)])




def split_data(self, test_data_ratio=0.2):
"""
Split the data into training and testing data.
Parameters
----------
test_data_ratio: float
The ratio of the test data
Returns
-------
np.ndarray, np.ndarray, np.ndarray, np.ndarray
Return the training and testing data for the embeddings and the sentiments.
in the order of x_train, x_test, y_train, y_test
"""
x_train, x_test, y_train, y_test = train_test_split(self.embeddings, self.sentiments, test_size=test_data_ratio, random_state=42)
return x_train, x_test, y_train, y_test
Loading

0 comments on commit b97d8af

Please sign in to comment.