What's my job?

Our application provides three functionalities: checking plagiarism between news articles, searches news articles from a huge corpus based on the query user has entered and web scrapes articles from the web.

Plagiarism Checker

  • Data

    Built on 29661 news articles from 2016 to 2017. News articles have been trimmed down to 500 words per document. Data was preprocessed before being fed into the model

  • Approach

    Two documents can be similar if their semantic context is similar and manually identifying the similarity between a large number of documents can be really difficult but Doc2vec makes sure to preserve the context

  • Results

    Popular paraphrasing tools like QuillBot, EditPad and Paraphraser were used to generate plagiarsed news articles in the data. After build the Doc2Vec model, we saw how the model was able to detect the plagiarsed content with 90-99% confidence

Search Engine

  • Data

    Built on 7395 news articles from 2016 to 2017. News articles have been trimmed down to 500 words per document. Data was preprocessed before being fed into the model

  • Approach

    We used the Google's pre-trained Universal Sentence Encoder - Deep Averaging Network to build the document search engine. It is pre-trained on a large corpus and can be used in a variety of tasks (sentimental analysis, classification and so on)

  • Results

    The model was able to provide top ten news article which are most relevant to the query searched by the user. We tried searching a ton of queries and each time the search engine gave out relevant documents that we manually verified and had a confidence score ranging from 30-50%

Web Scraper

  • Data

    Our web scraper is built to scrape latest news articles from Quartz. The web scraper is built in order to retrieve five latest news articles.

  • Approach

    The scraper is built using Beautiful Soup. To create the “Soup” element, BeautifulSoup takes in the web page hyperlink as an argument and then we selected the components of the web page

  • Results

    We were able to successfully retrieve name of the author, title and content of the article published and also the links to these news articles.

To know more...