News Inspector

What's my job?

Our application provides three functionalities: checking plagiarism between news articles, searches news articles from a huge corpus based on the query user has entered and web scrapes articles from the web.

Plagiarism Checker

Data

Built on 29661 news articles from 2016 to 2017. News articles have been trimmed down to 500 words per document. Data was preprocessed before being fed into the model
Approach

Two documents can be similar if their semantic context is similar and manually identifying the similarity between a large number of documents can be really difficult but Doc2vec makes sure to preserve the context
Results

Popular paraphrasing tools like QuillBot, EditPad and Paraphraser were used to generate plagiarsed news articles in the data. After build the Doc2Vec model, we saw how the model was able to detect the plagiarsed content with 90-99% confidence

Search Engine

Data

Built on 7395 news articles from 2016 to 2017. News articles have been trimmed down to 500 words per document. Data was preprocessed before being fed into the model
Approach

We used the Google's pre-trained Universal Sentence Encoder - Deep Averaging Network to build the document search engine. It is pre-trained on a large corpus and can be used in a variety of tasks (sentimental analysis, classification and so on)
Results

The model was able to provide top ten news article which are most relevant to the query searched by the user. We tried searching a ton of queries and each time the search engine gave out relevant documents that we manually verified and had a confidence score ranging from 30-50%

Web Scraper

Data

Our web scraper is built to scrape latest news articles from Quartz. The web scraper is built in order to retrieve five latest news articles.
Approach

The scraper is built using Beautiful Soup. To create the “Soup” element, BeautifulSoup takes in the web page hyperlink as an argument and then we selected the components of the web page
Results

We were able to successfully retrieve name of the author, title and content of the article published and also the links to these news articles.

Data

Approach

Results

Data

Approach

Results

Data

Approach

Results