Introducing the AI Research Agent

In the age of information, having quick and reliable access to accurate data is paramount. The AI Research Agent is here to revolutionize the way we gather information. Powered by advanced artificial intelligence and equipped with web scraping capabilities, this agent is adept at sourcing, summarizing, and citing information on a plethora of topics. Whether you're a researcher in academia, a professional in the industry, or simply a curious mind, the AI Research Agent streamlines the data gathering process, ensuring that you receive high-quality information from diverse sources. Welcome to the future of efficient and comprehensive research.

Overview

AI Research Agent: LLM 💬 + Memory 🧠+ Web Scraping 🕸️ + Tool Use 🧰

In today's fast-paced research environment, efficiency is paramount. With this in mind, I developed an AI Research Agent designed to revolutionize the research process by automating the data extraction from the vast expanse of the internet.

Key Features & Workflow:

User Query Initiation: Researchers simply pose a research question to the AI.
Web Search: The AI delves into the web, using the provided question as a search criterion, streamlining the hunt for relevant information.
Relevance Analysis: Once on a site, the AI evaluates the content, determining its relevance to the posed query. This ensures that only pertinent information is extracted.
Web Scraping: Using advanced web scraping techniques, the AI gathers data from the relevant sites.
Information Synthesis: Leveraging the power of a Large Language Model, the agent then processes this data, distilling it into concise, key takeaways.
User Output: The summarized information is then presented to the researcher, providing them with a digestible and relevant answer to their initial query.

The overarching aim of this project was to significantly reduce the time researchers spend sifting through myriad sources. By centralizing and automating this process, the AI Research Agent ensures that researchers can spend more time analyzing and less time searching.

Getting Started

Installation

To experience the power of the AI Research Agent for yourself, it's incredibly straightforward. All you need to do is click on this link and you'll be on your way to a seamless research journey.

Technical Initialization

Libaries to install: tweepy, beautifulsoup4, bs4, fastapi, jsonschema, jsonschema-specifications, langchain, langchainplus-sdk, matplotlib, numpy, openai, openapi-schema-pydantic, pandas, pydantic, python-dotenv, streamlit, typer, typing_extensions, typing-inspect, tiktoken, streamlit-js-eval, pymongo

Defining the Large Language Model

 llm  =  ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k")

Functionality

Description: When the application is in running, it dives into the top search results related to your query. Using a sophisticated large language model, the AI Agent evaluates whether a website contains information relevant to your research objective. If a match is found, the AI Agent extracts the necessary text data. This process is repeated for multiple websites, ensuring a comprehensive search. At the end of this research expedition, the AI Agent crafts a summarized output of its findings, neatly packaged with citations for your reference.

Technical:

function to search the web

def  search(query):
	url  =  "https://google.serper.dev/search"
	payload  =  json.dumps({
		"q": query
	})

	headers  = {
		'X-API-KEY': serper_api_key,
		'Content-Type': 'application/json'
	}

	 response  =  requests.request("POST", url, headers=headers, data=payload)
	 
	print(response.text)
	
	return  response.text

function to scrape website

 def  scrape_website(objective: str, url: str):
 	
 	print("Scraping website...")
 	
 	headers  = {
 	'Cache-Control': 'no-cache',
 	'Content-Type': 'application/json',
 	}

 	data  = {
 		"url": url
 	}

 	data_json  =  json.dumps(data)
 	post_url  =  f"https://chrome.browserless.io/content?token={browserless_api_key}"
 	response  =  requests.post(post_url, headers=headers, data=data_json)

 	if  response.status_code  ==  200:
 		soup  =  BeautifulSoup(response.content, "html.parser")
 		text  =  soup.get_text()
 		print("THIS CONTENT:", text)
 		if  len(text) >  10000:
 			output  =  summary(objective, text)
 			return  output
 		else:
 			return  text

 	else:
 		print(f"HTTP request failed with status code {response.status_code}")

function to produce final summary

 def  summary(objective, content):

 	text_splitter  =  RecursiveCharacterTextSplitter(
 		separators=["\n\n", "\n"], chunk_size=10000, chunk_overlap=500)

 	docs  =  text_splitter.create_documents([content])

 	map_prompt  =  """
 	Write a summary of the following text for {objective}:
 	"{text}"
 	SUMMARY:
 	"""
 	
 	map_prompt_template  =  PromptTemplate(
 		template=map_prompt, input_variables=["text", "objective"])

 	summary_chain  =  load_summarize_chain(
 		llm=llm,
 		chain_type='map_reduce',
 		map_prompt=map_prompt_template,
 		combine_prompt=map_prompt_template,
 		verbose=True
 	)
 	
 	output  =  summary_chain.run(input_documents=docs, objective=objective)
 	
 	return  output

creating a research AI agent using the functions above with LangChain & OpenAI

 class  ScrapeWebsiteInput(BaseModel):
 	"""Inputs for scrape_website"""
 	objective: str  =  Field(
 		description="The objective & task that users give to the agent")
 	url: str  =  Field(description="The url of the website to be scraped")

 class  ScrapeWebsiteTool(BaseTool):
 	name  =  "scrape_website"
 	description  =  "useful when you need to get data from a website url, pass both the objective and url to the function; DO NOT make up any url, the url shall only be from the search results"
 	args_schema: Type[BaseModel] =  ScrapeWebsiteInput

 	def  _run(self, objective: str, url: str):
 		return  scrape_website(objective, url)
 		
 	def  _arun(self, objective: str, url: str):
 		raise  NotImplementedError("error here")

 tools  =  [
 		Tool(
 		name="Search",
 		func=search,
 		description="Useful for when you need to answer questions regarding all topics. You shall ask targeted questions."
 		),
 		ScrapeWebsiteTool(),
 	]

 memory  =  ConversationSummaryBufferMemory(
 	memory_key="memory", return_messages=True, llm=llm, max_token_limit=1000)

 system_message  =  SystemMessage(
 		content=f"""You are a top world class researcher who works for the United States of America and is cleared to receive any internal information on any topic in every company or organization on Earth. You will conduct extremely detailed research on any topic and produce facts based on results;
 		you do not hallucinate, you will try as hard as possible to gather facts & data to back up and prove the research.
 		
 		Please make sure you complete the objective above with the following rules:
 		1/ You shall do enough research to gather as much information as possible about the objective.
 		2/ If there are urls of relevant links & articles, you will scrape it to gather more information about the objective.
 		3/ After scraping & search, you shall think "Is there any new information I should search & scrape based on the data I collected to increase the research quality?" If the answer is yes, continue; But do not do this more than 5 iterations.
 		4/ You shall not hallucinate, you shall only write facts with the data that you have gathered.
 		5/ In the final output, You shall include citations of all reference data & links to back up and prove your research; You shall include citations of all reference data & links to back up and prove your research.
 		6/ In the final output, You shall include citations of all reference data & links to back up and prove your research; You shall include citations of all reference data & links to back up and prove your research."""
 	)

 agent_kwargs  = {
 	"extra_prompt_messages": [MessagesPlaceholder(variable_name="memory")],
 	"system_message": system_message,
 }

 global chat_history
 chat_history = []

 agent  =  initialize_agent(
 	tools,
 	llm,
 	agent=AgentType.OPENAI_FUNCTIONS,
 	verbose=True,
 	agent_kwargs=  agent_kwargs,
 	memory=  memory,
 	chat_history = chat_history
 )

function to return result output & total (query + response) cost

 def run_agent(query):
     total = 0.0
     with get_openai_callback() as cb:
         result = agent({"input": query})
         total += cb.total_cost
     
     return result, total

Remember* to use the command responsibly and always review the summarized content to ensure its relevance and accuracy.

Feedback and Contributions

We value your feedback! If you have any suggestions or have found bugs, please report them to email: [email protected].

Stay tuned for regular updates and new features!

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.markdown		README.markdown
WebGPT.pdf		WebGPT.pdf
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introducing the AI Research Agent

Table of Contents

Overview

Getting Started

Installation

Technical Initialization

Functionality

Feedback and Contributions

About

Releases

Packages

Languages

License

caleb-kan/AI-Research-Agent

Folders and files

Latest commit

History

Repository files navigation

Introducing the AI Research Agent

Table of Contents

Overview

Getting Started

Installation

Technical Initialization

Functionality

Feedback and Contributions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages