Dark Web Crawler Using Python And Tor With Code Explanation

April 18, 2023

Dark Web Crawler Using Python And Tor With Code Explanation

Creating Dark Web Crawler Using Python And Tor

What is Dark Web?

The primary use of the dark web is for e-commerce. With the use of cryptocurrency, such as Bitcoin, users can make any purchase on the dark web without revealing their identity.

This lends itself well to criminal activity and hidden services, such as:

Hitmen
Purchasing and selling credit card numbers, bank account numbers or online banking information
Money laundering
Illegal content like child pornography
Purchasing and selling illegal drugs
Purchasing and selling counterfeit money
Purchasing and selling weapons

Accessing the dark web and using the tools or services found there is a high-risk activity both for individuals and enterprises. Dangers that users should be aware of before browsing the dark web include:

Viruses, ransomware, malware such as keyloggers, Remote Access Trojans (RAT), Distributed Denial Of Service (DDoS) or other cyber attacks.

Identity theft, credential theft or phishing.
Compromise of personal, customer, financial or operational data.
Leaks of intellectual property or trade secrets.
Spying, webcam hijacking or cyberespionage

Advantages Of Dark Web Crawler

The dark web is often used for illegal activity such as the sale of

Drugs
Weapons
Stolen Personal Information

Web Crawlers can be used to monitor the dark web and gather intelligence on illegal activity, helping law enforcement agencies to track down and prosecute those involved. Dark Web crawlers are also important in the field of dark web monitoring. The dark web is a part of the internet that is not indexed by traditional search engines, and it can be accessed only through special software such as the TOR network.

Why We Are Using Python

There are vast number of libraries and frameworks available for web scraping and data processing. Python has a large and active community of developers, and as a result there are many libraries and frameworks that can be used to simplify the process of building a web crawler. For example, the script uses the BeautifulSoup library to parse HTML and extract links and other information from web pages, and the requests library to send HTTP requests and retrieve web pages .
Python is known for its readability and simplicity, which makes it easier to write and debug code. This is especially important when building a web crawler, as the process of crawling the web can be complex and error-prone. With Python, it is easier to write code that is clear and easy to understand, which can help to reduce the risk of errors and improve the overall efficiency of the web crawler.
The benefits of the Python script that have been examined is the ability to search for keywords in websites and perform snowball sampling crawling. This is useful for finding specific pieces of information on the internet, as it allows the web crawler to focus on websites that are likely to contain the information that is being sought. The script also prints the titles of the pages it visits, which can be helpful for identifying relevant web pages.
The Python script have been examined as a useful tool for building a web crawler that can crawl websites using TOR and search for keywords. The advantages and benefits of using Python for web crawling are numerous, including the vast number of libraries and frameworks available for web scraping and data processing, the simplicity of the language, and the ability to search for keywords and perform snowball sampling crawling. Whether you are a researcher, a business owner, or simply someone who wants to find information on the internet, a Python-based web crawler can be a valuable tool.

Prerequisites Required Before Starting The Code

To run this code we need to have the following prerequisites:

Python Installed(https://www.python.org/downloads/)
TOR Installed(https://www.torproject.org/)

Libraries Required

The script uses the following Python Libraries which you will need to install in order to run the script:

requests: A library for sending HTTP requests and receiving responses.
stem: A library for interacting with the TOR control port.
BeautifulSoup:A library for parsing HTML and extracting information from web pages.

How To Install These Libraries In Windows 10

requests:

Steps To be followed :

Open Cmd(Command Prompt).
Run it as an Administrator.
This path will pop up in the command prompt(C:\Windows\System32>).
Now check that python module is installed or not by typing "python" after this path(C:\Windows\System32>python).
After this the following message will pop up

(Python 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32

Type "help", "copyright", "credits" or "license" for more information.

>>>

Type pip install requests after >>>(>>>pip install requests)
Library will be installed

stem

Steps To be followed :

Open Cmd(Command Prompt).
Run it as an Administrator.
This path will pop up in the command prompt(C:\Windows\System32>).
Now check that python module is installed or not by typing "python" after this path(C:\Windows\System32>python).
After this the following message will pop up

(Python 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32

Type "help", "copyright", "credits" or "license" for more information.

>>>

Type pip install requests after >>>(>>>pip install stem)
Library will be installed

BeautifulSoup

Steps To be followed :

Open Cmd(Command Prompt).
Run it as an Administrator.
This path will pop up in the command prompt(C:\Windows\System32>).
Now check that python module is installed or not by typing "python" after this path(C:\Windows\System32>python).
After this the following message will pop up

(Python 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32

Type "help", "copyright", "credits" or "license" for more information.

>>>

Type pip install requests after >>>(>>>pip install beautifulsoup4)
Library will be installed

Execution Code:

import time

import requests from stem import Signal from stem.control import Controller from bs4 import BeautifulSoup # Set the number of links to crawl num_links_to_crawl = 100 # Set the user agent to use for the request user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36' # Set the headers for the request headers = {'User-Agent': user_agent} # Initialize the controller for the Tor network with Controller.from_port(port=9051) as controller: # Set the controller password controller.authenticate(password='mypassword') # Set the starting URL url = 'http://example.com' # Initialize the visited set and the link queue visited = set() queue = [url] # Get the list of keywords to search for keywords = input('Enter a list of keywords to search for, separated by commas: ').split(',') # Crawl the links while queue: # Get the next link in the queue link = queue.pop(0) # Skip the link if it has already been visited if link in visited: continue # Set the new IP address controller.signal(Signal.NEWNYM) # Send the request to the URL response = requests.get(link, headers=headers) # Parse the response soup = BeautifulSoup(response.text, 'html.parser') # Find all links on the page links = soup.find_all('a') # Add any links that contain the keywords to the queue for a in links: href = a.get('href') if any(keyword in href for keyword in keywords): queue.append(href) # Add the link to the visited set visited.add(link) # Print the title and URL of the page print(soup.title.string, link) # Check if the number of visited links has reached the limit if len(visited) >= num_links_to_crawl: break # Print the visited links print('Visited links:') for link in visited: print(link)

Code Explanation

The script imports the necessary modules, including time, requests, stem, BeautifulSoup.
It sets the number of links to crawl and the user agent to use for the request.
It initializes the Tor controller with the specified password.
It sets the starting URL, initializes the visited set and the link queue, and gets the list of keywords to search for.
The script then begins crawling the links by looping through the link queue.
For each link in the queue, the script sends a request to the URL using the Tor network to change the IP address.
The response is parsed using BeautifulSoup to find all links on the page.
Any links that contain the specified keywords are added to the link queue.
The visited link is added to the visited set and the title and URL of the page are printed.
The script checks if the number of visited links has reached the limit and exits the loop if it has.
Finally, the script prints the list of visited links.

CONCLUSION

Web crawlers are an essential tool for organizing and indexing the vast amount of information on the internet. They play a crucial role in the discovery and ranking of web pages, and are used by search engines to help users find the information they are looking for. Web crawlers also have the ability to track changes to websites over time, and are important in the field of dark web monitoring. Overall, the benefits and importance of web crawlers cannot be understated, as they help to make the vast and constantly-changing landscape of the internet more accessible and easier to navigate.

THANK YOU!!!

Contact Us

sagnikbasu54@gmail.com

Search This Blog

TechBoom

Featured

Epileptic Seizure Recognition using Raspberry Pi

Dark Web Crawler Using Python And Tor With Code Explanation

Comments

Post a Comment

Popular Posts

Epileptic Seizure Recognition using Raspberry Pi

Smart Dustbin