A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Last update: May 17, 2022

Overview

Udemy Scraper

A Web Scraper built with beautiful soup, that fetches udemy course information.

Installation

Virtual Environment

Firstly, it is recommended to install and run this inside of a virtual environment. You can do so by using the virtualenv library and then activating it.

pip install virtualenv

virtualenv somerandomname

Activating for *nix

source somerandomname/bin/activate

Activating for Windows

somerandomname\Scripts\activate

Package Installation

pip install -r requirements.txt

Chrome setup

Be sure to have chrome installed and install the corresponding version of chromedriver. I have already provided a windows binary file. If you want, you can install the linux binary for the chromedriver from its page.

Approach

It is fairly easy to webscrape sites, however, there are some sites that are not that scrape-friendly. Scraping sites, in itself is perfectly legal however there have been cases of lawsuits against web scraping, some companies *cough Amazon *cough consider web-scraping from its website illegal however, they themselves, web-scrape from other websites. And then there are some sites like udemy, that try to prevent people from scraping their site.

Using BS4 in itself, doesn't give the required results back, so I had to use a browser engine by using selenium to fetch the courses information. Initially, even that didn't work out, but then I realised the courses were being fetch asynchronously so I had to add a bit of delay. So fetching the data can be a bit slow initially.

Functionality

As of this commit, the script can search udemy for the search term you input and get the courses link, and all the other overview details like description, instructor, duration, rating, etc.

Here is a json representation of the data it can fetch as of now:-

{
  "query": "The Complete Angular Course: Beginner to Advanced",
  "link": "https://udemy.com/course/the-complete-angular-master-class/",
  "title": "The Complete Angular Course: Beginner to Advanced",
  "headline": "The most comprehensive Angular 4 (Angular 2+) course. Build a real e-commerce app with Angular, Firebase and Bootstrap 4",
  "instructor": "Mosh Hamedani",
  "rating": "4.5",
  "duration": "29.5 total hours",
  "no_of_lectures": "376 lectures",
  "tags": ["Development", "Web Development", "Angular"],
  "no_of_rating": "23,910",
  "no_of_students": "96,174",
  "course_language": "English",
  "objectives": [
    "Establish yourself as a skilled professional developer",
    "Build real-world Angular applications on your own",
    "Troubleshoot common Angular errors",
    "Master the best practices",
    "Write clean and elegant code like a professional developer"
  ],
  "Sections": [
    {
      "name": "Introduction",
      "lessons": [{ "name": "Introduction" }, { "name": "What is Angular" }],
      "no_of_lessons": 12
    },
    {
      "name": "TypeScript Fundamentals",
      "lessons": [
        { "name": "Introduction" },
        { "name": "What is TypeScript?" }
      ],
      "no_of_lessons": 18
    },
    {
      "name": "Angular Fundamentals",
      "lessons": [
        { "name": "Introduction" },
        { "name": "Building Blocks of Angular Apps" }
      ],
      "no_of_lessons": 10
    }
  ],
  "requirements": [
    "Basic familiarity with HTML, CSS and JavaScript",
    "NO knowledge of Angular 1 or Angular 2 is required"
  ],
  "description": "\nAngular is one of the most popular frameworks for building client apps with HTML, CSS and TypeScript. If you want to establish yourself as a front-end or a full-stack developer, you need to learn Angular.\n\nIf you've been confused or frustrated jumping from one Angular 4 tutoria...",
  "target_audience": [
    "Developers who want to upgrade their skills and get better job opportunities",
    "Front-end developers who want to stay up-to-date with the latest technology"
  ],
  "banner": "https://foo.com/somepicture.jpg"
}

Usage

In order to use the scraper, import it as a module and then create a new course class like so-

from udemyscraper import UdemyCourse

This will import the UdemyCourse class and then you can create an instance of it and then pass the search query to it. Prefarably the exact course name.

from udemyscraper import UdemyCourse

javascript_course = UdemyCourse("Javascript course for beginners")

This will create an empty instance of UdemyCourse. To fetch the data, you need to call the fetch_course function.

javascript_course.fetch_course()

Now that you have the course, you can access all of the courses' data as shown here.

print(javascript_course.Sections[2].lessons[1].name) # This will print out the 3rd Sections' 2nd Lesson's name

Comments

pip install fails
Describe the bug Unable to install udemyscraper via pip install

To Reproduce ERROR: Cannot install udemyscraper==0.8.1 and udemyscraper==0.8.2 because these package versions have conflicting dependencies.

The conflict is caused by: udemyscraper 0.8.2 depends on getopt2==0.0.3 udemyscraper 0.8.1 depends on getopt2==0.0.3

Desktop (please complete the following information):

OS: MAC OS

bug
opened by nuggetsnetwork 5
udemyscraper timesout

Describe the bug When running the sample code all I get is timeout.

To Reproduce Steps to reproduce the behavior: Run the sample code from udemyscraper import UdemyCourse

course = UdemyCourse() course.fetch_course('learn javascript') print(course.title)

Current behavior Timed out waiting for page to load or could not find a matching course

OS: MACOS
bug duplicate

opened by nuggetsnetwork 3
Switch to browser explicit wait

EXPERIMENTAL! Needs Testing.

time.sleep() introduces a necessary wait, even if the page has already been loaded.

By using expected_components, we can proceed as and when the element loads. Using the python time library, I calculated the time taken by search and course page to load to be 2 seconds (approx.)

Theoretically speaking, after the change, execution time should have reduced by 5 seconds. (3+4-2) However, the gain was only 3 seconds instead of the expected 5.

This behavior seems unexpected for the moment, unless we can find where the missing 2 seconds are. For a reference, the original version, using time.sleep() took 17 seconds to execute.

(All times are measured for my internet connection, by executing the given example in readme file)

Possibly need to dig further. I haven't yet got the time to read full code.
bug optimization

opened by flyingcakes85 3
Use explicit wait for search query

Here 4 seconds have been hardcoded, it will be better to wait for the search results to load and then get the source code.

A basic method to do this would be to check if search element is visible or not, once its visible, it can proceed to fetch source code. This way if you have a really fast connection, you wouldn't need to wait longer and vice-versa.
bug optimization

opened by sortedcord 3
Classes Frequently Keep Changing

It seems that on the search page, the classes of the elements keep changing. So it would be best to only fetch the course url and then fetch all the other data from the course page itself.
bug

opened by sortedcord 3
Serialize to xml
Experimental!!

Export the entire dictionary to a xml file using the dict2xml library.

[x] Make branch even with refractor base

[x] Switch to dicttoxml from dict2xml

[x] Object arrays of sections and lessons are not grouped under one root called Sections or Lessons. This is also the case for all of the other arrays.

[x] Rename List item

[x] Rename root tag to course

enhancement area: module
opened by sortedcord 2
Automatically fetch drivers

Setup a way to automatically fetch browser drivers based on user's choice (chromium/firefox) corresponding to the installed browser version.

The hard part will be to find the version of the browser installed.
enhancement help wanted

opened by sortedcord 2
Timed out waiting for page to load or could not find a matching course
Whenever I try to scrape a course from udemy I get this error-

on 1: Timed out waiting for page to load or could not find a matching course Scraping Course |████████████████████████████████████████| 1 in 29.5s (0.03/s)

It was working a couple of times before but now it doesn't work..

Steps to reproduce the behavior:

This happens both when using the script and the module

I used query argument

Output-

Desktop (please complete the following information):

OS: Windows 10

Browser: Chromium

Version: 92

I did checked by manually opening chromium and searching for the course. But when I use the scraper, it doesn't work.
bug good first issue wontfix area: module
opened by sortedcord 1
Optimize element search

Some tests have shown that it is way more efficient to use css selectors than find, especially with nested finds which tend to be wayyy more slow and time consuming. It would be better replace all of the find statements with select and then use direct path.
optimization

opened by sortedcord 1
🌐 Added browser selection argument
Instead of editing the source code to select which browser you would like to use, you can now specify the same while initializing the UdemyCourse class or by simply using an argument when using the standalone way.

-b --browser Allows you to select the browser you would like to use for Scraping Values: "chrome" or "firefox". Defaults to chrome if no argument is passed.

Also provided a gekodriver.exe binary.
enhancement optimization
opened by sortedcord 1
Implementation of Command Line Arguments
I assume that the main udemyScraper.py file will be used as a module, so instead I made another file main.py which can be used for such operations. As of now only some basic arguments have been added. Will add more in the future.

-h --help Displays information about udemyscraper and its usage -v --version Displays the version of the tool -n --no-warn Disables the warning when initializing the udemyscourse class
enhancement
opened by sortedcord 1

Releases(0.8.2)

0.8.2(Oct 2, 2021)
Changes

Renamed repository to udemyscraper.

Added

Tests for module, exporting and scripts.

Added bulk export of courses with script.

Bug Fixes

Exporting to xml always defaults to course.xml filename.

'TypeError': 'module' object is not callable

Source code(tar.gz)
Source code(zip)
udemyscraper-0.8.2-py3-none-any.whl(32.67 KB)
udemyscraper-0.8.2.tar.gz(4.67 MB)
Beta(Aug 29, 2021)
The long awaited (atleast by me) distribution update for udemyscraper. Find this project on PyPI - https://pypi.org/project/udemyscraper/

Added

Udemyscraper can now export multiple courses to csv files!

course_to_csv takes an array as an input and dumps each course to a single csv file.

Udemyscraper can now export courses to xml files!

course_to_xml is function that can be used to export the course object to an xml file with the appropriate tags and format.

udemyscraper.export submodule for exporting scraped course.

Support for Microsoft Edge (Chromium Based) browser.

Support for Brave Browser.

Changes

Udemyscraper.py has been refractured into 5 different files:

__init__.py - Contains the code which will run when imported as a library

metadata.py - Contains metadata of the package such as the name, version, author, etc. Used by setup.py

output.py - Contains functions for outputting the course information.

udscraperscript.py -Is the script file which will run when you want to use udemyscraper as a script.

utils.py - Contains utility related functions for udemyscraper.

Now using udemyscraper.export instead of udemyscraper.output.

quick_display function has been replaced with print_course function.

Now using setup.py instead of setup.cfg

Deleted src folder which is now replaced by udemyscraper folder which is the source directory for all the modules

Installation Process

Since udemyscraper is now to be used as a package, it is obvious that the installation process has also had major changes.

Installation process is documented here

Renamed the browser_preference key in Preferences dictionary to browser

Relocated browser determination to utils as set_browser function.

Removed requirements.txt and pyproject.toml

Fixed

Fixed cache argument bug.

Fixed importing preferences bug.

Fixed Banner Image scraping.

Fixed Progressbar exception handling.

Fixed recognition of chrome as a valid browser.

Preferences will not be printed while using the script.

Fixed browser key error

Source code(tar.gz)
Source code(zip)
udemyscraper-0.8.1-py3-none-any.whl(31.19 KB)
udemyscraper-0.8.1.tar.gz(4.87 MB)

Owner

Aditya Gupta

🎓 Student🎨 Front end Dev & Part time weeb ϞϞ(๑⚈ ․̫ ⚈๑)∩

GitHub Repository

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

3.7k Dec 27, 2022

SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features.

SearchifyX SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features. SearchifyX lets you

28 Dec 20, 2022

一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

QQ音乐歌词爬虫一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件，默认去除了所有演唱会（Live）版本的歌曲。使用方法直接运行python run.py即可，然后输入你想获取的歌手名字，然后静静等待片刻。 output目录下保存生成的歌词和歌名文件。以周杰伦为例，会生成两

11 Jul 27, 2022

for those who dont want to pay $10/month for high school game footage with ads

nfhs-scraper Disclaimer: I am in no way responsible for what you choose to do with this script and guide. I do not endorse avoiding paywalls or any il

5 Apr 12, 2022

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

lxSpider 爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说网站、招标采购网》简介：时光荏苒，记不清写了多少案例了。

793 Jan 05, 2023

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo. (Todas as infomações)

3 Oct 04, 2022

Scrape all the media from an OnlyFans account - Updated regularly

3.2k Dec 29, 2022

A low-code tool that generates python crawler code based on curl or url

KKBA Intruoduction A low-code tool that generates python crawler code based on curl or url Requirement Python = 3.6 Install pip install kkba Usage Co

8 Sep 20, 2021

Introduction to WebScraping Workshop - Semcomp 24 Beta

Extrair informações da internet de forma automatizada. Existem diversas maneiras de fazer isso, nesse tutorial vamos ver algumas delas, por meio de bibliotecas de python.

19 Sep 11, 2022

An automated, headless YouTube Watcher and Scraper

Searches YouTube, queries recommended videos and watches them. All fully automated and anonymised through the Tor network. The project consists of two independently usable components, the YouTube aut

44 Oct 18, 2022

Scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info

SpaceX Sofware I developed software to scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info to use the software you need Python a

16 Aug 02, 2022

Snowflake database loading utility with Scrapy integration

Snowflake Stage Exporter Snowflake database loading utility with Scrapy integration. Meant for streaming ingestion of JSON serializable objects into S

0 Dec 06, 2021

A leetcode scraper to compile all questions in leetcode free tier to text file. pdf also available.

A leetcode scraper to compile all questions in leetcode free tier to text file, pdf also available. if new questions get added, run again to get new questions.

3 Dec 07, 2021

Telegram group scraper tool

Telegram Group Scrapper

2 Jan 11, 2022

Dude is a very simple framework for writing web scrapers using Python decorators

Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-lea

326 Dec 15, 2022

Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

Alpha Swap English This is a simple python tool for the purpose of swapping latinic letters with cirylic ones and vice versa, in txt, docx and pdf fil

3 May 31, 2022

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Related tags

Overview

Udemy Scraper

Installation

Virtual Environment

Package Installation

Chrome setup

Approach

Functionality

Usage

Comments

Releases(0.8.2)

0.8.2(Oct 2, 2021)

Changes

Added

Bug Fixes

Beta(Aug 29, 2021)

Added

Udemyscraper can now export multiple courses to csv files!

Udemyscraper can now export courses to xml files!

Changes

Udemyscraper.py has been refractured into 5 different files:

Now using udemyscraper.export instead of udemyscraper.output.

Now using setup.py instead of setup.cfg

Deleted src folder which is now replaced by udemyscraper folder which is the source directory for all the modules

Installation Process

Since udemyscraper is now to be used as a package, it is obvious that the installation process has also had major changes.

Fixed