Closed

Web Scraping & Data Comparison

This 3 part program requires the use of web scraping, database interaction, and data comparison, and output. All programming must be completed in Ruby. A flexible web scraping API must be used??"like scRUBYt. The program should simple, and easy to modify.

## Deliverables

Part 1: Web Scraping

The program should use the scRUBYt plug-in (or other developer recommended API) to crawl and scrape contact and employment data from various company contact databases available publically on the internet. This part of the program should be adaptable to easily be edited for different websites.

The program must scrape the following data

Last Name

First Name

Company Name

Title

Department

Address 1*

Address 2

Phone & Extension*

Email Address

*if specific ones cannot be found, a default one must be used

Programs that can scrape the following types of web pages need to be developed

[url removed, login to view] - A directory in which names are categorized alphabetically. The program needs to be given a URL string with “X?? being replaced with A-Z and then scraped.

[url removed, login to view] - A program that scrapes data when they appear on individual department pages URL string with “X?? being replaced

[url removed, login to view] - A program that gathers URLs matching a certain criteria on one screen and then scrapes those URLs (individual web pages) to gather the required information must submit FORM data (ex of individual page: [url removed, login to view])

<[url removed, login to view]> (search for “a?? as first name)- A program that will click a text Next or Image Next or a numbered page (pages which require you to click a numbered link 1, 2, 3, 4 to continue to the next page)

<[url removed, login to view]> (search name “smith??)- A program that must enter common criteria (first and last names) into a form field to gather a results list

Part 2: Database Interaction

The database should have the following fields

Last Name

First Name

Company Name

Title

Department

Address 1

Address 2

Phone & Extension

Email Address

Created Time

Last Updated

Current Record (increments to 100, with 100 being the most recent))

Unemployed

An object must be created to interface the results of Part 1 with a MySQL database. The following needs to happen

1) If a new entry is detected (via unique e-mail address) then create a new database entry is created

2) If an entry is modified (via unique e-mail address) a new record is created with a higher current record field

3) If an entry is not found or someone has left the company (via unique e-mail address) a new record is created with the unique email address and an unemployed flag

Before a record is created in the database checks need to be run to ensure that

1) Email address is in a correct format

2) Prefixes (Dr., Mr., Ms., Mrs., Miss, Rev.) and Suffixes (III, Jr., Sr., Ph.D., II, R.N., LCSW, M.D.) should be removed

3) Titles never have numbers

4) Names cannot be a single letter

5) Email addresses must match the company domain

6) First name cannot be the same as last name

Part 3: Export

A program must be run that will export the newly created records in a line-by-line csv format. 3 files should be outputted, 1) new additions 2) updates and 3) people who newly unemployed

Skills: Engineering, Linux, MySQL, PHP, Ruby on Rails, Software Architecture, Software Testing

See more: web scraping programs, web scraping part time, web scraping c#, web scraping api, web developer titles, web developer programs, web developer on line, web developer names, web developer for higher, web developer directory, web developer company names, web developer 2008, value of web developer, uid data entry, types of data entry fields, the linux programming interface, string matching in c, sr smith, sr. php developer, sr php developer

About the Employer:
( 23 reviews ) Bangkok, Thailand

Project ID: #3664643

5 freelancers are bidding on average $425 for this job

makeurownrulesvw

See private message.

$425 USD in 14 days
(70 Reviews)
6.1
ravitejasudhee

See private message.

$425 USD in 14 days
(48 Reviews)
5.5
piotrt

See private message.

$425 USD in 14 days
(5 Reviews)
2.8
newfrontiervw

See private message.

$425 USD in 14 days
(1 Review)
0.0
isofsysvw

See private message.

$425 USD in 14 days
(3 Reviews)
0.0