Making data analysis possible by building a Big Data platform

@ Orange

Key achievements

  • Took part in every step of the building of a Hadoop Big Data industrial platform
  • Used the newly build platform to implement two projects
  • Took part in the constitution of a Data Engineering team from two to more than ten people

Client

Orange is the leading or second telecom company in most of Europeans countries as well as in Africa with a total of 266M clients.

I worked for Orange's French B2B data service for two years (2016-2018) :

Problem

In 2018, Orange's B2B data was distibuted among many applications. It made cross-referencing data long and complex and hard to standardize. The existing BI architecture and tools also didn't allow th storage and analysis of massive and less structured data.

At its creation, I joined a team in charge of building a Hadoop Big Data platform to allow access to all of the French B2B data in a single place, in a simple way, and with modern tools to analyze it.

Process

I arrived at the very beginning of the project and took part in all steps from the architecture definition of a Hadoop platform, to the building of a framework, to the realization of each needed module and the expansion of the team.

Here are the steps in which I took part :

1.

Defining an Architecture

Defining the development standards, the global processes, file organization, namings, ....

2.

Upskilling

Learning how to use the different Big Data tools needed (Spark, Hive, Nifi, Sqoop, Oozie, Hue, Ambari, ...), and sharing best practices.

3.

Building a Framework

Building a Bash Shell framework to allow an easy use of all the needed tools.

4.

Data collection

Automatically collecting data from multiple sources with Sqoop and Nifi.

5.

Data Ingestion

Ensuring the quality of the collected data and storing it in Operational Data Stores (ODS) with Hive (SQL-like).

6.

Data Transformation

Cross-referencing different sources and building the desired KPIs following the rules defined with business teams with Hive or Spark (Scala).

7.

Data Exposition

Allowing each application or user to access only the data it needs with Hive or Spark.

8.

Data Visualisaton

Visualizing data with tools like Qlik Sense or Plotly.

9.

Defining deployment steps

For each module, coding and documenting deployments steps for installation or update in production.

10.

Automating

For each task that becomes repetitive in previous steps, coding an automated way to perform it.

11.

Documenting and Transferring Knowledge

For each module, documenting enough so that a newcomer can understand how it works on his/her own. Ensuring that everyones knows how everything work so the team can keep on working even in case of absence or resignation.

Solution

Results

One year later, more than ten people had joined the team. We had gathered the data for many sources and were transforming and exposing data for two major projects in an automated and industrial way.

The first project was providing call centers managers a tool to better monitor their activity and improve client service.

The second consisted in building tools to better explore and understand the activity of clients on various digital suports, and improve customer experience.

Finally new large and less structured data was made available for future use, and the platform would allow future Machine Learning projects to be developed easily.

Tools & methods :

HadoopSpark, Hive, Nifi, Sqoop, Oozie, Hue, Ambari

SQL, Scala, Bash Shell, Linux (command line), Git

Agile methodology, Knowledge Management

Interested in cooperation or would like to discuss anything ?

>