Motion capture is the process of recording human movements. It has many uses in medical, military, sports, computer vision, and cinema industry. Motion tracking or motion capture started as a photogrammetric analysis tool in biomechanics research in the 1970s and 1980s, and expanded into education, training, sports and recently computer animation for television, cinema, and video games as the technology matured. Since the 20th century the performer has to wear markers near each joint to identify the motion by the positions or angles between the markers. Acoustic, inertial, LED, magnetic or reflective markers, or combinations of any of these, are tracked, optimally at least two times the frequency rate of the desired motion. The purpose of motion capture is to record only the movements of the actor, not his or her visual appearance. This animation data is mapped to a 3D model so that the model performs the same actions as the actor. Current motion capture systems fall into two main categories: optical and non-optical systems. In optical system, the position of the subject is calculated using the position obtained by two or more calibrated cameras. The image of each camera is processed to find the position of special markers placed on the subjects. Markers could be passive, which are good reflector, or active, which are usually LEDs. In non-optical systems, the position of subject in 3D space would be obtained by collecting the signal of special sensors placed on the subject. Mechanical, Magnetic, and Inertial systems are examples of non-optical systems. The main disadvantage of current motion capture systems is the need for special hardware and corresponding software. The cost of such systems could be prohibitive for small productions. In addition, the system may need some requirements for the space it operates in, for example Chroma key (green screen) or magnetic distortion limitations. This proposal presents a method for human motion capture based on deep convolutional neural networks (CNN). Our method consists of two main phase: 1) 2D human pose estimation, and 2) 2D data aggregation to capture 3D data. First, the image captured by each camera (cameras are assumed to be calibrated) is processed by the designed CNN. Our CNN is trained to output human joints in the 2D input image. The input to the CNN is a raw image in RGB format and the output is the location of detected human joints in the image. Our model detects all humans in the image and is independent of the number of people who are present in the input image. Our Model is also capable of capturing fingers, if the input image has enough resolution. Then, the 2D positions in each image are aggregated to construct a single 3D map of human structure and motion. To do so, we need intrinsic and extrinsic matrices of all cameras. These matrices are form using calibration methods. Finally, the 3D points are mapped to a virtual character to follow the motions. Alongside human movement, the system could also capture human facial points. The procedure is the same as the body joints. Facial points could be used in both 2D and 3D formats, because of the symmetric features and limitations of face. The main advantage of the proposed method is that it does not need any special hardware, since our method works with low cost ordinary cameras and no sensors. The load of the system has been transferred from equipment and hardware (which is the case in traditional systems) to the calculations take place in computer. This is because of the fact that the main part of a motion capture system, which is the detection of special positions on subject, is done through complex algorithm using our designed CNN. We need no addition hardware. However, because of the CNN model in the core of our mocap system, a GPU is recommended.